TPM recovery
TPM recovery is BMO’s runtime response to provider rate-limit and
token-budget errors. It is a bounded, classification-driven cascade that
attempts cap retry, same-provider model swap, a configured cross-provider
chain, and an opt-in local last-resort hop — in that order — before letting
the run fail. Operators inspect the live posture through a single
tpmrecovery.PostureSnapshot envelope shared across CLI, HTTP, the TUI
slash command, and the in-process agent tool.
The recovery cascade is implemented in
internal/agent/llmerror (classification +
decision logic) and the
coordinatorRunExecutor
(application). The posture surface is in
internal/tpmrecovery
and is read-only: it observes the live cascade without changing decision
semantics.
When recovery fires
Section titled “When recovery fires”A provider error must classify as rate-limit or token-budget class —
typically HTTP 429, “TPM exceeded”, “rate limit reached”, or quota
errors — for the cascade to engage.
llmerror.EligibleForCrossProviderAfterTPMRecovery
is the single classifier; non-eligible errors (auth, schema, network reset,
provider 5xx without a rate-limit signature) bypass the cascade and surface
as terminal failures with no fallback applied.
When the cascade does engage, the run is annotated with a closed-enum
stage label: cap_retry, cross_provider:N (N is the hop index, zero-
based), or tpm_local_last_resort:N. Same-provider model swaps via
large_model_fallback ride the cap-retry stage and are surfaced through
prose in the run summary, not through a distinct stage label.
The four tiers
Section titled “The four tiers”| Tier | Stage label(s) | Trigger |
|---|---|---|
| 1. Cap retry + same-provider model swap | cap_retry | Decision via llmerror.DecideRecoveryAction; same provider, optional swap to large_model_fallback. |
| 2. Cross-provider chain | cross_provider:0 … cross_provider:N-1 | Walks options.tpm_recovery.provider_model_fallback, bounded by max_provider_fallback_hops (default 3). |
| 3. Opt-in local last-resort | tpm_local_last_resort:0 … tpm_local_last_resort:N-1 | Only when options.tpm_recovery.allow_local_last_resort = true AND the configured target is a loopback base_url. Bounded by max_local_last_resort_hops (default 1). |
| 4. Terminal failure | no further stage | Surfaces as provider_recovery event with outcome=exhausted. |
Each tier is bounded and additive — a single run advances at most
max_provider_fallback_hops + max_local_last_resort_hops cross-process
hops before the cascade is declared exhausted.
Operator surfaces
Section titled “Operator surfaces”Every operator surface reads the same tpmrecovery.PostureSnapshot:
| Surface | Purpose |
|---|---|
bmo config show-tpm-recovery | Default config-only summary (back-compat). |
bmo config show-tpm-recovery --runtime | Live runtime posture: config flags, hop bounds, local-last-resort target host, provider cooldown summary, recent-event ring tail. |
bmo config show-tpm-recovery --runtime --json | Same posture, machine-parseable JSON. |
GET /v1/tpm-recovery/posture | HTTP route — same JSON envelope, gated by the standard requireAuth helper. |
/recovery (TUI slash; aliases /tpm-recovery, /tpm_recovery, /tpm-recovery-posture) | Inline posture report in the chat transcript, sharing the same RenderPosture formatter. |
list_recent_provider_recovery_events (native agent tool) | Bounded metadata-only ring entries as JSON, with closed-enum outcome filter. |
tpmrecovery.BuildSnapshot / RenderPosture | Library accessors used by every surface above. |
The cross-surface JSON parity contract is enforced by
bmo/internal/cmd/cross_surface_parity_tpm_recovery_test.go. Any new field
on PostureSnapshot must surface on every arm or the test fails.
Recent-event ring
Section titled “Recent-event ring”Every cascade entry into the executor records a metadata-only event on a process-global FIFO ring (default capacity 64). Each event carries:
Stage(closed enum, see above)Outcome(running,recovered,exhausted)FromProvider/FromModelandToProvider/ToModel- FNV32-hashed
SessionIDHashandRunIDHash - Wall-clock timestamp
The ring is wholly metadata: no error bodies, no prompts, no full session
or run identifiers, no full base URLs. Redaction is enforced at the producer
boundary in coordinator_run_executor.go::publishProviderRecoveryStatus and
gated by a parity test that injects a synthetic sentinel.
Disable per-recipe with FailFast
Section titled “Disable per-recipe with FailFast”Per-recipe fail-fast (settings.fail_fast: true, per-pipeline
fail_fast, or bmo recipe run --fail-fast) disables Tiers 2 and 3 only.
Tier 1 (cap retry + same-provider large_model_fallback) still applies,
because that path stays inside the same provider/key and does not engage
the cross-provider chain. Use FailFast when a recipe must fail loudly
rather than degrade silently across providers.
Discover free fallbacks
Section titled “Discover free fallbacks”bmo config discover-free-fallbacks is an operator-initiated catalog
helper that surfaces OpenRouter free-tier candidates that meet a strict
filter (text output, zero-priced, tool-capable). It is approval-list
generation, not runtime discovery — the runtime still walks only the
configured provider_model_fallback selectors. Use it to seed or refresh
your cross-provider chain after upstream catalog churn.
bmo config discover-free-fallbacksbmo config discover-free-fallbacks --include-text-only # opt-in to text-only modelsbmo config discover-free-fallbacks --include-openrouter-free # opt-in to the bare openrouter/free routerTracing recipe
Section titled “Tracing recipe”When a run engaged the cascade and operators want the full hop genealogy:
/recovery(TUI) orbmo config show-tpm-recovery --runtime— confirm config flags, hop bounds, and recent-ring tail.- Filter the recent ring for the failing run’s hash:
Terminal window bmo agent-tool list_recent_provider_recovery_events --json \| jq '.events[] | select(.outcome=="exhausted")' - Tail the structured log for
provider_recovery.changedevents:Terminal window bmo logs --tail 1000 | jq -c 'select(.msg|startswith("provider_recovery"))' - Pair
cap_retry→cross_provider:N→tpm_local_last_resort:Nbysession_id_hashfor the full recovery lifecycle.
The recovery cascade also writes a v1 tpm recovery: … summary line to
the run residue (autopilot/run_residue/{run_id}.json) for both
coordinator and scheduled-recipe runs.
Scope and non-goals
Section titled “Scope and non-goals”- Decision semantics are stable. This iteration delivers observability
and surface parity.
RecoveryAction,EligibleForCrossProviderAfterTPMRecovery, and the cap-retry/same-provider/cross-provider/local-last-resort policy itself are unchanged. - The
degraded_recoveryper-session HTTP route and agent tool remain the canonical disposition packet for a finished session. The new posture surface is the fleet view;degraded_recoveryis the session view.
Related
Section titled “Related”- TPM recovery topic — maintainer-facing config
reference, OpenRouter
:exactonotes, and discovery filter details. - Fleet metabolism — paired posture/ring pattern donor.
- Troubleshooting — operator-facing rate-limit entry with cross-link to this page.
- CLI reference —
bmo config show-tpm-recoveryflags.