Skip to content

TPM recovery

TPM recovery is BMO’s runtime response to provider rate-limit and token-budget errors. It is a bounded, classification-driven cascade that attempts cap retry, same-provider model swap, a configured cross-provider chain, and an opt-in local last-resort hop — in that order — before letting the run fail. Operators inspect the live posture through a single tpmrecovery.PostureSnapshot envelope shared across CLI, HTTP, the TUI slash command, and the in-process agent tool.

The recovery cascade is implemented in internal/agent/llmerror (classification + decision logic) and the coordinatorRunExecutor (application). The posture surface is in internal/tpmrecovery and is read-only: it observes the live cascade without changing decision semantics.

A provider error must classify as rate-limit or token-budget class — typically HTTP 429, “TPM exceeded”, “rate limit reached”, or quota errors — for the cascade to engage. llmerror.EligibleForCrossProviderAfterTPMRecovery is the single classifier; non-eligible errors (auth, schema, network reset, provider 5xx without a rate-limit signature) bypass the cascade and surface as terminal failures with no fallback applied.

When the cascade does engage, the run is annotated with a closed-enum stage label: cap_retry, cross_provider:N (N is the hop index, zero- based), or tpm_local_last_resort:N. Same-provider model swaps via large_model_fallback ride the cap-retry stage and are surfaced through prose in the run summary, not through a distinct stage label.

flowchart TB err["Provider error"] eligible{"EligibleForCrossProviderAfterTPMRecovery?"} cap["Tier 1: cap retry + same-provider large_model_fallback"] same{"Same-provider exhausted?"} cross["Tier 2: cross-provider chain (provider_model_fallback)"] crossExhaust{"Cross-provider hops exhausted? (max_provider_fallback_hops)"} local["Tier 3: opt-in local last-resort (allow_local_last_resort)"] fail["Terminal failure: provider_recovery exhausted"] err --> eligible eligible -- no --> fail eligible -- yes --> cap cap --> same same -- no --> cap same -- yes --> cross cross --> crossExhaust crossExhaust -- no --> cross crossExhaust -- yes --> local local --> fail
TierStage label(s)Trigger
1. Cap retry + same-provider model swapcap_retryDecision via llmerror.DecideRecoveryAction; same provider, optional swap to large_model_fallback.
2. Cross-provider chaincross_provider:0cross_provider:N-1Walks options.tpm_recovery.provider_model_fallback, bounded by max_provider_fallback_hops (default 3).
3. Opt-in local last-resorttpm_local_last_resort:0tpm_local_last_resort:N-1Only when options.tpm_recovery.allow_local_last_resort = true AND the configured target is a loopback base_url. Bounded by max_local_last_resort_hops (default 1).
4. Terminal failureno further stageSurfaces as provider_recovery event with outcome=exhausted.

Each tier is bounded and additive — a single run advances at most max_provider_fallback_hops + max_local_last_resort_hops cross-process hops before the cascade is declared exhausted.

Every operator surface reads the same tpmrecovery.PostureSnapshot:

SurfacePurpose
bmo config show-tpm-recoveryDefault config-only summary (back-compat).
bmo config show-tpm-recovery --runtimeLive runtime posture: config flags, hop bounds, local-last-resort target host, provider cooldown summary, recent-event ring tail.
bmo config show-tpm-recovery --runtime --jsonSame posture, machine-parseable JSON.
GET /v1/tpm-recovery/postureHTTP route — same JSON envelope, gated by the standard requireAuth helper.
/recovery (TUI slash; aliases /tpm-recovery, /tpm_recovery, /tpm-recovery-posture)Inline posture report in the chat transcript, sharing the same RenderPosture formatter.
list_recent_provider_recovery_events (native agent tool)Bounded metadata-only ring entries as JSON, with closed-enum outcome filter.
tpmrecovery.BuildSnapshot / RenderPostureLibrary accessors used by every surface above.

The cross-surface JSON parity contract is enforced by bmo/internal/cmd/cross_surface_parity_tpm_recovery_test.go. Any new field on PostureSnapshot must surface on every arm or the test fails.

Every cascade entry into the executor records a metadata-only event on a process-global FIFO ring (default capacity 64). Each event carries:

  • Stage (closed enum, see above)
  • Outcome (running, recovered, exhausted)
  • FromProvider / FromModel and ToProvider / ToModel
  • FNV32-hashed SessionIDHash and RunIDHash
  • Wall-clock timestamp

The ring is wholly metadata: no error bodies, no prompts, no full session or run identifiers, no full base URLs. Redaction is enforced at the producer boundary in coordinator_run_executor.go::publishProviderRecoveryStatus and gated by a parity test that injects a synthetic sentinel.

Per-recipe fail-fast (settings.fail_fast: true, per-pipeline fail_fast, or bmo recipe run --fail-fast) disables Tiers 2 and 3 only. Tier 1 (cap retry + same-provider large_model_fallback) still applies, because that path stays inside the same provider/key and does not engage the cross-provider chain. Use FailFast when a recipe must fail loudly rather than degrade silently across providers.

bmo config discover-free-fallbacks is an operator-initiated catalog helper that surfaces OpenRouter free-tier candidates that meet a strict filter (text output, zero-priced, tool-capable). It is approval-list generation, not runtime discovery — the runtime still walks only the configured provider_model_fallback selectors. Use it to seed or refresh your cross-provider chain after upstream catalog churn.

Terminal window
bmo config discover-free-fallbacks
bmo config discover-free-fallbacks --include-text-only # opt-in to text-only models
bmo config discover-free-fallbacks --include-openrouter-free # opt-in to the bare openrouter/free router

When a run engaged the cascade and operators want the full hop genealogy:

  1. /recovery (TUI) or bmo config show-tpm-recovery --runtime — confirm config flags, hop bounds, and recent-ring tail.
  2. Filter the recent ring for the failing run’s hash:
    Terminal window
    bmo agent-tool list_recent_provider_recovery_events --json \
    | jq '.events[] | select(.outcome=="exhausted")'
  3. Tail the structured log for provider_recovery.changed events:
    Terminal window
    bmo logs --tail 1000 | jq -c 'select(.msg|startswith("provider_recovery"))'
  4. Pair cap_retrycross_provider:Ntpm_local_last_resort:N by session_id_hash for the full recovery lifecycle.

The recovery cascade also writes a v1 tpm recovery: … summary line to the run residue (autopilot/run_residue/{run_id}.json) for both coordinator and scheduled-recipe runs.

  • Decision semantics are stable. This iteration delivers observability and surface parity. RecoveryAction, EligibleForCrossProviderAfterTPMRecovery, and the cap-retry/same-provider/cross-provider/local-last-resort policy itself are unchanged.
  • The degraded_recovery per-session HTTP route and agent tool remain the canonical disposition packet for a finished session. The new posture surface is the fleet view; degraded_recovery is the session view.