Skip to content

Prompt caching

BMO does not add a second, client-side prompt “cache” in front of models. Savings and routing come from each provider’s API. BMO surfaces cache-related usage in costs and observability, and can optionally derive an OpenAI prompt cache key for coordination across turns.

  • Usage fields may include cache read and cache write (creation) counters when the provider maps them. These show up in session cost totals and the agent debugger, consistent with the provider’s own dashboards.
  • Prompt-budget snapshots also expose cache eligibility and presence on the final provider envelope: OpenAI cache key presence, Anthropic cache marker counts, approximate Anthropic cacheable system-prefix size, Gemini cached-content references, and whether usage is reported separately. These fields do not claim a cache hit; they explain whether the request was cache-shaped before provider execution.
  • With debug logging enabled, a turn that attributes non-zero cache read or cache creation tokens may emit a structured LLM cache usage non-zero for generation line (session id, provider, model, counts). See Troubleshooting and your host’s log-level controls (for example BMO_LOG_LEVEL).
  • For empirical proof, run bmo cache proof. Existing-run mode summarizes repeated --run-id values without provider calls. Live mode requires --live-provider and a prompt, performs a warm-up plus measured repeats, and writes a redacted JSON artifact under .bmo/evidence/ by default. A pass verdict requires provider-reported cache reads on a measured repeat plus a comparable prior iteration with the same provider, model, and stable-prefix or prompt-shape fingerprint; timing is recorded but does not prove cache leverage by itself.
  • bmo cache proof --tool-surface none enforces a toolless proof posture for the first provider request. If prompt-budget evidence sees effective tools, provider-envelope tools, or tool-schema tokens under that posture, the proof reports tool_surface_mismatch instead of treating the run as cache or timing evidence.
  • bmo config show-cache reports the effective prompt-caching config posture from the shell. It is intentionally config-only and does not claim access to the live in-process recent-event ring.
  • /cache in the TUI renders the live app-backed posture plus bounded recent cacheproof.* metadata.
  • Agent-native and MCP consumers can use get_prompt_cache_status and bmo_get_prompt_cache_status for the same structured live posture.
  • list_recent_cache_events remains the lower-level raw ring export when you need event-by-event metadata instead of the summarized posture.
  • Providers describe prefix stability and best-effort reuse. Optional prompt_cache_key in provider options can help align requests that share a long stable prefix. You can set prompt_cache_key under provider and model provider_options (Fantasy parses the OpenAI-typed map).
  • BMO opt-in key derivation (default off): set options.openai_prompt_cache_affinity = "session_tools". BMO then sets prompt_cache_key to a deterministic bmo:-prefixed hash of the session id and a sorted list of allowed tool names for that run (rotation when tools change). Adjust the hex length with options.openai_prompt_cache_key_hash_hex_chars (8–64 hex characters, default 32 when unset or zero).
  • A manually configured prompt_cache_key in provider_options is left unchanged; BMO will not override it.
  • Optional prompt_cache_retention ("in_memory" or "24h") can be set in the same merged provider_options map for Chat Completions and Responses; it is passed through Fantasy to the OpenAI API (see OpenAI prompt-caching docs for semantics and availability).
  • Azure OpenAI and other gateways can differ: treat cache hits, keys, and retention as best-effort and confirm against that provider’s docs.

Anthropic, Bedrock (Converse), Vercel Anthropic

Section titled “Anthropic, Bedrock (Converse), Vercel Anthropic”
  • Caching is driven by cache_control on message/tool blocks. BMO attaches ephemeral to the last tool in the final filtered tool list so the stable tool tail matches the tools actually sent. If adaptive filtering leaves no tools, BMO does not set a tool marker (no panic). Reordering tools with the same set changes which tool is last, so a different tool carries the marker—this is expected.
  • When the system-prompt cache breakpoint is enabled, BMO also splits the system prompt into stable and volatile blocks and marks the stable block as ephemeral for Anthropic-family providers. Cache stability is classified separately from shedding priority, so high-priority live state such as the current run or active plan stays outside the cache-marked block.
  • Set BMO_DISABLE_ANTHROPIC_CACHE=true to disable attaching that marker. See Environment variables.
  • Implicit context caching (often on newer models) can reduce cost without a separate resource name; follow Google’s product docs.
  • Explicit cachedContents resources are created, updated, and deleted out of band; pass a cached_content name in merged provider provider_options (Fantasy google parse) when you have a resource. Optional: bmo cache gemini create --from-prompt-snapshot (see bmo cache gemini -h) prints a line you can paste, using a Gemini Developer API key (GOOGLE_API_KEY or GEMINI_API_KEY). Managing TTL, minimum cache size, and cost for stored caches is the operator’s responsibility; BMO does not run a long-lived automatic cache janitor for Gemini by default.

For contributors: Code alignment, edge cases, system-prompt cache rationale, and Gemini cached-content implementation background are in the repository topics prompt-caching.md and gemini-cached-content.md.