Skip to content

Shadow Evals

Shadow evals let BMO capture eligible real runs, replay them in the background against one or more configured variants, and store the outcomes for later comparison. You can then generate proposals from comparable scorecards, decide (accept, reject, defer), and promote accepted proposals to promoted assets with stable target refs. Rollback is supported and auditable.

These proposals are learning/promotion records. They are separate from Patch Proposals, which review and optionally apply agent-produced code changes.

This is an opt-in learning feature. It does not change or re-rank the live assistant response, and it never auto-promotes a model or prompt. It only records a baseline run, replays configured variants, saves scorecards, and lets you review proposals and record promotion/rollback manually.

If you use multi-model arena with options.arena.auto_select_policy = "eval_shadow", BMO can pick a winner after a standalone arena round using the best shadow scorecard per candidate model (artifacts must be captured for that session and user message). This is separate from proposal promotion: it only chooses which arena candidate to accept. Requires shadow eval capture/replay to have produced scorecards whose candidate_model matches each arena model_key. See arena-verified.md for eligibility, workflow guardrails, and tie-break rules.

In the current runtime, automatic capture is limited to top-level runs from the normal app/session path. It excludes child-agent subruns and protocol surfaces such as A2A and gateway traffic.

Each captured artifact stores:

  • The original prompt snapshot
  • Prompt-trace segments from the run
  • The baseline response
  • Tool-call observations from the baseline run
  • Read-file snapshots and latest file snapshots from the workspace
  • Validation rules derived from the baseline run

If BMO cannot capture a safe, replayable artifact, it stores the artifact as non-replayable and does not queue shadow runs. This fail-closed path is used when prompt trace is missing, read snapshots are unavailable, captured paths fall outside the configured working directory, or the capture exceeds the size limit. Each artifact records a structured capture reason (e.g. no_read_snapshots, capture_over_bytes) for inspection and policy compliance.

Capture can be constrained by scope policy: allow/deny lists for repo path (working directory), session ID, and task class. When an allow list is set, only matching runs are captured; when a deny list is set, matching runs are excluded. See the configuration reference for allow_repos, deny_repos, allow_sessions, deny_sessions, allow_task_classes, and deny_task_classes.

When an artifact is replayable, BMO enqueues one replay run per configured variant.

Each replay:

  • Copies the source working tree into an isolated temp directory
  • Rehydrates tracked file snapshots into that temp workdir
  • Runs bmo run with the variant’s model, prompt prefix, and optional agent override
  • Reuses the same validation-rule semantics as eval scenarios
  • Writes a scorecard with file diffs, tool diffs, replay summary, and pass/fail status

Replay scorecards use one of three comparability states:

  • comparable - the replay completed cleanly and the captured evidence was sufficient
  • degraded - the replay completed, but evidence or workspace drift reduced confidence
  • non_comparable - the replay could not be compared reliably

Add a learning.shadow section to your config:

[learning.shadow]
enabled = true
capture_top_level_runs = true
sample_rate = 1.0
retention_days = 30
max_capture_bytes = 1048576
# Optional: restrict capture by repo path, session ID, or task class
# allow_repos = ["/path/to/allowed/repo"]
# deny_repos = ["/tmp"]
# allow_sessions = ["sess-1"]
# deny_task_classes = ["exploration"]
[[learning.shadow.variants]]
name = "candidate-sonnet"
model = "anthropic/claude-sonnet-4-20250514"
prompt_prefix = "Prefer concise code-review style output."
agent = "coder"
[[learning.shadow.variants]]
name = "candidate-task"
model = "openai/gpt-5"
agent = "task"

Use bmo eval posture first when you need a bounded maintainer readout across deterministic eval gates, shadow summary health, proposal state, and arena eval_shadow readiness:

Terminal window
# Print deterministic-first eval posture plus bounded shadow-eval health
bmo eval posture
# Emit machine-readable posture for automation
bmo eval posture --json

The posture command succeeds when learning.shadow is disabled. That is the safe opt-in state; the command reports that shadow-only evidence is unavailable and points maintainers back to deterministic before/after reports. When learning is enabled, the readout distinguishes empty, non-comparable, failing, usable, and stale shadow evidence without dumping captured prompts or artifacts.

Use the bmo eval shadow commands when the posture readout says deeper audit is needed:

Terminal window
# Print the bounded quality summary over recent runs, scorecards, and proposals
bmo eval shadow summary
# List recent artifacts and replay runs
bmo eval shadow list
# Filter by agent type or route source
bmo eval shadow list --agent-type coder --route-source app_run
# Show one captured artifact with linked runs and scorecards
bmo eval shadow show <artifact-id>
# Export runs and scorecards as JSONL
bmo eval shadow export --format jsonl
# Export only comparable runs for one model as CSV
bmo eval shadow export --format csv --candidate-model anthropic/claude-sonnet-4-20250514 --comparability comparable
# Delete an artifact (and its runs and scorecards) or a single run
bmo eval shadow delete artifact <artifact-id>
bmo eval shadow delete run <run-id>
# Purge artifacts older than retention (TTL)
bmo eval shadow purge --retention-days 30

bmo eval shadow summary prints the bounded quality snapshot used by the eval_shadow_summary tool, bmo eval posture, and HTTP summary route. Use it before list, show, or export unless you already know which artifact or run needs inspection. bmo eval shadow show prints one JSON document containing the artifact, its linked runs and scorecards. bmo eval shadow export defaults to JSONL and also supports CSV. Use bmo eval shadow delete to remove specific artifacts or runs; use bmo eval shadow purge to remove artifacts older than a retention window (manual TTL purge; the learning worker also purges by configured retention automatically).

For prompt, routing, arena, provider, or scenario changes, collect deterministic evidence before leaning on shadow learning:

Terminal window
bmo eval run eval/scenarios/prompt-stack/ --output before.json
bmo eval run eval/scenarios/prompt-stack/ --output after.json
bmo eval compare before.json after.json
bmo eval posture

Treat bmo eval compare regressions as blocking local evidence. Treat shadow summary state as advisory unless it has comparable passing scorecards from the same class of work and is fresh enough for the decision at hand. Live-provider evals and shadow replay remain opt-in because they depend on local credentials, provider behavior, and intentional spend.

If you run BMO as an HTTP server, the same data is available over JSON endpoints.

Artifacts and runs

  • GET /v1/eval/shadow/artifacts — list artifacts
  • GET /v1/eval/shadow/artifacts/{artifact_id} — get one artifact with runs and scorecards
  • DELETE /v1/eval/shadow/artifacts/{artifact_id} — delete an artifact and its runs/scorecards (204)
  • GET /v1/eval/shadow/runs — list runs (with scorecards)
  • DELETE /v1/eval/shadow/runs/{run_id} — delete a run and its scorecard (204)
  • GET /v1/eval/shadow/export?format=jsonl — stream artifacts, scorecards, proposals, and promoted assets as JSONL for offline review or archival.
  • GET /v1/eval/shadow/export?format=csv — export comparable shadow-eval rows as CSV for spreadsheet review and reporting.

Proposals and promoted assets are documented in the Proposals and Promoted assets sections above.

Common filters include limit, artifact_id, session_id, agent_type, route_source, run_id, status, candidate_model, and comparability. Delete endpoints return 404 when the resource is missing or the learning service is unavailable.

After you have comparable passing scorecards, you can generate proposals grouped by task class and target surface. Each proposal is a candidate for turning repeated wins into a reusable asset (e.g. recipe, operating pack, or change contract). Proposals stay in a pending state until a human records a decision.

BMO never auto-promotes. You must:

  1. Generate proposals from comparable scorecards (optional; you can also create proposals manually via the API).
  2. Review proposal summary, evidence refs, and target surface.
  3. Record a decision: accepted, rejected, deferred, or rolled_back.

Proposal records are append-only audit artifacts. There is no delete/archive API or agent tool for proposals; use decisions, promotion, and rollback to advance their lifecycle while preserving history.

Terminal window
# List proposals (filter by state, task class, target surface)
bmo eval shadow proposals list
bmo eval shadow proposals list --state pending --limit 20
# Get one proposal
bmo eval shadow proposals get <proposal-id>
# Generate proposals from comparable passing scorecards (min 2 runs per group by default)
bmo eval shadow proposals generate
bmo eval shadow proposals generate --min-comparable-runs 3 --target-surface recipe
# Record a decision (accepted, rejected, deferred, rolled_back)
bmo eval shadow proposals decide <proposal-id> --outcome accepted --reviewer "alice" --rationale "LGTM"
  • GET /v1/eval/shadow/proposals — list (query: limit, proposal_id, state, task_class, target_surface)
  • GET /v1/eval/shadow/proposals/{proposal_id} — get one
  • POST /v1/eval/shadow/proposals/generate — generate from scorecards (body: optional min_comparable_runs, target_surface, task_class)
  • POST /v1/eval/shadow/proposals/{proposal_id}/decide — record decision (body: outcome, reviewer, rationale)

When a proposal is accepted, you can promote it: record a stable target ref (e.g. recipe path, pack id, contract id) and who activated it. That creates a promoted asset (derived asset) in state active. Later you can roll back an active asset; the rollback preserves the rationale and updates the proposal’s audit trail.

Promotion in BMO only records the target ref and state. It does not create or mutate recipe files, operating packs, or change contracts itself; you use other workflows to create those assets and then record the ref when promoting.

Terminal window
# List promoted assets (filter by state, proposal, target type)
bmo eval shadow promoted list
bmo eval shadow promoted list --state active
# Get one promoted asset
bmo eval shadow promoted get <asset-id>
# Promote an accepted proposal (record target ref)
bmo eval shadow promoted promote <proposal-id> --target-type recipe --target-id "recipes/learned.yaml" --activated-by "alice"
# Roll back an active promoted asset
bmo eval shadow promoted rollback <asset-id> --rationale "Overfit to one run"
  • GET /v1/eval/shadow/derived-assets — list (query: limit, asset_id, proposal_id, state, target_type)
  • GET /v1/eval/shadow/derived-assets/{asset_id} — get one
  • POST /v1/eval/shadow/proposals/{proposal_id}/promote — create promoted asset from accepted proposal (body: target_type, target_id, version_ref, activated_by)
  • POST /v1/eval/shadow/derived-assets/{asset_id}/rollback — mark asset rolled back (body: rationale)

Rollback records a decision on the underlying proposal so the full chain (proposal → promote → rollback) remains auditable.

Shadow replay variants are built on top of bmo run, which supports:

  • --agent to override the agent used for that one run
  • --system-prompt-prefix to prepend additional system guidance for that run

These flags are also useful for ad hoc local experiments outside the shadow worker.

Task classes, capture decision reasons, provenance, proposal states, and promoted asset (derived asset) states are defined in code under internal/learning/. In-tree behavior covers capture, replay, scorecards, proposals and decisions, and promoted assets with rollback, using that taxonomy.

Treat the Go source as ground truth for vocabulary; this page summarises the operator-visible behavior built on it.

Shadow evals are a contributor and operator feature in the active interface. There is no dedicated TUI pane. Use the config, CLI, and server APIs to enable capture, inspect scorecards, and export results for offline analysis.