Shadow Evals
Shadow evals let BMO capture eligible real runs, replay them in the background against one or more configured variants, and store the outcomes for later comparison. You can then generate proposals from comparable scorecards, decide (accept, reject, defer), and promote accepted proposals to promoted assets with stable target refs. Rollback is supported and auditable.
These proposals are learning/promotion records. They are separate from Patch Proposals, which review and optionally apply agent-produced code changes.
This is an opt-in learning feature. It does not change or re-rank the live assistant response, and it never auto-promotes a model or prompt. It only records a baseline run, replays configured variants, saves scorecards, and lets you review proposals and record promotion/rollback manually.
Arena auto-select (eval_shadow)
Section titled “Arena auto-select (eval_shadow)”If you use multi-model arena with
options.arena.auto_select_policy = "eval_shadow", BMO can pick a winner after a
standalone arena round using the best shadow scorecard per candidate model
(artifacts must be captured for that session and user message). This is
separate from proposal promotion: it only chooses which arena candidate to
accept. Requires shadow eval capture/replay to have produced scorecards whose
candidate_model matches each arena model_key. See arena-verified.md for eligibility, workflow guardrails, and tie-break rules.
What gets captured
Section titled “What gets captured”In the current runtime, automatic capture is limited to top-level runs from the normal app/session path. It excludes child-agent subruns and protocol surfaces such as A2A and gateway traffic.
Each captured artifact stores:
- The original prompt snapshot
- Prompt-trace segments from the run
- The baseline response
- Tool-call observations from the baseline run
- Read-file snapshots and latest file snapshots from the workspace
- Validation rules derived from the baseline run
If BMO cannot capture a safe, replayable artifact, it stores the artifact as
non-replayable and does not queue shadow runs. This fail-closed path is used
when prompt trace is missing, read snapshots are unavailable, captured paths
fall outside the configured working directory, or the capture exceeds the size
limit. Each artifact records a structured capture reason (e.g. no_read_snapshots,
capture_over_bytes) for inspection and policy compliance.
Capture can be constrained by scope policy: allow/deny lists for repo path
(working directory), session ID, and task class. When an allow list is set,
only matching runs are captured; when a deny list is set, matching runs are
excluded. See the configuration reference for allow_repos, deny_repos,
allow_sessions, deny_sessions, allow_task_classes, and deny_task_classes.
How replay works
Section titled “How replay works”When an artifact is replayable, BMO enqueues one replay run per configured variant.
Each replay:
- Copies the source working tree into an isolated temp directory
- Rehydrates tracked file snapshots into that temp workdir
- Runs
bmo runwith the variant’s model, prompt prefix, and optional agent override - Reuses the same validation-rule semantics as eval scenarios
- Writes a scorecard with file diffs, tool diffs, replay summary, and pass/fail status
Replay scorecards use one of three comparability states:
comparable- the replay completed cleanly and the captured evidence was sufficientdegraded- the replay completed, but evidence or workspace drift reduced confidencenon_comparable- the replay could not be compared reliably
Enable shadow evals
Section titled “Enable shadow evals”Add a learning.shadow section to your config:
[learning.shadow]enabled = truecapture_top_level_runs = truesample_rate = 1.0retention_days = 30max_capture_bytes = 1048576# Optional: restrict capture by repo path, session ID, or task class# allow_repos = ["/path/to/allowed/repo"]# deny_repos = ["/tmp"]# allow_sessions = ["sess-1"]# deny_task_classes = ["exploration"]
[[learning.shadow.variants]]name = "candidate-sonnet"model = "anthropic/claude-sonnet-4-20250514"prompt_prefix = "Prefer concise code-review style output."agent = "coder"
[[learning.shadow.variants]]name = "candidate-task"model = "openai/gpt-5"agent = "task"Inspect results from the CLI
Section titled “Inspect results from the CLI”Use bmo eval posture first when you need a bounded maintainer readout across
deterministic eval gates, shadow summary health, proposal state, and arena
eval_shadow readiness:
# Print deterministic-first eval posture plus bounded shadow-eval healthbmo eval posture
# Emit machine-readable posture for automationbmo eval posture --jsonThe posture command succeeds when learning.shadow is disabled. That is the
safe opt-in state; the command reports that shadow-only evidence is unavailable
and points maintainers back to deterministic before/after reports. When
learning is enabled, the readout distinguishes empty, non-comparable, failing,
usable, and stale shadow evidence without dumping captured prompts or artifacts.
Use the bmo eval shadow commands when the posture readout says deeper audit is
needed:
# Print the bounded quality summary over recent runs, scorecards, and proposalsbmo eval shadow summary
# List recent artifacts and replay runsbmo eval shadow list
# Filter by agent type or route sourcebmo eval shadow list --agent-type coder --route-source app_run
# Show one captured artifact with linked runs and scorecardsbmo eval shadow show <artifact-id>
# Export runs and scorecards as JSONLbmo eval shadow export --format jsonl
# Export only comparable runs for one model as CSVbmo eval shadow export --format csv --candidate-model anthropic/claude-sonnet-4-20250514 --comparability comparable
# Delete an artifact (and its runs and scorecards) or a single runbmo eval shadow delete artifact <artifact-id>bmo eval shadow delete run <run-id>
# Purge artifacts older than retention (TTL)bmo eval shadow purge --retention-days 30bmo eval shadow summary prints the bounded quality snapshot used by the
eval_shadow_summary tool, bmo eval posture, and HTTP summary route. Use it
before list, show, or export unless you already know which artifact or run
needs inspection. bmo eval shadow show prints one JSON document containing the
artifact, its linked runs and scorecards. bmo eval shadow export defaults to
JSONL and also supports CSV. Use bmo eval shadow delete to remove specific
artifacts or runs; use bmo eval shadow purge to remove artifacts older than a
retention window (manual TTL purge; the learning worker also purges by
configured retention automatically).
Quality gate recipes
Section titled “Quality gate recipes”For prompt, routing, arena, provider, or scenario changes, collect deterministic evidence before leaning on shadow learning:
bmo eval run eval/scenarios/prompt-stack/ --output before.jsonbmo eval run eval/scenarios/prompt-stack/ --output after.jsonbmo eval compare before.json after.jsonbmo eval postureTreat bmo eval compare regressions as blocking local evidence. Treat shadow
summary state as advisory unless it has comparable passing scorecards from the
same class of work and is fresh enough for the decision at hand. Live-provider
evals and shadow replay remain opt-in because they depend on local credentials,
provider behavior, and intentional spend.
API surfaces
Section titled “API surfaces”If you run BMO as an HTTP server, the same data is available over JSON endpoints.
Artifacts and runs
GET /v1/eval/shadow/artifacts— list artifactsGET /v1/eval/shadow/artifacts/{artifact_id}— get one artifact with runs and scorecardsDELETE /v1/eval/shadow/artifacts/{artifact_id}— delete an artifact and its runs/scorecards (204)GET /v1/eval/shadow/runs— list runs (with scorecards)DELETE /v1/eval/shadow/runs/{run_id}— delete a run and its scorecard (204)GET /v1/eval/shadow/export?format=jsonl— stream artifacts, scorecards, proposals, and promoted assets as JSONL for offline review or archival.GET /v1/eval/shadow/export?format=csv— export comparable shadow-eval rows as CSV for spreadsheet review and reporting.
Proposals and promoted assets are documented in the Proposals and Promoted assets sections above.
Common filters include limit, artifact_id, session_id, agent_type,
route_source, run_id, status, candidate_model, and comparability.
Delete endpoints return 404 when the resource is missing or the learning
service is unavailable.
Proposals (review and decide)
Section titled “Proposals (review and decide)”After you have comparable passing scorecards, you can generate proposals grouped by task class and target surface. Each proposal is a candidate for turning repeated wins into a reusable asset (e.g. recipe, operating pack, or change contract). Proposals stay in a pending state until a human records a decision.
BMO never auto-promotes. You must:
- Generate proposals from comparable scorecards (optional; you can also create proposals manually via the API).
- Review proposal summary, evidence refs, and target surface.
- Record a decision: accepted, rejected, deferred, or rolled_back.
Proposal records are append-only audit artifacts. There is no delete/archive API or agent tool for proposals; use decisions, promotion, and rollback to advance their lifecycle while preserving history.
CLI: proposals
Section titled “CLI: proposals”# List proposals (filter by state, task class, target surface)bmo eval shadow proposals listbmo eval shadow proposals list --state pending --limit 20
# Get one proposalbmo eval shadow proposals get <proposal-id>
# Generate proposals from comparable passing scorecards (min 2 runs per group by default)bmo eval shadow proposals generatebmo eval shadow proposals generate --min-comparable-runs 3 --target-surface recipe
# Record a decision (accepted, rejected, deferred, rolled_back)bmo eval shadow proposals decide <proposal-id> --outcome accepted --reviewer "alice" --rationale "LGTM"API: proposals
Section titled “API: proposals”GET /v1/eval/shadow/proposals— list (query:limit,proposal_id,state,task_class,target_surface)GET /v1/eval/shadow/proposals/{proposal_id}— get onePOST /v1/eval/shadow/proposals/generate— generate from scorecards (body: optionalmin_comparable_runs,target_surface,task_class)POST /v1/eval/shadow/proposals/{proposal_id}/decide— record decision (body:outcome,reviewer,rationale)
Promoted assets (activate and roll back)
Section titled “Promoted assets (activate and roll back)”When a proposal is accepted, you can promote it: record a stable target ref (e.g. recipe path, pack id, contract id) and who activated it. That creates a promoted asset (derived asset) in state active. Later you can roll back an active asset; the rollback preserves the rationale and updates the proposal’s audit trail.
Promotion in BMO only records the target ref and state. It does not create or mutate recipe files, operating packs, or change contracts itself; you use other workflows to create those assets and then record the ref when promoting.
CLI: promoted assets
Section titled “CLI: promoted assets”# List promoted assets (filter by state, proposal, target type)bmo eval shadow promoted listbmo eval shadow promoted list --state active
# Get one promoted assetbmo eval shadow promoted get <asset-id>
# Promote an accepted proposal (record target ref)bmo eval shadow promoted promote <proposal-id> --target-type recipe --target-id "recipes/learned.yaml" --activated-by "alice"
# Roll back an active promoted assetbmo eval shadow promoted rollback <asset-id> --rationale "Overfit to one run"API: promoted assets
Section titled “API: promoted assets”GET /v1/eval/shadow/derived-assets— list (query:limit,asset_id,proposal_id,state,target_type)GET /v1/eval/shadow/derived-assets/{asset_id}— get onePOST /v1/eval/shadow/proposals/{proposal_id}/promote— create promoted asset from accepted proposal (body:target_type,target_id,version_ref,activated_by)POST /v1/eval/shadow/derived-assets/{asset_id}/rollback— mark asset rolled back (body:rationale)
Rollback records a decision on the underlying proposal so the full chain (proposal → promote → rollback) remains auditable.
Useful replay-related run flags
Section titled “Useful replay-related run flags”Shadow replay variants are built on top of bmo run, which supports:
--agentto override the agent used for that one run--system-prompt-prefixto prepend additional system guidance for that run
These flags are also useful for ad hoc local experiments outside the shadow worker.
Contracts and taxonomy
Section titled “Contracts and taxonomy”Task classes, capture decision reasons, provenance, proposal states, and promoted asset (derived asset) states are defined in code under internal/learning/. In-tree behavior covers capture, replay, scorecards, proposals and decisions, and promoted assets with rollback, using that taxonomy.
- Type contracts:
internal/learning/contracts.go— task classes, capture reasons, provenance, proposal and derived-asset types. - Service entry points and state transitions:
internal/learning/service.go. - Rollout, promotion, and rollback edges:
internal/learning/rollout.go.
Treat the Go source as ground truth for vocabulary; this page summarises the operator-visible behavior built on it.
Current scope
Section titled “Current scope”Shadow evals are a contributor and operator feature in the active interface. There is no dedicated TUI pane. Use the config, CLI, and server APIs to enable capture, inspect scorecards, and export results for offline analysis.