Eval shadow summary

Eval shadow summary gives you one bounded quality snapshot over the existing eval-shadow store. It is designed for quick regression checks and operator inspection, not for replaying every captured run. Start from Shadow Evals for the full lifecycle; this page is the contract reference for the summary surfaces.

Surfaces

HTTP: GET /v1/eval/shadow/summary
In-agent tool: eval_shadow_summary
CLI summary: bmo eval shadow summary
CLI posture: bmo eval posture
Runtime features: eval_shadow_summary_api, eval_shadow_summary_tool

Both surfaces use the same bounded eval-shadow summary and shared learning posture contract. The summary stays aggregate; the additive posture block reports whether learning is disabled, unavailable, empty, ready, stale, proposal-backed-up, or degraded. The additive evidence block answers the narrower measurement question: whether recent shadow evidence is disabled, empty, stale, non_comparable, failing, or usable. Degraded is service-owned: the worker flips after three consecutive purge, claim, replay, or persistence failures and clears on the next successful worker boundary.

bmo eval posture layers deterministic eval-gate guidance, stale-evidence checks, and arena eval_shadow guidance on top of that same bounded summary. It succeeds even when learning.shadow is disabled so maintainers can verify that shadow learning is intentionally unavailable.

Parameters

limit: optional recent-window cap. Default 100, max 500.

The limit applies to:

recent runs considered
recent scorecards considered
recent proposals considered

Returned fields

The summary includes:

posture
evidence
status_counts
comparability_counts
proposal_state_counts
pass_count
fail_count
comparable_runs
comparable_pass_count
comparable_pass_rate
latest_run_at_ms
latest_scorecard_at_ms
latest_proposal_at_ms
candidate_models
failing_rule_types

This makes it possible to answer questions like:

Are comparable eval runs still passing?
Which candidate model is regressing?
Which validation rule type is failing most often?
Are proposals accumulating without review?

Relationship to existing eval-shadow APIs

Use this summary when you want a fast health check.

Use the existing eval-shadow routes and tools when you need full detail:

artifacts
runs
proposals
derived assets
exports

The summary is intentionally bounded and aggregate; it is not a replacement for the detailed replay records.

Use bmo eval posture when deciding whether a change has enough local evidence to proceed. It separates deterministic before/after scenario reports from shadow-eval evidence, calls out empty or non-comparable shadow windows, and warns when the newest shadow evidence is older than the configured --stale-after window. For automation or review tooling, prefer the evidence verdict over parsing the human-readable text output.

Relationship to runtime features

When the HTTP server is running, runtime_features distinguishes the two summary access paths:

eval_shadow_summary_api
eval_shadow_summary_tool

Those records let operators see whether the summary surface is only available or has actually been exercised over HTTP or from the in-agent tool path.