Eval shadow summary
Eval shadow summary gives you one bounded quality snapshot over the existing eval-shadow store. It is designed for quick regression checks and operator inspection, not for replaying every captured run.
Surfaces
Section titled “Surfaces”- HTTP:
GET /v1/eval/shadow/summary - In-agent tool:
eval_shadow_summary - CLI summary:
bmo eval shadow summary - CLI posture:
bmo eval posture - Runtime features:
eval_shadow_summary_api,eval_shadow_summary_tool
Both surfaces use the same summary builder in internal/learning/summary.go.
bmo eval posture layers deterministic eval-gate guidance, stale-evidence
checks, and arena eval_shadow guidance on top of that same bounded summary.
It succeeds even when learning.shadow is disabled so maintainers can verify
that shadow learning is intentionally unavailable.
Parameters
Section titled “Parameters”limit: optional recent-window cap. Default100, max500.
The limit applies to:
- recent runs considered
- recent scorecards considered
- recent proposals considered
Returned fields
Section titled “Returned fields”The summary includes:
status_countscomparability_countsproposal_state_countspass_countfail_countcomparable_runscomparable_pass_countcomparable_pass_ratelatest_run_at_mslatest_scorecard_at_mslatest_proposal_at_mscandidate_modelsfailing_rule_types
This makes it possible to answer questions like:
- Are comparable eval runs still passing?
- Which candidate model is regressing?
- Which validation rule type is failing most often?
- Are proposals accumulating without review?
Relationship to existing eval-shadow APIs
Section titled “Relationship to existing eval-shadow APIs”Use this summary when you want a fast health check.
Use the existing eval-shadow routes and tools when you need full detail:
- artifacts
- runs
- proposals
- derived assets
- exports
The summary is intentionally bounded and aggregate; it is not a replacement for the detailed replay records.
Use bmo eval posture when deciding whether a change has enough local evidence
to proceed. It separates deterministic before/after scenario reports from
shadow-eval evidence, calls out empty or non-comparable shadow windows, and
warns when the newest shadow evidence is older than the configured
--stale-after window.
Relationship to runtime features
Section titled “Relationship to runtime features”When the HTTP server is running, runtime_features distinguishes the two
summary access paths:
eval_shadow_summary_apieval_shadow_summary_tool
Those records let operators see whether the summary surface is only available or has actually been exercised over HTTP or from the in-agent tool path.