Skip to content

Eval shadow summary

Eval shadow summary gives you one bounded quality snapshot over the existing eval-shadow store. It is designed for quick regression checks and operator inspection, not for replaying every captured run.

  • HTTP: GET /v1/eval/shadow/summary
  • In-agent tool: eval_shadow_summary
  • CLI summary: bmo eval shadow summary
  • CLI posture: bmo eval posture
  • Runtime features: eval_shadow_summary_api, eval_shadow_summary_tool

Both surfaces use the same summary builder in internal/learning/summary.go. bmo eval posture layers deterministic eval-gate guidance, stale-evidence checks, and arena eval_shadow guidance on top of that same bounded summary. It succeeds even when learning.shadow is disabled so maintainers can verify that shadow learning is intentionally unavailable.

  • limit: optional recent-window cap. Default 100, max 500.

The limit applies to:

  • recent runs considered
  • recent scorecards considered
  • recent proposals considered

The summary includes:

  • status_counts
  • comparability_counts
  • proposal_state_counts
  • pass_count
  • fail_count
  • comparable_runs
  • comparable_pass_count
  • comparable_pass_rate
  • latest_run_at_ms
  • latest_scorecard_at_ms
  • latest_proposal_at_ms
  • candidate_models
  • failing_rule_types

This makes it possible to answer questions like:

  • Are comparable eval runs still passing?
  • Which candidate model is regressing?
  • Which validation rule type is failing most often?
  • Are proposals accumulating without review?

Use this summary when you want a fast health check.

Use the existing eval-shadow routes and tools when you need full detail:

  • artifacts
  • runs
  • proposals
  • derived assets
  • exports

The summary is intentionally bounded and aggregate; it is not a replacement for the detailed replay records.

Use bmo eval posture when deciding whether a change has enough local evidence to proceed. It separates deterministic before/after scenario reports from shadow-eval evidence, calls out empty or non-comparable shadow windows, and warns when the newest shadow evidence is older than the configured --stale-after window.

When the HTTP server is running, runtime_features distinguishes the two summary access paths:

  • eval_shadow_summary_api
  • eval_shadow_summary_tool

Those records let operators see whether the summary surface is only available or has actually been exercised over HTTP or from the in-agent tool path.