Skip to content

Session Mode Autoselect — Strategic Value Evaluation

Maturity: Eval-only surface. It records shadow decisions and reports for measurement; default live routing behavior is unchanged.

BMO can propose a session mode (code / review / debug / plan) from your prompt. Separately, the evaluation apparatus records shadow decisions and outcomes so you can measure strategic value (did the pick matter?) and not just classifier accuracy.

  • Shadow capture is on by default (config nil-guard): redacted decision rows and offline reports are available without changing which mode actually runs. Set [options.session_mode_autoselect_eval] enabled = false to turn off capture entirely.
  • Live routing is unchanged: users are not placed on an experiment arm by default.
  • Live experiment settings are default off. Routing returns the control decision unless an operator configures an active treatment policy.

Offline accuracy (“did we match a human label?”) is necessary but not enough. This apparatus adds:

  • outcome-linked rows (did the run finish, how long, contradictions),
  • contradiction rate when users override with /mode,
  • segment rollups with minimum-n floors so thin slices do not drive conclusions.

Defaults favor privacy (digest-first). Adjust in TOML:

[options.session_mode_autoselect_eval]
# enabled = true # default when omitted; set false to disable all capture
store_raw_prompt = false
retention_days = 30
prompt_retention_days = 30
outcome_window_hours = 6
Terminal window
bmo eval autoselect aggregate --since-days 30
bmo eval autoselect report --since-days 30
bmo eval autoselect score docs/evals/session-mode-autoselect/corpus-v1/
  • aggregate — raw per-axis counts.
  • report — reportability floors (segments with very small n are flagged).
  • score — offline corpus vs builtin classifier; no DB required.

All commands are read-only for the live system and support --json.

  • Decision rows store an allowlist-only redacted digest by default (length bucket, presence markers, keyword families, one-way hash).
  • Raw prompt text is stored only if store_raw_prompt = true.
  • Startup retention sweeps enforce prompt_retention_days and retention_days.

live_experiment contains assignment-hash and kill-switch settings. With the default configuration, it does not change routing outcomes.