Session Mode Autoselect — Strategic Value Evaluation
Current behavior
Section titled “Current behavior”Maturity: Eval-only surface. It records shadow decisions and reports for measurement; default live routing behavior is unchanged.
BMO can propose a session mode (code / review / debug / plan) from
your prompt. Separately, the evaluation apparatus records shadow
decisions and outcomes so you can measure strategic value (did the pick
matter?) and not just classifier accuracy.
- Shadow capture is on by default (config nil-guard): redacted decision
rows and offline reports are available without changing which mode actually
runs. Set
[options.session_mode_autoselect_eval] enabled = falseto turn off capture entirely. - Live routing is unchanged: users are not placed on an experiment arm by default.
- Live experiment settings are default off. Routing returns the control decision unless an operator configures an active treatment policy.
Why measure “strategic value”?
Section titled “Why measure “strategic value”?”Offline accuracy (“did we match a human label?”) is necessary but not enough. This apparatus adds:
- outcome-linked rows (did the run finish, how long, contradictions),
- contradiction rate when users override with
/mode, - segment rollups with minimum-n floors so thin slices do not drive conclusions.
Tune shadow capture (optional)
Section titled “Tune shadow capture (optional)”Defaults favor privacy (digest-first). Adjust in TOML:
[options.session_mode_autoselect_eval]# enabled = true # default when omitted; set false to disable all capturestore_raw_prompt = falseretention_days = 30prompt_retention_days = 30outcome_window_hours = 6Read the rollup
Section titled “Read the rollup”bmo eval autoselect aggregate --since-days 30bmo eval autoselect report --since-days 30bmo eval autoselect score docs/evals/session-mode-autoselect/corpus-v1/aggregate— raw per-axis counts.report— reportability floors (segments with very small n are flagged).score— offline corpus vs builtin classifier; no DB required.
All commands are read-only for the live system and support --json.
Privacy
Section titled “Privacy”- Decision rows store an allowlist-only redacted digest by default (length bucket, presence markers, keyword families, one-way hash).
- Raw prompt text is stored only if
store_raw_prompt = true. - Startup retention sweeps enforce
prompt_retention_daysandretention_days.
Live experiment settings
Section titled “Live experiment settings”live_experiment contains assignment-hash and kill-switch settings. With the
default configuration, it does not change routing outcomes.
Related
Section titled “Related”- Session modes — interactive
/modeswitching. - Implementation detail: session-mode-autoselect-eval.md in the repository.