Benchmark Results
Polyglot Benchmark Results
Section titled “Polyglot Benchmark Results”BMO’s edit quality is tested against the Aider polyglot benchmark, a corpus of 133+ Exercism exercises across 19 programming languages.
How to run
Section titled “How to run”# Fetch the corpuspolyglot-harness fetch
# Run against a model (Go exercises by default)polyglot-harness run --model anthropic/claude-sonnet-4-20250514 --output results.json
# Run with a different language trackpolyglot-harness run --model openai/gpt-4.1 --language python --output results.json
# Run a subset for quick testingpolyglot-harness run --model anthropic/claude-sonnet-4-20250514 --limit 5 --output results.jsonMethodology
Section titled “Methodology”Each exercise is run as a single-turn prompt: BMO receives the exercise instructions
and must implement a passing solution. Success is measured by go test for Go
exercises or the language’s standard test runner.
Metrics tracked:
- First-try pass rate: Percentage of exercises where BMO’s solution passes all tests on the first attempt
- Cost per 1k tasks: Estimated API cost in USD extrapolated from per-exercise token usage
- Tool breakdown: Which edit tools (
edit,write,apply_patch) were used
Results
Section titled “Results”Results are generated by running polyglot-harness run and are published below when available.
Note: Run your own benchmarks with
polyglot-harness runto get results for your configured models and providers.