Skip to content

Benchmark Results

BMO’s edit quality is tested against the Aider polyglot benchmark, a corpus of 133+ Exercism exercises across 19 programming languages.

Terminal window
# Fetch the corpus
polyglot-harness fetch
# Run against a model (Go exercises by default)
polyglot-harness run --model anthropic/claude-sonnet-4-20250514 --output results.json
# Run with a different language track
polyglot-harness run --model openai/gpt-4.1 --language python --output results.json
# Run a subset for quick testing
polyglot-harness run --model anthropic/claude-sonnet-4-20250514 --limit 5 --output results.json

Each exercise is run as a single-turn prompt: BMO receives the exercise instructions and must implement a passing solution. Success is measured by go test for Go exercises or the language’s standard test runner.

Metrics tracked:

  • First-try pass rate: Percentage of exercises where BMO’s solution passes all tests on the first attempt
  • Cost per 1k tasks: Estimated API cost in USD extrapolated from per-exercise token usage
  • Tool breakdown: Which edit tools (edit, write, apply_patch) were used

Results are generated by running polyglot-harness run and are published below when available.

Note: Run your own benchmarks with polyglot-harness run to get results for your configured models and providers.