The numbers, with the methodology.

Every benchmark on this page is fully reproducible from the OMEGA repo. The harnesses, task fixtures, scoring metrics, and raw results all live in benchmarks/ — clone the repo, run one command, get the same numbers.

Where we have real numbers today, they're below. Where we're still running, we say so. No cherry-picked sub-sets, no hidden task mixes.

80–85% cost reduction vs. a monolithic frontier-model call.

A 5-task offline benchmark that counts tokens, applies real published pricing, and compares OMEGA's fractal decomposition to a monolithic prompt. Fully reproducible — no API keys needed to verify.

Published
Routing profileSetupAggregate reductionPer-task range
DefaultClaude Opus vs Haiku + Sonnet83.3%81.2% – 84.4%
OpenAIGPT-5 vs GPT-4o-mini + GPT-4o80.9%78.4% – 82.4%
Hybrid-localClaude Opus vs Haiku + MLX on Apple Silicon + Sonnet85.3%83.8% – 86.6%

The honest counter-result

Measured on raw token count, fractal decomposition uses ~73% more tokens than monolithic on these tasks. Each level has system-prompt overhead, and the synthesizer has to read worker summaries. The cost win is entirely from model routing — fractal sends each level to the smallest-viable model, while monolithic has to route the whole thing to a frontier model.

We publish both numbers because any comparison that only reports "tokens" is either confused or cherry-picked.

Ran: 2026-04-20 · Tokenizer: openai/cl100k_base · Suite size: 5 tasks

Full methodology + raw results: benchmarks/token_efficiency/README.md

SWE-bench and LongBench — running.

Two public benchmarks: SWE-bench for coding-agent capability, LongBench for long-context reasoning. Harnesses are wired and tested; generation + grading is a paid, hours-long run we're scheduling.

SWE-bench Lite

Real GitHub bugs across 12 Python repos

Harness ready · results pending

300 human-validated bugs from major OSS projects (Django, Flask, scikit-learn, etc.). Each bug has test cases that must newly pass after the fix is applied. The industry standard for measuring coding-agent capability. Harness is end-to-end smoke-tested (zero-cost mock dispatcher); real runs require API keys and land soon.

Context

  • Claude Sonnet 4.5 (published)~65%
  • GPT-5 (published)~55-60%
  • OMEGApending
Harness: benchmarks/swe_bench/

LongBench

Long-context QA, summarisation, and retrieval

Harness ready · results pending

21 tasks across 6 categories (single-doc QA, multi-doc QA, summarisation, classification, retrieval, code) with contexts from 8k to 200k tokens. The benchmark OMEGA's fractal decomposition is architecturally best-suited to win — monolithic prompting has to truncate or burn context window, fractal routes each source through a narrow worker. Harness smoke-tested end-to-end.

Context

  • Claude 3.5 Sonnet (published)~50% avg
  • GPT-4o 128k (published)~48% avg
  • OMEGApending
Harness: benchmarks/longbench/

Reproducibility is the point.

Same prompts both sides

The monolithic baseline gets the identical prompt the fractal side starts from. No special-casing, no advantage baked into the comparison.

Same tokenizer, both sides

OpenAI uses tiktoken cl100k_base, Anthropic uses the bundled SDK tokenizer. We report both so numbers can be checked against either vendor's billing.

Published pricing, explicitly

The pricing table is a single file, three profiles. If the numbers move, anyone can rerun with the new table.

Tests lock the methodology

62 tests in the harness — if someone tweaks a synthesizer to "look better," CI fails. Workers can't leak foreign context; synth can't re-read source bodies; monolithic can't use a non-frontier model.

All three benchmarks live in the OMEGA repo under benchmarks/. The harnesses are Apache-2.0 licensed — fork, run, or adapt them for your own comparisons. We're interested in feedback, disagreements, and better task fixtures.

Questions? benchmarks@myomega.ai