The numbers, with the methodology.
Every benchmark on this page is fully reproducible from the OMEGA repo. The harnesses, task fixtures, scoring metrics, and raw results all live in benchmarks/ — clone the repo, run one command, get the same numbers.
Where we have real numbers today, they're below. Where we're still running, we say so. No cherry-picked sub-sets, no hidden task mixes.
80–85% cost reduction vs. a monolithic frontier-model call.
A 5-task offline benchmark that counts tokens, applies real published pricing, and compares OMEGA's fractal decomposition to a monolithic prompt. Fully reproducible — no API keys needed to verify.
| Routing profile | Setup | Aggregate reduction | Per-task range |
|---|---|---|---|
| Default | Claude Opus vs Haiku + Sonnet | 83.3% | 81.2% – 84.4% |
| OpenAI | GPT-5 vs GPT-4o-mini + GPT-4o | 80.9% | 78.4% – 82.4% |
| Hybrid-local | Claude Opus vs Haiku + MLX on Apple Silicon + Sonnet | 85.3% | 83.8% – 86.6% |
The honest counter-result
Measured on raw token count, fractal decomposition uses ~73% more tokens than monolithic on these tasks. Each level has system-prompt overhead, and the synthesizer has to read worker summaries. The cost win is entirely from model routing — fractal sends each level to the smallest-viable model, while monolithic has to route the whole thing to a frontier model.
We publish both numbers because any comparison that only reports "tokens" is either confused or cherry-picked.
Ran: 2026-04-20 · Tokenizer: openai/cl100k_base · Suite size: 5 tasks
Full methodology + raw results: benchmarks/token_efficiency/README.md
SWE-bench and LongBench — running.
Two public benchmarks: SWE-bench for coding-agent capability, LongBench for long-context reasoning. Harnesses are wired and tested; generation + grading is a paid, hours-long run we're scheduling.
SWE-bench Lite
Real GitHub bugs across 12 Python repos
300 human-validated bugs from major OSS projects (Django, Flask, scikit-learn, etc.). Each bug has test cases that must newly pass after the fix is applied. The industry standard for measuring coding-agent capability. Harness is end-to-end smoke-tested (zero-cost mock dispatcher); real runs require API keys and land soon.
Context
- Claude Sonnet 4.5 (published)~65%
- GPT-5 (published)~55-60%
- OMEGApending
benchmarks/swe_bench/LongBench
Long-context QA, summarisation, and retrieval
21 tasks across 6 categories (single-doc QA, multi-doc QA, summarisation, classification, retrieval, code) with contexts from 8k to 200k tokens. The benchmark OMEGA's fractal decomposition is architecturally best-suited to win — monolithic prompting has to truncate or burn context window, fractal routes each source through a narrow worker. Harness smoke-tested end-to-end.
Context
- Claude 3.5 Sonnet (published)~50% avg
- GPT-4o 128k (published)~48% avg
- OMEGApending
benchmarks/longbench/Reproducibility is the point.
Same prompts both sides
The monolithic baseline gets the identical prompt the fractal side starts from. No special-casing, no advantage baked into the comparison.
Same tokenizer, both sides
OpenAI uses tiktoken cl100k_base, Anthropic uses the bundled SDK tokenizer. We report both so numbers can be checked against either vendor's billing.
Published pricing, explicitly
The pricing table is a single file, three profiles. If the numbers move, anyone can rerun with the new table.
Tests lock the methodology
62 tests in the harness — if someone tweaks a synthesizer to "look better," CI fails. Workers can't leak foreign context; synth can't re-read source bodies; monolithic can't use a non-frontier model.
All three benchmarks live in the OMEGA repo under benchmarks/. The harnesses are Apache-2.0 licensed — fork, run, or adapt them for your own comparisons. We're interested in feedback, disagreements, and better task fixtures.
Questions? benchmarks@myomega.ai