Benchmarks

The numbers, with the methodology.

Every benchmark on this page is fully reproducible from the OMEGA repo. The harnesses, task fixtures, scoring metrics, and raw results all live in benchmarks/ — clone the repo, run one command, get the same numbers.

Where we have real numbers today, they're below. Where we're still running, we say so. No cherry-picked sub-sets, no hidden task mixes.

Token efficiency · published

80–85% cost reduction vs. a monolithic frontier-model call.

A 5-task offline benchmark that counts tokens, applies real published pricing, and compares OMEGA's fractal decomposition to a monolithic prompt. Fully reproducible — no API keys needed to verify.

Published

Routing profile	Setup	Aggregate reduction	Per-task range
Default	Claude Opus vs Haiku + Sonnet	83.3%	81.2% – 84.4%
OpenAI	GPT-5 vs GPT-4o-mini + GPT-4o	80.9%	78.4% – 82.4%
Hybrid-local	Claude Opus vs Haiku + MLX on Apple Silicon + Sonnet	85.3%	83.8% – 86.6%

The honest counter-result

Measured on raw token count, fractal decomposition uses ~73% more tokens than monolithic on these tasks. Each level has system-prompt overhead, and the synthesizer has to read worker summaries. The cost win is entirely from model routing — fractal sends each level to the smallest-viable model, while monolithic has to route the whole thing to a frontier model.

We publish both numbers because any comparison that only reports "tokens" is either confused or cherry-picked.

Ran: 2026-04-20 · Tokenizer: openai/cl100k_base · Suite size: 5 tasks

Full methodology + raw results: benchmarks/token_efficiency/README.md

Public benchmarks

SWE-bench and LongBench — running.

Two public benchmarks: SWE-bench for coding-agent capability, LongBench for long-context reasoning. Harnesses are wired and tested; generation + grading is a paid, hours-long run we're scheduling.

SWE-bench Lite

Real GitHub bugs across 12 Python repos

Harness ready · results pending

300 human-validated bugs from major OSS projects (Django, Flask, scikit-learn, etc.). Each bug has test cases that must newly pass after the fix is applied. The industry standard for measuring coding-agent capability. Harness is end-to-end smoke-tested (zero-cost mock dispatcher); real runs require API keys and land soon.

Context

Claude Sonnet 4.5 (published)~65%
GPT-5 (published)~55-60%
OMEGApending

Harness: benchmarks/swe_bench/

LongBench

Long-context QA, summarisation, and retrieval

Harness ready · results pending

21 tasks across 6 categories (single-doc QA, multi-doc QA, summarisation, classification, retrieval, code) with contexts from 8k to 200k tokens. The benchmark OMEGA's fractal decomposition is architecturally best-suited to win — monolithic prompting has to truncate or burn context window, fractal routes each source through a narrow worker. Harness smoke-tested end-to-end.

Context

Claude 3.5 Sonnet (published)~50% avg
GPT-4o 128k (published)~48% avg
OMEGApending

Harness: benchmarks/longbench/

How these work

Reproducibility is the point.

Same prompts both sides

The monolithic baseline gets the identical prompt the fractal side starts from. No special-casing, no advantage baked into the comparison.

Same tokenizer, both sides

OpenAI uses tiktoken cl100k_base, Anthropic uses the bundled SDK tokenizer. We report both so numbers can be checked against either vendor's billing.

Published pricing, explicitly

The pricing table is a single file, three profiles. If the numbers move, anyone can rerun with the new table.

Tests lock the methodology

62 tests in the harness — if someone tweaks a synthesizer to "look better," CI fails. Workers can't leak foreign context; synth can't re-read source bodies; monolithic can't use a non-frontier model.

All three benchmarks live in the OMEGA repo under benchmarks/. The harnesses are Apache-2.0 licensed — fork, run, or adapt them for your own comparisons. We're interested in feedback, disagreements, and better task fixtures.

Questions? benchmarks@myomega.ai