Skip to content

Benchmarks

1,700+ LLM evaluations across 10+ models, 3 providers, and 60+ independent test runs.

No model has ever been trained on GCF. Every model reads it better than the formats they were trained on.

Generic ProfileGraph Profile
GCF100% (6 frontier models)90.7% (10 models)
TOON92.3%68.5%
JSON91.2%53.6%
GCFTOONJSON
Token efficiency (15 datasets)wins 13/15wins 2/15wins 0/15
Generation (28 runs, 9 models)5/51.0/55.0/5
33,000,000,000+ round-trips0 failures

Four benchmark suites, three providers (Anthropic, OpenAI, Google), zero training:

  1. Generic comprehension: 500-order nested data, 7 models. GCF 100% on every frontier model.
  2. Graph comprehension: 500-symbol code graphs, 10 models. GCF 90.7% where JSON drops to 53.6%.
  3. Scale test: At 1000 records, JSON doesn't fit. GCF is the only format that works on 200K context models.
  4. Token efficiency: 15 real-world datasets. GCF wins 13/15 vs TOON, 25.5% fewer overall.
  5. Generation: Every frontier model produces valid GCF. TOON's decoder rejects output from 7/9 models.

All results reproducible.


Generic Profile: Standard Workloads

500 orders with nested customer objects and line items. 13 structured extraction questions. Zero format instructions. Deterministic answers, no LLM judge.

This is what most MCP tool responses look like: arrays of objects with nested metadata. The "normal" workload.

Generic Comprehension Accuracy

ModelProviderGCFJSONTOON
Claude Opus 4.6Anthropic100%100%100%
Claude Sonnet 4.6Anthropic100%100%100%
Claude Haiku 4.5Anthropic100%100%100%
GPT-5.5OpenAI100%100%92.3%
GPT-4o-miniOpenAI69.2%61.5%69.2%
Gemini 2.5 FlashGoogle100%76.9%84.6%
Gemini 3.5 FlashGoogle100%100%100%

GCF achieves 100% on every frontier model. The only format that never fails.

TOON fails on GPT-5.5 (count_premium_customers: got 250, expected 200). JSON fails on Gemini 2.5 Flash (3 counting questions wrong). On the weakest model (4o-mini), all formats struggle equally on aggregation questions.


Graph Profile: Under Structural Stress

500 symbols, 200 edges, zero format instructions. Code intelligence data with cross-references, distance groups, and provenance chains. This is the hard case: structurally complex data that tests whether a format scales.

Comprehension Accuracy by Model

ModelRunsGCF avgTOON avgJSON avg
Claude Opus 4.6296.2%84.6%73.1%
Claude Sonnet 4.62100%73.1%53.8%
Claude Haiku 4.5296.2%69.2%57.7%
GPT-5.5584.1%67.7%45.8%
GPT-5.4478.0%56.0%44.1%
GPT-5.4-mini271.8%64.1%54.2%
Gemini 2.5 Pro1100%76.9%58.3%
Gemini 3.1 Pro1100%76.9%46.2%
Gemini 3.5 Flash1100%61.5%46.2%
Gemini 2.5 Flash380.6%54.6%57.0%

23 runs, 10 models, 3 providers. GCF wins 22, ties 1, loses 0.

When an agent receives data in JSON at this scale, it gets the wrong answer 46% of the time. With TOON, 32% of the time. With GCF, 10%.

Why GCF wins on complex data

GCF encodes answers structurally. "How many related symbols?" is answered by the section header ## related [167]. TOON and JSON force the model to scan 500 rows and count. The result: GCF errors are off by 1-2 (precision), TOON/JSON errors are off by 50-140 (comprehension failure).

See the full failure taxonomy for the complete analysis.


Scale Test: 1000 Orders

At production scale, format choice determines whether the task is possible at all.

Scale Test

ModelContextGCF (47K)TOON (84K)JSON (161K)
Claude Haiku 4.5200K100% (13/13)100% (13/13)IMPOSSIBLE
Claude Sonnet 4.6200K92.3% (12/13)IMPOSSIBLEIMPOSSIBLE
Claude Opus 4.61M100% (13/13)100% (13/13)100% (13/13)
GPT-5.5-100% (6/6)100% (5/5)100% (6/6)

JSON at 1000 records consumes 161K tokens. On 200K context models, this exceeds usable context and the task becomes impossible. TOON at 84K also exceeds the effective limit on Sonnet.

GCF encodes the same data in 47K tokens (71% smaller than JSON). This means:

  • On 200K models: GCF is the only format that reliably fits
  • On 1M models: all formats work, but GCF costs 71% less per API call
  • In agent loops: GCF leaves 150K+ tokens for conversation history, tool schemas, and reasoning

Token Efficiency: 15 Datasets

15 real-world datasets representing actual LLM tool response payloads. Same tokenizer (o200k_base), deterministic data, spec-compliant encoders.

Token Efficiency

#DatasetGCFTOONGCF vs TOON
1Employee records (flat)49,06149,966-1.8%
2E-commerce orders (nested)51,33473,246-29.9%
3Analytics time-series8,4049,127-7.9%
4GitHub repositories8,5828,744-1.9%
5Event logs (semi-uniform)95,635154,032-37.9%
6Nested config645618+4.4%
7LSP symbol search5,4425,365+1.4%
8PR file changes2,6232,657-1.3%
9Distributed trace4,3184,959-12.9%
10Database query results17,71617,969-1.4%
11File tree + diagnostics6,0186,894-12.7%
12Multi-tool composite3,1313,192-1.9%
13Order history (shared schemas)13,29516,454-19.2%
14Blast radius response6,5617,831-16.2%
15Comprehension eval payload41,21360,603-32.0%
TOTAL313,978421,657-25.5%

GCF wins 13/15 vs TOON. Two TOON wins: nested config (27 tokens, pure key-value tree) and LSP symbols (77 tokens, tokenizer artifact).

Dataset 15 is the exact payload used in the comprehension eval. The format that achieves 100% accuracy uses 32% fewer tokens.


Generation: Can LLMs Write It?

The model is given a natural-language description and a 3-line format primer. It must produce valid, decoder-parseable output. Tested at 5, 10, 20, 50, and 100 symbols.

Generation Validity by Model

ModelGCFTOONJSON
Claude Opus 4.65/50/55/5
Claude Sonnet 4.65/52-3/55/5
Claude Haiku 4.55/51-3/55/5
GPT-5.54-5/51-2/55/5
GPT-5.45/50/55/5
GPT-5.4-mini5/50/55/5
Gemini 2.5 Pro5/51/55/5
Gemini 3.1 Pro5/50/55/5
Gemini 3.5 Flash3/51/53/5

GCF is the only format every frontier model can produce. TOON's official decoder rejects output on 7 of 9 models. The format's flat tabular design encodes semantic categories as integers, forcing a mapping no model performs unprompted.

GCF output is 63% smaller than JSON and 33% smaller than TOON at 100 symbols. Every output token costs money.

Output Size at Scale


Reproduce

All evals are in gcf-go/eval. All raw logs are in eval/results.

bash
git clone https://github.com/blackwell-systems/gcf-go
cd gcf-go/eval

# Generic profile comprehension
GOWORK=off EVAL_FORMATS=gcf,json,toon EVAL_BACKEND=cli EVAL_MODEL=haiku EVAL_NUM_ORDERS=500 go test -run TestGenericComprehension -v -timeout 0

# Graph profile comprehension
GOWORK=off go test -run TestComprehension -v -timeout 0
EVAL_BACKEND=openai OPENAI_API_KEY=... EVAL_MODEL=gpt-5.5 GOWORK=off go test -run TestComprehension -v -timeout 0
EVAL_BACKEND=google GOOGLE_API_KEY=... EVAL_MODEL=gemini-2.5-flash GOWORK=off go test -run TestComprehension -v -timeout 0

# Generation
GOWORK=off go test -run "TestGeneration$|TestGenerationTOON|TestGenerationJSON" -v -timeout 0

# Token efficiency (15 datasets)
git clone https://github.com/blackwell-systems/toon-benchmark
cd toon-benchmark
node --experimental-strip-types benchmarks/scripts/token-efficiency-benchmark.ts

For detailed failure analysis, error taxonomy, and per-run data, see the full eval results.

100% comprehension. 71% fewer tokens. 1,700+ LLM evaluations.