Skip to content

Benchmarks (Full Data)

Complete data from all eval runs. For the summary, see Benchmarks.


Comprehension: All 23 Runs

500 symbols, 200 edges, 13 structured extraction questions, zero format instructions. Each run generates a fresh random payload.

ModelRunGCFTOONJSONGCF wins?
Claude Opus 4.61100%92.3%76.9%
Claude Opus 4.6292.3%76.9%69.2%
Claude Sonnet 4.61100%76.9%53.8%
Claude Sonnet 4.62100%69.2%53.8%
Claude Haiku 4.5192.3%69.2%61.5%
Claude Haiku 4.52100%69.2%53.8%
GPT-5.5191.7%66.7%50.0%
GPT-5.5276.9%69.2%46.2%
GPT-5.5376.9%69.2%46.2%
GPT-5.5491.7%66.7%50.0%
GPT-5.5583.3%66.7%36.4%
GPT-5.4175.0%58.3%41.7%
GPT-5.4276.9%53.8%46.2%
GPT-5.4376.9%53.8%38.5%
GPT-5.4476.9%58.3%50.0%
GPT-5.4-mini176.9%61.5%58.3%
GPT-5.4-mini266.7%66.7%50.0%tied
Gemini 2.5 Flash176.9%58.3%53.8%
Gemini 2.5 Flash275.0%50.0%57.1%
Gemini 2.5 Flash390.0%55.6%60.0%
Gemini 3.5 Flash1100%61.5%46.2%
Gemini 2.5 Pro1100%76.9%58.3%
Gemini 3.1 Pro1100%76.9%46.2%

23 runs, 10 models, 3 providers. GCF wins 22, ties 1, loses 0. Four models achieve 100%: Sonnet, Gemini 2.5 Pro, Gemini 3.1 Pro, Gemini 3.5 Flash.

Score variance

Models with 2+ runs show how consistent each format is.

Comprehension Variance

GCF advantage by model tier

The advantage grows on weaker models. Frontier models can brute-force flat data. Smaller models cannot.

GCF Advantage by Tier

Token cost vs accuracy

GCF is in the top-left: fewer tokens, higher accuracy.

Token Cost vs Accuracy


Failure Taxonomy

Classified from all FAIL lines across 23 runs (39 questions per run, 3 formats each).

Error Magnitude

GCF median error: 4. TOON median error: 53. JSON median error: 56. GCF encodes answers structurally (## related [167]). TOON/JSON force the model to compute them from raw data.

GCF failures: precision errors

GCF fails on precision (off by 1-2). The structure is understood; the count is slightly misread.

TypeCountModelsCause
Off-by-1-2 header misread5Haiku (1), GPT-5.4 (3), mini (1)Header says [167], model reads 166. Tokenization artifact.
Column scan miscount10GPT-5.4 (7), mini (3)Must scan fn kind across rows. function_count=84 deterministically.
Field confusion2GPT-5.4 (1), mini (1)Read symbol count instead of edge count.
Empty response10GPT-5.5 (10)Context overwhelm at 53k+ input tokens.

TOON failures: comprehension errors

TOON fails on comprehension (wrong by 50-140). The model cannot filter a flat list by column value at scale.

TypeCountModelsCause
Distance grouping failure25Opus/Sonnet (3), Haiku (6), GPT-5.4 (11), mini (5)Must scan 500 rows and filter by distance column. Wildly inconsistent answers.
Round-number guessing7Haiku (1), mini (6)Model gives up counting and guesses "100".
Attention decay (last row)5Opus/Sonnet (1), Haiku (1), GPT-5.4 (3)last_symbol_kind wrong. Loses track at row 500.
Empty response20GPT-5.5 (20)Context overwhelm. Same as JSON.

JSON failures: structural overwhelm

JSON fails on structure (empty responses, massive undercounts, chain-of-thought enumeration). The format itself prevents comprehension at scale.

TypeCountModelsCause
Empty string response33GPT-5.5 (33)53k tokens of repeated {"qualifiedName":...} overwhelms attention.
Massive undercount9Opus/Sonnet (3), Haiku (1), GPT-5.4 (4), mini (1)Field-name repetition dilutes signal.
Distance filter failure29Opus/Sonnet (7), Haiku (6), GPT-5.4 (11), mini (5)Must parse JSON objects AND filter by field value.
Field confusion3GPT-5.4 (3)last_symbol_kind reads edge type instead of kind.

Failure distribution by format

Failure Types (Pie)

Failures by model tier

Failure Types by Model

ModelGCF failure modeTOON failure modeJSON failure mode
Opus/SonnetNoneOff-by-2 extended_count; last_symbol_kind wrong (attention decay at row 500)Undercounts (356 vs 500); 143-line chain-of-thought enumeration, still wrong answer
Haiku 4.5Off-by-1 (1 of 2 runs)Distance grouping (100, 200, 214 vs 166); last_symbol_kind wrongUndercounts; distance filter failures
GPT-5.5Empty strings (context overwhelm at 53k input tokens)Empty strings; distance grouping failuresReturns nothing on most questions (53k tokens of repeated field names overwhelms attention)
GPT-5.4Deterministic: edge_count=198, function_count=84 every runDistance grouping wildly inconsistent (169, 229, 200 vs 166); round-number guessingsymbol_count 326-404; massive undercounts everywhere
GPT-5.4-miniSame as 5.4 (198, 84) plus larger misses (250, 100)Worst distance grouping (26, 28 vs 166); defaults to round-number guessing300 vs 500 symbol_count; consistent failure across all question types

GCF failures on Claude are near-zero. GCF failures on OpenAI are deterministic and repeatable (same wrong number every run), suggesting a tokenizer-level parsing difference rather than a comprehension issue.

Artifacts: What JSON failure looks like in practice

When asked "how many symbols have distance 1 (related)?", the answer is 167. GCF encodes it in the section header: ## related [167]. The model reads it directly.

JSON has no structural grouping. The model must scan 500 JSON objects, filter by "Distance": 1, and count. On two separate runs, Claude Opus (the most capable model on earth) responded by manually enumerating every symbol:

Run 1 (full artifact):

"Let me count precisely by going through the list:1. handler.Response.Notify2. model.SubscribeConfig3. service.PublishOptions...143. store.DispatchConfig

So: 143."

143 lines of output tokens. Wrong answer (expected 167).

Run 2 (full artifact):

"Let me count systematically. The symbols list transitions from Distance 0 to Distance 1 at handler.Response.Notify...1. handler.Response.Notify2. model.SubscribeConfig...119. store.DispatchConfig120. cache.ExecuteOptions -- wait, this is Distance 2.

So: 119."

119 lines. Wrong again (expected 167). Different random payload, same failure mode. The model even caught itself mid-count ("wait, this is Distance 2") and still got it wrong.

This is JSON's structural problem: it forces LLMs to perform manual enumeration at scale, burning output tokens on a task the format should have answered structurally. GCF answers the same question from a 3-character header lookup.


Generation: All Runs

GCF validity across all models

Generation Validity

Model5 sym10 sym20 sym50 sym100 symScoreRuns
Claude Opus 4.6YESYESYESYESYES5/52 (zero variance)
Claude Sonnet 4.6YESYESYESYESYES5/52
Claude Haiku 4.5YESYESYESYESYES5/52
GPT-5.5YESYESYESYES4-5/54-5/52
GPT-5.4YESYESYESYESYES5/51
GPT-5.4-miniYESYESYESYESYES5/52 (zero variance)
Gemini 2.5 ProYESYESYESYESYES5/52 (zero variance)
Gemini 3.1 ProYESYESYESYESYES5/51
Gemini 3.1 Flash LiteYESYESYESYES4-5/54-5/53

Three-way comparison

ModelGCFTOON (natural)JSONRuns
Claude Opus 4.65/50/55/52 (zero variance)
Claude Sonnet 4.65/52-3/55/52
Claude Haiku 4.55/51-3/55/52
GPT-5.54-5/51-2/55/52
GPT-5.45/50/55/51
GPT-5.4-mini5/50/55/52 (zero variance)
Gemini 2.5 Pro5/51/55/52 (zero variance)
Gemini 3.1 Pro5/50/55/51
Gemini 3.1 Flash Lite4-5/50/54-5/53

TOON generation heatmap

TOON Heatmap

TOON is a fundamentally fragile format

TOON requires special handling by the caller to produce valid results. When given the same natural-language description that GCF and JSON handle without issue, TOON's official decoder rejects the output on 7 of 9 models. The format's flat tabular design encodes semantic categories as integers, forcing an encoding step that no model performs unprompted. This isn't a prompt engineering problem; it's a structural design flaw.

When we explicitly pre-encode distances as integers in the prompt (hand-holding the model through TOON's internal mapping), performance improves on some models but remains inconsistent. Even in the best case, TOON output is 28% larger than GCF.

FormatPromptValid100 sym outputvs JSON
GCFnatural labels5/55,984 B78% fewer
TOONhand-held (integers)5/58,336 B69% fewer
TOONnatural labels0/5--
JSONnatural labels5/516,121 Bbaseline

GCF is robust. It works with natural-language descriptions, pre-encoded values, and everything in between. The format aligns with how models naturally express grouped data. TOON requires the caller to know its internal encoding and pre-process every categorical field before the model can write valid output. Any time a column encodes a semantic category as an integer, TOON is one prompt change away from producing invalid data.

Output size at scale

Output Cost at Scale


Methodology

  • 500 symbols, 200 edges for comprehension; 5-100 symbols for generation
  • 13 extraction questions with deterministic ground truth (no LLM judge)
  • OpenAI runs used default temperature (non-zero); EVAL_TEMPERATURE=0 available for deterministic runs
  • Each run generates a fresh random payload with different symbol names and edge distributions
  • Claude evals via claude -p CLI with --model flag
  • OpenAI evals via chat completions API with exponential backoff on 429s
  • Google evals via generativelanguage API with retry logic (free tier: 5 RPM)
  • TOON validation uses the official toon-go library
  • All raw logs in eval/results

Reproduce

bash
git clone https://github.com/blackwell-systems/gcf-go
cd gcf-go/eval

# Comprehension
GOWORK=off go test -run TestComprehension -v -timeout 0
EVAL_BACKEND=openai OPENAI_API_KEY=... EVAL_MODEL=gpt-5.5 GOWORK=off go test -run TestComprehension -v -timeout 0
EVAL_BACKEND=google GOOGLE_API_KEY=... EVAL_MODEL=gemini-2.5-flash GOWORK=off go test -run TestComprehension -v -timeout 0

# Generation (all three formats)
GOWORK=off go test -run "TestGeneration$|TestGenerationTOON|TestGenerationJSON" -v -timeout 0

# Token efficiency (TOON's benchmark)
git clone https://github.com/blackwell-systems/toon.git
cd toon && git checkout gcf-comparison
cd benchmarks && pnpm install && pnpm benchmark:tokens