Skip to content

Tokenizer Analysis

GCF token savings are consistent across every model tokenizer tested. This page presents the evidence.

Summary

MetricResult
Tokenizers tested8
Providers covered6 (Anthropic, OpenAI, Meta, Google, Alibaba, Mistral)
GCF savings range50-59% across all tokenizers
Savings std deviation< 3 percentage points
GCF delimiter consistency100% (pipe, @, <, ## are single tokens on every tokenizer)

Tokenizers Tested

TokenizerProviderModel Family
Claude tokenizerAnthropicClaude 3.5, 4.x
cl100k_baseOpenAIGPT-4
o200k_baseOpenAIGPT-4o, GPT-5.x
LLaMA 3.1 tokenizerMetaLLaMA 3.x
Qwen 2.5 tokenizerAlibabaQwen 2.5
DeepSeek V3 tokenizerDeepSeekDeepSeek V3
Gemma 2 tokenizerGoogleGemma 2
Mistral Nemo tokenizerMistralMistral/Ministral

Savings by Tokenizer (Generic Profile, 500 Orders)

TokenizerGCF TokensJSON TokensSavings
Claude (Anthropic)44,09996,61954.4%
GPT-4 (OpenAI)40,38397,84858.7%
GPT-4o (OpenAI)41,38298,34857.9%
LLaMA 3.1 (Meta)40,38497,84958.7%
Qwen 2.5 (Alibaba)52,595109,16851.8%
DeepSeek V344,234101,84856.6%
Gemma 2 (Google)55,301124,62055.6%
Mistral Nemo55,998112,56950.3%

Every tokenizer produces 50%+ savings. The worst case (Mistral Nemo, 50.3%) still halves the token count.

Savings Consistency Across Scales

PayloadMin SavingsMax SavingsMean SavingsSpread
10 orders51.0%57.6%55.0%6.6pp
50 orders51.5%59.2%56.3%7.7pp
100 orders51.4%59.3%56.2%7.9pp
500 orders50.3%58.7%55.5%8.5pp

Savings are stable from 10 to 500 records. The spread stays under 9 percentage points across all 8 tokenizers at every scale.

Graph Profile Token Variance

PayloadMin TokensMax TokensMeanCV
10 symbols, 5 edges2122402224.3%
50 symbols, 25 edges1,0001,2361,0868.3%
100 symbols, 50 edges1,9822,4852,1669.0%
500 symbols, 200 edges9,45013,01310,75413.7%

At small scale, GCF graph payloads tokenize nearly identically (4.3% CV). Variance grows at scale but remains moderate.

Syntactic Analysis: Why It Works

GCF uses only basic ASCII delimiters. Here is how each tokenizer handles the core GCF syntax:

Edge declaration: @0<@2|implements

TokenizerTokensSplit
Claude7@ 0 < @ 2 `
GPT-47@ 0 < @ 2 `
GPT-4o7@ 0 < @ 2 `
LLaMA 3.17@ 0 < @ 2 `
Qwen 2.57@ 0 < @ 2 `
DeepSeek V38@ 0 < @ 2 `
Gemma 27@ 0 < @ 2 `
Mistral Nemo8@ 0 < @ 2 `

All 8 tokenizers split GCF structural characters identically: @, <, | are always single tokens. The only variance is in the value implements (1 vs 2 tokens), which doesn't affect parsing.

Symbol row: @0|function|auth.validateToken|0.95|definition

TokenizerTokensKey Differences
GPT-414Merges .validate (1 tok), 95 (1 tok)
Qwen 2.515Splits 959 + 5
Gemma 216Splits . + validate, splits 9 + 5

The pipe delimiters are always single tokens. Variance comes from how tokenizers handle:

  • Dot-prefixed words: .validate (1 tok on GPT-4) vs . + validate (2 tok on Gemma)
  • Two-digit numbers: 95 (1 tok on GPT-4) vs 9 + 5 (2 tok on Qwen/Gemma)
  • Comma-prefixed chars: ,q (1 tok on GPT-4) vs , + q (2 tok on Gemma)

Section header: ## symbols [3]{id,kind,qname,score,provenance}

All tokenizers produce 17-18 tokens. ##, [, ], {, } are consistently handled. Variance is only in how field names merge with adjacent punctuation.

Delimiter Character Universality

Every character in GCF's grammar is a single token on every tokenizer tested:

CharacterPurpose in GCFSingle token on all 8?
| (pipe)Field delimiterYes
@Symbol ID prefixYes
<Edge directionYes
#Section headerYes
{Schema openYes
}Schema closeYes
[Count openYes
]Count closeYes
,Schema field separatorYes
\nRow separatorYes

10 characters, 8 tokenizers, 80 checks, zero exceptions. GCF's grammar characters are universally unambiguous at the token level.

JSON Tokenization Comparison

The same data encoded as JSON shows identical variance patterns. This proves the variance is content-driven, not format-driven.

Example: Full symbol object

JSON: {"id": 0, "kind": "function", "qualified_name": "auth.validateToken", "score": 0.95, "provenance": "definition"}

TokenizerTokensKey variance
Claude37Splits _ + name, . + validate
GPT-437Merges _name, .validate
GPT-4o37Same as GPT-4
LLaMA 3.137Same as GPT-4
Qwen 2.538Splits 9 + 5
DeepSeek V337Same as GPT-4
Gemma 239Splits _ + name, . + validate, 9 + 5
Mistral Nemo39Splits qual + ified, 9 + 5

JSON variance: 37-39 tokens (5.4% spread).

GCF equivalent: @0|function|auth.validateToken|0.95|definition

TokenizerTokens
GPT-414
Qwen 2.515
Gemma 216

GCF variance: 14-16 tokens (14% spread proportionally, but 2 tokens absolute).

The variance sources are identical: dot-splitting, number-splitting, word boundaries. JSON has the same patterns. GCF just has fewer total tokens to vary.

Example: Edge array (2 edges)

JSON: [{"target": 0, "source": 2, "type": "implements"}, {"target": 1, "source": 2, "type": "implements"}]

TokenizerJSON tokens
Claude33
GPT-438
DeepSeek V340
Mistral Nemo40

JSON variance: 33-40 tokens (21% spread).

GCF equivalent: Two lines: @0<@2|implements and @1<@2|implements

TokenizerGCF tokens
GPT-414
DeepSeek V316
Mistral Nemo16

GCF variance: 14-16 tokens (14% spread), but only 2 tokens absolute difference.

JSON's variance is larger in both absolute and proportional terms on edge data, because JSON repeats "target":, "source":, "type": for each edge, and each of those key-value pairs can split differently.

ASCII Delimiter Space Analysis

We tested every printable ASCII character (codes 33-126) against all 8 tokenizers on two criteria:

  1. Single token in isolation: does the character encode as exactly 1 token?
  2. Never merges with adjacent text: does Xfoo tokenize as X + foo (separate) or Xfoo (merged)?

A "perfect" delimiter satisfies both: it's always 1 token and never gets absorbed into neighboring content.

Results: 74 of 94 printable ASCII characters are perfect delimiters. 20 characters merge with adjacent text on at least one tokenizer.

Characters that merge (avoid as delimiters)

CharacterWhy it merges
(Merges into (foo on some tokenizers
-Merges into -token, -based, -style
.Merges into .validate, .com, .json
/Merges into /api, /path
_Merges into _name, _id, _count
a-f, i, l, o, p, r, t, uCommon lowercase letters that form subword prefixes

These are exactly the characters tokenizers aggressively merge into subwords. Any format using dots, dashes, or underscores as structural delimiters will have tokenizer-dependent behavior.

GCF's delimiters: all perfect

CharacterSingle token (8/8)Never merges (8/8)Role in GCF
|YesYesField delimiter
@YesYesSymbol ID prefix
<YesYesEdge direction
#YesYesSection header
{YesYesSchema open
}YesYesSchema close
[YesYesCount open
]YesYesCount close
,YesYesSchema field separator

Every GCF grammar character is in the perfect category. This was a deliberate design choice.

What this means for competing formats

JSON uses . (dot), : (colon), and " (quote) as structural characters. Dots merge with adjacent text (.validate becomes 1 token on GPT-4 but 2 on Gemma). Quotes create "key": sequences that tokenize inconsistently. The format's grammar characters are in the merging category.

TOON inherits YAML's indentation-sensitive structure. Whitespace-based delimiting is inherently tokenizer-dependent because tokenizers handle leading spaces, tabs, and indentation levels differently.

GCF uses only characters from the non-merging set. The format's token efficiency is not an accident of one tokenizer's vocabulary; it's a structural property of the delimiter choices.

Design Rationale: Why These Specific Delimiters

From the 74 perfect candidates, GCF chose based on readability and semantic meaning:

CharacterWhy chosenAlternative consideredWhy not
| (pipe)Rare in natural text. Visually distinct column separator.Tab (\t)Invisible, harder to debug
@Establishes "this is an ID" semantically.$Also perfect, but less intuitive for IDs
##Two-char sequence tokenizers merge into one token. Markdown-familiar.===3 chars, less efficient
<Reads as "points to" for edges.~Also perfect, but less semantic
\nUniversal row separator. Zero overhead.;Less readable
,Schema field separator. Familiar from CSV.:Conflicts with potential value content

Grammar Swap Experiment

To prove GCF's savings are structural (positional fields, keys declared once) and not an artifact of specific delimiter choices, we swapped the entire grammar and re-measured.

Method

5 delimiter sets, all using characters from the "perfect" category:

SetFieldIDEdgeSectionSchema
GCF (actual)|@<##{,}
Alt A~$>%%(;)
Alt B^!=&&{:}
Alt C`#~!![|]
Alt D;%^$${+}

Each set tested against 5 payload types (employees, orders, logs, code symbols, mixed nested) at 4 sizes (10, 50, 100, 500 records) across all 8 tokenizers. 800 total measurements.

Results

Delimiter set10 records50 records100 records500 recordsOverall
GCF (actual)59.1%60.9%61.0%60.5%60.6%
Alt set A58.7%60.6%60.7%60.2%60.3%
Alt set B59.1%61.0%61.1%60.6%60.7%
Alt set C58.7%60.6%60.7%60.2%60.3%
Alt set D58.8%60.7%60.8%60.3%60.4%

Spread: 0.4 percentage points across all delimiter sets. The choice of delimiter character has negligible effect on savings.

Per-tokenizer consistency

TokenizerGCFAlt AAlt BAlt CAlt D
Claude60.8%60.4%60.8%60.4%60.4%
GPT-463.2%62.8%63.2%62.8%62.8%
GPT-4o63.1%63.1%63.5%63.1%63.1%
LLaMA 3.163.2%62.8%63.2%62.8%62.8%
Qwen 2.556.3%55.9%56.3%55.9%55.9%
DeepSeek V363.4%62.9%62.9%62.9%63.7%
Gemma 260.2%59.9%60.2%59.9%59.9%
Mistral Nemo56.0%56.0%56.5%56.0%56.0%

Variation across delimiter sets: 0.0-0.8pp per tokenizer.

What this proves

GCF's token savings come from the encoding structure (keys declared once, values positional), not from any specific delimiter character. You could replace every delimiter in the grammar and get the same compression. The format's efficiency is a mathematical property of eliminating key repetition.

Summary

GCF's grammar is tokenizer-invariant. The delimiters were chosen from the perfect-category character set (single token, never merges) as a robustness guarantee, but the savings themselves are structural. The grammar swap experiment proves this: 0.4pp spread across 5 completely different delimiter sets, 800 measurements.

The comprehension eval data (24 stress-scale runs, 27 generic runs across Claude, GPT, Gemini, and Mistral) confirms this at the behavioral level: no provider-specific anomalies in how models read GCF.

Reproduce

All experiments are reproducible:

bash
cd gcf

# Tokenizer variance analysis
node eval/tokenizer-variance.mjs

# Grammar swap experiment
node eval/grammar-swap-experiment.mjs

```bash
cd gcf
npm install @blackwell-systems/gcf @lenml/tokenizers \
  @lenml/tokenizer-claude @lenml/tokenizer-gpt4 @lenml/tokenizer-gpt4o \
  @lenml/tokenizer-llama3_1 @lenml/tokenizer-qwen2_5 \
  @lenml/tokenizer-deepseek_v3 @lenml/tokenizer-gemma2 \
  @lenml/tokenizer-mistral_nemo

node eval/tokenizer-variance.mjs

100% comprehension. 71% fewer tokens. 2,400+ LLM evaluations.