Tokenizer Analysis
GCF token savings are consistent across every model tokenizer tested. This page presents the evidence.
Summary
| Metric | Result |
|---|---|
| Tokenizers tested | 8 |
| Providers covered | 6 (Anthropic, OpenAI, Meta, Google, Alibaba, Mistral) |
| GCF savings range | 50-59% across all tokenizers |
| Savings std deviation | < 3 percentage points |
| GCF delimiter consistency | 100% (pipe, @, <, ## are single tokens on every tokenizer) |
Tokenizers Tested
| Tokenizer | Provider | Model Family |
|---|---|---|
| Claude tokenizer | Anthropic | Claude 3.5, 4.x |
| cl100k_base | OpenAI | GPT-4 |
| o200k_base | OpenAI | GPT-4o, GPT-5.x |
| LLaMA 3.1 tokenizer | Meta | LLaMA 3.x |
| Qwen 2.5 tokenizer | Alibaba | Qwen 2.5 |
| DeepSeek V3 tokenizer | DeepSeek | DeepSeek V3 |
| Gemma 2 tokenizer | Gemma 2 | |
| Mistral Nemo tokenizer | Mistral | Mistral/Ministral |
Savings by Tokenizer (Generic Profile, 500 Orders)
| Tokenizer | GCF Tokens | JSON Tokens | Savings |
|---|---|---|---|
| Claude (Anthropic) | 44,099 | 96,619 | 54.4% |
| GPT-4 (OpenAI) | 40,383 | 97,848 | 58.7% |
| GPT-4o (OpenAI) | 41,382 | 98,348 | 57.9% |
| LLaMA 3.1 (Meta) | 40,384 | 97,849 | 58.7% |
| Qwen 2.5 (Alibaba) | 52,595 | 109,168 | 51.8% |
| DeepSeek V3 | 44,234 | 101,848 | 56.6% |
| Gemma 2 (Google) | 55,301 | 124,620 | 55.6% |
| Mistral Nemo | 55,998 | 112,569 | 50.3% |
Every tokenizer produces 50%+ savings. The worst case (Mistral Nemo, 50.3%) still halves the token count.
Savings Consistency Across Scales
| Payload | Min Savings | Max Savings | Mean Savings | Spread |
|---|---|---|---|---|
| 10 orders | 51.0% | 57.6% | 55.0% | 6.6pp |
| 50 orders | 51.5% | 59.2% | 56.3% | 7.7pp |
| 100 orders | 51.4% | 59.3% | 56.2% | 7.9pp |
| 500 orders | 50.3% | 58.7% | 55.5% | 8.5pp |
Savings are stable from 10 to 500 records. The spread stays under 9 percentage points across all 8 tokenizers at every scale.
Graph Profile Token Variance
| Payload | Min Tokens | Max Tokens | Mean | CV |
|---|---|---|---|---|
| 10 symbols, 5 edges | 212 | 240 | 222 | 4.3% |
| 50 symbols, 25 edges | 1,000 | 1,236 | 1,086 | 8.3% |
| 100 symbols, 50 edges | 1,982 | 2,485 | 2,166 | 9.0% |
| 500 symbols, 200 edges | 9,450 | 13,013 | 10,754 | 13.7% |
At small scale, GCF graph payloads tokenize nearly identically (4.3% CV). Variance grows at scale but remains moderate.
Syntactic Analysis: Why It Works
GCF uses only basic ASCII delimiters. Here is how each tokenizer handles the core GCF syntax:
Edge declaration: @0<@2|implements
| Tokenizer | Tokens | Split |
|---|---|---|
| Claude | 7 | @ 0 < @ 2 ` |
| GPT-4 | 7 | @ 0 < @ 2 ` |
| GPT-4o | 7 | @ 0 < @ 2 ` |
| LLaMA 3.1 | 7 | @ 0 < @ 2 ` |
| Qwen 2.5 | 7 | @ 0 < @ 2 ` |
| DeepSeek V3 | 8 | @ 0 < @ 2 ` |
| Gemma 2 | 7 | @ 0 < @ 2 ` |
| Mistral Nemo | 8 | @ 0 < @ 2 ` |
All 8 tokenizers split GCF structural characters identically: @, <, | are always single tokens. The only variance is in the value implements (1 vs 2 tokens), which doesn't affect parsing.
Symbol row: @0|function|auth.validateToken|0.95|definition
| Tokenizer | Tokens | Key Differences |
|---|---|---|
| GPT-4 | 14 | Merges .validate (1 tok), 95 (1 tok) |
| Qwen 2.5 | 15 | Splits 95 → 9 + 5 |
| Gemma 2 | 16 | Splits . + validate, splits 9 + 5 |
The pipe delimiters are always single tokens. Variance comes from how tokenizers handle:
- Dot-prefixed words:
.validate(1 tok on GPT-4) vs.+validate(2 tok on Gemma) - Two-digit numbers:
95(1 tok on GPT-4) vs9+5(2 tok on Qwen/Gemma) - Comma-prefixed chars:
,q(1 tok on GPT-4) vs,+q(2 tok on Gemma)
Section header: ## symbols [3]{id,kind,qname,score,provenance}
All tokenizers produce 17-18 tokens. ##, [, ], {, } are consistently handled. Variance is only in how field names merge with adjacent punctuation.
Delimiter Character Universality
Every character in GCF's grammar is a single token on every tokenizer tested:
| Character | Purpose in GCF | Single token on all 8? |
|---|---|---|
| (pipe) | Field delimiter | Yes |
@ | Symbol ID prefix | Yes |
< | Edge direction | Yes |
# | Section header | Yes |
{ | Schema open | Yes |
} | Schema close | Yes |
[ | Count open | Yes |
] | Count close | Yes |
, | Schema field separator | Yes |
\n | Row separator | Yes |
10 characters, 8 tokenizers, 80 checks, zero exceptions. GCF's grammar characters are universally unambiguous at the token level.
JSON Tokenization Comparison
The same data encoded as JSON shows identical variance patterns. This proves the variance is content-driven, not format-driven.
Example: Full symbol object
JSON: {"id": 0, "kind": "function", "qualified_name": "auth.validateToken", "score": 0.95, "provenance": "definition"}
| Tokenizer | Tokens | Key variance |
|---|---|---|
| Claude | 37 | Splits _ + name, . + validate |
| GPT-4 | 37 | Merges _name, .validate |
| GPT-4o | 37 | Same as GPT-4 |
| LLaMA 3.1 | 37 | Same as GPT-4 |
| Qwen 2.5 | 38 | Splits 9 + 5 |
| DeepSeek V3 | 37 | Same as GPT-4 |
| Gemma 2 | 39 | Splits _ + name, . + validate, 9 + 5 |
| Mistral Nemo | 39 | Splits qual + ified, 9 + 5 |
JSON variance: 37-39 tokens (5.4% spread).
GCF equivalent: @0|function|auth.validateToken|0.95|definition
| Tokenizer | Tokens |
|---|---|
| GPT-4 | 14 |
| Qwen 2.5 | 15 |
| Gemma 2 | 16 |
GCF variance: 14-16 tokens (14% spread proportionally, but 2 tokens absolute).
The variance sources are identical: dot-splitting, number-splitting, word boundaries. JSON has the same patterns. GCF just has fewer total tokens to vary.
Example: Edge array (2 edges)
JSON: [{"target": 0, "source": 2, "type": "implements"}, {"target": 1, "source": 2, "type": "implements"}]
| Tokenizer | JSON tokens |
|---|---|
| Claude | 33 |
| GPT-4 | 38 |
| DeepSeek V3 | 40 |
| Mistral Nemo | 40 |
JSON variance: 33-40 tokens (21% spread).
GCF equivalent: Two lines: @0<@2|implements and @1<@2|implements
| Tokenizer | GCF tokens |
|---|---|
| GPT-4 | 14 |
| DeepSeek V3 | 16 |
| Mistral Nemo | 16 |
GCF variance: 14-16 tokens (14% spread), but only 2 tokens absolute difference.
JSON's variance is larger in both absolute and proportional terms on edge data, because JSON repeats "target":, "source":, "type": for each edge, and each of those key-value pairs can split differently.
ASCII Delimiter Space Analysis
We tested every printable ASCII character (codes 33-126) against all 8 tokenizers on two criteria:
- Single token in isolation: does the character encode as exactly 1 token?
- Never merges with adjacent text: does
Xfootokenize asX+foo(separate) orXfoo(merged)?
A "perfect" delimiter satisfies both: it's always 1 token and never gets absorbed into neighboring content.
Results: 74 of 94 printable ASCII characters are perfect delimiters. 20 characters merge with adjacent text on at least one tokenizer.
Characters that merge (avoid as delimiters)
| Character | Why it merges |
|---|---|
( | Merges into (foo on some tokenizers |
- | Merges into -token, -based, -style |
. | Merges into .validate, .com, .json |
/ | Merges into /api, /path |
_ | Merges into _name, _id, _count |
a-f, i, l, o, p, r, t, u | Common lowercase letters that form subword prefixes |
These are exactly the characters tokenizers aggressively merge into subwords. Any format using dots, dashes, or underscores as structural delimiters will have tokenizer-dependent behavior.
GCF's delimiters: all perfect
| Character | Single token (8/8) | Never merges (8/8) | Role in GCF |
|---|---|---|---|
| | Yes | Yes | Field delimiter |
@ | Yes | Yes | Symbol ID prefix |
< | Yes | Yes | Edge direction |
# | Yes | Yes | Section header |
{ | Yes | Yes | Schema open |
} | Yes | Yes | Schema close |
[ | Yes | Yes | Count open |
] | Yes | Yes | Count close |
, | Yes | Yes | Schema field separator |
Every GCF grammar character is in the perfect category. This was a deliberate design choice.
What this means for competing formats
JSON uses . (dot), : (colon), and " (quote) as structural characters. Dots merge with adjacent text (.validate becomes 1 token on GPT-4 but 2 on Gemma). Quotes create "key": sequences that tokenize inconsistently. The format's grammar characters are in the merging category.
TOON inherits YAML's indentation-sensitive structure. Whitespace-based delimiting is inherently tokenizer-dependent because tokenizers handle leading spaces, tabs, and indentation levels differently.
GCF uses only characters from the non-merging set. The format's token efficiency is not an accident of one tokenizer's vocabulary; it's a structural property of the delimiter choices.
Design Rationale: Why These Specific Delimiters
From the 74 perfect candidates, GCF chose based on readability and semantic meaning:
| Character | Why chosen | Alternative considered | Why not |
|---|---|---|---|
| (pipe) | Rare in natural text. Visually distinct column separator. | Tab (\t) | Invisible, harder to debug |
@ | Establishes "this is an ID" semantically. | $ | Also perfect, but less intuitive for IDs |
## | Two-char sequence tokenizers merge into one token. Markdown-familiar. | === | 3 chars, less efficient |
< | Reads as "points to" for edges. | ~ | Also perfect, but less semantic |
\n | Universal row separator. Zero overhead. | ; | Less readable |
, | Schema field separator. Familiar from CSV. | : | Conflicts with potential value content |
Grammar Swap Experiment
To prove GCF's savings are structural (positional fields, keys declared once) and not an artifact of specific delimiter choices, we swapped the entire grammar and re-measured.
Method
5 delimiter sets, all using characters from the "perfect" category:
| Set | Field | ID | Edge | Section | Schema |
|---|---|---|---|---|---|
| GCF (actual) | | | @ | < | ## | {,} |
| Alt A | ~ | $ | > | %% | (;) |
| Alt B | ^ | ! | = | && | {:} |
| Alt C | ` | # | ~ | !! | [|] |
| Alt D | ; | % | ^ | $$ | {+} |
Each set tested against 5 payload types (employees, orders, logs, code symbols, mixed nested) at 4 sizes (10, 50, 100, 500 records) across all 8 tokenizers. 800 total measurements.
Results
| Delimiter set | 10 records | 50 records | 100 records | 500 records | Overall |
|---|---|---|---|---|---|
| GCF (actual) | 59.1% | 60.9% | 61.0% | 60.5% | 60.6% |
| Alt set A | 58.7% | 60.6% | 60.7% | 60.2% | 60.3% |
| Alt set B | 59.1% | 61.0% | 61.1% | 60.6% | 60.7% |
| Alt set C | 58.7% | 60.6% | 60.7% | 60.2% | 60.3% |
| Alt set D | 58.8% | 60.7% | 60.8% | 60.3% | 60.4% |
Spread: 0.4 percentage points across all delimiter sets. The choice of delimiter character has negligible effect on savings.
Per-tokenizer consistency
| Tokenizer | GCF | Alt A | Alt B | Alt C | Alt D |
|---|---|---|---|---|---|
| Claude | 60.8% | 60.4% | 60.8% | 60.4% | 60.4% |
| GPT-4 | 63.2% | 62.8% | 63.2% | 62.8% | 62.8% |
| GPT-4o | 63.1% | 63.1% | 63.5% | 63.1% | 63.1% |
| LLaMA 3.1 | 63.2% | 62.8% | 63.2% | 62.8% | 62.8% |
| Qwen 2.5 | 56.3% | 55.9% | 56.3% | 55.9% | 55.9% |
| DeepSeek V3 | 63.4% | 62.9% | 62.9% | 62.9% | 63.7% |
| Gemma 2 | 60.2% | 59.9% | 60.2% | 59.9% | 59.9% |
| Mistral Nemo | 56.0% | 56.0% | 56.5% | 56.0% | 56.0% |
Variation across delimiter sets: 0.0-0.8pp per tokenizer.
What this proves
GCF's token savings come from the encoding structure (keys declared once, values positional), not from any specific delimiter character. You could replace every delimiter in the grammar and get the same compression. The format's efficiency is a mathematical property of eliminating key repetition.
Summary
GCF's grammar is tokenizer-invariant. The delimiters were chosen from the perfect-category character set (single token, never merges) as a robustness guarantee, but the savings themselves are structural. The grammar swap experiment proves this: 0.4pp spread across 5 completely different delimiter sets, 800 measurements.
The comprehension eval data (24 stress-scale runs, 27 generic runs across Claude, GPT, Gemini, and Mistral) confirms this at the behavioral level: no provider-specific anomalies in how models read GCF.
Reproduce
All experiments are reproducible:
cd gcf
# Tokenizer variance analysis
node eval/tokenizer-variance.mjs
# Grammar swap experiment
node eval/grammar-swap-experiment.mjs
```bash
cd gcf
npm install @blackwell-systems/gcf @lenml/tokenizers \
@lenml/tokenizer-claude @lenml/tokenizer-gpt4 @lenml/tokenizer-gpt4o \
@lenml/tokenizer-llama3_1 @lenml/tokenizer-qwen2_5 \
@lenml/tokenizer-deepseek_v3 @lenml/tokenizer-gemma2 \
@lenml/tokenizer-mistral_nemo
node eval/tokenizer-variance.mjs