Skip to content

Token Savings Proof

Mathematical proof that GCF's structural overhead is lower than both JSON and TOON for structured data. The analysis below covers the graph profile (symbols + edges). For tabular profile benchmarks on generic data, see Benchmarks. For output token savings, see the generation eval. All analysis uses byte length as a proxy for token count (validated: byte length correlates 0.97 with o200k_base token count for ASCII-dominant payloads).

Notation

SymbolDefinition
nNumber of symbols (nodes) in payload
eNumber of edges in payload
qAverage qualified name length in bytes
kAverage kind string length in bytes (full form)
k'Average abbreviated kind length (GCF)
gNumber of distance groups
L(x)Byte length of encoding x

1. Symbol encoding

JSON (one symbol)

{"qualified_name":"<q>","kind":"<k>","score":0.78,"provenance":"<p>","distance":0}

Fixed overhead per symbol:

  • {"qualified_name":" (19) + ","kind":" (9) + ","score": (9) + ,"provenance":" (15) + ","distance": (12) + } (1) = 65 bytes
L_json(symbol) = 65 + q + k + len(score) + len(provenance)

For a typical symbol (q=45, k=8, score=4, prov=12):

L_json(symbol) = 65 + 45 + 8 + 4 + 12 = 134 bytes

GCF (one symbol)

@{id} {kind'} {qname} {score} {provenance}

Fixed overhead per symbol:

  • @ (1) + space (1) + space (1) + space (1) + space (1) + newline (1) = 6 bytes + id digits
L_gcf(symbol) = 6 + digits(id) + k' + q + len(score) + len(provenance)

For the same symbol (q=45, k'=2 for "fn", score=4, prov=12, id=1 digit):

L_gcf(symbol) = 6 + 1 + 2 + 45 + 4 + 12 = 70 bytes

Per-symbol savings

Δ_symbol = L_json - L_gcf = 134 - 70 = 64 bytes (48% per symbol)

The savings come from:

  • Field name elimination: 65 - 6 = 59 bytes of structural overhead removed
  • Kind abbreviation: k - k' (avg 6 bytes saved per symbol)

Scaling

For n symbols:

Total_json(symbols) = n * L_json(symbol) + 2n (commas) + 14 ("symbols":[...])
                    = n * 136 + 14

Total_gcf(symbols)  = n * L_gcf(symbol) + g * 12 (group headers)
                    = n * 70 + g * 12

For n=500, g=3:

Total_json = 500 * 136 + 14 = 68,014 bytes
Total_gcf  = 500 * 70 + 36  = 35,036 bytes
Savings: 48.5%

2. Edge encoding

JSON (one edge)

{"source":"<q1>","target":"<q2>","edge_type":"<t>"}

Fixed overhead:

  • {"source":" (11) + ","target":" (11) + ","edge_type":" (14) + "} (2) = 38 bytes
L_json(edge) = 38 + q1 + q2 + len(type)

For typical edge (q1=45, q2=45, type=5):

L_json(edge) = 38 + 45 + 45 + 5 = 133 bytes

GCF (one edge)

@{tgt}<@{src} {type}

Fixed overhead:

  • @ (1) + <@ (2) + space (1) + newline (1) = 5 bytes + id digits
L_gcf(edge) = 5 + digits(tgt_id) + digits(src_id) + len(type)

For the same edge (ids are 1-3 digits, type=5):

L_gcf(edge) = 5 + 2 + 2 + 5 = 14 bytes

Per-edge savings

Δ_edge = L_json - L_gcf = 133 - 14 = 119 bytes (89.5% per edge)

This is GCF's largest structural advantage. JSON repeats the full qualified name of both source and target for every edge. GCF uses 2-3 digit local IDs.

Scaling

For e edges:

Total_json(edges) = e * L_json(edge) + 2e (commas) + 10 ("edges":[...])
                  = e * 135 + 10

Total_gcf(edges)  = e * L_gcf(edge) + 9 ("## edges\n")
                  = e * 14 + 9

For e=200:

Total_json = 200 * 135 + 10 = 27,010 bytes
Total_gcf  = 200 * 14 + 9   = 2,809 bytes
Savings: 89.6%

3. Distance grouping

JSON

Each symbol carries a "distance": N field:

Per symbol: 12 bytes ("distance":N) + comma

For n=500: 6,500 bytes of distance metadata.

GCF

Group headers replace per-symbol fields:

## targets\n    = 11 bytes
## related\n    = 11 bytes
## extended\n   = 12 bytes

For g=3 groups: 34 bytes total.

Savings

Δ_grouping = 6,500 - 34 = 6,466 bytes

One-time declarations replace n repetitions. Savings scale linearly with n.

4. Header overhead

JSON

{"tool":"context_for_task","tokens_used":1847,"token_budget":5000,...}

Typical header: ~120 bytes (field names + values + delimiters).

GCF

GCF tool=context_for_task budget=5000 tokens=1847 symbols=500 pack_root=a1b2c3\n

Typical header: ~80 bytes.

Savings

Δ_header = 120 - 80 = 40 bytes (fixed, does not scale)

Header savings are negligible at scale. The payload body dominates.

5. Total model

Combining all components:

L_json(payload) = L_json(header) + Total_json(symbols) + Total_json(edges)
                = 120 + (n * 136 + 14) + (e * 135 + 10)
                = 144 + 136n + 135e

L_gcf(payload)  = L_gcf(header) + Total_gcf(symbols) + Total_gcf(edges)
                = 80 + (n * 70 + g * 12) + (e * 14 + 9)
                = 89 + 70n + 12g + 14e

Savings formula

Δ(n, e, g) = L_json - L_gcf
           = (144 + 136n + 135e) - (89 + 70n + 12g + 14e)
           = 55 + 66n + 121e - 12g

Savings grow linearly with both symbol count and edge count. Edge savings dominate because Δ_edge (121 bytes) > Δ_symbol (66 bytes).

Savings percentage

Savings% = Δ / L_json = (55 + 66n + 121e - 12g) / (144 + 136n + 135e)

For typical payloads (g << n):

lim(n,e→∞) Savings% ≈ (66n + 121e) / (136n + 135e)

At the median ratio e/n = 0.4 (edges per symbol):

≈ (66 + 121*0.4) / (136 + 135*0.4) = 114.4 / 190 = 60.2%

At e/n = 1.0 (dense graphs):

≈ (66 + 121) / (136 + 135) = 187 / 271 = 69.0%

At e/n = 2.0 (very dense):

≈ (66 + 242) / (136 + 270) = 308 / 406 = 75.9%

Conclusion: GCF savings range from 60% (sparse) to 76% (dense) asymptotically, increasing with edge density. The measured 84% includes kind abbreviation savings not fully captured in this simplified model.

6. GCF vs TOON

TOON uses tabular encoding for uniform arrays. For symbol data:

symbols[500]{qualified_name,kind,score,provenance,distance}:
  <q>,<k>,<score>,<prov>,<dist>
  <q>,<k>,<score>,<prov>,<dist>
  ...

TOON per-symbol cost

L_toon(symbol) = q + 1 + k + 1 + len(score) + 1 + len(prov) + 1 + len(dist) + 1
               = q + k + len(score) + len(prov) + len(dist) + 5 (delimiters + newline)

For typical symbol: 45 + 8 + 4 + 12 + 1 + 5 = 75 bytes

GCF per-symbol cost (from above): 70 bytes

Per-symbol advantage

Δ_gcf_vs_toon(symbol) = 75 - 70 = 5 bytes per symbol

Small per-symbol savings. The real advantage is in edges.

TOON edge encoding

TOON has no special edge encoding. Edges are either:

  1. A separate tabular array (repeating full qualified names like JSON), or
  2. Embedded in an object structure

Best case (tabular edges):

edges[200]{source,target,type}:
  <q1>,<q2>,<type>
  ...
L_toon(edge) = q1 + 1 + q2 + 1 + len(type) + 1 = q1 + q2 + len(type) + 3

For typical: 45 + 45 + 5 + 3 = 98 bytes

GCF edge cost (from above): 14 bytes

Per-edge advantage over TOON

Δ_gcf_vs_toon(edge) = 98 - 14 = 84 bytes per edge (85.7% savings)

Total GCF vs TOON

Δ_total(n, e) = 5n + 84e

For n=500, e=200:

Δ = 5*500 + 84*200 = 2,500 + 16,800 = 19,300 bytes

GCF is 19,300 bytes smaller than TOON for a 500-symbol, 200-edge payload. This aligns with the measured result: GCF 11,090 tokens vs TOON 16,378 tokens (32% smaller).

Why edges are the differentiator

TOON was designed for generic tabular data. It has no concept of references between records. Every edge must spell out the full identifier of both source and target.

GCF's local ID system (@0, @1) means edges cost ~14 bytes regardless of identifier length. A symbol with a 120-character qualified name costs the same in an edge reference as one with a 10-character name.

This advantage grows with:

  • Longer qualified names (common in Java/Go packages)
  • Higher edge density (call graphs, dependency graphs)
  • Larger payloads (more symbols = longer IDs but still 2-3 digits for <1000 symbols)

7. Session deduplication model

After call c with overlap ratio r (fraction of symbols seen before):

L_session(call_c) = (1-r) * L_gcf(new_symbols) + r * L_bare_ref * n_total + L_gcf(edges)

Where L_bare_ref = digits(id) + 28 bytes (@{id} # previously transmitted\n).

For r=0.8, n=50, e=20 (typical 3rd call):

New symbols: 10 * 70 = 700 bytes
Bare refs: 40 * 30 = 1,200 bytes
Edges: 20 * 14 = 280 bytes
Total: 2,180 bytes

vs full GCF: 50 * 70 + 20 * 14 = 3,780 bytes
vs JSON: 50 * 136 + 20 * 135 = 9,500 bytes

Session savings vs JSON: 77.1%

8. Empirical validation

PredictionMeasuredError
500-sym savings vs JSON60-76% (model)79% (measured)
500-sym savings vs TOON~32% (model: 19,300 / (19,300 + L_gcf))32% (measured)
Session savings (5th call)~77-93% vs JSON (model)92.7% (measured)
Edge savings89.5% (model)Not independently measured

The model slightly underpredicts total savings because it uses average values and excludes second-order effects (kind abbreviation contributes ~1-2% additional savings, distance grouping eliminates array-level overhead).