Getting Started

GCF is an AI-native wire format for structured data, built for the agentic loop. It sits at the boundary between your data and the LLM context window: 71% fewer tokens than JSON, 100% comprehension on every frontier model.

One zero-dep library encodes any structured shape, no schema. It is lossless (verified across 43 billion+ round-trips with JSON, YAML, TOML, CSV, and MessagePack), token-compact, and read natively by every frontier model. No other single format is all four at once.

Two profiles, one grammar. The generic profile encodes any structured data. The graph profile is a superset that adds local IDs, typed edges, and session deduplication for relationship-heavy data. Delta encoding, streaming, and content-addressed identity work in both profiles; local IDs, typed edges, and session dedup are graph-only. You never read or write GCF by hand: call encode(), the LLM reads it natively, call decode() when a human needs the data back.

100% comprehension accuracy on every frontier model tested (Claude, Gemini, GPT). The only format that never fails.
91.2% under structural stress (500-symbol code graphs), where JSON drops to 54.1% and TOON to 68.8%.
71% fewer tokens than JSON. At 1000 records, JSON's 161K tokens overwhelm the usable context on a 200K model; GCF fits in 47K.
29% fewer tokens than TOON across 16 real-world datasets (15/16 wins).
43,000,000,000+ lossless round-trips across 5 formats and 6 language implementations. Zero data corruption.
Zero training. No model has ever seen GCF in training data. Every frontier model reads it natively.
Backed by mechanistic research. The grammar was reverse-engineered from attention-level experimentation and validated by three companion papers: controlled training experiments across two architectures and two scales showing clean delimiter boundaries produce 3-738x lower perplexity on structured data, with zero natural-language cost. The research.

Why not just use JSON?

JSON works at small scale. At 8 records, every format scores near 100%. The problems start when payloads grow.

At 500 records, JSON scores 54.1% comprehension accuracy across 10 models on code graph data. GPT-5.5 returns empty strings. Claude Opus spends 143 lines manually enumerating symbols and still gets the wrong answer. The repeated field names ("qualified_name":, "kind":, "score": on every record) consume 53,341 tokens of structural noise that overwhelms the model's attention. At scale, JSON's grammar attention collapses from 30% to 8.6%: the model stops tracking structure entirely.

At 1000 records, JSON consumes 161K tokens, leaving no usable room in a 200K context window once the prompt, task, and response are accounted for. The task becomes impossible regardless of model capability.

The problem goes deeper than repetition. We tested 45 common field names across 43 tokenizers from 20 providers. JSON's quote-colon patterns ("fieldName":) don't just waste tokens; they tokenize inconsistently across models. When GPT-4o sees "value":"pending", the opening quote merges with the field name into one token. Claude keeps them separate. The structural boundary (where the field name starts) is at a different token position depending on which model reads it. The most common field names in computing ("id":, "name":, "type":, "title":, "time":) merge on 30% of all tokenizers tested. The format designed for human readability doesn't even have consistent structure at the token level. The cost is not only inconsistency: controlled training experiments show that merging delimiters into content permanently constrains the model's structural attention, leaving every attention head with roughly 4x more structural capacity than the tokenizer lets it use.

GCF declares field names once in a header. Rows are positional values. GCF's pipe separator (|) was reverse-engineered from that analysis: it has a 0% field-name merge rate across all 43 tokenizers, so every model sees the same token boundaries. The same 500-record payload uses 11,090 tokens and scores 91.2% accuracy on code graphs. On nested order data, GCF achieves 100% accuracy on every frontier model tested: Claude Opus, Sonnet, Haiku, GPT-5.5, Gemini 2.5 Flash, and Gemini 3.5 Flash.

The format designed for human readability is incomprehensible to the systems actually reading it. Full benchmark data. Tokenizer analysis.

When to use GCF

Generic profile: any structured data

Use encodeGeneric() for any structured data. JSON is the most common source, but the same function handles data originating from YAML, TOML, CSV, MessagePack, or any format that deserializes to objects and arrays:

MCP tool responses (arrays of objects, nested data)
API responses, database query results, search results
Log entries, telemetry, event streams
RAG retrieval chunks
Agent-to-agent communication
Any structured output where you want fewer tokens

71% fewer tokens than JSON. Lossless. decode(encode(value)) == value for every structured value, regardless of source format.

Graph profile: relationship-heavy data

Use encode() when your data has entities with relationships:

Code intelligence (symbols, call graphs, dependencies)
Knowledge graphs and ontologies
Neo4j / Memgraph / Dgraph query results
Agent memory with entity relationships
Network topologies, dependency trees
Any domain where nodes have edges

Graph-shaped data is the fastest-growing data shape in AI. Knowledge systems, ontologies, GraphRAG, code intelligence, agent memory. No other token-efficient format treats graphs as first-class. GCF is the only format with native graph syntax: local IDs, typed edges, distance grouping, and session deduplication that compounds to 84% savings across a multi-turn session.

When GCF is most effective

Payloads contain repeated structures (arrays of similar objects)
Records have relationships between them (edges, references)
You're operating under a token budget (context windows are finite)
You make multiple calls in a session (graph session dedup and generic delta both compound savings across turns)
You want cheaper output (63% fewer output tokens than JSON)

When NOT to use GCF

Single scalar values (just return the value)
Systems that require JSON schema validation: call decodeGeneric() first, then validate the decoded object with any JSON Schema validator. The schema validates the data structure, not the wire encoding.
Non-LLM consumers that expect JSON (call decode() or use the MCP proxy to bridge)

"But I need human-readable output"

Use GCF for the wire format and JSON for the display format. The agent reads GCF (cheap: 50-69% fewer tokens than JSON in the context window), does its work, then calls decode() at the end to render JSON for a human. The context window savings are already banked. Readability is a last-mile rendering concern, not a wire format property. See the tokenizer analysis for why JSON's structural grammar causes comprehension failures at scale.

Install

bash

pip install gcf-python

bash

npm install @blackwell-systems/gcf

bash

go get github.com/blackwell-systems/gcf-go

bash

cargo add gcf

bash

# Package.swift
.package(url: "https://github.com/blackwell-systems/gcf-swift", from: "2.4.0")

bash

# build.gradle.kts
implementation("com.github.blackwell-systems:gcf-kotlin:v2.4.0")

Encode your first payload

python

from gcf import encode_generic

output = encode_generic({
    "employees": [
        {"id": 1, "name": "Alice", "department": "Engineering", "salary": 95000},
        {"id": 2, "name": "Bob", "department": "Sales", "salary": 72000},
    ],
})
print(output)

typescript

import { encodeGeneric } from '@blackwell-systems/gcf';

const output = encodeGeneric({
  employees: [
    { id: 1, name: 'Alice', department: 'Engineering', salary: 95000 },
    { id: 2, name: 'Bob', department: 'Sales', salary: 72000 },
  ],
});
console.log(output);

output := gcf.EncodeGeneric(map[string]any{
    "employees": []map[string]any{
        {"id": 1, "name": "Alice", "department": "Engineering", "salary": 95000},
        {"id": 2, "name": "Bob", "department": "Sales", "salary": 72000},
    },
})
fmt.Println(output)

rust

use gcf::encode_generic;
use serde_json::json;

let output = encode_generic(&json!({
    "employees": [
        {"id": 1, "name": "Alice", "department": "Engineering", "salary": 95000},
        {"id": 2, "name": "Bob", "department": "Sales", "salary": 72000},
    ]
}));
println!("{}", output);

swift

import GCF

let output = GCF.encodeGeneric([
    "employees": [
        ["id": 1, "name": "Alice", "department": "Engineering", "salary": 95000],
        ["id": 2, "name": "Bob", "department": "Sales", "salary": 72000],
    ]
])
print(output)

kotlin

import com.blackwellsystems.gcf.encodeGeneric

val output = encodeGeneric(mapOf(
    "employees" to listOf(
        mapOf("id" to 1, "name" to "Alice", "department" to "Engineering", "salary" to 95000),
        mapOf("id" to 2, "name" to "Bob", "department" to "Sales", "salary" to 72000),
    )
))
println(output)

Output:

gcf

GCF profile=generic
## employees [2]{id,name,department,salary}
1|Alice|Engineering|95000
2|Bob|Sales|72000

One header declares field names. Rows are positional values only. No field names repeated per record. Works on any structured data, whether it originated as JSON, YAML, or any other format.

Graph profile (first-class graph data)

For data with entities and relationships, the graph profile adds local IDs, typed edges, distance grouping, and session dedup. Everything the generic profile does, plus the grammar to express graphs natively:

python

from gcf import encode, Payload, Symbol, Edge

output = encode(Payload(
    tool="context_for_task",
    token_budget=5000,
    tokens_used=1847,
    symbols=[
        Symbol(qualified_name="pkg.Auth", kind="function", score=0.78, provenance="lsp", distance=0),
        Symbol(qualified_name="pkg.Server", kind="function", score=0.54, provenance="lsp", distance=1),
    ],
    edges=[Edge(source="pkg.Server", target="pkg.Auth", edge_type="calls")],
))

typescript

import { encode, type Payload } from '@blackwell-systems/gcf';

const output = encode({
  tool: 'context_for_task',
  tokenBudget: 5000,
  tokensUsed: 1847,
  symbols: [
    { qualifiedName: 'pkg.Auth', kind: 'function', score: 0.78, provenance: 'lsp', distance: 0 },
    { qualifiedName: 'pkg.Server', kind: 'function', score: 0.54, provenance: 'lsp', distance: 1 },
  ],
  edges: [{ source: 'pkg.Server', target: 'pkg.Auth', edgeType: 'calls' }],
});

output := gcf.Encode(&gcf.Payload{
    Tool: "context_for_task", TokenBudget: 5000, TokensUsed: 1847,
    Symbols: []gcf.Symbol{
        {QualifiedName: "pkg.Auth", Kind: "function", Score: 0.78, Provenance: "lsp", Distance: 0},
        {QualifiedName: "pkg.Server", Kind: "function", Score: 0.54, Provenance: "lsp", Distance: 1},
    },
    Edges: []gcf.Edge{{Source: "pkg.Server", Target: "pkg.Auth", EdgeType: "calls"}},
})

rust

use gcf::{encode, Payload, Symbol, Edge};

let output = encode(&Payload {
    tool: "context_for_task".into(),
    token_budget: 5000,
    tokens_used: 1847,
    symbols: vec![
        Symbol { qualified_name: "pkg.Auth".into(), kind: "function".into(), score: 0.78, provenance: "lsp".into(), distance: 0, ..Default::default() },
        Symbol { qualified_name: "pkg.Server".into(), kind: "function".into(), score: 0.54, provenance: "lsp".into(), distance: 1, ..Default::default() },
    ],
    edges: vec![Edge { source: "pkg.Server".into(), target: "pkg.Auth".into(), edge_type: "calls".into(), ..Default::default() }],
    ..Default::default()
});

swift

import GCF

let output = GCF.encode(Payload(
    tool: "context_for_task", tokenBudget: 5000, tokensUsed: 1847,
    symbols: [
        Symbol(qualifiedName: "pkg.Auth", kind: "function", score: 0.78, provenance: "lsp", distance: 0),
        Symbol(qualifiedName: "pkg.Server", kind: "function", score: 0.54, provenance: "lsp", distance: 1),
    ],
    edges: [Edge(source: "pkg.Server", target: "pkg.Auth", edgeType: "calls")]
))

kotlin

import com.blackwellsystems.gcf.*

val output = encode(Payload(
    tool = "context_for_task", tokenBudget = 5000, tokensUsed = 1847,
    symbols = listOf(
        Symbol(qualifiedName = "pkg.Auth", kind = "function", score = 0.78, provenance = "lsp", distance = 0),
        Symbol(qualifiedName = "pkg.Server", kind = "function", score = 0.54, provenance = "lsp", distance = 1),
    ),
    edges = listOf(Edge(source = "pkg.Server", target = "pkg.Auth", edgeType = "calls"))
))

Output:

gcf

GCF profile=graph tool=context_for_task budget=5000 tokens=1847 symbols=2 edges=1
## targets
@0 fn pkg.Auth 0.78 lsp
## related
@1 fn pkg.Server 0.54 lsp
## edges [1]
@0<@1 calls

65 tokens instead of 163 for the JSON equivalent. Local IDs (@0, @1) replace full qualified names in edges. Distance groups (## targets, ## related) replace per-record "distance": N fields.

Decode

python

from gcf import decode

p = decode(gcf_text)
print(p.tool)           # "context_for_task"
print(len(p.symbols))   # 2
print(p.edges[0].source)  # "pkg.Server"

typescript

import { decode } from '@blackwell-systems/gcf';

const p = decode(gcfText);
console.log(p.tool);           // "context_for_task"
console.log(p.symbols.length); // 2
console.log(p.edges[0].source);  // "pkg.Server"

p, err := gcf.Decode(gcfText)
if err != nil {
    log.Fatal(err)
}
fmt.Println(p.Tool)           // "context_for_task"
fmt.Println(len(p.Symbols))   // 2
fmt.Println(p.Edges[0].Source)  // "pkg.Server"

rust

use gcf::decode;

let p = decode(gcf_text)?;
println!("{}", p.tool);           // "context_for_task"
println!("{}", p.symbols.len());  // 2
println!("{}", p.edges[0].source); // "pkg.Server"

swift

import GCF

let p = try GCF.decode(gcfText)
print(p.tool)           // "context_for_task"
print(p.symbols.count)  // 2
print(p.edges[0].source) // "pkg.Server"

kotlin

import com.blackwellsystems.gcf.decode

val p = decode(gcfText)
println(p.tool)           // "context_for_task"
println(p.symbols.size)   // 2
println(p.edges[0].source) // "pkg.Server"

What's next

Format Overview to understand both encoding profiles
Using GCF with LLMs for comprehension and generation results
Sessions for multi-turn deduplication (84% session savings)
Delta Encoding for incremental updates (81.2% savings)
Streaming Encoding for zero-buffering incremental encode
GCF vs TOON for the full competitive comparison
Benchmarks for input and output token data
Playground to try it live in the browser

Getting Started ​

Why not just use JSON? ​

When to use GCF ​

Generic profile: any structured data ​

Graph profile: relationship-heavy data ​

When GCF is most effective ​

When NOT to use GCF ​

"But I need human-readable output" ​

Install ​

Encode your first payload ​

Graph profile (first-class graph data) ​

Decode ​

What's next ​

Getting Started

Why not just use JSON?

When to use GCF

Generic profile: any structured data

Graph profile: relationship-heavy data

When GCF is most effective

When NOT to use GCF

"But I need human-readable output"

Install

Encode your first payload

Graph profile (first-class graph data)

Decode

What's next