TL;DR β The One-Minute Version
- What serialization actually is β turning live objects in RAM into bytes that survive a network hop or disk write
- The two big families: text-based formats (JSON, XML, YAML) vs binary formats (Protobuf, Avro, FlatBuffers)
- The three-way trade-off every format makes: wire size vs CPU cost vs schema flexibility
- Why picking the wrong format is an architectural mistake that compounds for years, not just a performance nuisance
Serialization is converting a live, in-memory object β a Python dict, a Java object, a Go struct β into a flat sequence of bytes that can travel over a network or sit on disk. Deserialization is the reverse: turning those bytes back into a live object on the other side. Every distributed system does this billions of times a day. The format you choose determines bandwidth cost, CPU overhead, and how painful future schema changes will be.
Imagine your Python service has a User object sitting in RAM: id=42, name="Alice". Your Go service on the other machine has never heard of Python objects. The only thing that can travel between them is bytes β raw numbers. Serialization is the agreed-upon rule for turning that Python object into those bytes, and deserialization is the Go service reading those bytes back into its own User struct. The format is the shared contract. Without it, the bytes are meaningless noise.
The core idea: A live object in RAM is just a web of pointers and memory addresses β it only makes sense to the process that created it. Serialization flattens it into a sequence of bytes that has a machine-independent meaning. Deserialization rebuilds the object from those bytes. Without this process, two programs β even on the same machine β can't share complex data.
Two big families: Text-based formats (JSON, XML, YAML) store data as human-readable characters. They're easy to debug, universally supported, and self-describing β the field names are right there in the bytes. Binary formats (Protobuf, Avro, FlatBuffers) encode data as compact numbers. Smaller, faster, but you need a schema to decode them β raw bytes look like gibberish.
The three-way trade-off: Every format trades off: wire size (bytes on the network), CPU cost (time to encode/decode), and schema flexibility (how easily the format handles added or removed fields over time). No format wins on all three. JSON wins on ease; Protobuf wins on size + speed; Avro wins on schema evolution. Knowing the trade-off lets you choose deliberately.
Why You Need This β The Real Problem
Let's make this concrete. Imagine you're building a microservice architecture: Service A is a Python user service that stores user data. Service B is a Go order service that needs user details to validate purchases. They run on separate machines, possibly in separate data centers.
Service A has a Python User object in its RAM. Service B needs a Go User struct in its RAM. The only thing connecting them is a TCP socket β a pipe that carries raw bytes. Service A needs to turn its Python object into bytes. Service B needs to turn those bytes back into a Go struct. Both sides must agree on exactly how that translation works β otherwise Service B reads garbage.
That agreement is the serialization format. It's not just a technical choice β it's a contract. Like a legal contract, changing it later requires coordination. Every client that ever wrote data in format V1 will at some point need to be read by code expecting format V2. If your format can't handle that gracefully, you're looking at painful downtime or big-bang migrations.
Your new service handles 1 million requests per second. Each JSON payload averages 5 KB. Your team wants to switch to Protobuf, which reduces the same payload to about 1.5 KB. Before reading on: how much bandwidth does that save per day?
Hint: multiply request rate Γ payload size Γ seconds in a day, then compare.
Show the math
JSON: 1,000,000 req/s Γ 5,000 bytes = 5 GB/s = 5 Γ 86,400 s = ~432 TB/day
Protobuf: 1,000,000 req/s Γ 1,500 bytes = 1.5 GB/s = 1.5 Γ 86,400 s = ~130 TB/day
Savings: ~302 TB/day β roughly 70% less bandwidth. At $0.08/GB (AWS data transfer), that's ~$24,000/day saved. Format choice is a business decision, not just a tech preference.
The flip side: Protobuf isn't free. It requires a schema file (a .proto file) shared between both services. Debugging a failed call means you can't just paste the wire bytes into a text editor and read them. Your team needs tooling. A junior developer can't curl your API and instantly understand the response. These are real costs β we'll quantify them in later sections.
Mental Model β Three Concerns of Any Format
Before diving into specific formats, you need a mental framework for evaluating any format β including ones invented next year. Every serialization format, without exception, makes a three-way trade-off. Understand the triangle and you can instantly reason about any format you encounter.
Think of it like choosing a vehicle. A bicycle is cheap and maneuverable but slow for long distances. A cargo truck carries enormous loads but won't fit in a parking garage. A sports car is fast but hauls nothing. No vehicle is best at everything, and neither is any serialization format. The question is always "best for my use case."
The three axes are:
- Wire size β how many bytes does a serialized object take? Smaller = less network bandwidth, less storage cost, faster transmission over slow links (mobile, IoT devices). JSON is verbose because it stores field names as strings in every message. Protobuf uses field numbers (1, 2, 3β¦) instead β much shorter.
- CPU cost β how long does encoding and decoding take? Text parsing is surprisingly slow: you have to scan characters, handle escape sequences, validate UTF-8. Binary formats with fixed-width fields can be decoded with a single memory read. This matters when you're serializing millions of objects per second.
- Schema flexibility β when you add a new field to your
Userobject next quarter, what happens to the old data that was serialized without that field? Can old code still read new data (forward compatibility)? Can new code still read old data (backward compatibility)? Some formats handle this beautifully; others break silently.
A useful mental shortcut: the closer a format sits to a corner of the triangle, the better it is at that one thing β but the worse it is at the other two. FlatBuffers sits near the "compact + fast" corner (amazing for games and embedded systems) but has the most rigid schema evolution story. JSON sits near the center (decent at everything, best at none) which is exactly why it's so broadly applicable. Avro sits near the "schema flexibility + compact" edge β perfect for data pipelines where schemas evolve constantly.
Serialize / Marshal / Encode all mean the same thing: object β bytes. Different communities prefer different words (Java says "marshal", Python says "serialize", networking folks say "encode"). They are interchangeable. Deserialize / Unmarshal / Decode mean bytes β object.
Schema is just a formal description of the data structure: what fields exist, what types they are, which ones are optional. Think of it as a blueprint that tells the decoder what each byte sequence means.
Core Concepts β The Six Terms That Unlock Everything
You'll see the same six concepts appear in every discussion about serialization β in interviews, in API documentation, in migration post-mortems. Master these six and every format description you read from here on will make immediate sense. We'll build each one from plain English first, then add the precise definition.
Schema
Imagine you're sending a box of Lego pieces with no instructions. The recipient sees pieces but doesn't know what they're supposed to build. A schema is the instruction manual β a formal description of what fields exist, what type each field is (string, integer, listβ¦), and which fields are optional vs required.
JSON has no mandatory schema β the field names are baked into every message ("self-describing"). Protobuf requires a .proto schema file shared between sender and receiver. Avro requires a .avsc schema, usually stored in a Schema Registry service.
Wire Format
The wire format is the actual byte layout that travels over the network (or sits on disk). Two formats can describe the same data but use completely different wire formats. JSON's wire format is UTF-8 text β human-readable characters. Protobuf's wire format uses variable-length integers and length-prefixed strings β compact binary that looks like random bytes without a schema.
Why care? The wire format determines how fast you can parse it and how efficiently it compresses. Text formats compress well with gzip. Binary formats are often already near-optimal.
Schema Evolution
Real systems change. You'll add a phone_number field to User next quarter. The problem: old messages on Kafka don't have that field. Old clients still sending data don't know about it. Schema evolution is the set of rules that determine what changes are "safe" (won't break anything) vs "breaking" (will crash consumers).
Safe: adding an optional field. Dangerous: renaming a field. Breaking: changing a field's type. Each format has different rules β Avro is the strictest and most explicit about this, which is why it's used in data pipelines where correctness over years matters.
Forward Compatibility
Scenario: you deploy Service B (new code) before you finish updating Service A. Service A is still sending messages in format V1. Service B expects V2. Forward compatibility means new code can read old data. The new code must be able to handle messages that are missing the new fields it added β usually by filling in defaults.
This is how rolling deployments work safely. You update receivers first, then senders. Without forward compatibility, you must update every consumer simultaneously β the dreaded "big bang" deployment.
Backward Compatibility
The reverse situation: you deployed Service B with new code, but there are still old Service B instances running (you're doing a gradual rollout). Those old instances see messages from Service A in the new format V2 β with fields the old code doesn't understand. Backward compatibility means old code can read new data β it ignores fields it doesn't recognize instead of crashing.
JSON handles this naturally (parsers typically ignore unknown keys). Protobuf handles it by using field numbers β unknown field numbers are preserved but ignored. XML is brittle about it without careful schema design.
Self-Describing vs Schema-Required
A self-describing format includes the field names inside every message. JSON is the prime example: {"name":"Alice"} tells you the field is called "name" right there in the bytes. Anyone can decode it without a separate document.
A schema-required format omits field names to save space. Protobuf stores only field numbers: byte 12 means field 2, a string. Without the .proto file, you can't decode "field 2" into "name". Self-describing formats are easier to debug but wasteful. Schema-required formats are compact but need tooling infrastructure.
A Brief History β XML β JSON β Binary β Zero-Copy
The dominant serialization format at any point in history reflects what the software industry valued at that moment. Understanding the evolution isn't just trivia β it explains why each format was designed the way it was and what problem it solved that its predecessor failed to. Formats don't get replaced because someone found a bug. They get replaced because the industry's priorities shift.
Here's the story arc in plain English. Each era solved the previous era's biggest pain point β and introduced the next era's biggest pain point.
XML Era (1996β2008) β "Machines Must Understand Each Other"
The late 1990s had a real problem: how do you make business software from different vendors talk to each other? A bank's mainframe, a supplier's Oracle database, a retailer's custom app β all written in different languages, different operating systems, different data models. The answer the industry settled on was XML: a verbose, human-readable, hierarchical text format with a powerful schema language (XSD) that could precisely describe any data structure.
SOAP (Simple Object Access Protocol) built on XML to create a standard for web services. It was incredibly thorough β it specified encoding, error handling, security, transaction management. It was also incredibly painful to work with. A simple "get user by ID" call required 50+ lines of XML boilerplate. Parsing was slow. Every language needed a heavy framework. But for its time, it worked. It proved that heterogeneous systems could interoperate.
What XML solved: vendor interoperability, formal contracts. What it failed at: developer ergonomics, bandwidth efficiency.
JSON Era (2002βpresent) β "Developers Must Be Happy"
Douglas Crockford documented JSON in 2002 β he didn't invent it exactly, but he recognized and named a subset of JavaScript object literal syntax that was already being used for data exchange. The genius of JSON was radical minimalism: no schema, no envelope, no namespace declarations. Just {"key": "value"}. A developer could understand it in five minutes.
Web 2.0 and REST made JSON the default. JavaScript applications in browsers could parse JSON natively with one function call. No libraries needed, no WSDL files, no code generation. REST + JSON became the default API style for the entire web. Today, roughly 95% of public APIs use JSON. It's not because it's the most efficient β it isn't. It's because it's the easiest to work with, and for public APIs where you don't control the client, ease wins.
What JSON solved: developer experience, language-agnostic simplicity. What it failed at: efficiency at scale, enforced contracts.
Binary Era (2008βpresent) β "Internal Services Must Be Fast"
As companies like Google, Facebook, and LinkedIn scaled to billions of users, they hit the ceiling of JSON's efficiency. Google's internal services exchange trillions of RPCs (remote procedure calls) per day. At that scale, JSON's verbosity and parsing overhead were measurable budget items β bandwidth costs real money, and slow serialization adds tail latency to every request.
Google open-sourced Protocol Buffers (Protobuf) in 2008. Facebook had already built Thrift (2007). LinkedIn built Avro (2009) with particularly good schema evolution for Kafka streaming. These formats share a design philosophy: define the schema separately, store only the data on the wire, use compact binary encoding. The trade-off: you need shared schema files, tooling for code generation, and you lose human readability. For internal service-to-service communication where you control both sides, this is an excellent trade-off.
Real usage today: Protobuf powers virtually all of Google's internal traffic. Avro is the default Kafka message format at companies like LinkedIn, Confluent, and Uber. Thrift is still used at Facebook (Meta) internally.
Zero-Copy Era (2013βpresent) β "Parsing Itself Is Too Slow"
Even Protobuf has a parsing step β you read bytes and build in-memory objects. For most applications this is negligible. But for some workloads β game engines rendering 120 frames per second, embedded systems with no garbage collector, high-frequency trading systems where a microsecond matters β even parsing is a bottleneck.
The answer: zero-copy formats like FlatBuffers (Google, 2014) and Cap'n Proto (2013). These formats store data in memory in the exact same layout used for the wire format. There is no "deserialization step" β you literally cast a pointer to the byte buffer and read fields directly. No allocation, no parsing, no copying. You get the wire bytes and immediately start reading fields from them.
Where this matters: games (FlatBuffers is widely used in Unity and Cocos2d-x projects), embedded systems, inter-process communication on the same machine (shared memory). Most web applications don't need this β Protobuf is already fast enough. But knowing this tier exists helps you understand the full performance spectrum.
There's no single "winner" format β different layers of the same system use different formats, and that's correct engineering:
- Frontend β Backend API β JSON (developers need to read it, clients are browsers and mobile apps you don't control)
- Internal microservice β microservice β Protobuf via gRPC (both sides under your control, performance matters)
- Event streaming to data lake (Kafka) β Avro or Parquet (schema evolution over years, columnar compression for analytics)
- Config files checked in to git β YAML or TOML (humans edit these, comments matter)
- Game engine / embedded β FlatBuffers or Cap'n Proto (zero-copy critical path)
A senior engineer picks the right format per layer, not one format for everything.
When to Use Which Format β A Decision Framework
Real engineers don't memorize a lookup table β they internalize a decision process. The right format depends on three questions: who controls both sides? (If you control both sides, binary is viable. If you expose a public API, it's not.) How often will the schema change? (If it changes frequently over years, schema evolution matters more than raw speed.) Do humans need to read the data? (Config files, API responses debugged in a terminal, data loaded in a browser devtool β humans need readable formats.)
Public APIs β JSON
When external developers integrate with your API β mobile apps, third-party dashboards, partner integrations β they need to read and debug responses without any special tooling. JSON is readable in a browser devtool, paste-able into Slack, and parseable by every language without a library. This is why roughly 95% of public APIs use JSON. The bandwidth overhead is real but acceptable because the integration simplicity is priceless.
Consider: JSON:API or OpenAPI/Swagger to add structure without moving away from JSON.
Internal Microservices β Protobuf / gRPC
When you control both sides of the communication β your Go service calling your Python service β you can agree on a schema up front. Protobuf + gRPC gives you: compact binary messages (~5β10x smaller than JSON for the same data), fast parsing, auto-generated client/server code in any language, and strongly-typed APIs that catch integration bugs at compile time. Google uses this for virtually all internal traffic.
Cost: you need a Schema Registry or shared repo for .proto files, and debugging raw gRPC traffic requires tooling (like grpcurl).
Event Streaming β Avro / Parquet
Kafka topics collect events for months or years. A Kafka topic created today will be read by code that doesn't exist yet. Avro's strongest feature is its explicit schema evolution rules β you register schemas in Confluent Schema Registry, and the registry enforces compatibility (you can't accidentally break consumers). Parquet extends this with columnar storage for analytical queries β reading only the user_id column from 1 billion events without reading the rest.
Real usage: Uber, LinkedIn, Netflix all use Avro for Kafka.
Mobile / Games β FlatBuffers
Mobile devices have limited CPUs and batteries. Game engines run on a tight frame budget (8ms per frame at 120fps). FlatBuffers' zero-copy design means there is literally no deserialization step β you access fields by computing an offset into the raw byte buffer. No allocation, no garbage collection pressure, no CPU cycles spent parsing. FlatBuffers is widely adopted in games and game tooling β Unity studios use it for asset bundles and network messages, and it ships in Cocos2d-x β precisely because the zero-copy access pattern fits a per-frame loop.
Trade-off: schema evolution is limited β mutating FlatBuffers is complex and often avoided.
Human-Edited Config β YAML / TOML
Kubernetes manifests, GitHub Actions workflows, Docker Compose files, Helm charts β all YAML. Why? Because humans write and read config files. YAML supports comments (# this field disables TLS), multi-line strings, and readable indentation. TOML (used in Rust's Cargo, Python's pyproject.toml) adds clearer syntax for tables. Neither is efficient for machine-to-machine communication β but that's not the use case. Humans are the audience.
Warning: YAML's implicit type coercion is a famous footgun β no in YAML is the boolean false, not the string "no".
When Format Doesn't Matter
Many teams over-engineer format selection. For a startup handling 1,000 requests per second with 1 KB payloads, the total bandwidth is 1 MB/s β negligible. The CPU cost of JSON parsing is a rounding error. In this range, pick JSON and move on. The switching cost of migrating to Protobuf later is higher than the bandwidth savings at low scale.
Rule of thumb: if your total serialization overhead is under 5% of your compute budget, format is not the bottleneck. Profile first, optimize second.
Teams switch to Protobuf because they heard "it's faster" β without measuring their actual bottleneck. They then spend months managing schema files, training new developers on gRPC tooling, setting up a Schema Registry, and writing debugging scripts for binary payloads. Three months later, they've saved 2ms per request but spent 200 engineer-hours in the migration. Optimization without measurement is gambling. Benchmark JSON performance first. Add compression (gzip, zstd) to JSON second β this often cuts payload size by 80% with zero schema management overhead. Only then consider Protobuf if the bottleneck persists.
The Same User Object β Three Formats Side by Side
Before we go deep into each format, here's a concrete preview: the same User { id: 42, name: "Alice" } expressed in JSON, Protobuf, and Avro. The code tabs below show what you actually write and what actually travels over the wire.
JSON is self-describing β the field names travel with every message. No schema file needed on either end. Any JSON parser in any language can decode this without prior knowledge of the structure.
// What travels over the wire β 100% readable as-is:
{"id":42,"name":"Alice"}
// Formatted for readability (same data, more bytes):
{
"id": 42,
"name": "Alice"
}
// Wire size: ~24 bytes (minified)
// Parsing: text scanner must read char-by-char, handle escapes
// Schema: none required β field names ARE the schema
// Debug: paste into browser console β works immediately
// Compression: gzip reduces to ~20 bytes (not much gain on tiny payloads,
// but 60-80% savings on payloads with many repeated field names)
JSON was designed for one-off data exchange where you can't guarantee the receiver has a schema. The field names are the schema β they travel with every message. This wastes bandwidth but makes the format universally decodable by anyone, in any language, forever.
Protobuf requires a .proto schema shared between sender and receiver. The schema is compiled into code (Go, Python, Java, etc.) that handles encoding/decoding. Field names are NOT in the wire bytes β only field numbers (1, 2, 3β¦). This is why it's ~62% smaller than JSON for this example.
// user.proto β shared schema file (NOT on the wire)
syntax = "proto3";
message User {
int32 id = 1; // field number 1 β this is what travels, not "id"
string name = 2; // field number 2 β not "name"
}
// Generated Go code (proto-compiler creates this):
// user.pb.go β User struct with Marshal() / Unmarshal() methods
// Wire bytes (hex) for User{id:42, name:"Alice"}:
08 2A 12 05 41 6C 69 63 65
// Decoded:
// 08 = field 1 (id), wire type 0 (varint)
// 2A = 42 in decimal (the value of id)
// 12 = field 2 (name), wire type 2 (length-delimited string)
// 05 = string is 5 bytes long
// 41 6C 69 63 65 = "Alice" in ASCII
// Wire size: 9 bytes (vs 24 for JSON = 62% smaller)
// Parsing: single-pass binary scan, no string processing
// Schema: .proto file REQUIRED to decode β without it, bytes are opaque
Because field numbers β not names β identify fields on the wire, you can safely rename a field in the .proto file as long as you keep the same number. Old senders still send field number 2; the new schema calls it full_name instead of name. Old data decodes correctly. This is backward-compatible renaming β something JSON can't do at all.
Avro uses a JSON-formatted schema definition (.avsc) but writes compact binary wire bytes β no field names, no field numbers. Instead, it relies on the writer's schema and reader's schema being present at decode time. The schema is typically stored in a Schema Registry (like Confluent's), and the wire bytes include only a schema ID (4 bytes) plus the data.
// user.avsc β Avro schema (JSON format, NOT binary)
{
"type": "record",
"name": "User",
"namespace": "com.example",
"fields": [
{ "name": "id", "type": "int", "default": 0 },
{ "name": "name", "type": "string", "default": "" }
]
}
// The "default" values are critical β they enable schema evolution.
// New fields MUST have defaults so old data (without those fields)
// can still be decoded by new code.
// Wire bytes for User{id:42, name:"Alice"} WITH schema registry:
00 00 00 00 01 54 0A 41 6C 69 63 65
// Decoded:
// 00 = magic byte (Confluent convention)
// 00 00 00 01 = schema ID 1 (reader fetches schema from registry)
// 54 = zigzag-encoded 42 (Avro uses zigzag for integers)
// 0A = 5 (string length, varint)
// 41 6C 69 63 65 = "Alice"
// Wire size: 12 bytes (vs 24 for JSON = 50% smaller)
// Schema evolution: BEST in class β registry enforces compatibility rules
// Kafka use case: producer writes schema once; millions of messages reference it
Kafka topics accumulate messages for months. Over time, the producer's schema and the consumer's schema will diverge β new fields are added, old ones deprecated. Avro + Confluent Schema Registry enforces compatibility at publish time: you cannot publish a schema change that would break existing consumers. This is the only format that bakes this guarantee into the infrastructure layer rather than relying on team discipline.
JSON β The Web's Universal Language
JSON has exactly six data types: object, array, string, number, boolean, and null. That's the entire grammar. There's nothing else. And that radical simplicity is precisely why JSON crushed every competitor for public APIs β it's so small that any developer can hold the whole spec in their head in an afternoon.
The web runs on JSON today. About 95% of all public APIs speak it natively. When you open your phone's weather app, a JSON payload just arrived. When you tweet something and the analytics dashboard updates, JSON is the courier. It became the lingua franca of the web not because of a committee decision, but because JavaScript was everywhere and JSON.parse() was already built in β no libraries needed.
What JSON gets right
JSON's wins are not accidental β each one comes from a deliberate constraint in the design.
Human-readable
You can open a JSON payload in any text editor and understand it instantly. No hex dumps, no decoder rings. This cuts debugging time dramatically β a senior engineer reading logs doesn't need a tool to tell them what the payload contains. That's worth a lot of engineering hours.
Native to JavaScript
JSON.parse() and JSON.stringify() are built into every browser and Node.js runtime. No external library to install, no version conflicts. When the web ran on JavaScript, this was the killer advantage β the data format and the language were designed to match each other.
Self-describing
Every field carries its own name inside the message. When you receive {"created_at":"2024-01-15"}, you know the field is called created_at just from the bytes β no separate schema file needed. This makes JSON ideal for exploratory work and public APIs where consumers don't want to track down a .proto file to read a response.
Universal tooling
Every programming language has a JSON library. Every database can store and query it. Every API testing tool (Postman, curl, Insomnia) speaks it natively. The ecosystem depth means you'll never hit a wall where a tool can't understand your data β a practical advantage that compounds across a team of 50 engineers using 10 different technologies.
Where JSON hurts
transaction_reference_id costs 26 bytes of name in every single record. Across a billion records, that's 26 GB of field names β for data that carries zero information because the name is already known by convention. Binary formats solve this by encoding field names as 1-2 byte integers.
No native types for common real-world data. JSON's number type is IEEE-754 double-precision float β which can't accurately represent integers larger than 253. This is why Twitter's API returns tweet IDs as both a number AND a string: JavaScript clients that parse the number lose precision on large IDs, so the string version is the safe fallback. Dates, byte arrays, and decimals all get serialized as strings β and then you need out-of-band conventions to know what those strings mean.
No comments. You can't write // this is the user's external ID inside JSON. Configuration files (like package.json) work around this with "_comment" keys, or switch to JSON5/YAML. It's a small annoyance that accumulates when you're maintaining infrastructure configuration.
Pretty-printed JSON is what you'd see in API docs or a curl response with jq. Whitespace is free to add β it changes nothing about the data, only readability. Most APIs send minified JSON over the wire to save bandwidth.
{
"id": 42,
"name": "Alice",
"email": "alice@example.com",
"age": 29,
"verified": true,
"address": {
"city": "New York",
"country": "US"
},
"tags": ["admin", "beta-user"],
"last_login": "2024-01-15T10:30:00Z",
"balance": null
}
Notice: last_login is a string, not a native datetime. balance is null (the user has no balance recorded, distinct from zero). The structure is immediately readable β no decoder needed.
Minified JSON strips all whitespace. Same data, fewer bytes. Gzip compression on top typically reduces JSON a further 70β80% for real payloads because field names repeat predictably across records.
{"id":42,"name":"Alice","email":"alice@example.com","age":29,"verified":true,"address":{"city":"New York","country":"US"},"tags":["admin","beta-user"],"last_login":"2024-01-15T10:30:00Z","balance":null}
Pretty version: ~215 bytes. Minified: ~185 bytes. With gzip: ~120 bytes. The compressor collapses the repeated field names across thousands of similar records β which is why columnar formats (Parquet) are even smaller for analytics workloads.
Every mainstream language has JSON parsing in its standard library. The API is always the same idea: give it bytes, get back a native object.
import json
payload = '{"id":42,"name":"Alice","verified":true}'
# Deserialize: bytes β Python dict
user = json.loads(payload)
print(user["name"]) # "Alice"
print(type(user["id"])) # <class 'int'>
# Serialize: Python dict β bytes
data = {"id": 42, "name": "Alice"}
text = json.dumps(data) # '{"id": 42, "name": "Alice"}'
package main
import (
"encoding/json"
"fmt"
)
type User struct {
ID int `json:"id"`
Name string `json:"name"`
Verified bool `json:"verified"`
}
func main() {
payload := []byte(`{"id":42,"name":"Alice","verified":true}`)
var user User
json.Unmarshal(payload, &user) // deserialize
fmt.Println(user.Name) // "Alice"
bytes, _ := json.Marshal(user) // serialize
fmt.Println(string(bytes))
}
XML β The Format That Won't Die
Before JSON, there was XML. The year was 1998 β JavaScript barely existed, the web was exploding, and enterprises needed a way to exchange structured data between systems. XML stepped in and dominated for over a decade. SOAP web services, RSS and Atom feeds, Maven build files, Spring configuration, Microsoft Office documents, SVG graphics, XHTML β all XML. It's verbose, sure, but XML solved real problems that JSON still can't handle cleanly.
The core idea is simple: you wrap data in matching open and close tags, like <name>Alice</name>. Tags nest arbitrarily. Each element can have attributes (metadata on the tag itself) and text content. That combination β element + text + nested elements β is what lets XML represent documents with mixed content in a way JSON fundamentally cannot. An HTML paragraph with a bold word inside it is trivial in XML. In JSON it would require a custom schema to model the inline structure.
Why XML is still alive in 2026
JSON never replaced XML in document-centric domains. When you download a Word .docx file and unzip it, you find XML files inside β word/document.xml is the entire document text with formatting. Excel is the same. This isn't legacy inertia; XML's mixed-content model (text with inline tags) is genuinely better for documents than JSON's clean data-only model.
Office documents
.docx, .xlsx, .pptx are ZIP archives of XML files. The Open XML standard (OOXML) is an ISO specification, which means every tool in the world can read Office files without needing Microsoft's proprietary code. XML's ability to mix text and markup makes it the right tool here β a paragraph like "Hello world" is natural in XML, awkward in JSON.
Regulated industries
XBRL (eXtensible Business Reporting Language) is mandatory for financial filings with the SEC. HL7 v2 and FHIR power healthcare data exchange. FIX protocol (financial trading) has XML variants. These industries standardized on XML years ago, and their standards don't change on startup timescales β the compliance burden of switching is enormous.
Java enterprise ecosystems
Spring configuration (applicationContext.xml), Maven pom.xml, web.xml, Hibernate mappings β Java's enterprise stack was architected in the XML era. Even though modern Spring favors Java annotations and YAML configs, billions of lines of XML config exist in production Java applications today.
SVG and XHTML
SVG β the format used for the diagrams on this very page β is XML. That's why you can open an .svg file in a text editor and read it. XHTML was the W3C's attempt to make HTML XML-strict. HTML5 won instead, but SVG kept the XML syntax because it genuinely benefits from namespace-based extension and XPath tooling for programmatic manipulation.
XML's real costs
The verbosity tax is severe. Every element needs a closing tag, so the tags themselves can easily double the byte count. Namespaces β the mechanism for avoiding naming conflicts across schema owners β are notoriously confusing (a namespace declaration like xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" is just a string prefix, but the rules for when it applies are subtle). XML Schema Definition (XSD) is a complete validation language that takes weeks to master properly. XML parsers come in three incompatible flavors: DOM (load everything into memory), SAX (event-driven streaming), and StAX (cursor-based streaming) β picking the wrong one causes out-of-memory crashes on large files.
Still, about 30% of B2B integrations use XML today, and SOAP web services remain alive in banking, insurance, and government. If your career touches those industries, you'll read XML.
XML for a simple user record. Count the bytes spent on closing tags alone β each element name is written twice.
<?xml version="1.0" encoding="UTF-8"?>
<user id="42">
<name>Alice</name>
<email>alice@example.com</email>
<age>29</age>
<verified>true</verified>
<address>
<city>New York</city>
<country>US</country>
</address>
<tags>
<tag>admin</tag>
<tag>beta-user</tag>
</tags>
</user>
<!-- ~350 bytes -->
Exactly the same data in JSON. The XML version is about 70% larger for identical information β and that ratio gets worse as field names get longer.
{
"id": 42,
"name": "Alice",
"email": "alice@example.com",
"age": 29,
"verified": true,
"address": { "city": "New York", "country": "US" },
"tags": ["admin", "beta-user"]
}
// ~185 bytes β roughly half the XML size
Protocol Buffers β Google's Internal Standard
Picture every service at Google β Search, Maps, YouTube, Gmail β sending data to every other service millions of times per second. Each of those messages needs to be as small as possible (bandwidth costs money), as fast to encode/decode as possible (CPU costs money), and stable enough that when an engineer adds a field to the User object, the old services don't break. JSON's self-describing verbosity and XML's tag doubling are both unacceptable at that scale. So in 2008, Google open-sourced their solution: Protocol Buffers, usually called Protobuf.
The core idea: instead of describing the shape of data inside every message (like JSON does with its field names), you describe it once in a .proto schema file. A compiler reads that file and generates typed code in Go, Java, Python, C++, or a dozen other languages. That generated code knows exactly what bytes mean what β so the wire format contains just numbers and values, no names at all. The result is shockingly compact.
Why the wire format is so compact
The secret is that field names never travel over the wire. Instead, each field in your .proto file gets a small integer tag β id = 1, name = 2. The wire format encodes each field as a compact key-value pair: the key is the tag number + a 3-bit "wire type" code that tells the decoder whether what follows is a varint, a fixed 64-bit value, a length-delimited blob, etc. A small integer like 42 encodes as a single byte with varint (variable-length integer) encoding. The string "Alice" encodes as a length byte (5) followed by 5 bytes of UTF-8. No quotes. No field names. No braces.
Schema evolution rules
The tricky part of binary formats is what happens when your schema changes. JSON's self-describing nature means old code ignores unknown fields and new code handles missing fields with defaults β it's loose and forgiving. Protobuf's wire format is compact precisely because it doesn't carry names, which means you need explicit rules for what changes are safe.
Safe changes
Adding a new field is always safe. Old services that don't know about field 4 will simply skip its bytes β the field tag tells them how many bytes to skip. New services receive the field normally. This is the most common evolution pattern.
Renaming a field is safe at the wire level. The wire carries only the numeric tag, not the name. But generated code uses the name β so all codebases need to be recompiled after a rename. Wire-compatible, code-breaking.
Widening a number type β int32 β int64 β is usually safe for values that fit, though not formally guaranteed by the spec.
Dangerous changes
Removing a field is risky. Old services that still send the removed field will populate bytes that new services don't expect at that tag number. If you later add a new field with the same tag, the old services will accidentally populate the new field with the wrong type of data β silent corruption.
Changing a field's type β especially int32 β string β is almost always breaking. The wire type code changes, causing parse errors.
Reusing a deleted field's tag number is the worst mistake. Always mark old tags as reserved 5 so the compiler prevents accidental reuse.
reserved 5; and optionally reserved "old_field_name"; in your .proto file. The Protobuf compiler will then reject any future attempt to define a field with that number, protecting you from silent data corruption if old services are still sending bytes tagged as field 5.
Real-world scale: Protobuf is the most commonly used data format at Google and powers nearly all inter-service RPC there β at that traffic level, internal Protobuf flow runs into the petabytes per day. Serialization speed is ~5β100x faster than equivalent JSON depending on payload complexity. Wire size is typically 3β10x smaller. These aren't theoretical gains β at Google's traffic, even a 5% improvement in wire efficiency saves tens of millions of dollars in bandwidth annually.
The .proto file is the source of truth. Field numbers are the permanent contract β names can change, types have limited change rules, but the number is locked forever once you've shipped data.
syntax = "proto3";
package users;
message User {
int32 id = 1; // tag 1 β NEVER reuse if deleted
string name = 2; // tag 2
string email = 3; // tag 3
bool verified = 4; // tag 4
// Safe to add new fields with new tag numbers:
// string phone_number = 5;
// If you delete a field, protect the tag:
// reserved 6;
// reserved "old_legacy_field";
}
message UserList {
repeated User users = 1; // "repeated" = array/list
}
After running protoc --go_out=. user.proto, you get a generated file with a typed Go struct. You never write this file by hand β the compiler maintains it.
// Generated code (user.pb.go) β don't edit by hand
// type User struct { Id int32; Name string; Email string; Verified bool; ... }
// Your application code:
package main
import (
"fmt"
"google.golang.org/protobuf/proto"
pb "your-module/users"
)
func main() {
user := &pb.User{
Id: 42,
Name: "Alice",
Email: "alice@example.com",
Verified: true,
}
// Serialize to wire bytes
bytes, err := proto.Marshal(user)
if err != nil {
panic(err)
}
fmt.Printf("Wire bytes: %d bytes\n", len(bytes)) // ~35 bytes
// Deserialize from wire bytes
var decoded pb.User
proto.Unmarshal(bytes, &decoded)
fmt.Println(decoded.Name) // "Alice"
}
The raw bytes for User { id: 42, name: "Alice", email: "alice@example.com", verified: true }. Notice: no quotes, no field names, no braces β just tag bytes and value bytes.
08 2A -- field 1 (id), varint, value = 42
12 05 41 6C 69 63 65 -- field 2 (name), length=5, "Alice"
1A 11 61 6C 69 63 65 40 65 78 -- field 3 (email), length=17
61 6D 70 6C 65 2E 63 6F 6D -- "alice@example.com" continued
20 01 -- field 4 (verified), varint, value = 1 (true)
Total: ~35 bytes
JSON equivalent: ~95 bytes (63% larger)
XML equivalent: ~215 bytes (514% larger)
Apache Thrift β Facebook's Contemporary
One year before Google open-sourced Protobuf, Facebook was solving the exact same problem: hundreds of internal services needed to talk to each other efficiently, with typed schemas and generated code. Their answer, built around 2007, was Apache Thrift. The philosophy is identical to Protobuf β define your data types and services in an IDL (Interface Definition Language) file, run a compiler, get generated code in your language. But Thrift came with one thing Protobuf originally didn't: a complete, built-in RPC framework with transport and server skeletons.
Think of it like two competing kitchen knife sets. Both cut food (binary serialization, generated types). But Thrift also comes with the cutting board and the storage block (RPC transport, server boilerplate), while Protobuf originally came as just the knives β you were expected to bring your own serving infrastructure (which eventually became gRPC).
Where Thrift shines and where it doesn't
Thrift's standout feature is its breadth β around 25 language targets including Erlang, Haskell, OCaml, and D, which Protobuf's official compiler never formally supported. If you're running a polyglot system with some unusual language in the stack (maybe old Erlang services handling telephony), Thrift was often the only schema tool with a generator for it.
The pluggable transport and protocol stack is also genuinely useful. You can switch a service from TBinary (fast, opaque) to TCompact (varint encoding, ~30% smaller than TBinary) or TJson (for debugging) without changing your application code β just swap the protocol object at startup. gRPC doesn't offer this level of runtime protocol flexibility.
Where Thrift lost ground is ecosystem momentum. Facebook's internal investment slowed after ~2014 (they maintain a private fork called fbthrift today). The Apache project moves slowly. Meanwhile gRPC β Protobuf plus HTTP/2 β landed with Google's full backing, rich tooling, browser support via grpc-web, and a booming community. For new services in 2025, almost every team picks gRPC. Thrift's core user base is companies that built on it a decade ago and have too much invested to switch.
Still used by: Meta, Twitter (legacy), Cloudera, Evernote, and any company that built SOA infrastructure in the 2008β2014 era.
Thrift uses its own IDL syntax. Notable differences: types are i32/i64 not int32/int64; services and exceptions are first-class IDL concepts; no syntax = "proto3" preamble.
namespace java com.example.users
namespace go users
struct User {
1: required i32 id
2: required string name
3: optional string email
4: optional bool verified = false
}
exception UserNotFound {
1: i32 id
2: string message
}
service UserService {
User getUser(1: i32 id) throws (1: UserNotFound e),
list<User> listUsers()
}
The same User + service in Protobuf. Key structural differences: Protobuf uses int32/string/bool, services are called rpc inside a service block, and required/optional distinctions were removed in proto3 (all fields optional by default).
syntax = "proto3";
package users;
message User {
int32 id = 1;
string name = 2;
string email = 3;
bool verified = 4;
}
// gRPC service definition
service UserService {
rpc GetUser(GetUserRequest) returns (User);
rpc ListUsers(ListUsersRequest) returns (UserList);
}
message GetUserRequest { int32 id = 1; }
message ListUsersRequest {}
message UserList { repeated User users = 1; }
Apache Avro β Schema-First with a Registry
Imagine you're running a data pipeline. A producer service writes a million records per second to Kafka. Those records get stored in a data lake on S3. Six months from now, an analyst wants to read those records. The schema has changed three times since they were written. How does the analyst's code know what the old records look like?
This is the exact problem Apache Avro was designed to solve. Avro is Hadoop's native serialization format (built in 2009 as part of the Hadoop ecosystem) and it became the default choice for Kafka payloads when Confluent built their Schema Registry. The core insight: instead of baking schema knowledge into generated code (like Protobuf does), Avro either embeds the schema directly in the data file, or stores schemas in a central registry and references them by a small integer ID. Every record is self-describing at the batch level.
Why Avro wins for streaming and data lakes
The magic byte + schema ID trick in the wire format is elegantly simple. Every Avro message in Kafka starts with a literal 0x00 byte (the Confluent magic byte), followed by 4 bytes encoding the schema ID, followed by the Avro binary payload. A consumer that receives the message reads those first 5 bytes, looks up the schema in the registry (caching it after the first lookup), and then knows exactly how to interpret the remaining bytes. The schema doesn't travel with every single message β just its integer ID does. This keeps messages compact while ensuring that messages written months apart with different schema versions can all be decoded correctly.
The Schema Registry enforces compatibility rules explicitly. You declare BACKWARD compatibility (new consumers can read old data), FORWARD (old consumers can read new data), or FULL (both directions). The registry rejects schema registrations that would break the declared compatibility β before the broken schema ever touches production. This is stronger than Protobuf's informal "add fields are safe" convention because it's mechanically enforced at registration time.
Avro's schema format is JSON (.avsc files), which makes schemas easy to read, diff in git, and review in pull requests. Contrast with XSD (XML Schemas), which is famously arcane.
Avro schemas are JSON documents. This one defines a User record with field defaults β defaults are critical for backward compatibility (new fields must have defaults so old records without them can still deserialize).
{
"type": "record",
"name": "User",
"namespace": "com.example.users",
"fields": [
{ "name": "id", "type": "int" },
{ "name": "name", "type": "string" },
{ "name": "email", "type": ["null", "string"], "default": null },
{ "name": "verified", "type": "boolean", "default": false },
{
"name": "created_at",
"type": {
"type": "long",
"logicalType": "timestamp-millis"
},
"default": 0
}
]
}
Notice ["null", "string"] β that's an Avro union type. The default null means old records without an email field will deserialize cleanly to null. This is the backward compatibility pattern.
Avro binary encoding is similar in concept to Protobuf but uses the schema's field order (not tag numbers) to decode. Field names are NOT in the wire β just values in schema-declared order.
Confluent wire framing for Kafka:
00 -- magic byte
00 00 00 02 -- schema_id = 2 (big-endian int)
Avro binary payload (field order matches schema):
54 -- id = 42 (zigzag varint: 42*2 = 84 = 0x54)
0A 41 6C 69 63 65 -- name = "Alice" (length=5 as varint, then bytes)
00 -- email = null (union index 0 = null)
01 -- verified = true (boolean)
00 -- created_at = 0 (default, encoded as zigzag)
Total payload: ~12 bytes
JSON equivalent: ~120 bytes (10x larger)
Avro also supports a JSON encoding mode β same schema, but the bytes are a JSON document instead of binary. You'd use this for debugging or when you need human-readable output but still want schema validation. Not for production throughput.
{
"id": 42,
"name": "Alice",
"email": { "string": "alice@example.com" },
"verified": true,
"created_at": 1705312200000
}
// Note: union fields need an explicit type wrapper:
// "email": { "string": "alice@example.com" } β union with "string" selected
// "email": null β union with null selected
// This is verbose but unambiguous β useful for debugging schema evolution issues.
MessagePack, BSON, CBOR β JSON's Binary Cousins
Sometimes you want JSON's flexibility β no upfront schema definition, no compiler step, easy to add fields whenever you want β but you're hitting real pain with JSON's verbosity or parsing overhead. Binary JSON formats fill that exact gap. They keep JSON's data model (object, array, string, number, bool, null) but encode it in binary, saving 30β60% in wire size and 2β5x in parse time.
Three formats dominate this space, each optimized for a different environment:
MessagePack β Binary JSON for Services
MessagePack is the simplest of the three: it's JSON's exact data model, encoded as binary. An object is encoded as a count of key-value pairs followed by alternating keys and values. A string is a length byte (or multi-byte for long strings) followed by UTF-8 bytes. No field names indexed separately β just the same key-string bytes as JSON, but without quotes or commas.
The result is typically 30β50% smaller than JSON and 2β5x faster to encode/decode because the parser never needs to handle string escaping, whitespace, or number-as-text parsing. MessagePack also adds one type JSON lacks: a native binary/bytes type, letting you embed raw binary blobs without base64 encoding them as strings first.
Used by: Redis pub/sub, Pinterest (internal services), Twitter (legacy), Fluentd (log shipper). Widely supported β 50+ language implementations. No schema registry, no compiler. Drop-in replacement wherever JSON flexibility is needed with less overhead.
BSON β MongoDB's Native Format
MongoDB designed BSON ("Binary JSON") for one purpose: storing documents efficiently in a database that also needs to scan through documents quickly. JSON's text format is terrible for this β finding field 15 in a JSON object means you have to scan every preceding character. BSON prepends the size of each element, so a database can jump to any field in O(1) by reading ahead.
BSON also adds native types that MongoDB needs for application data: ObjectId (MongoDB's 12-byte primary key type), datetime (stored as UTC milliseconds since epoch, not a string), decimal128 (precise decimals for money), byte arrays (binary data without base64), and regular expressions. These additions make BSON ideal if you're already in the MongoDB world β you rarely need to convert types.
The trade-off: BSON is actually slightly larger than JSON for small documents because the size prefixes add overhead. This surprises people β they expect binary to always be smaller. MessagePack beats BSON on wire size; BSON beats MessagePack on random-access speed inside a database engine.
CBOR β The IETF Standard for IoT
CBOR (Concise Binary Object Representation, RFC 7049) is what the IETF (the internet standards body) standardized as the binary-JSON-for-constrained-environments format. The design targets IoT devices β sensors, microcontrollers β where RAM is kilobytes, not gigabytes, and every byte of message overhead matters.
CBOR has roughly the same wire efficiency as MessagePack but adds more type variety: native dates and times, big integers, floating-point half-precision (16-bit floats for sensor readings), tagged types (user-defined extensions). The tagging system is particularly clever β you can add application-specific meaning to any value without breaking the base format.
Where you'll see CBOR in production: WebAuthn/FIDO2 (the passwordless authentication standard β browser-to-authenticator messages are CBOR), COSE (CBOR Object Signing and Encryption β signed/encrypted JSON-like payloads), Matter (smart home IoT protocol), and various IETF OAuth and certificate extensions.
When to use each
MessagePack: when you want a JSON-like workflow but with less bandwidth and faster parsing. No schema, no compiler, just swap your JSON serializer for a MessagePack one. Internal service communication where you control both sides.
BSON: almost never β unless you're building a MongoDB driver or working directly with MongoDB's wire protocol. Your application code sees BSON through the MongoDB driver; you don't usually choose it directly.
CBOR: when an IETF standard requires it (WebAuthn, COSE), or when you're building for genuinely constrained IoT devices and need a standardized binary format with maximum library support across embedded toolchains.
Real numbers summary: MessagePack encodes/decodes roughly 2β5x faster than JSON (CPU cycles saved on escape-parsing and number-to-text conversion). BSON is ~5β15% larger than MessagePack for typical small documents but supports random-access field lookups internally β that's why MongoDB uses it for storage even though network payloads go through the driver. CBOR is roughly size-equivalent to MessagePack but with a richer type system and an IETF-standard number that lets you cite it in RFCs. All three lack the 3β10x size advantage that Protobuf/Avro achieve by eliminating field names from the wire entirely.
FlatBuffers & Cap'n Proto β Zero-Copy Serialization
Every format we've looked at so far has a hidden cost that's easy to miss: parsing. When your code calls Protobuf.parse(bytes), the library walks every byte in the buffer, interprets the wire-type tags, allocates new heap objects, and copies field values into them. For a tiny 100-byte message, that's a handful of microseconds β totally invisible. For a 100 MB game asset or a 50 MB ML model weight file loaded thousands of times, it's a painful bottleneck.
Zero-copy formats solve this by laying out bytes in memory exactly the way the CPU expects to see them. Instead of "parse the buffer β build heap objects β read fields from heap," you do "cast a pointer at the buffer β read fields directly from the buffer." There's nothing to parse because the binary layout is the object. Fields you never access are never touched at all.
The Two Zero-Copy Formats
FlatBuffers (Google, 2014)
Created by Google for game development β the first user was Cocos2d-x and Call of Duty: Black Ops III, where loading a 200MB asset file in 80ms vs 0.5ms is a tangible player-experience difference. The format is also used in TensorFlow Lite to ship ML model weights to mobile devices, where you want to memory-map the model file directly and never copy bytes into heap. Schema required (a .fbs file), and the generated accessor code does the offset arithmetic for you.
Cap'n Proto (Kenton Varda, 2013)
Written by the original Protobuf engineer at Google after he left for Sandstorm. The pitch: "Protobuf was a good idea; let's do it right." Cap'n Proto adds zero-copy to Protobuf's ideas AND bundles a built-in RPC layer (Cap'n Proto RPC) so you skip the serialize-network-deserialize round-trip entirely for in-process calls. Used in Cloudflare Workers for sandbox boundary communication, where the overhead of copying every request/response between the isolate and the runtime adds up at billions-of-requests scale.
When Zero-Copy WINS
- Read-heavy, write-once data β game assets, ML model weights, config blobs. Written once, loaded thousands of times.
- Huge payloads β 10MB+ where a 80ms parse vs 0.2ms pointer cast is user-visible.
- Memory-mapped files β you can
mmap()a FlatBuffers file and access it without ever reading the whole thing into RAM. - Cache-friendly hot paths β the CPU's L1/L2 cache benefits because you never copy data; you read it directly from the original buffer.
When Zero-Copy Hurts
- Write-heavy or mutation-heavy data β FlatBuffers buffers are immutable once built. Mutations require rebuilding the whole buffer. Encode speed is ~3x slower than Protobuf.
- Deeply nested schemas β the vtable indirection per object adds up for very deep structures.
- Developer ergonomics β generated accessor code is less natural to write than a plain Protobuf struct. Debugging raw offsets is painful.
- Typical request/response APIs β JSON or Protobuf is fast enough for 99% of web APIs. Don't add FlatBuffers complexity for a 1ms savings.
Parquet & ORC β Columnar Storage Formats
Every format we've discussed stores data row by row: all of User #1's fields together, then all of User #2's fields, and so on. This is natural β it's how you think about a single record β and it's exactly right when you need to fetch a single user by ID. But imagine you're an analyst who wants to answer: "What was the average order amount last month?" You don't need names, emails, or phone numbers. You need ONE column out of hundreds. With a row-oriented format, you must read every single byte of every row to get to that one number per row. You're paying for 100 fields when you need 1.
Columnar formats flip the layout: all user IDs are stored together, all names together, all order amounts together. A query that touches only the amount column reads only the amount column's bytes β skipping everything else entirely. For analytics workloads that scan billions of rows but touch 3-5 columns, this is a 10-50x I/O reduction.
Four Big Wins of Columnar Storage
Compression (5β20x better)
In a row layout, consecutive bytes look like: 1, "Alice", 99.5, 42, "Bob", 12.0β¦ β a mix of integers, strings, and floats. Compressors need variety to encode context. In a columnar layout, the age column is 42, 31, 28, 45β¦ β all integers, mostly similar range. Run-length encoding, delta encoding, and dictionary encoding all work far better on homogeneous data. Parquet's timestamp columns in real Snowflake tables compress 15β20x routinely.
Column Pruning (skip columns entirely)
When you query SELECT amount FROM orders WHERE date > '2024-01-01', Parquet reads only two columns β amount and date β from disk. The other 998 columns never leave storage. At cloud storage prices ($0.023/GB on S3), scanning 1 GB instead of 1 TB saves ~$0.023 per query, but at Athena pricing ($5 per TB scanned) it saves $4.99 per query. This adds up to millions per year at data-platform scale.
SIMD Vectorization (modern CPU speedup)
Modern x86/ARM CPUs have SIMD instructions (SSE4, AVX-512) that process 8, 16, or 32 values at once. A SUM over an int64 column becomes a tight loop of VPADDQ (vector add 4 int64s at once). This only works when the values are contiguous in memory β which columnar layout guarantees. Row-oriented layouts interleave types, preventing vectorization. Parquet + Arrow queries routinely hit 10β50 GB/s throughput on a single core this way.
Predicate Pushdown (filter before deserialize)
Parquet stores per-column min/max statistics in each row group's footer. If you query WHERE date > '2024-06-01' and a row group's date column has max='2024-01-31', Parquet skips that row group entirely β never reads its bytes from disk. This is called a skip-scan and can eliminate 90%+ of I/O for time-range queries on time-sorted data. Spark and Trino push predicates into Parquet readers automatically.
The Two Dominant Columnar File Formats
Parquet (Twitter + Cloudera, 2013)
The dominant format for modern data lakes. It's the default for Apache Spark, AWS Athena, Google BigQuery external tables, Trino/Presto, Delta Lake, Apache Iceberg, and Apache Hudi. Parquet uses a nested columnar model inspired by Dremel (Google's internal query system) that handles nested structures (arrays, maps, structs) via Dremel's repetition/definition levels β not just flat tables. This is why Spark can write a DataFrame of complex JSON structures as Parquet and read it back correctly.
ORC (Hortonworks, 2013)
Created at the same time as Parquet in the Hadoop/Hive ecosystem. ORC has a slightly different internal structure (stripe-based with bloom filters per stripe vs Parquet's row-group model) and historically outperformed Parquet on some Hive queries. Today, Parquet has largely won the ecosystem battle β it has more tooling, more cloud-native integrations, and the Apache Arrow community uses it as the canonical file format. ORC is still used in Hive, Impala, and some older Hadoop deployments. If you're starting fresh, pick Parquet.
Performance Comparison & Benchmarks
The only correct way to pick a format is to benchmark with your actual data β deeply nested vs flat, many small strings vs a few large ones, high field-count vs sparse objects. That said, canonical benchmark numbers let you reason about the order-of-magnitude differences before you write a single line of code. The numbers below are compiled from the jvm-serializers benchmark suite, Google's own Protobuf benchmarks, and community reports from Netflix, LinkedIn, and Uber engineering blogs.
Full Comparison Table
| Format | Wire Size (vs JSON) | Encode Speed | Decode Speed | Schema Required? | Self-Describing? |
|---|---|---|---|---|---|
| JSON | 100% (baseline) | ~200 MB/s | ~150 MB/s | No (optional JSON Schema) | Yes β field names in bytes |
| XML | ~150% (larger) | ~50 MB/s | ~30 MB/s | Optional (XSD/DTD) | Yes β tags in bytes |
| Protobuf | ~30% | ~500 MB/s | ~600 MB/s | YES (.proto file) | No β numeric tags only |
| Thrift | ~30% | ~400 MB/s | ~500 MB/s | YES (.thrift file) | No |
| Avro | ~30% | ~400 MB/s | ~500 MB/s | YES (or embedded) | Embeds schema in file header |
| MessagePack | ~70% | ~600 MB/s | ~700 MB/s | No | Yes β type tags in bytes |
| BSON | ~75% | ~400 MB/s | ~500 MB/s | No | Yes β field names in bytes |
| FlatBuffers | ~30% | ~150 MB/s (slow) | ~β (zero parse) | YES (.fbs file) | No β vtable offsets only |
json module; simdjson can parse JSON at memory-bandwidth speed), JIT warmup state, and buffer sizes. Always measure with your actual payload shapes.
Schema Evolution Patterns
Code changes faster than data. The Protobuf message you defined in 2020 is still sitting in your S3 bucket in millions of files; the Java class that reads it has been changed 47 times since. Your Kafka topic has been consuming messages for 3 years; the original producer team was acquired and no longer exists. Schema evolution is the set of rules and practices that let you change your data format over time without corrupting existing data or breaking existing code.
The key insight: every schema change has three separate concerns. What happens to old data (already written) when read by new code? What happens to new data (written with new schema) when read by old code? And what happens to data in transit right now, while you're in the middle of a rolling deployment?
Four Rules of Safe Schema Evolution
1. Adding fields is always safe
Old code that receives new data with an unknown field will simply ignore it β this is how Protobuf's unknown field preservation works. New code that receives old data missing the new field will get a default value (0, "", empty list). As long as your code handles defaults sensibly (and you don't assume "field present = X happened"), adding fields is a no-drama operation.
2. Removing fields is risky
When you remove a field, old code still parses it and old data still has it. The risk isn't a crash β it's silent behavioral change. If old code has logic like if (score > threshold) award_badge() and the new schema stops sending score, old code silently defaults score to 0 and stops awarding badges. No error, no alarm, just wrong behavior. Always deprecate before removing; reserve the field tag in Protobuf to prevent future reuse of that tag number.
3. Renaming fields is wire-safe but code-breaking
This is Protobuf's key design win: fields are identified by their number tag on the wire, not their name. Renaming user_name β username in the .proto file changes zero bytes on the wire β existing readers and writers still work. But your generated code now has a different accessor name (.username() vs .user_name()), which breaks every Java/Go/Python file that called it. Wire-safe, code-breaking. JSON renames break both wire and code.
4. Type changes are usually breaking
Changing zip_code: int32 to zip_code: string means the bytes at that field position have completely different encoding. Old code reading new bytes tries to decode a UTF-8 string as a 4-byte varint β corrupt value or crash. The safe exception: widening a numeric type (int32 β int64) where the extra bits are just zero-extended β most Protobuf implementations handle this transparently.
Four Patterns for Evolving Safely in Practice
Always add, never remove
Dead fields rot in your schema forever. This is ugly and it's the right choice. A field named legacy_score that nobody uses is far less dangerous than a removed field that unknown consumers still depend on. Treat unused fields as documentation: "we once cared about this; be careful before removing." Schedule removal only after you've confirmed zero consumers in your service mesh.
Reserve removed field tags (Protobuf)
Protobuf has a reserved keyword for exactly this. If you remove field 5 today, a future developer might add a new field 5 with a completely different type. Old data still in S3 has the old bytes at position 5 β they'd now be decoded as the new field. reserved 5; causes protoc to reject any future schema that tries to reuse tag 5. It's a one-line safeguard against a subtle, years-later bug.
Version your schemas
For breaking changes you can't avoid (a full table reshape, a type migration), keep both versions in production simultaneously. Route old producers to v1, new producers to v2, and migrate consumers one by one. Only decommission v1 after all consumers have moved. In Protobuf this is sometimes as simple as optional email_v2 string = 15; alongside the deprecated original β two fields, gradual migration, no big-bang cutover.
Use a Schema Registry
A Schema Registry is a central service that stores schema versions and enforces evolution rules at write time β before bad schemas ever reach production. When a producer tries to register a new schema version, the registry checks: "is this BACKWARD compatible with v1?" If no, it rejects the registration with an error message explaining the violation. Confluent reports rejecting ~5% of new Kafka schema submissions for compatibility violations β each rejection is a production bug that never happened.
// v1 β shipped 2022
message User {
int32 id = 1;
string name = 2;
}
// v2 β shipped 2024, BACKWARD compatible
message User {
int32 id = 1;
string name = 2;
string email = 3; // NEW: old code ignores this field entirely
// new code gets "" if reading old data β handle that!
}
email is an unknown field β silently ignored. New consumers reading v1 data: email missing β default empty string. Both directions work. This is the ideal evolution.
// v1 β shipped 2022
message User {
int32 id = 1;
string user_name = 2; // field tag 2, name "user_name"
}
// v2 β renamed field (same wire tag = 2, different name)
message User {
int32 id = 1;
string username = 2; // still tag 2 on the wire β IDENTICAL bytes
// generated accessor is now .username() not .user_name()
}
// Result:
// β
Wire-safe: bytes are identical for both schemas
// β Code-breaking: every file that calls .user_name() or user.user_name fails
// Fix: update all call sites before shipping v2, or keep both with an alias
// v1 β shipped 2022
message User {
int32 id = 1;
string name = 2;
int32 score = 3; // used in badge-awarding logic
}
// DANGEROUS v2 β score removed without coordinating consumers
message User {
int32 id = 1;
string name = 2;
// score deleted β but old code still has: if (user.score > 100) award_badge()
// Old code reading v2 data gets score = 0 (default)
// badge-awarding stops silently. No error thrown.
}
// SAFER v2 β deprecate instead of remove
message User {
int32 id = 1;
string name = 2;
int32 score = 3 [deprecated = true]; // still on wire, just flagged
reserved 4, 5; // block future reuse of tags you've freed elsewhere
}
reserved when you eventually DO remove the field. This prevents a future developer from accidentally reusing tag 3 with a different type, which would silently corrupt old data still in S3.
Schema Registries & Tooling
Imagine ten microservices all reading from the same Kafka topic, and twenty different teams across the company each running their own producer. Every team's version of "the User schema" lives in their own git repo. Who's the authority? What happens when Team A changes the schema in a way that breaks Team B's consumer? Without a central service to mediate, the answer is: you find out in production, at 2am.
A Schema Registry is a central service that stores every schema ever used, versions them, and enforces evolution compatibility rules at the moment a producer tries to write. It's the difference between "we hope everyone evolves schemas safely" and "the system enforces safety programmatically."
The Three Main Schema Registries
Confluent Schema Registry
The dominant registry in the Kafka world β so widely adopted that "Schema Registry" often implicitly means Confluent's. Supports Avro, Protobuf, and JSON Schema. Runs as a standalone HTTP service with a simple REST API (POST /subjects/user-value/versions). Producers and consumers use the Confluent Kafka Serializer libraries, which automatically register schemas on first send and cache schema lookups to avoid per-message registry calls. Available as managed cloud service in Confluent Cloud, or self-hosted alongside your Kafka cluster.
AWS Glue Schema Registry
The AWS-native equivalent, fully managed and integrated with Amazon Kinesis Data Streams and Amazon MSK (managed Kafka). Uses the same semantic model as Confluent (schema registration, compatibility modes, schema IDs) with AWS IAM for access control instead of Confluent's RBAC. If your data pipeline is already AWS-native (Kinesis β Lambda β S3 via Glue), Glue Schema Registry fits with zero additional infrastructure to run. The main trade-off: it's tightly coupled to AWS β harder to use in hybrid or multi-cloud setups.
buf.build / Buf Schema Registry
Protobuf-focused, sometimes described as "git for Protobuf schemas." The Buf Schema Registry hosts versioned .proto files and the buf CLI enforces evolution rules in CI β running buf breaking against a base ref will detect any field removal, type change, or tag reuse that would break wire compatibility. This brings schema governance into the PR review process rather than waiting for a runtime failure. If your company uses Protobuf heavily (gRPC services, internal APIs), adding buf breaking --against .git#branch=main to CI is a low-effort, high-value safety net.
Tooling Ecosystem
buf CLI
buf lint enforces Protobuf style rules (field names, file structure). buf breaking detects breaking changes against a baseline (a git branch, a tag, or a registered version). buf generate replaces protoc with a simpler buf.gen.yaml config β no more maintaining protoc plugin chains. The standard tool for any serious Protobuf shop in 2024+.
protoc
The original Protobuf compiler from Google. Still the underlying engine for code generation, but its plugin ecosystem is complex to manage (each language needs a separate plugin binary on PATH). buf generate wraps it with a cleaner interface. Use protoc directly only if you have a good reason; for everything else, use buf.
avro-tools CLI
The official Apache Avro toolkit. Lets you convert between Avro binary and JSON representation (avro-tools tojson data.avro), inspect schema fingerprints (useful for schema registry lookups), and validate schemas. Invaluable for debugging Avro data in Kafka topics where you can't just read the bytes with a text editor.
JSON Schema validators (ajv, jsonschema)
ajv (Node.js) and jsonschema (Python) validate JSON documents against a JSON Schema definition at runtime β typically at API boundaries or during integration tests. They don't do the binary encoding/evolution management that Avro/Protobuf registries handle, but they're the right tool for ensuring incoming JSON payloads match your expected shape before they reach your business logic.
.proto or avsc file commit is a schema version. The only difference is that git doesn't enforce compatibility rules β adding buf breaking or a custom CI check is the upgrade from "informal registry" to "enforced governance." It's a 30-minute setup with a week-one ROI.
# .github/workflows/proto-check.yml
name: Protobuf breaking change detection
on: [pull_request]
jobs:
buf-breaking:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: bufbuild/buf-action@v1
with:
# Detect breaking changes against the main branch
breaking_against: "https://github.com/myorg/myrepo.git#branch=main"
# Fails CI if any field is removed, type changed, or tag reused
# Output: "Field 3 "score" on message "User" was deleted."
# Check if a new Avro schema is BACKWARD compatible before registering
# Uses Confluent Schema Registry REST API
NEW_SCHEMA=$(cat user_v2.avsc | python3 -c "import sys,json; print(json.dumps({'schema': sys.stdin.read()}))")
# Ask the registry: is this schema compatible with the current version?
curl -X POST \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
--data "$NEW_SCHEMA" \
http://schema-registry:8081/compatibility/subjects/user-value/versions/latest
# Response if compatible:
# {"is_compatible": true}
# Response if NOT compatible (e.g., you removed a field):
# {"is_compatible": false}
# β Do NOT push this schema. Find the breaking change first.
Compression Pairings
Serialization and compression are two different tools that solve different parts of the same problem, and they work together beautifully. Serialization shrinks data through efficient encoding β binary field tags instead of text names, varints instead of string-encoded numbers, shared schemas instead of embedded field names. Compression shrinks data through redundancy elimination β finding repeated byte sequences and replacing them with shorter references (LZ77 and its descendants) or statistical models (DEFLATE combines LZ77 + Huffman coding).
The two techniques compose: Protobuf already removed field names from the wire, making the bytes denser. Then zstd looks at those dense bytes, finds further patterns, and compresses again. You almost always get meaningful additional savings even after an already-compact binary format. Always apply both.
Three Compression Algorithms to Know
gzip / DEFLATE β the universal baseline
DEFLATE (the algorithm inside gzip, zip, and zlib) combines LZ77 β which replaces repeated byte sequences with back-references β with Huffman coding β which uses shorter bit strings for more common bytes. It's been the web standard since 1996 (Content-Encoding: gzip) and is implemented in every HTTP client, server, and CDN on earth. Typical reduction: 60β80% on JSON/XML. Fast decompression (~400 MB/s); moderate compression speed. If you need one algorithm that works everywhere without any dependencies: gzip.
brotli (br) β better than gzip on text
Google released brotli in 2015 for web compression. It uses a similar LZ77+Huffman core but adds a pre-built static dictionary of ~122KB of common web strings (HTML tags, HTTP headers, common JS patterns). That dictionary gives it a head start on text payloads β 10β15% smaller than gzip at the cost of 2β5x slower compression. Decompression speed is comparable to gzip. Default in Cloudflare Workers, Nginx 1.11+, modern browsers. Use it for static assets and responses that are compressed once and served many times. Not ideal for streaming or real-time compression (slow encoder).
zstd β the modern all-around winner
Facebook released zstd (Zstandard) in 2016 with a key insight: offer a wide compression level range (1β22) so you can tune the speed/ratio trade-off without changing format. At level 1: faster than gzip AND better compression ratio. At level 22: best-in-class ratio, comparable to brotli on general data. At level 3 (default): the sweet spot for most server workloads. Adopted by Linux kernel 5.16 (2022), MongoDB 4.2, Apache Iceberg, Apache Parquet writers, Meta internally, and Slack. If you're designing a new internal protocol from scratch, default to zstd level 3.
Pairing Matrix β What Actually Gets Combined in Production
JSON + gzip
The universal default for REST APIs. HTTP/1.1 clients negotiate Accept-Encoding: gzip automatically; nginx and AWS CloudFront compress responses transparently. You get ~70% bandwidth savings with zero application code changes. This is "good enough" for 95% of API traffic. Most REST frameworks (Express, Django, Spring MVC) enable this with a one-line config.
Protobuf + zstd
The go-to pairing for high-throughput internal services where every byte matters. Protobuf removes field-name overhead first; zstd then finds structural redundancy in the binary data. Slack engineering reported a 75% bandwidth reduction on internal RPC after adopting this combination. At large scale (millions of RPC calls/sec), bandwidth cost is a real budget line item β this pairing pays for its schema management complexity.
Avro + Snappy
The default in Apache Kafka and the Hadoop ecosystem. Snappy (Google, 2011) doesn't achieve the best compression ratio β it targets fast CPU over maximum compression. The reasoning: in a Kafka pipeline where messages are produced and consumed millions of times per second, compression latency matters more than ratio. Snappy compresses and decompresses at ~500 MB/s with essentially no tuning knobs. The trade-off: roughly half the compression ratio of zstd at similar speeds.
Parquet + zstd or brotli
Modern data lake standard. Parquet applies compression per-column-chunk (not per-file), so each column can have its own compression algorithm tuned for its data type β string columns get dictionary encoding + zstd; timestamp columns get delta encoding + zstd. Apache Spark defaults shifted from Snappy to zstd in Spark 3.0. Reading a single column from a 1TB Parquet table on S3 with zstd can touch as little as 5GB of bytes instead of 1TB β two orders of magnitude less I/O.
Diagnostic & Conversion Tools β See the Actual Bytes
- Six command-line tools that let you inspect, query, and convert any serialization format
- A repeatable debug workflow for identifying an unknown binary blob
- How to filter JSON on the command line as if it were SQL, and how to decode raw Protobuf bytes without a schema
You're staring at a binary blob. Is it Protobuf? Avro? A corrupt file? These six tools let you answer that question in minutes β no guessing, no "let me check the docs," just raw byte inspection that tells you exactly what you're looking at.
When something goes wrong in a serialization pipeline β a Kafka consumer getting garbage, a gRPC call returning zeros, a DB record that won't deserialize β the fastest path to diagnosis is to look at the actual bytes, not the code that wrote them. Think of these tools as forensic instruments: each one strips away one layer of abstraction until you can see the raw data underneath.
jq β the CLI SQL for JSON
Think of jq as a command-line query language for JSON, the way SQL is a query language for tables. You pipe JSON in, write a filter expression, and get transformed JSON out. jq '.users[] | select(.active)' reads: "for every object in the users array, keep only those where the active field is truthy." It runs entirely in memory with no index, so it's fast for files under a few hundred MB β beyond that, reach for a proper database.
Why it matters: JSON APIs return nested structures that are painful to inspect with plain text tools. jq lets you drill into any depth, reshape the output, and pipe the result into another tool β all without writing a single Python script.
xmllint β XML validation and XPath
xmllint ships with libxml2 (pre-installed on most Linux distros). Three modes you'll use regularly: --format pretty-prints a minified XML document so you can read it; --schema validates against an XSD schema and prints every error; --xpath runs an XPath expression and returns the matching nodes. XPath is like a CSS selector for XML β //Order[Status='PENDING']/@id extracts every pending order ID in seconds.
Why it matters: XML is verbose and deeply nested. A payment or SOAP API error buried 8 levels deep is invisible in a raw terminal dump; xmllint --xpath finds it in one command.
protoc --decode_raw β Protobuf without a schema
The normal Protobuf decode path requires a .proto file to map field numbers to names. --decode_raw skips that β it parses the binary wire format purely by structure, printing field numbers and raw values. You'll see output like 1: 42 (field 1, varint 42) and 2: "Alice" (field 2, length-delimited string). Without a schema you don't know the field names, but you can confirm the blob is valid Protobuf and see what values were written.
Why it matters: in a production incident where a .proto file was changed without warning, this is the fastest way to see exactly what bytes were put on the wire β no recompilation needed.
avro-tools tojson / fromjson β Avro β JSON
Avro binary files (.avro) are unreadable in a terminal. avro-tools tojson input.avro converts the entire file to newline-delimited JSON, one JSON object per Avro record. avro-tools fromjson --schema schema.avsc input.json does the reverse β useful for crafting test fixtures. The tool also prints the embedded schema header, so you can confirm which schema version wrote the file.
Why it matters: Avro carries its schema inside the file header. avro-tools extracts that schema so you can compare it against the current registry version β essential when debugging schema-mismatch deserialization failures.
bsondump β MongoDB BSON to human-readable JSON
bsondump ships with MongoDB tools. It reads a raw BSON file (the binary format MongoDB uses on disk and in replication oplog) and prints it as JSON-like text with type annotations β {"_id": {"$oid": "64a2..."}}. You'll use this when inspecting mongodump backups, replication oplog files, or wire captures from a MongoDB proxy.
Why it matters: BSON extends JSON with types like ObjectId, Date, Decimal128, and Binary β types that have no JSON equivalent. Without bsondump, you'd need to write code to inspect them. With it, it's one command.
hexdump -C β the last resort
hexdump -C file.bin prints every byte as a two-character hex value alongside the ASCII representation in the right column. When no format-specific tool recognises your blob, you fall back to hexdump and read the format spec manually β match the magic bytes at offset 0 (Parquet starts with PAR1, PNG with 89 50 4E 47, Gzip with 1f 8b), count the header fields, and interpret byte-by-byte.
Why it matters: every binary format has a magic number at the start. Matching the first 4β8 bytes against known magic numbers tells you the format in under 30 seconds, even if every higher-level tool fails.
# Filter: keep only active users and select just their email field
curl -s https://api.example.com/users \
| jq '[.users[] | select(.active == true) | {id, email}]'
# Transform: rename keys and add a computed field
jq '[.orders[] | {
order_id: .id,
total_cents: (.items | map(.price_cents) | add),
item_count: (.items | length)
}]' orders.json
# Find the first element where a nested field matches
jq '.events[] | select(.type == "CHECKOUT") | .session_id' events.json \
| head -1
# Flatten nested array then count unique values
jq '[.records[].tags[]] | unique | length' data.json
# Decode a Protobuf blob without a .proto file β see field numbers and raw values
cat message.bin | protoc --decode_raw
# Output example (User message with id=42, name="Alice", role=ADMIN):
# 1: 42 β field number 1, varint value 42
# 2: "Alice" β field number 2, length-delimited (string or bytes)
# 3: 2 β field number 3, varint (enum value 2 = ADMIN)
# Decode WITH a schema if you have the .proto:
cat message.bin | protoc --decode=mypackage.User --proto_path=./proto user.proto
# Decode a Kafka message body saved to file:
kafka-console-consumer --topic my-topic --max-messages 1 \
--property print.value=true \
| base64 -d > message.bin
cat message.bin | protoc --decode_raw
# Validate XML against an XSD schema
xmllint --schema order.xsd order.xml --noout
# β order.xml validates
# β order.xml:14: element Item: Schemas validity error: value '...'
# Pretty-print a minified XML document
xmllint --format minified.xml > readable.xml
# XPath query: get all pending order IDs
xmllint --xpath "//Order[Status='PENDING']/@id" orders.xml
# β id="ORD-1001" id="ORD-1042" id="ORD-1099"
# Count: how many line items have quantity > 5?
xmllint --xpath "count(//LineItem[Quantity > 5])" report.xml
# β 17
# Namespace-aware XPath (SOAP response)
xmllint --xpath "//soap:Body/ns:GetOrderResponse/ns:Total/text()" \
--xpath-namespace soap=http://schemas.xmlsoap.org/soap/envelope/ \
--xpath-namespace ns=http://orders.example.com/v1 \
response.xml
Common Misconceptions β Things Engineers Get Wrong
- Why "Protobuf is always smaller" has important caveats for tiny messages
- Why binary formats are not always faster than text, and when library quality matters more than format
- The difference between Avro and Protobuf by design intent β not just implementation
Every format has a fan club that oversells its virtues. The truth is always more nuanced: the right choice depends on message size, team size, tooling maturity, and where the bottleneck actually is. Let's kill six persistent myths.
"Protobuf is always smaller than JSON."
Almost always true β but not unconditionally. Protobuf encodes field tags as varint-encoded field numbers. For a message with just 1β2 fields and short string values, the tag overhead can bring Protobuf within 10β20% of JSON's size. Example: {"ok":1} is 8 bytes as JSON; as Protobuf it's also 4 bytes for field tag + varint β a 2Γ win, but far less dramatic than the 5Γ difference you see on larger, field-heavy messages.
The second wrinkle: once you apply gzip or zstd compression, text formats compress very well because they're repetitive (field names repeat across thousands of records). The gap between compressed JSON and compressed Protobuf is often under 20%. If you're already paying for TLS + gzip on your HTTP responses β which most APIs do β the wire-size argument for Protobuf is weaker than people claim. Protobuf's real wins are CPU cost (binary parse is faster) and schema enforcement (not size alone).
"Binary formats are always faster to parse than text."
Not on every workload. The fastest JSON parsers β simdjson (used in Node.js's V8), RapidJSON, and Go's encoding/json optimised builds β use SIMD instructions to scan 32β64 bytes per CPU clock cycle. simdjson benchmarks in the multi-GB/sec range on modern hardware (parsing reaches a few GB/s; UTF-8 validation and minify are even faster). A naive Protobuf parser that allocates a new struct per field will often be slower than simdjson on large JSON payloads.
The correct mental model: format choice sets the ceiling for how fast parsing can theoretically go. Library quality determines how close you get to that ceiling. A well-written JSON parser in a high-performance language can beat a poorly-written Protobuf parser. For most services processing under 50k requests per second, serialization CPU is not the bottleneck β database I/O and network latency dominate. Benchmark your actual workload before making format choices on performance grounds alone.
"JSON has no schema."
JSON's grammar (the syntax rules for what makes valid JSON) has no schema built in β any combination of objects, arrays, strings, and numbers is syntactically valid JSON. But JSON Schema (a separate specification, published as an IETF draft) lets you define exactly which fields are required, what their types must be, and what validation constraints apply. Tools like ajv (Node.js), jsonschema (Python), and networknt/json-schema-validator (Java) validate JSON documents against a schema at parse time or as middleware.
The distinction matters: calling an API "schema-less" because it uses JSON is usually wrong. Most production APIs define their shape in OpenAPI (which uses JSON Schema internally). "Schema-less" is a runtime property of how strictly you enforce the contract β not a property of the format itself.
"Avro is just Protobuf with extra steps."
They look similar on the surface β both binary, both schema-based β but they were designed for different problems. Protobuf assumes that both sender and receiver share the same schema out-of-band (via generated code from the same .proto file). There is no schema embedded in the message. This makes Protobuf compact and fast for service-to-service calls, but it means schema management is an out-of-band problem you solve with code generation pipelines.
Avro was designed for data lakes and Hadoop-style batch processing, where a file might be read years after it was written by a system that no longer exists. Every Avro container file embeds the writer's schema in its header. When Avro deserializes a record, it uses both the writer's schema (from the file) and the reader's schema (from your code) to perform schema resolution β automatically mapping old fields to new names, filling in defaults for added fields, and ignoring removed fields. This writer+reader schema resolution is something Protobuf does not do natively. The "extra steps" in Avro (Schema Registry, schema IDs in Kafka messages) exist precisely because the format was designed to outlive the code that wrote it.
"I should use FlatBuffers for my REST API."
FlatBuffers is a format designed for a specific use case: data that is read many times but written rarely, where the consumer needs zero-copy random access to fields without deserializing the entire message. Classic examples: game asset bundles (load a level file, read only the objects in view), ML model weights (random access to individual tensors), and embedded device firmware manifests.
For a request/response REST API β where every message is small, written once, and read once before being discarded β FlatBuffers' zero-copy property is irrelevant. What you lose is significant: FlatBuffers has a much steeper learning curve than JSON or Protobuf, browser tooling is immature, most API clients won't speak it natively, and debugging a FlatBuffer from a production log is genuinely painful. JSON wins on developer experience. Protobuf wins on performance + schema safety. FlatBuffers wins only on the specific "read many random fields from a large static blob" workload.
"We should benchmark all formats and pick the fastest."
This is the classic premature optimization trap applied to serialization. Most of the time, serialization CPU is not the bottleneck β a typical service spends <5% of its latency budget on serialization. The bottleneck is almost always database I/O, upstream service calls, or network round trips. Running a multi-week format migration to save 3ms of serialization time on a service with 120ms of database latency is negative-ROI engineering.
The right heuristic: pick the format that minimizes integration friction first (JSON for public APIs, Protobuf for internal services, Avro for streaming pipelines). Ship. Measure. If and only if a profiler shows serialization is actually the bottleneck, optimize. The "benchmark everything first" instinct wastes weeks that should be spent on the actual slow parts of your system.
Real-World Disasters β When Serialization Goes Wrong
- Why JSON numbers above 2^53 silently corrupt IDs in JavaScript clients
- How a single Protobuf field tag reuse caused a data-corruption incident
- Why the Confluent Schema Registry is critical infrastructure, not a nice-to-have
- How the XML Billion Laughs attack can crash a parser in milliseconds
The best way to learn serialization rules is to see what happens when they're broken in production. These five incidents span major companies and common patterns β every one of them is a rule you can apply today.
A major financial API serialized 64-bit user account IDs as JSON number literals. JavaScript clients β mobile apps and browser-based dashboards β parsed those numbers as IEEE-754 doubles. Any ID above 2^53 (9,007,199,254,740,991) was silently rounded to the nearest representable float. Two different user IDs mapped to the same JavaScript number; records from one user were displayed in another user's account. The incident was in production for several weeks before the data mismatch triggered alerts.
Lesson: Any integer that could exceed 2^53 β user IDs, order IDs, timestamps in microseconds, 64-bit counters β must be serialized as a JSON string, not a JSON number, when any JavaScript client will consume it. Twitter solved this in 2010 (when it rolled out 64-bit Snowflake IDs) by returning every tweet ID twice:
"id": 12345 (for non-JS clients) and "id_str": "12345" (for JS clients). Always ship both or drop the number entirely.
A backend team deprecated field tag 5 in their
.proto definition (it had been the user_role enum) and reused the same tag number for a new payment_tier integer field in the next release. Old clients β still running on 20% of servers during the phased rollout β continued sending bytes encoded as user_role for field tag 5. New servers read those same bytes as payment_tier. The encoding is identical: a varint with tag number 5. Every request from an old client set a payment tier derived from a user role enum value β nonsensical data that was silently written to the database.Lesson: NEVER reuse a Protobuf field tag number. When you remove a field, immediately add
reserved 5; and reserved "user_role"; to the message definition. The compiler will then reject any future attempt to reuse that tag or name, turning a silent runtime corruption into a build-time error.
The Confluent Schema Registry hit a database deadlock under high write load. Kafka producers using Avro serialization needed to register new schema versions before publishing messages. With the registry unavailable, producers could only serialize messages using schemas that were already cached locally. Services that had recently deployed (and thus had empty local caches) failed to produce any messages at all. The queue backed up, and downstream consumers fell hours behind.
Lesson: The Schema Registry is not a convenience feature β it's critical infrastructure that sits in the hot path of every Kafka producer. Treat it with the same operational maturity as your primary database: monitor SR health (latency, error rate, replica lag), cache schemas aggressively on the client side (Confluent clients cache by default, but check the TTL), and have a runbook for SR unavailability that lets producers fall back to a cached schema rather than failing hard.
An XML parser was deployed with entity resolution enabled (the default in many parsers). An attacker sent a request with a crafted XML document that defined a chain of recursive entity references:
&lol; expands to &lol2;&lol2;&lol2;... ten times, and each &lol2; expands similarly. The final expansion produces a document with 10^9 repetitions of the string "lol" β roughly 3 GB of text from an 800-byte input. The parser allocated memory until the server OOM-killed the process. Response time went to zero in about 200 milliseconds.Lesson: Disable XML external entities (XXE) by default in any XML parser that accepts user-supplied input. In Java:
factory.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true). In Python: use defusedxml instead of the standard library. Set explicit limits on entity expansion depth, document size, and parse time. These are not optional hardening steps β they are security-critical defaults.
Multiple widely-used JSON libraries published CVEs in the same year from a cluster of edge cases: a deeply-nested JSON object (512+ levels) caused stack overflow via recursive descent; a JSON number with 10,000 decimal digits caused a DoS via BigDecimal construction; malformed unicode surrogate pairs caused buffer over-reads in C libraries. Affected frameworks included several versions of Spring, Jackson, and a popular Go JSON library. All were exploitable from unauthenticated HTTP endpoints.
Lesson: Use battle-tested JSON libraries (simdjson, RapidJSON, Jackson with security defaults) and keep them updated. Enforce input limits before parsing: reject payloads above a maximum size (e.g., 1 MB), set a maximum nesting depth (e.g., 32 levels), and parse with a timeout. These limits are cheap to add and prevent the entire class of complexity-based DoS attacks.
Performance & Best Practices Recap
- Eight concrete rules that apply to every serialization decision
- Which format wins in each context β public API, internal service, streaming pipeline, analytics table
- Why compression + serialization is nearly always a free win
After walking through six formats, their trade-offs, schema evolution strategies, and five production incidents, this section distills the practical takeaways into eight actionable rules. Print this section. Put it in your team's engineering handbook. Follow it before every API design decision.
Rule 1 β Default to JSON for public APIs
JSON is readable in a browser, supported by every HTTP client library in every language, and testable with curl. These ergonomics translate to faster SDK adoption, fewer integration support tickets, and easier debugging for both you and your users. The performance cost is real but almost never the bottleneck at the scale of a public API. Default to JSON; reach for alternatives only when you have measured evidence that JSON is the problem.
Rule 2 β Use Protobuf or gRPC for internal microservices
Inside your own infrastructure, you control both the sender and receiver β which means you can require a shared schema. That's exactly Protobuf's sweet spot. Protobuf serialization is ~3-10Γ faster than JSON, the schema enforces a contract at compile time (not runtime), and gRPC adds bidirectional streaming, deadlines, and first-class load balancing on top. The learning curve (toolchain, .proto files, code generation) is worth it once you're running more than a few dozen services.
Rule 3 β Use Avro for streaming data lakes
Kafka + Avro + Confluent Schema Registry is the canonical data-streaming stack for a reason. Avro's embedded-schema design means a Spark job reading a Kafka topic can always decode old messages, even after the schema has evolved. Parquet files written from Avro via Iceberg or Delta preserve the column structure for efficient analytics queries. The Schema Registry provides the governance layer β every schema version is immutable, every change is audited, and consumers can subscribe to schema-change events.
Rule 4 β Use Parquet for analytics tables
Parquet is a columnar format: all values for a single column sit adjacent on disk. A query like SELECT SUM(revenue) FROM orders WHERE year = 2024 reads only the revenue and year columns β skipping the other 40 columns entirely. That column skipping, combined with per-column compression (dictionary encoding for low-cardinality columns, RLE for sorted columns), gives Parquet 5β20Γ better I/O performance than row-based formats (JSON, CSV, Avro) for analytical queries. Use Parquet anywhere you run GROUP BY, SUM, or aggregations over large datasets.
Rule 5 β Always pair serialization with compression
Compression is almost free in 2024. gzip ships in every HTTP library (enable with Content-Encoding: gzip and your web server handles the rest). brotli compresses ~15β25% better than gzip at the same speed for text. zstd compresses as well as gzip at 5β10Γ the speed β ideal for Kafka payloads and database WAL compression. On a typical JSON API response, compression cuts wire bytes by 60β80%. For binary formats (Protobuf, Avro), the gain is smaller (20β40%) but still worth enabling. The CPU cost of zstd at level 1 is under 1ms per MB β negligible against network latency.
Rule 6 β Reserve removed Protobuf field tags
This rule is non-negotiable. After the 2017 field-reuse incident described in Section 21, every team using Protobuf should have a strict policy: when you remove a field, immediately add both a numeric and named reservation. Example: reserved 5, 12; reserved "user_role", "legacy_status";. This prevents the compiler from ever assigning those numbers or names to a new field. The cost is two lines of code. The cost of skipping it is silent data corruption in production. Add this check to your code review checklist and your .proto linter.
Rule 7 β Set parser limits on all inputs
For every endpoint that accepts serialized input from the network, configure explicit limits before parsing begins: maximum body size (e.g., 1 MB for JSON API, 10 MB for file uploads), maximum nesting depth (32 levels is generous for any real-world payload), maximum array/map length, and a parse timeout. Most frameworks have these as config flags you're simply not setting. Jackson: StreamReadConstraints.maxDepth(32). nginx: client_max_body_size 1m. The alternative β parsing unlimited input β is the root cause of nearly every complexity-based DoS CVE in JSON and XML parsers.
Rule 8 β Benchmark with YOUR data
Generic benchmarks measure synthetic payloads β small structs, uniformly distributed values, no schema evolution overhead. Your production messages have different field distributions, different sizes, different hot paths. A format that wins in a generic benchmark may lose on your specific payload shape. Before any format migration, run a benchmark using a representative sample of real production messages β ideally 10,000+ messages from your busiest Kafka topic or API endpoint. Then measure the end-to-end latency including serialization + network + deserialization, not just the serialization step in isolation.
FAQ β Questions Engineers Ask Most
- When to use JSON vs Protobuf for a new API β the one-sentence decision rule
- The real difference between proto2 and proto3, and why it matters when reading old code
- Why YAML is fine for config but fragile for APIs
These are the questions that come up in every architecture review, code review, and late-night debugging session. Direct answers with the reasoning included β not just "it depends" but exactly what it depends on.
Should I use JSON or Protobuf for my new API?
One-sentence rule: JSON if it's a public API; Protobuf if it's an internal API between services you control on both sides.
The reasoning: public APIs need to be consumable by any client in any language without tooling setup β JSON is universal, readable, and supported everywhere. Internal APIs can require a build step (running protoc to generate client code), and the payoff is schema enforcement at compile time, smaller wire format, and faster parsing. The moment you can't guarantee that all consumers run the same .proto file, JSON's flexibility wins over Protobuf's constraints.
Grey area: if your "internal" API is consumed by 50 teams you don't fully control, treat it as public from a compatibility standpoint and use JSON + OpenAPI, even if it runs inside your firewall.
Can I send Protobuf over a public REST API?
Yes β technically nothing stops you. Set Content-Type: application/protobuf (or application/x-protobuf) on the response. Some clients support this natively: Stripe's API accepts both JSON and Protobuf via content negotiation. The browser JavaScript ecosystem is less mature for Protobuf than for JSON, but protobufjs and the official @bufbuild/protobuf package handle it.
The practical problem: debugging is much harder. When an integration breaks, your API consumer can't curl | jq . the response in a terminal. They need to know your exact schema version to decode anything. Support burden rises. For most public APIs the developer-experience cost outweighs the wire-size benefit unless your API clients are themselves services (not humans) and you ship official generated SDKs that hide the Protobuf layer entirely.
What's the actual difference between proto2 and proto3?
proto2 (released 2008) gives you explicit optional and required field labels. required means the field must always be present; a message without it is invalid. In practice, required turned out to be a compatibility nightmare: once you mark a field required, you can never remove it, and every client and server must always populate it β forever. Google eventually banned required in new schemas because it made schema evolution nearly impossible.
proto3 (released 2016) removed required entirely and made all scalar fields implicitly optional. The tradeoff: you lose the ability to distinguish "field not set" from "field set to its default value" (0, empty string, false) at the wire level. For most APIs this distinction doesn't matter, but for use cases where zero is a meaningful value and "absent" is semantically different from zero, you need to wrap the field in a google.protobuf.Int32Value wrapper or use oneof tricks. Most new code uses proto3; you'll encounter proto2 when reading old Google or legacy codebases.
Should I use YAML for my API payloads?
No. YAML is the right format for human-edited configuration files β Kubernetes manifests, CI/CD pipelines, Ansible playbooks β because humans find it more readable than JSON (no brackets, no quotes on simple strings). But YAML has properties that make it fragile for machine-to-machine API payloads:
Indentation is semantic. A two-space vs four-space error, a tab vs space, or a trailing space produces either a parse error or silently different structure. In a config file a human edits once a week, this is tolerable. In a payload generated by a service thousands of times per second, it's a latent bug.
Type coercion surprises. YAML's implicit typing silently converts yes, no, on, off, true, false to booleans; turns 1.0 into a float; and converts octal-looking numbers like 010 to 8 in YAML 1.1 parsers. These conversions differ between YAML 1.1 and 1.2 specs, and between parser implementations. JSON's type system is simpler and unambiguous.
Use YAML for config. Use JSON, Protobuf, or Avro for wire data.
What about gRPC-Web for browser clients?
Standard gRPC runs over HTTP/2 and uses HTTP/2 trailers to signal the final status code and metadata. Browsers expose the fetch API, which does not support reading HTTP/2 trailers. So gRPC literally cannot work from a browser without a translation layer.
gRPC-Web solves this by encoding trailers in the response body (framed as a special message type) so fetch can read them. You deploy an Envoy proxy (or the gRPC-Web Go library) that translates between gRPC-Web requests from the browser and standard gRPC requests to your backend. This works, and it's used in production (e.g., some Google internal tools), but it adds proxy infrastructure, increases deployment complexity, and means browser debugging is still harder than JSON+REST. The honest assessment: for most browser-facing APIs, JSON+REST is simpler and fast enough. Use gRPC-Web only when you have a strong reason β you already run Envoy everywhere, or the generated TypeScript client types from Protobuf are genuinely valuable for your frontend team.
How do I version my JSON API without breaking clients?
Two practical approaches that are widely used: URL path versioning (/v1/orders, /v2/orders) and Accept-header versioning (Accept: application/vnd.myapi.v2+json). URL versioning is simpler to implement and debug (you can see the version in logs and browser dev tools). Header versioning is more "RESTful" but harder to test manually.
The technique matters less than the practice: never remove a field from an existing version, never change the meaning of an existing field, and maintain old versions for at least 12 months with a clearly communicated sunset date. Add new fields freely (consumers who don't know about a field will simply ignore it β that's forward compatibility). The most common mistake is removing a field from v1 the day v2 ships; clients that haven't migrated immediately break.
Why does JSON have null but JavaScript also has undefined?
JSON's null is defined by the JSON specification (RFC 8259). It's a literal value that means "this field is explicitly absent or empty." JSON has exactly four value types: strings, numbers, booleans (true/false), and null. That's it β no undefined, no NaN, no Infinity.
JavaScript's undefined is a runtime language concept meaning "this variable or property has never been assigned a value." It's not part of JSON. When you call JSON.stringify({a: 1, b: undefined}), JavaScript silently drops the b key from the output β it does not serialize it as "b": undefined or "b": null. This is a frequent source of bugs: you add a field to a JS object, forget to assign it, and it silently disappears from the serialized output. Use a linter rule or explicit null assignments to guard against this.
What is MessagePack actually used for in production?
MessagePack is the format you reach for when you want JSON's type model (objects, arrays, strings, numbers, booleans, null) but smaller wire bytes, without the learning curve of a schema language. It maps directly to JSON types so existing JSON-based code can usually be migrated by swapping the serializer with no structural changes.
Production uses: Redis uses MessagePack (or custom binary) for storing complex payloads as cache values β you get JSON semantics with 20β40% smaller bytes and faster parse time at cache hit rate. Pinterest's mobile API shipped MessagePack for bandwidth-constrained mobile clients on 3G in emerging markets. Fluentd (the log aggregator) uses MessagePack natively for its internal log event format. The common thread: situations where you want JSON flexibility but you've measured that JSON byte size or parse time is a real cost β and you don't want to maintain .proto files or a Schema Registry.