Serialization formats
Compare JSON, Protobuf, Avro, and MessagePack on encoding size, schema evolution, and parse speed, so you can pick the right format for each layer of your system.
The problem
A profiler runs on a microservice handling 50 MB/sec of data. The report shows 40% of CPU time inside JSON.parse() and JSON.stringify(). The service serializes a simple event object on every message. The JSON is human-readable and easy to debug. It is also destroying the latency profile.
Engineers on the team switch to Avro. Payload sizes drop by 70%. CPU usage falls immediately. The system performs well for months. Then the Avro schema registry goes down during a deploy. New consumer instances spin up, attempt to fetch the writer schema, and fail. The Kafka consumers that were already running continue to work. Any new pod that starts cannot deserialize a single message until the registry recovers.
These are the two failure modes that serialization format choice forces you to navigate: JSON is too slow at scale, and schema-dependent binary formats introduce operational dependencies that can take down consumers silently.
What serialization is
Serialization converts an in-memory object into a flat sequence of bytes suitable for network transmission or disk storage. Deserialization reverses it: the bytes come in and an object comes out. Every time your service sends a request, writes to a queue, or persists a record, serialization happens.
Analogy: Think of a moving company packing the contents of an apartment. One approach is to write the name of every item on the box. Another is to number each item in a catalogue and only write the number on the box. The catalogue approach produces smaller boxes and faster packing, but the recipient needs the catalogue to unpack. That catalogue is the schema.
How it works
The same object encoded in three formats shows the tradeoff immediately:
Input object: { userId: 12345, name: "alice", active: true }
JSON (text, ~42 bytes):
{"userId":12345,"name":"alice","active":true}
Field names are spelled out in full as strings in every message.
MessagePack (binary, ~25 bytes):
83 a6 75 73 65 72 49 64 cd 30 39 a4 6e 61 6d 65 a5 61 6c 69 63 65 a6 61 63 74 69 76 65 c3
Field names are still included but binary-encoded. No schema needed.
Protobuf (binary, ~15 bytes):
08 b9 60 12 05 61 6c 69 63 65 18 01
Field names are gone. Field 1 (varint 12345), field 2 (string "alice"), field 3 (bool true).
A .proto schema file maps field numbers back to names.
Protobuf encodes the integer 12345 as a variable-length integer (b9 60 in little-endian base-128). The boolean true is 01. The entire key-value structure is encoded as field-number tags followed by values, with zero bytes spent on field names.
The format you choose determines where the schema information lives: encoded inside every message (JSON, MessagePack), in a shared .proto file compiled into the binary (Protobuf), or in an external registry looked up at runtime (Avro).
The wire-format size difference is not abstract. Encoding the same object in each format produces measurably different byte counts, and that difference multiplies across every message the system sends.
spawnSync d2 ENOENT
Protobuf's 3x size advantage over JSON comes almost entirely from replacing string field names with 1-2 byte integers. For a message with 5 fields averaging 10 characters each, that is 50 bytes of field-name overhead eliminated from every single message.
Schema evolution
Schema evolution is how the format handles the fact that your data model will change. Adding fields. Removing fields. Renaming fields. These happen in every long-lived system.
JSON: No schema is required. By convention, receivers ignore fields they do not recognize. You can add new fields today and old clients will silently skip them. You can remove fields and hope all consumers have been updated. There is no enforcement, no versioning, and no protection against an engineer who renames a field and breaks every existing consumer without knowing it.
Protobuf: Field numbers are permanent. When you define a field, you assign it a number. That number is what gets encoded in the binary. You can add new fields with new numbers, and old clients will skip any field number they do not recognize. You can remove fields, but you must then reserve their numbers so no future field reuses the slot. Reusing a number for a different type causes silent data corruption on old readers.
message UserEvent {
int32 user_id = 1;
string name = 2;
bool active = 3;
// Field 4 was "email", now removed
reserved 4;
reserved "email";
// New field adds with a fresh number
string region = 5;
}
Avro: The schema is required on both the write side and the read side. A schema registry stores every schema version. When a consumer deserializes a message, it fetches the writer's schema from the registry and compares it to its own reader schema. The registry performs compatibility checks: can the reader schema safely read data written by the writer schema? This is the most rigorous evolution model, and it makes the registry a hard dependency.
Avro fails asymmetrically when the schema registry goes down
Existing consumers that have already fetched and cached the writer schema keep running when the registry becomes unreachable. New consumer pods starting during the outage fail immediately: they cannot fetch the writer schema and cannot deserialize a single message. If you scale out during an incident (Kafka consumer lag is growing) and your schema registry is simultaneously degraded, every new pod you add silently fails to process messages. Monitor the schema registry as a tier-zero dependency, separate from Kafka broker health.
| Format | Schema required | Add field | Remove field | Rename field | Fails if registry down |
|---|---|---|---|---|---|
| JSON | No | Safe (ignored) | Risky (no enforcement) | Breaking | N/A |
| MessagePack | No | Safe (ignored) | Risky (no enforcement) | Breaking | N/A |
| Protobuf | Yes (.proto file) | Safe (new field number) | Safe (reserve the number) | Safe (aliases supported) | No (schema is compiled in) |
| Avro | Yes (registry) | Safe (with default) | Safe (with compatibility check) | Via alias | Yes (new instances fail) |
Performance comparison
The size and speed differences between formats are not marginal. At 50 MB/sec, a 3x size reduction means 33 MB/sec instead of 100 MB/sec out of your network interface. At 5,000 messages/sec, a 10x parse speedup means 100 microseconds of CPU instead of 1 millisecond.
| Format | Payload size vs JSON | Parse speed vs JSON | Schema | Human readable |
|---|---|---|---|---|
| JSON | 1x (baseline) | 1x (baseline) | No | Yes |
| MessagePack | ~0.6x (smaller) | ~3x faster | No | No |
| Protobuf | ~0.3x (smaller) | ~5-10x faster | Yes (.proto file) | No |
| Avro | ~0.3x (smaller) | ~3-5x faster | Yes (registry) | No |
These numbers vary by message shape. JSON's overhead is worst for objects with long field names. Protobuf's advantage shrinks for payloads that are mostly large binary blobs or strings, since those transmit similarly in both formats. For small messages with many named fields, Protobuf's advantage is near the top of this range.
JSON becomes a CPU bottleneck earlier than most teams expect
The crossover from "JSON is fine" to "JSON is the bottleneck" typically appears around 5-10 MB/sec of serialization throughput on a single service instance. Below that threshold, JSON parse time is a rounding error. Above it, profilers start showing JSON.parse or JSON.stringify prominently. The service in the opening scenario (50 MB/sec) had 40% of CPU inside the JSON parser. Switching to Protobuf at that scale is a near-zero-risk CPU reduction: 1-2 days to write .proto files and integrate codegen, with immediate payback in freed capacity.
Protobuf field number rules
Protobuf's safety comes from a discipline around field numbers. Two rules matter in production.
Rule 1: Field numbers 1-15 fit in one byte. Use them for your most frequent fields.
Protobuf encodes each field as a tag-value pair. The tag is the field number combined with the wire type. Field numbers 1-15 encode in a single byte. Field numbers 16-2047 require two bytes. For a message that is read millions of times per second, putting the high-frequency fields at numbers 1-15 reduces payload size.
Rule 2: A removed field's number must be reserved forever.
If you remove field 3 from your schema today, and a new engineer adds a different field at number 3 nine months from now, any old clients that still parse field 3 as the original type will read the new field's bytes as the wrong type. There is no error. There is only corrupted data. The .proto file protects against this with reserved:
message OrderEvent {
int64 order_id = 1;
int64 customer_id = 2;
// Field 3 was "coupon_code" (string), removed in v2
// Field 4 was "discount_percent" (float), removed in v2
reserved 3, 4;
reserved "coupon_code", "discount_percent";
// Safe to add new fields starting at 5
string payment_method = 5;
int64 amount_cents = 6;
}
This is not only a code practice. It is an organizational practice. Code review for .proto files must check that removed fields are reserved, not just deleted.
Production usage
| System | Format | Why |
|---|---|---|
| gRPC | Protobuf | Schema enforced at compile time, compact binary encoding, built-in code generation for all major languages, schema doubles as API contract |
| Kafka (typical) | Avro with schema registry | Strict schema evolution across independently deployed producers and consumers; Confluent Schema Registry is the de facto standard |
| Redis pub/sub | MessagePack or JSON | MessagePack reduces bandwidth with no schema overhead; JSON used when messages are consumed by browser clients directly |
| Browser REST APIs | JSON | Universal language support, human-debuggable via browser DevTools and curl, no client-side schema tooling required |
| Apache Spark / Parquet | Avro for streaming, Parquet for storage | Avro is row-oriented for streaming ingestion; Parquet is columnar for analytics scans; Spark reads both natively |
The pattern here: JSON at the external boundary (browsers, third-party integrations), Protobuf for internal synchronous RPC (gRPC), and Avro for event-driven async pipelines (Kafka) where schema governance matters enough to warrant a registry.
Limitations and when NOT to use it
-
Binary formats (Protobuf, Avro, MessagePack) produce opaque bytes. Running
curl | lessor reading a Kafka topic withkafkacatshows garbage. Debugging requires schema-aware tooling. Teams that are not disciplined about maintaining.protofiles or the schema registry lose the ability to inspect production traffic at the worst possible time. -
Avro ties consumer startup to schema registry availability. Consumers that are already running can continue deserializing from their cached schema. Any new consumer pod that starts when the registry is unreachable cannot deserialize a single message. At high traffic, this means a registry outage during a deploy can silently stall the entire consumer fleet.
-
Protobuf enforces field number discipline on humans, not computers. The
reservedkeyword protects against accidental reuse only if engineers remember to use it. Organizations that do not have code review discipline for.protofiles will eventually produce silent data corruption from field number reuse. -
JSON is 3-10x slower to parse than Protobuf for equivalent data. At low throughput (under 1 MB/sec), this is invisible. At 50 MB/sec, JSON parsing can account for 40% of CPU time on the service (this is the scenario that opened this article). JSON becomes a bottleneck before most teams expect it to.
-
MessagePack is a middle ground that has never achieved widespread adoption. Most libraries that support Protobuf or Avro do not support MessagePack. If you pick MessagePack, you own a less common dependency with a smaller community.
Which format to use
Interview cheat sheet
-
When asked about microservice serialization, immediately ask about schema evolution requirements. That single question determines whether you need Protobuf, Avro, or can stay with JSON. If schemas are stable and human readability matters, JSON is fine. If schemas evolve and consumers are independently deployed, you need something stricter.
-
When someone says "let's just use JSON everywhere", say: JSON is the right default for external APIs and low-throughput internal services. At 50 MB/sec of parse throughput, it becomes a CPU bottleneck. Name the scale at which you would reach for Protobuf, because it signals you understand that format choice is a tradeoff, not a style preference.
-
When asked about Protobuf field numbers, say: Field numbers are permanent. You cannot reuse a field number after removing a field. If you do, old clients silently misparse the new field's bytes as the old type. Reserve removed numbers explicitly in the
.protofile. This is enforced by code review discipline, not by the compiler. -
When asked about Avro and schema registries, say: Avro gives you rigorous schema evolution with reader/writer schema comparison, but it makes the schema registry a hard operational dependency. If the registry is down, new consumer instances cannot start. Existing running consumers continue to work from cached schemas. This asymmetry trips teams during incidents.
-
When comparing to JSON, say: Protobuf payloads are roughly 3x smaller and 5-10x faster to parse. The tradeoff is binary opacity (no human-readable debugging) and schema management overhead (
.protofiles, codegen, field number discipline). The crossover point is typically somewhere around 5-10 MB/sec of sustained throughput. -
When asked about Kafka serialization, say: Avro with schema registry is the standard for Kafka because it enforces compatibility between producers and consumers that deploy independently. A new producer schema is checked against all existing consumer schemas before the producer is allowed to publish. This prevents silent deserialization failures at the consumer.
-
When asked about gRPC, say: gRPC uses Protobuf by default. The
.protofile is simultaneously the schema, the code generator input, and the API contract. Both sides compile from the same file, so client and server are always in structural agreement. This is fundamentally different from REST/JSON, where the API shape is documented separately and drift is possible. -
When someone suggests MessagePack, say: MessagePack eliminates JSON's field-name verbosity without requiring a schema. It is the right choice when you want smaller and faster JSON without the schema management overhead of Protobuf. The downside is ecosystem support: it is less widely adopted than JSON or Protobuf, and fewer observability tools understand it natively.
Quick recap
- Serialization formats trade human readability, payload size, parse speed, and schema evolution capabilities against each other.
- JSON encodes field names as strings inside every message; binary formats like Protobuf use field numbers, producing payloads 3-10x smaller.
- Schema evolution is the key differentiator: Protobuf handles add/remove via permanent field numbers, Avro via schema registry comparisons, and JSON via convention only.
- Protobuf field numbers are permanent; reusing a removed field's number causes silent data corruption on old clients that read the new bytes as the wrong type.
- Avro's schema registry is a hard dependency; consumers can deserialize from their cached schema but new instances cannot start without registry access.
- Choose JSON for human-debuggable external APIs, Protobuf for internal high-throughput RPC, and Avro for Kafka pipelines where strict schema governance across independently deployed services matters most.
Related concepts
- Message queues — Kafka's standard choice of Avro with schema registry is a direct consequence of needing schema evolution enforcement across independently deployed producers and consumers. Without the registry, a producer rename silently corrupts all consumer reads.
- Microservices — gRPC with Protobuf is the dominant serialization choice for synchronous microservice RPC because the
.protofile doubles as the API contract and forces both sides into structural agreement at compile time. - API Gateway — External APIs almost universally require JSON regardless of what internal services use, which means gateways must translate between Protobuf or Avro internally and JSON externally. That translation is a real CPU cost and an additional failure mode.