Compression algorithms
How LZ77, Huffman coding, and their descendants (gzip, zstd, Snappy) compress data, the entropy and redundancy concepts, speed vs ratio tradeoffs, and which algorithm to use for databases, networking, and storage.
The problem
Your analytics pipeline ingests 2 TB of JSON event data per day. Each event is 1-4 KB of mostly repeated field names, similar string values, and sequential timestamps. Storage costs are growing linearly with traffic, and network transfers between services are bottlenecked by bandwidth, not CPU.
You run the numbers: the 2 TB of raw JSON contains roughly 400 GB of actual information. The other 1.6 TB is structural redundancy (repeated keys, whitespace, predictable patterns). You are paying to store and transmit data that could be represented in a fraction of the space.
This is not just a cost problem. Uncompressed data means slower replication between database nodes, slower backups, and slower network transfers between datacenters. Every system that reads from disk or network benefits when the data is smaller. Compression algorithms solve this by exploiting redundancy to represent data in fewer bytes.
What it is
Compression is the process of encoding data using fewer bits than the original representation by identifying and eliminating redundancy. Lossless compression (which this article focuses on) preserves the original data exactly: decompress and you get back every byte.
Think of it like shorthand notes. Instead of writing "the quarterly earnings report for the third quarter of fiscal year 2024" in every meeting note, you write "Q3 FY24 report" and keep a legend that lets anyone reconstruct the full phrase. The legend is the dictionary, the abbreviation is the compressed form, and the ability to recover the original text exactly is what makes it lossless.
All practical lossless compression algorithms combine two fundamental ideas: finding repeated patterns (dictionary coding) and assigning shorter codes to frequent symbols (entropy coding). The differences between gzip, zstd, Snappy, and LZ4 come down to how aggressively they search for patterns and how they trade CPU time for compression ratio.
How it works
Most compression algorithms run in two stages. First, dictionary coding (LZ77 family) scans the input for repeated byte sequences and replaces duplicates with compact back-references. Second, entropy coding (Huffman or arithmetic) assigns shorter bit patterns to symbols that appear frequently.
LZ77: dictionary-based compression
LZ77 (Lempel-Ziv 1977) is the foundation of gzip, Deflate, zstd, Snappy, and LZ4. The core idea: maintain a sliding window of recently seen bytes and replace repeated sequences with (offset, length) references.
function lz77_compress(data, window_size=32768):
output = []
pos = 0
while pos < len(data):
best_offset = 0
best_length = 0
// Search the sliding window for the longest match
search_start = max(0, pos - window_size)
for i in range(search_start, pos):
match_len = 0
while pos + match_len < len(data)
and data[i + match_len] == data[pos + match_len]:
match_len += 1
if match_len > best_length:
best_offset = pos - i
best_length = match_len
if best_length >= 3: // minimum match length (3 for Deflate)
output.append(Reference(offset=best_offset, length=best_length))
pos += best_length
else:
output.append(Literal(data[pos]))
pos += 1
return output
The sliding window size determines how far back the algorithm looks for matches. Larger windows find more matches (better ratio) but require more memory and CPU time. gzip uses a 32 KB window. zstd supports windows up to 128 MB at higher compression levels.
The minimum match length matters too. A reference like (offset=500, length=2) takes more bytes to encode than the 2 literal bytes it replaces. Deflate requires matches of at least 3 bytes. Snappy and LZ4 require 4 bytes.
I find that understanding LZ77's sliding window is the single most useful thing for reasoning about why different compressors behave differently. A compressor with a larger window and more aggressive match searching (zstd level 19) compresses better but runs slower. One with a tiny window and simple hash-based matching (Snappy) runs fast but misses long-range patterns.
Huffman coding: entropy-based compression
Huffman coding assigns variable-length bit codes to symbols based on frequency. Frequent symbols get short codes; rare symbols get long codes. This is optimal for symbol-by-symbol coding (no code is a prefix of another).
English letter frequencies → Huffman codes:
'e' (13.0%) → 100 (3 bits)
't' (9.1%) → 000 (3 bits)
'a' (8.2%) → 0110 (4 bits)
'z' (0.07%) → 110111010 (9 bits)
Fixed ASCII encoding: every character = 8 bits
Huffman encoding: average ≈ 4.5 bits per character
Savings: ~44% smaller for English text
Huffman builds a binary tree bottom-up: the two least frequent symbols are merged into a subtree, then the two least frequent remaining items are merged, and so on until one root. Left branches assign a "0" bit, right branches assign a "1" bit. The path from root to leaf is the code.
Arithmetic coding is the theoretical successor: it encodes the entire message as a single number and achieves slightly better ratios than Huffman (closer to the entropy limit). In practice, most systems still use Huffman because it is faster to decode and the ratio difference is small (typically under 5%).
Dictionary coding does the heavy lifting
In most real-world compressors, LZ77 dictionary coding contributes 60-80% of the compression ratio while Huffman/entropy coding contributes the remaining 20-40%. When engineers say "gzip compresses JSON 5x," most of that reduction comes from LZ77 finding repeated field names and values, not from Huffman optimizing individual byte frequencies. This is why Snappy (LZ only, no entropy stage) still achieves 1.5-2.5x ratios.
zstd vs gzip vs Snappy vs LZ4
The four compressors you will encounter in production systems differ primarily in how they balance compression ratio against CPU cost.
gzip (Deflate = LZ77 + Huffman)
import gzip
data_compressed = gzip.compress(data, compresslevel=6)
# Speed: 100-300 MB/s compression, 400-600 MB/s decompression
# Ratio: 2-4x for JSON, 5-10x for log text
The internet's default compressor. Every browser supports Content-Encoding: gzip. Levels 1-9 trade speed for ratio: level 1 is fast with moderate compression, level 9 squeezes out every byte but runs 5-10x slower. Level 6 (the default) is the sweet spot for most use cases. My recommendation: if you are choosing gzip levels for a new system, just use the default.
zstd (Zstandard)
import zstandard as zstd
cctx = zstd.ZstdCompressor(level=3)
data_compressed = cctx.compress(data)
# Speed: 300-500 MB/s compression at level 3, 2000+ MB/s decompression
# Ratio: matches gzip at level 3, surpasses it at level 10+
Zstd is the modern default for new systems. It achieves similar ratios to gzip at level 3 while compressing 2-3x faster and decompressing 3-5x faster. The decompression speed (2+ GB/s) is what makes it transformative for databases: reading compressed pages adds negligible CPU overhead. Facebook developed zstd and uses it across their entire storage stack.
Interview tip: name zstd as the default
When asked about compression in a system design interview, say zstd by default. It dominates gzip on both speed and ratio. The only exceptions are: HTTP responses (browsers may not support zstd yet, so fall back to gzip), and latency-critical hot paths where Snappy/LZ4's simpler decode loop matters.
Snappy
import snappy
compressed = snappy.compress(data)
# Speed: 500-1500 MB/s compression, 1500-3000 MB/s decompression
# Ratio: 1.5-2.5x (significantly less than gzip/zstd)
Snappy deliberately sacrifices compression ratio for speed. It skips the entropy coding stage entirely, using only a simplified LZ77 pass. Google designed it for internal use where CPU cycles are more expensive than storage and every microsecond of decompression latency matters. Cassandra, RocksDB, and LevelDB all default to Snappy for SSTable compression.
LZ4
Speed: 4+ GB/s compression and decompression
Ratio: similar to Snappy, slightly better
LZ4 is faster than Snappy with marginally better ratios. For streaming data pipelines where throughput is everything, LZ4 is the right choice. Kafka defaults to LZ4 for message compression. ClickHouse uses LZ4 for column compression in its hot path.
Compression in practice: databases, Kafka, and HTTP
Different systems apply compression at different layers, and the choices reveal what each system optimizes for.
Column stores: type-aware encoding + generic compression
Columnar databases (Parquet, ClickHouse, DuckDB) apply type-aware encoding before generic compression:
Column of integers: [1000001, 1000002, 1000003, ...]
Step 1: Delta encoding → [1000001, +1, +1, +1, ...]
Step 2: Run-length encode → [1000001, run of +1 x 999999]
Step 3: Apply LZ4 on top → ~100x total compression ratio
Column-level encoding exploits semantic knowledge of the data type. A column of timestamps with millisecond resolution benefits enormously from delta encoding because consecutive values differ by small, predictable amounts. Generic compression (gzip, zstd) treats data as an opaque byte stream and misses these regularities.
This is why column stores achieve 10-100x compression while row stores typically see 2-4x. The column layout groups identical data types together, creating far more redundancy for both stages to exploit.
Kafka: per-batch compression
Kafka compresses at the message batch level, not individual messages. The producer accumulates messages into a batch, compresses the entire batch, and sends it as one compressed payload. The broker stores it compressed. The consumer decompresses the batch after receiving it.
Batch compression is more effective than per-message compression because repeated field names and similar values across messages in the same batch create redundancy that LZ77 can exploit. A batch of 100 similar JSON events compresses much better than 100 individually compressed events.
The broker never decompresses the data (in most configurations), so compression reduces both network bandwidth and disk usage without adding CPU cost at the broker. This is a critical design choice: the CPU cost is shifted to producers and consumers, which are typically easier to scale horizontally.
HTTP: content negotiation
HTTP compression works through content negotiation. The client sends Accept-Encoding: gzip, br in the request header. The server compresses the response body and sets Content-Encoding: gzip. The client decompresses transparently.
For static assets (JS, CSS, HTML), compress at build time and serve pre-compressed files. For dynamic API responses, compress on the fly with a middleware. Most reverse proxies (nginx, Cloudflare) handle this automatically.
Brotli (br) achieves 15-20% better ratios than gzip for text content but compresses slower. Use Brotli for static assets (compress once, serve many times) and gzip for dynamic responses where compression latency matters.
Algorithm selection guide
| Scenario | Algorithm | Reasoning |
|---|---|---|
| HTTP response bodies | gzip or Brotli | Universal browser support; ratios matter for user experience |
| Static assets on CDN | Brotli (pre-compressed) | Compress once at build, serve millions of times |
| Kafka messages | LZ4 or zstd | LZ4 for throughput-critical topics, zstd for storage-constrained topics |
| Database pages (OLAP) | zstd level 3 | Balance of ratio and decompression speed for analytical queries |
| Database pages (OLTP) | Snappy or LZ4 | Decompression speed dominates in the hot read path |
| WAL/redo logs | zstd | High ratio for write-once, rarely-read data |
| Machine-to-machine RPC | zstd or Snappy | Depends on whether network or CPU is the bottleneck |
| Cold archival storage | zstd level 19 | Maximum ratio; decompression speed is irrelevant for cold data |
Decompression latency in the read path
Compression saves space and bandwidth, but the critical metric for real-time systems is decompression speed. Every page read, every message consumed, every API response decompressed adds CPU time to the request path.
PostgreSQL with zstd page compression:
Read 8KB compressed page from SSD: < 0.1ms I/O
Decompress 8KB → 32KB at 2 GB/s: 0.016ms
Net overhead per page: ~16 microseconds
At 1,000 pages/sec (light OLTP): 16ms total decompression overhead
At 100,000 pages/sec (heavy scan): 1.6 seconds total overhead
Snappy and LZ4 decompress at 3-4+ GB/s, cutting that overhead by 50-75% compared to zstd. For OLTP databases doing point lookups (reading a few pages), the difference is negligible. For analytical queries scanning millions of pages, the decompression algorithm choice directly affects query latency.
The rule of thumb: if your system reads data more than it writes (which is most systems), optimize for decompression speed. If storage cost or network bandwidth is the primary constraint, optimize for ratio.
Production usage
| System | Compression usage | Notable behavior |
|---|---|---|
| Kafka | Per-batch compression with LZ4 (default), Snappy, zstd, or gzip | Broker stores data compressed without decompressing; compression ratio improves with larger batch sizes; zstd since Kafka 2.1 |
| PostgreSQL | TOAST compression for large values; zstd page compression (PG 16+) | TOAST compresses values > 2KB using pglz (custom LZ variant); zstd support added in PG 16 for table compression |
| RocksDB / LevelDB | Per-SSTable block compression with Snappy (default), zstd, LZ4, or zlib | Different compression per LSM level: Snappy for L0-L1 (hot), zstd for L2+ (cold); configurable per level |
| ClickHouse | Per-column compression with LZ4 (default) or zstd | Applies type-aware encoding (delta, dictionary, double-delta) before generic compression; codec can be specified per column |
| Cassandra | Per-SSTable compression with LZ4 (default), Snappy, zstd, or Deflate | Changed default from Snappy to LZ4 in Cassandra 3.0; compression chunk size (default 64KB) affects random read performance |
| nginx / CDN | HTTP response compression with gzip, Brotli | Pre-compressed static files served with gzip_static; dynamic compression at levels 1-3 for API responses to minimize latency |
Limitations and when NOT to use it
- Already-compressed data does not compress further. Applying gzip to JPEG images, MP4 video, or encrypted data typically increases file size by a few bytes (the compression header) while wasting CPU. Encrypted data is indistinguishable from random noise by design, so LZ77 finds no patterns.
- Compression breaks random access to data. You cannot seek to byte offset N in a compressed file without decompressing everything before it. Databases solve this by compressing in fixed-size blocks (pages or chunks) so only the relevant block needs decompression. But this means smaller blocks compress worse.
- High compression levels can bottleneck the write path. zstd level 19 achieves excellent ratios but compresses at 10-30 MB/s, roughly 10-30x slower than level 3. For write-heavy workloads (ingestion pipelines, WAL writes), high compression levels create backpressure.
- Compression amplifies write amplification in LSM-tree databases. Every compaction cycle decompresses, merges, and recompresses data. Higher compression ratios save disk space but increase the CPU cost of compaction. RocksDB allows different compression per LSM level to mitigate this.
- Small payloads compress poorly. LZ77 needs enough data to find repeated patterns. A 50-byte JSON message has almost no redundancy to exploit, and the compression header may be larger than the savings. Kafka's batch compression works precisely because it amortizes the overhead across hundreds of messages.
- Compression leaks information in encrypted channels (CRIME/BREACH attacks). If an attacker can control part of the plaintext and observe the compressed size, they can infer secret values byte-by-byte. This is why TLS-level compression is disabled in modern browsers.
Interview cheat sheet
- When asked about compression in a system design, start with the bottleneck: if the bottleneck is network or storage, optimize for compression ratio (zstd, gzip). If the bottleneck is CPU in the read path, optimize for decompression speed (LZ4, Snappy).
- Name zstd as the modern default for new systems. It matches gzip's ratio at 3x the speed and decompresses at 2+ GB/s. The only exception is HTTP (browser support) and ultra-low-latency paths (LZ4).
- When discussing databases, mention that column stores achieve 10-100x compression by applying type-aware encoding (delta, run-length, dictionary) before generic compression. Row stores typically see 2-4x because data types are interleaved.
- When a design involves Kafka, state that compression is per-batch, the broker never decompresses, and LZ4 is the default. Mention that batch size affects compression ratio (larger batches compress better).
- When asked about the tradeoff between compression and latency, quantify it: zstd decompresses at ~2 GB/s, adding ~16 microseconds per 32KB page. For point lookups this is negligible. For full table scans reading millions of pages, it adds up.
- When discussing storage tiers, recommend different algorithms per tier: Snappy/LZ4 for hot data (fast decompression), zstd for warm data (balance), zstd level 19 for cold archival (maximum ratio, speed irrelevant).
- State that all lossless compression combines two techniques: dictionary coding (LZ77, replacing repeated sequences with references) contributes 60-80% of the ratio, and entropy coding (Huffman, assigning short codes to frequent symbols) contributes the rest.
- When a peer mentions compressing encrypted or already-compressed data, note that it is pointless. Encrypted data has maximum entropy by design, and re-compressing compressed data yields no savings.
Quick recap
- All lossless compression combines dictionary coding (LZ77, replacing repeated sequences with back-references) and entropy coding (Huffman, assigning shorter codes to frequent symbols). Dictionary coding contributes 60-80% of the ratio.
- gzip (Deflate = LZ77 + Huffman) is the internet standard with good ratios but moderate speed. zstd matches gzip's ratio with 2-3x faster decompression, making it the modern default for new systems.
- Snappy and LZ4 sacrifice compression ratio for speed (1.5-2.5x ratio vs. 3-5x for gzip). Use them in hot read paths where decompression latency per page or per message must be minimized.
- Column stores achieve 10-100x compression by applying type-aware encoding (delta, run-length, dictionary) before generic compression, exploiting the semantic structure of same-type data stored together.
- Match algorithm to bottleneck: network/storage-constrained systems need ratio (zstd, gzip). CPU-constrained read paths need decompression speed (LZ4, Snappy). Write-heavy pipelines need fast compression (LZ4, zstd level 1-3).
- Never compress already-encrypted data (it has maximum entropy) or individually compress messages that will be batch-compressed later (you destroy cross-message redundancy).
Related concepts
- Serialization formats - Serialization (Protobuf, Avro, JSON) determines the byte layout that compression operates on. Compact binary formats like Protobuf leave less redundancy for LZ77, while verbose formats like JSON compress dramatically.
- Column-oriented storage - Columnar layout groups same-type values together, creating the redundancy patterns that make type-aware encoding (delta, RLE) and generic compression so effective.
- Kafka internals - Kafka's batch compression (producer compresses, broker stores compressed, consumer decompresses) is a core example of compression applied at the right aggregation level.
- Zero-copy I/O - Zero-copy techniques (sendfile, mmap) avoid copying data between kernel and user space. Compression requires processing data in user space, so compressed data cannot use zero-copy for the decompression step.
- LSM trees - LSM trees use per-level compression with different algorithms for hot vs. cold levels, directly applying the bottleneck-matching principle described in this article.