Compression algorithms
How LZ77, Huffman coding, and their descendants (gzip, zstd, Snappy) compress data, the entropy and redundancy concepts, speed vs ratio tradeoffs, and which algorithm to use for databases, networking, and storage.
The problem
Your analytics pipeline ingests 2 TB of JSON event data per day. Each event is 1-4 KB of mostly repeated field names, similar string values, and sequential timestamps. Storage costs are growing linearly with traffic, and network transfers between services are bottlenecked by bandwidth, not CPU.
You run the numbers: the 2 TB of raw JSON contains roughly 400 GB of actual information. The other 1.6 TB is structural redundancy (repeated keys, whitespace, predictable patterns). You are paying to store and transmit data that could be represented in a fraction of the space.
This is not just a cost problem. Uncompressed data means slower replication between database nodes, slower backups, and slower network transfers between datacenters. Every system that reads from disk or network benefits when the data is smaller. Compression algorithms solve this by exploiting redundancy to represent data in fewer bytes.
What it is
Compression is the process of encoding data using fewer bits than the original representation by identifying and eliminating redundancy. Lossless compression (which this article focuses on) preserves the original data exactly: decompress and you get back every byte.
Think of it like shorthand notes. Instead of writing "the quarterly earnings report for the third quarter of fiscal year 2024" in every meeting note, you write "Q3 FY24 report" and keep a legend that lets anyone reconstruct the full phrase. The legend is the dictionary, the abbreviation is the compressed form, and the ability to recover the original text exactly is what makes it lossless.
All practical lossless compression algorithms combine two fundamental ideas: finding repeated patterns (dictionary coding) and assigning shorter codes to frequent symbols (entropy coding). The differences between gzip, zstd, Snappy, and LZ4 come down to how aggressively they search for patterns and how they trade CPU time for compression ratio.
How it works
Most compression algorithms run in two stages. First, dictionary coding (LZ77 family) scans the input for repeated byte sequences and replaces duplicates with compact back-references. Second, entropy coding (Huffman or arithmetic) assigns shorter bit patterns to symbols that appear frequently.
LZ77: dictionary-based compression
LZ77 (Lempel-Ziv 1977) is the foundation of gzip, Deflate, zstd, Snappy, and LZ4. The core idea: maintain a sliding window of recently seen bytes and replace repeated sequences with (offset, length) references.
function lz77_compress(data, window_size=32768):
output = []
pos = 0
while pos < len(data):
best_offset = 0
best_length = 0
// Search the sliding window for the longest match
search_start = max(0, pos - window_size)
for i in range(search_start, pos):
match_len = 0
while pos + match_len < len(data)
and data[i + match_len] == data[pos + match_len]:
match_len += 1
if match_len > best_length:
best_offset = pos - i
best_length = match_len
if best_length >= 3: // minimum match length (3 for Deflate)
output.append(Reference(offset=best_offset, length=best_length))
pos += best_length
else:
output.append(Literal(data[pos]))
pos += 1
return output
The sliding window size determines how far back the algorithm looks for matches. Larger windows find more matches (better ratio) but require more memory and CPU time. gzip uses a 32 KB window. zstd supports windows up to 128 MB at higher compression levels.
The minimum match length matters too. A reference like (offset=500, length=2) takes more bytes to encode than the 2 literal bytes it replaces. Deflate requires matches of at least 3 bytes. Snappy and LZ4 require 4 bytes.
I find that understanding LZ77's sliding window is the single most useful thing for reasoning about why different compressors behave differently. A compressor with a larger window and more aggressive match searching (zstd level 19) compresses better but runs slower. One with a tiny window and simple hash-based matching (Snappy) runs fast but misses long-range patterns.
Huffman coding: entropy-based compression
Huffman coding assigns variable-length bit codes to symbols based on frequency. Frequent symbols get short codes; rare symbols get long codes. This is optimal for symbol-by-symbol coding (no code is a prefix of another).
English letter frequencies β Huffman codes:
'e' (13.0%) β 100 (3 bits)
't' (9.1%) β 000 (3 bits)
'a' (8.2%) β 0110 (4 bits)
'z' (0.07%) β 110111010 (9 bits)
Fixed ASCII encoding: every character = 8 bits
Huffman encoding: average β 4.5 bits per character
Savings: ~44% smaller for English text
Huffman builds a binary tree bottom-up: the two least frequent symbols are merged into a subtree, then the two least frequent remaining items are merged, and so on until one root. Left branches assign a "0" bit, right branches assign a "1" bit. The path from root to leaf is the code.
Arithmetic coding is the theoretical successor: it encodes the entire message as a single number and achieves slightly better ratios than Huffman (closer to the entropy limit). In practice, most systems still use Huffman because it is faster to decode and the ratio difference is small (typically under 5%).
Dictionary coding does the heavy lifting
In most real-world compressors, LZ77 dictionary coding contributes 60-80% of the compression ratio while Huffman/entropy coding contributes the remaining 20-40%. When engineers say "gzip compresses JSON 5x," most of that reduction comes from LZ77 finding repeated field names and values, not from Huffman optimizing individual byte frequencies. This is why Snappy (LZ only, no entropy stage) still achieves 1.5-2.5x ratios.
zstd vs gzip vs Snappy vs LZ4
The four compressors you will encounter in production systems differ primarily in how they balance compression ratio against CPU cost.
gzip (Deflate = LZ77 + Huffman)
import gzip
data_compressed = gzip.compress(data, compresslevel=6)
# Speed: 100-300 MB/s compression, 400-600 MB/s decompression
# Ratio: 2-4x for JSON, 5-10x for log text
The internet's default compressor. Every browser supports Content-Encoding: gzip. Levels 1-9 trade speed for ratio: level 1 is fast with moderate compression, level 9 squeezes out every byte but runs 5-10x slower. Level 6 (the default) is the sweet spot for most use cases. My recommendation: if you are choosing gzip levels for a new system, just use the default.
zstd (Zstandard)
import zstandard as zstd
cctx = zstd.ZstdCompressor(level=3)
data_compressed = cctx.compress(data)
# Speed: 300-500 MB/s compression at level 3, 2000+ MB/s decompression
# Ratio: matches gzip at level 3, surpasses it at level 10+
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.