Compression algorithms

The problem

Your analytics pipeline ingests 2 TB of JSON event data per day. Each event is 1-4 KB of mostly repeated field names, similar string values, and sequential timestamps. Storage costs are growing linearly with traffic, and network transfers between services are bottlenecked by bandwidth, not CPU.

You run the numbers: the 2 TB of raw JSON contains roughly 400 GB of actual information. The other 1.6 TB is structural redundancy (repeated keys, whitespace, predictable patterns). You are paying to store and transmit data that could be represented in a fraction of the space.

This is not just a cost problem. Uncompressed data means slower replication between database nodes, slower backups, and slower network transfers between datacenters. Every system that reads from disk or network benefits when the data is smaller. Compression algorithms solve this by exploiting redundancy to represent data in fewer bytes.

What it is

Compression is the process of encoding data using fewer bits than the original representation by identifying and eliminating redundancy. Lossless compression (which this article focuses on) preserves the original data exactly: decompress and you get back every byte.

Think of it like shorthand notes. Instead of writing "the quarterly earnings report for the third quarter of fiscal year 2024" in every meeting note, you write "Q3 FY24 report" and keep a legend that lets anyone reconstruct the full phrase. The legend is the dictionary, the abbreviation is the compressed form, and the ability to recover the original text exactly is what makes it lossless.

All practical lossless compression algorithms combine two fundamental ideas: finding repeated patterns (dictionary coding) and assigning shorter codes to frequent symbols (entropy coding). The differences between gzip, zstd, Snappy, and LZ4 come down to how aggressively they search for patterns and how they trade CPU time for compression ratio.

How it works

Most compression algorithms run in two stages. First, dictionary coding (LZ77 family) scans the input for repeated byte sequences and replaces duplicates with compact back-references. Second, entropy coding (Huffman or arithmetic) assigns shorter bit patterns to symbols that appear frequently.

LZ77: dictionary-based compression

LZ77 (Lempel-Ziv 1977) is the foundation of gzip, Deflate, zstd, Snappy, and LZ4. The core idea: maintain a sliding window of recently seen bytes and replace repeated sequences with (offset, length) references.

function lz77_compress(data, window_size=32768):
    output = []
    pos = 0
    
    while pos < len(data):
        best_offset = 0
        best_length = 0
        
        // Search the sliding window for the longest match
        search_start = max(0, pos - window_size)
        for i in range(search_start, pos):
            match_len = 0
            while pos + match_len < len(data) 
                  and data[i + match_len] == data[pos + match_len]:
                match_len += 1
            
            if match_len > best_length:
                best_offset = pos - i
                best_length = match_len
        
        if best_length >= 3:   // minimum match length (3 for Deflate)
            output.append(Reference(offset=best_offset, length=best_length))
            pos += best_length
        else:
            output.append(Literal(data[pos]))
            pos += 1
    
    return output

The sliding window size determines how far back the algorithm looks for matches. Larger windows find more matches (better ratio) but require more memory and CPU time. gzip uses a 32 KB window. zstd supports windows up to 128 MB at higher compression levels.

The minimum match length matters too. A reference like (offset=500, length=2) takes more bytes to encode than the 2 literal bytes it replaces. Deflate requires matches of at least 3 bytes. Snappy and LZ4 require 4 bytes.

I find that understanding LZ77's sliding window is the single most useful thing for reasoning about why different compressors behave differently. A compressor with a larger window and more aggressive match searching (zstd level 19) compresses better but runs slower. One with a tiny window and simple hash-based matching (Snappy) runs fast but misses long-range patterns.

Huffman coding: entropy-based compression

Huffman coding assigns variable-length bit codes to symbols based on frequency. Frequent symbols get short codes; rare symbols get long codes. This is optimal for symbol-by-symbol coding (no code is a prefix of another).

English letter frequencies → Huffman codes:
  'e' (13.0%) → 100        (3 bits)
  't' (9.1%)  → 000        (3 bits)
  'a' (8.2%)  → 0110       (4 bits)
  'z' (0.07%) → 110111010  (9 bits)

Fixed ASCII encoding: every character = 8 bits
Huffman encoding:      average ≈ 4.5 bits per character
Savings:               ~44% smaller for English text

Huffman builds a binary tree bottom-up: the two least frequent symbols are merged into a subtree, then the two least frequent remaining items are merged, and so on until one root. Left branches assign a "0" bit, right branches assign a "1" bit. The path from root to leaf is the code.

Arithmetic coding is the theoretical successor: it encodes the entire message as a single number and achieves slightly better ratios than Huffman (closer to the entropy limit). In practice, most systems still use Huffman because it is faster to decode and the ratio difference is small (typically under 5%).

Dictionary coding does the heavy lifting

In most real-world compressors, LZ77 dictionary coding contributes 60-80% of the compression ratio while Huffman/entropy coding contributes the remaining 20-40%. When engineers say "gzip compresses JSON 5x," most of that reduction comes from LZ77 finding repeated field names and values, not from Huffman optimizing individual byte frequencies. This is why Snappy (LZ only, no entropy stage) still achieves 1.5-2.5x ratios.

zstd vs gzip vs Snappy vs LZ4

The four compressors you will encounter in production systems differ primarily in how they balance compression ratio against CPU cost.

gzip (Deflate = LZ77 + Huffman)

import gzip
data_compressed = gzip.compress(data, compresslevel=6)
# Speed: 100-300 MB/s compression, 400-600 MB/s decompression
# Ratio: 2-4x for JSON, 5-10x for log text

The internet's default compressor. Every browser supports Content-Encoding: gzip. Levels 1-9 trade speed for ratio: level 1 is fast with moderate compression, level 9 squeezes out every byte but runs 5-10x slower. Level 6 (the default) is the sweet spot for most use cases. My recommendation: if you are choosing gzip levels for a new system, just use the default.

zstd (Zstandard)

import zstandard as zstd
cctx = zstd.ZstdCompressor(level=3)
data_compressed = cctx.compress(data)
# Speed: 300-500 MB/s compression at level 3, 2000+ MB/s decompression
# Ratio: matches gzip at level 3, surpasses it at level 10+

function lz77_compress(data, window_size=32768):
    output = []
    pos = 0
    
    while pos < len(data):
        best_offset = 0
        best_length = 0
        
        // Search the sliding window for the longest match
        search_start = max(0, pos - window_size)
        for i in range(search_start, pos):
            match_len = 0
            while pos + match_len < len(data) 
                  and data[i + match_len] == data[pos + match_len]:
                match_len += 1
            
            if match_len > best_length:
                best_offset = pos - i
                best_length = match_len
        
        if best_length >= 3:   // minimum match length (3 for Deflate)
            output.append(Reference(offset=best_offset, length=best_length))
            pos += best_length
        else:
            output.append(Literal(data[pos]))
            pos += 1
    
    return output

Huffman coding: entropy-based compression

English letter frequencies → Huffman codes:
  'e' (13.0%) → 100        (3 bits)
  't' (9.1%)  → 000        (3 bits)
  'a' (8.2%)  → 0110       (4 bits)
  'z' (0.07%) → 110111010  (9 bits)

Fixed ASCII encoding: every character = 8 bits
Huffman encoding:      average ≈ 4.5 bits per character
Savings:               ~44% smaller for English text

Dictionary coding does the heavy lifting

zstd vs gzip vs Snappy vs LZ4

The four compressors you will encounter in production systems differ primarily in how they balance compression ratio against CPU cost.

gzip (Deflate = LZ77 + Huffman)

import gzip
data_compressed = gzip.compress(data, compresslevel=6)
# Speed: 100-300 MB/s compression, 400-600 MB/s decompression
# Ratio: 2-4x for JSON, 5-10x for log text

zstd (Zstandard)

import zstandard as zstd
cctx = zstd.ZstdCompressor(level=3)
data_compressed = cctx.compress(data)
# Speed: 300-500 MB/s compression at level 3, 2000+ MB/s decompression
# Ratio: matches gzip at level 3, surpasses it at level 10+

Compression algorithms

The problem

What it is

How it works

LZ77: dictionary-based compression

Huffman coding: entropy-based compression

zstd vs gzip vs Snappy vs LZ4

gzip (Deflate = LZ77 + Huffman)

zstd (Zstandard)

Continue Reading with Premium

Comments

Compression algorithms

The problem

What it is

How it works

LZ77: dictionary-based compression

Huffman coding: entropy-based compression

zstd vs gzip vs Snappy vs LZ4

gzip (Deflate = LZ77 + Huffman)

zstd (Zstandard)

Continue Reading with Premium

Comments