How data compression works
Understand the algorithms behind gzip, zstd, and LZ4 that compress files and network payloads, the entropy theory, dictionary-based compression, and why system designers must understand when to compress, when not to, and what the tradeoffs are.
The Problem Statement
Interviewer: "Your API responses are averaging 500KB of JSON, and you are paying a fortune in bandwidth. You also have 2TB of daily log data sitting in S3. Walk me through how compression works, which algorithms you would pick for each case, and what tradeoffs you would consider."
This question tests three things. First, whether you actually understand how compression algorithms work (not just "use gzip"). Second, whether you can match the right algorithm to the right use case. Third, whether you know the situations where compression hurts rather than helps.
I love this question because it separates candidates who treat compression as a black box ("just enable gzip") from those who understand the mechanics well enough to make informed tradeoffs. The difference matters when you are choosing between zstd and LZ4 for a Kafka pipeline, or deciding whether to compress data that is already in Parquet format.
Clarifying the Scenario
You: "Before I dive in, let me confirm scope. When you say compression, are we focused on lossless compression for structured data (JSON, logs, database pages), or should I also cover lossy compression for media (images, video, audio)?"
Interviewer: "Focus on lossless. Assume we are dealing with API payloads, log files, and database storage."
You: "Got it. And should I go deep on the algorithm internals, or keep it at the level of 'how it works and when to use what?'"
Interviewer: "I want you to explain how LZ77 and Huffman coding work at a conceptual level, then get practical about algorithm selection."
You: "Perfect. I will structure my answer in three parts: how dictionary-based compression works, how entropy coding works, and then a practical framework for choosing the right algorithm for different scenarios in a production system."
My Approach
I break this into four parts:
- Entropy and the compression limit: Why some data compresses well and other data does not, rooted in Shannon's information theory.
- Dictionary-based compression (LZ77): The core algorithm behind gzip, zstd, and LZ4, using sliding windows to find and replace repeated byte patterns.
- Entropy coding (Huffman and ANS): The second stage that squeezes out remaining redundancy by encoding frequent symbols with fewer bits.
- Practical algorithm selection: A decision framework for choosing gzip vs zstd vs LZ4 vs Brotli vs Snappy based on the specific system design scenario.
The Architecture
Where does compression fit in a typical system? It is not a single point. Compression appears at multiple layers, each with different requirements.
Each layer has different latency and throughput requirements, which is why different algorithms win at different points. The CDN uses Brotli (slow compression, best ratio) because assets are compressed once at build time and served millions of times. Kafka uses Snappy or LZ4 (fastest compression) because messages flow through in real time. S3 archives use zstd (best ratio at reasonable speed) because the data is written once and read rarely.
For your interview: mention that compression is not a single decision but a per-layer decision, and name the algorithm you would pick at each layer. This shows systems thinking, not just algorithm knowledge.
Dictionary-Based Compression: LZ77 and Sliding Windows
This is the core idea behind nearly every modern compression algorithm. The concept is elegant: instead of storing repeated data, store a reference to where that data appeared before.
How LZ77 works
LZ77 uses a "sliding window" that looks backward through the data you have already processed. When it finds a sequence of bytes that matches something in the window, it replaces the sequence with a (distance, length) pair.
Input text: "the cat sat on the mat on the flat"
Processing:
"the cat sat on " -> stored literally (no prior matches)
"the " -> (16, 4) meaning "go back 16 bytes, copy 4 bytes"
"mat on " -> "m" literal + (10, 6) for "at on "
"the flat" -> (8, 4) for "the " + "flat" literal
The repeated phrases "the ", "on the", "at on" are all replaced with
short back-references instead of storing the bytes again.
The sliding window size determines how far back the algorithm can look for matches. A larger window finds more matches (better compression) but requires more memory and CPU time to search. gzip uses a 32KB window. zstd can use up to 128MB.
The key insight: compression ratio depends entirely on how much repetition exists in the data. JSON with repeated field names ("userId", "timestamp", "status") compresses extremely well because those strings appear thousands of times. Random binary data (encrypted content, already-compressed images) has no repetition, so LZ77 finds no matches and the "compressed" output is actually larger than the input due to encoding overhead.
zstd (developed by Facebook/Meta) has largely replaced gzip as the default choice for new systems. It compresses better and faster at the same CPU budget. The main reason gzip persists is backward compatibility: every HTTP client and server supports Content-Encoding: gzip, while zstd support (Content-Encoding: zstd) is still rolling out. For internal services where you control both ends, use zstd.
Choosing the Right Algorithm for Your Use Case
This is the section that matters most in an interview. Knowing how LZ77 works is impressive, but knowing which algorithm to pick for a specific scenario is what gets you the offer.
The algorithm landscape
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.