Column-oriented storage
How columnar databases store data by column instead of row, why this is 10-100x faster for analytical queries, and how compression and vectorized execution amplify the advantage, with tradeoffs vs row-oriented storage.
The problem
Your analytics team runs a dashboard query against a 500-million-row orders table with 50 columns. The query reads only three columns: country, order_total, and created_at. But the row-oriented database stores each row contiguously on disk, so reading those three columns forces the engine to load all 50 for every matching row.
The I/O math is brutal. 500M rows at 50 columns averaging 20 bytes each means ~500 GB of disk reads. The query actually needs ~30 GB (three columns). The storage layout forces the engine to read the other 470 GB as waste.
The dashboard that should refresh in seconds takes 45 minutes. The data warehouse bill climbs to $50,000/month in compute because every analytical query reads 10-50x more data than necessary. The storage format was designed for transactional workloads where you read entire rows, not for analytics where you read a few columns across billions of rows. This is the problem column-oriented storage solves.
What it is
Column-oriented storage (columnar storage) writes data by column instead of by row on disk. Instead of storing [id=1, name="Alice", age=30, city="NYC"] as one contiguous block per row, it stores all id values together, all name values together, all age values together, and so on.
Think of a filing cabinet. Row storage puts each employee's complete file in one folder, stacked in order. To compute the average salary across 10,000 employees, you open every folder and flip to the salary page. Columnar storage instead has one drawer per attribute: one drawer for salaries, one for names, one for hire dates. Computing the average salary means opening exactly one drawer.
This layout change has three cascading effects: queries read dramatically less data from disk, compression ratios improve by 5-10x because values in the same column share a type, and CPUs can process column vectors using SIMD instructions. I find it easy to underestimate how large each effect is individually, and they multiply together.
How it works
The fundamental difference is the on-disk layout:
Row storage (PostgreSQL, MySQL):
Page 1: [id=1, name="Alice", age=30, city="NYC"]
[id=2, name="Bob", age=25, city="LA"]
Page 2: [id=3, name="Carol", age=35, city="NYC"]
[id=4, name="Dave", age=28, city="SF"]
Column storage (ClickHouse, Redshift, Parquet):
id file: [1, 2, 3, 4, 5, ...]
name file: ["Alice", "Bob", "Carol", "Dave", ...]
age file: [30, 25, 35, 28, ...]
city file: ["NYC", "LA", "NYC", "SF", ...]
When a query touches only a subset of columns, the columnar engine skips every column file it does not need:
The column scan operates on contiguous arrays of same-typed values. Here is simplified pseudocode for a columnar scan with filtering:
function columnar_scan(query):
// Step 1: determine which column files to open
needed_cols = query.referenced_columns() // [city, age]
// Step 2: scan the filter column and build a selection bitmap
city_data = load_column_file("city") // sequential I/O
bitmap = new Bitmap(city_data.length)
for i in range(city_data.length):
if city_data[i] == query.filter_value: // "NYC"
bitmap.set(i)
// Step 3: load the aggregate column at selected positions only
age_data = load_column_file("age")
total = 0
count = 0
for i in bitmap.set_positions():
total += age_data[i]
count += 1
return total / count
For SELECT AVG(age) WHERE city = 'NYC' over a table with 50 columns, row storage reads all 50 columns for every row. Columnar storage reads exactly 2 column files. On 500M rows at 20 bytes/column, that is 500 GB vs 20 GB of I/O: a 25x reduction before any compression.
Interview tip: quantify the I/O savings
When discussing columnar storage in an interview, lead with the concrete I/O ratio. "Analytical queries typically reference 3-5 of 50 columns, so column storage reduces disk reads by 10-20x before compression." Interviewers want to see you quantify, not just describe.
Encoding schemes
Columnar layout unlocks compression ratios that row stores cannot match. All values in a column share the same data type, often have low cardinality, and frequently exhibit sort order or temporal patterns. Three encoding schemes dominate.
Run-length encoding (RLE)
RLE replaces consecutive repeated values with a (value, count) pair. It works best on sorted or clustered columns:
Raw column: [NYC, NYC, NYC, NYC, LA, LA, LA, NYC, NYC]
RLE encoded: [(NYC, 4), (LA, 3), (NYC, 2)]
If the table is sorted by city, a column of 100 million rows with 200 distinct cities compresses to roughly 200 entries. The reduction can be 5 to 6 orders of magnitude for well-sorted low-cardinality data.
Dictionary encoding
Dictionary encoding maps repeated string values to small integer codes:
Dictionary: {NYC β 0, LA β 1, Chicago β 2}
Raw column: [NYC, NYC, NYC, LA, LA, NYC]
Encoded column: [0, 0, 0, 1, 1, 0] (1 byte per value instead of 3-7 bytes)
Low-cardinality string columns (country, status, category) compress dramatically. The dictionary itself is small and fits in CPU cache, so filtering can operate on integer comparison instead of string comparison.
Delta encoding
Delta encoding stores the difference between adjacent values instead of the values themselves:
Timestamps: [1704067200, 1704067260, 1704067320, 1704067380]
Deltas: [1704067200, +60, +60, +60]
For monotonically increasing columns (timestamps, auto-increment IDs), deltas are small integers requiring far fewer bits. Combined with bit-packing (storing each delta in the minimum number of bits), delta encoding achieves 8-16x compression on timestamp columns.
| Encoding | Best for | Typical compression | Example |
|---|---|---|---|
| Run-length (RLE) | Sorted, low-cardinality columns | 100x-1000x on sorted data | city column sorted by city |
| Dictionary | Low-cardinality strings | 3-10x | country, status, category |
| Delta | Monotonic numeric columns | 8-16x | timestamps, auto-increment IDs |
| Bit-packing | Small integer ranges | 2-8x | age, quantity, boolean flags |
Compression depends on sort order
RLE performance collapses if the column is not sorted or clustered by that column's values. A city column in random row order shows no consecutive runs and compresses poorly with RLE. In practice, columnar databases sort data by the most-queried column (often a timestamp or partition key) to maximize compression. I've seen compression ratios drop from 100x to 2x just from changing the sort key.
Analytical databases typically combine all three: dictionary-encode strings, delta-encode timestamps, RLE where consecutive duplicates exist, then apply a general-purpose compressor (LZ4, Zstd) on top. The net result is 5-10x overall compression over row storage.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.