Column-oriented storage

The problem

Your analytics team runs a dashboard query against a 500-million-row orders table with 50 columns. The query reads only three columns: country, order_total, and created_at. But the row-oriented database stores each row contiguously on disk, so reading those three columns forces the engine to load all 50 for every matching row.

The I/O math is brutal. 500M rows at 50 columns averaging 20 bytes each means ~500 GB of disk reads. The query actually needs ~30 GB (three columns). The storage layout forces the engine to read the other 470 GB as waste.

The dashboard that should refresh in seconds takes 45 minutes. The data warehouse bill climbs to $50,000/month in compute because every analytical query reads 10-50x more data than necessary. The storage format was designed for transactional workloads where you read entire rows, not for analytics where you read a few columns across billions of rows. This is the problem column-oriented storage solves.

What it is

Column-oriented storage (columnar storage) writes data by column instead of by row on disk. Instead of storing [id=1, name="Alice", age=30, city="NYC"] as one contiguous block per row, it stores all id values together, all name values together, all age values together, and so on.

Think of a filing cabinet. Row storage puts each employee's complete file in one folder, stacked in order. To compute the average salary across 10,000 employees, you open every folder and flip to the salary page. Columnar storage instead has one drawer per attribute: one drawer for salaries, one for names, one for hire dates. Computing the average salary means opening exactly one drawer.

This layout change has three cascading effects: queries read dramatically less data from disk, compression ratios improve by 5-10x because values in the same column share a type, and CPUs can process column vectors using SIMD instructions. I find it easy to underestimate how large each effect is individually, and they multiply together.

How it works

The fundamental difference is the on-disk layout:

Row storage (PostgreSQL, MySQL):
  Page 1: [id=1, name="Alice", age=30, city="NYC"]
          [id=2, name="Bob",   age=25, city="LA"]
  Page 2: [id=3, name="Carol", age=35, city="NYC"]
          [id=4, name="Dave",  age=28, city="SF"]

Column storage (ClickHouse, Redshift, Parquet):
  id file:   [1, 2, 3, 4, 5, ...]
  name file: ["Alice", "Bob", "Carol", "Dave", ...]
  age file:  [30, 25, 35, 28, ...]
  city file: ["NYC", "LA", "NYC", "SF", ...]

When a query touches only a subset of columns, the columnar engine skips every column file it does not need:

The column scan operates on contiguous arrays of same-typed values. Here is simplified pseudocode for a columnar scan with filtering:

function columnar_scan(query):
  // Step 1: determine which column files to open
  needed_cols = query.referenced_columns()    // [city, age]

  // Step 2: scan the filter column and build a selection bitmap
  city_data = load_column_file("city")        // sequential I/O
  bitmap = new Bitmap(city_data.length)
  for i in range(city_data.length):
    if city_data[i] == query.filter_value:    // "NYC"
      bitmap.set(i)

  // Step 3: load the aggregate column at selected positions only
  age_data = load_column_file("age")
  total = 0
  count = 0
  for i in bitmap.set_positions():
    total += age_data[i]
    count += 1

  return total / count

For SELECT AVG(age) WHERE city = 'NYC' over a table with 50 columns, row storage reads all 50 columns for every row. Columnar storage reads exactly 2 column files. On 500M rows at 20 bytes/column, that is 500 GB vs 20 GB of I/O: a 25x reduction before any compression.

Interview tip: quantify the I/O savings

When discussing columnar storage in an interview, lead with the concrete I/O ratio. "Analytical queries typically reference 3-5 of 50 columns, so column storage reduces disk reads by 10-20x before compression." Interviewers want to see you quantify, not just describe.

Encoding schemes

Columnar layout unlocks compression ratios that row stores cannot match. All values in a column share the same data type, often have low cardinality, and frequently exhibit sort order or temporal patterns. Three encoding schemes dominate.

Run-length encoding (RLE)

RLE replaces consecutive repeated values with a (value, count) pair. It works best on sorted or clustered columns:

Raw column:  [NYC, NYC, NYC, NYC, LA, LA, LA, NYC, NYC]
RLE encoded: [(NYC, 4), (LA, 3), (NYC, 2)]

If the table is sorted by city, a column of 100 million rows with 200 distinct cities compresses to roughly 200 entries. The reduction can be 5 to 6 orders of magnitude for well-sorted low-cardinality data.

Dictionary encoding

Dictionary encoding maps repeated string values to small integer codes:

Dictionary: {NYC → 0, LA → 1, Chicago → 2}
Raw column:     [NYC, NYC, NYC, LA, LA, NYC]
Encoded column: [0, 0, 0, 1, 1, 0]  (1 byte per value instead of 3-7 bytes)

Low-cardinality string columns (country, status, category) compress dramatically. The dictionary itself is small and fits in CPU cache, so filtering can operate on integer comparison instead of string comparison.

Delta encoding

Delta encoding stores the difference between adjacent values instead of the values themselves:

Timestamps: [1704067200, 1704067260, 1704067320, 1704067380]
Deltas:     [1704067200, +60, +60, +60]

For monotonically increasing columns (timestamps, auto-increment IDs), deltas are small integers requiring far fewer bits. Combined with bit-packing (storing each delta in the minimum number of bits), delta encoding achieves 8-16x compression on timestamp columns.

Encoding	Best for	Typical compression	Example
Run-length (RLE)	Sorted, low-cardinality columns	100x-1000x on sorted data	city column sorted by city
Dictionary	Low-cardinality strings	3-10x	country, status, category
Delta	Monotonic numeric columns	8-16x	timestamps, auto-increment IDs
Bit-packing	Small integer ranges	2-8x	age, quantity, boolean flags

Compression depends on sort order

RLE performance collapses if the column is not sorted or clustered by that column's values. A city column in random row order shows no consecutive runs and compresses poorly with RLE. In practice, columnar databases sort data by the most-queried column (often a timestamp or partition key) to maximize compression. I've seen compression ratios drop from 100x to 2x just from changing the sort key.

Analytical databases typically combine all three: dictionary-encode strings, delta-encode timestamps, RLE where consecutive duplicates exist, then apply a general-purpose compressor (LZ4, Zstd) on top. The net result is 5-10x overall compression over row storage.

The problem

What it is

How it works

The fundamental difference is the on-disk layout:

Row storage (PostgreSQL, MySQL):
  Page 1: [id=1, name="Alice", age=30, city="NYC"]
          [id=2, name="Bob",   age=25, city="LA"]
  Page 2: [id=3, name="Carol", age=35, city="NYC"]
          [id=4, name="Dave",  age=28, city="SF"]

Column storage (ClickHouse, Redshift, Parquet):
  id file:   [1, 2, 3, 4, 5, ...]
  name file: ["Alice", "Bob", "Carol", "Dave", ...]
  age file:  [30, 25, 35, 28, ...]
  city file: ["NYC", "LA", "NYC", "SF", ...]

When a query touches only a subset of columns, the columnar engine skips every column file it does not need:

The column scan operates on contiguous arrays of same-typed values. Here is simplified pseudocode for a columnar scan with filtering:

function columnar_scan(query):
  // Step 1: determine which column files to open
  needed_cols = query.referenced_columns()    // [city, age]

  // Step 2: scan the filter column and build a selection bitmap
  city_data = load_column_file("city")        // sequential I/O
  bitmap = new Bitmap(city_data.length)
  for i in range(city_data.length):
    if city_data[i] == query.filter_value:    // "NYC"
      bitmap.set(i)

  // Step 3: load the aggregate column at selected positions only
  age_data = load_column_file("age")
  total = 0
  count = 0
  for i in bitmap.set_positions():
    total += age_data[i]
    count += 1

  return total / count

Interview tip: quantify the I/O savings

Encoding schemes

Run-length encoding (RLE)

RLE replaces consecutive repeated values with a (value, count) pair. It works best on sorted or clustered columns:

Raw column:  [NYC, NYC, NYC, NYC, LA, LA, LA, NYC, NYC]
RLE encoded: [(NYC, 4), (LA, 3), (NYC, 2)]

Dictionary encoding

Dictionary encoding maps repeated string values to small integer codes:

Dictionary: {NYC → 0, LA → 1, Chicago → 2}
Raw column:     [NYC, NYC, NYC, LA, LA, NYC]
Encoded column: [0, 0, 0, 1, 1, 0]  (1 byte per value instead of 3-7 bytes)

Delta encoding

Delta encoding stores the difference between adjacent values instead of the values themselves:

Timestamps: [1704067200, 1704067260, 1704067320, 1704067380]
Deltas:     [1704067200, +60, +60, +60]

Encoding	Best for	Typical compression	Example
Run-length (RLE)	Sorted, low-cardinality columns	100x-1000x on sorted data	city column sorted by city
Dictionary	Low-cardinality strings	3-10x	country, status, category
Delta	Monotonic numeric columns	8-16x	timestamps, auto-increment IDs
Bit-packing	Small integer ranges	2-8x	age, quantity, boolean flags

Compression depends on sort order

Column-oriented storage

The problem

What it is

How it works

Encoding schemes

Run-length encoding (RLE)

Dictionary encoding

Delta encoding

Continue Reading with Premium

Comments

Column-oriented storage

The problem

What it is

How it works

Encoding schemes

Run-length encoding (RLE)

Dictionary encoding

Delta encoding

Continue Reading with Premium

Comments