Column-oriented storage
How columnar databases store data by column instead of row, why this is 10-100x faster for analytical queries, and how compression and vectorized execution amplify the advantage, with tradeoffs vs row-oriented storage.
The problem
Your analytics team runs a dashboard query against a 500-million-row orders table with 50 columns. The query reads only three columns: country, order_total, and created_at. But the row-oriented database stores each row contiguously on disk, so reading those three columns forces the engine to load all 50 for every matching row.
The I/O math is brutal. 500M rows at 50 columns averaging 20 bytes each means ~500 GB of disk reads. The query actually needs ~30 GB (three columns). The storage layout forces the engine to read the other 470 GB as waste.
The dashboard that should refresh in seconds takes 45 minutes. The data warehouse bill climbs to $50,000/month in compute because every analytical query reads 10-50x more data than necessary. The storage format was designed for transactional workloads where you read entire rows, not for analytics where you read a few columns across billions of rows. This is the problem column-oriented storage solves.
What it is
Column-oriented storage (columnar storage) writes data by column instead of by row on disk. Instead of storing [id=1, name="Alice", age=30, city="NYC"] as one contiguous block per row, it stores all id values together, all name values together, all age values together, and so on.
Think of a filing cabinet. Row storage puts each employee's complete file in one folder, stacked in order. To compute the average salary across 10,000 employees, you open every folder and flip to the salary page. Columnar storage instead has one drawer per attribute: one drawer for salaries, one for names, one for hire dates. Computing the average salary means opening exactly one drawer.
This layout change has three cascading effects: queries read dramatically less data from disk, compression ratios improve by 5-10x because values in the same column share a type, and CPUs can process column vectors using SIMD instructions. I find it easy to underestimate how large each effect is individually, and they multiply together.
How it works
The fundamental difference is the on-disk layout:
Row storage (PostgreSQL, MySQL):
Page 1: [id=1, name="Alice", age=30, city="NYC"]
[id=2, name="Bob", age=25, city="LA"]
Page 2: [id=3, name="Carol", age=35, city="NYC"]
[id=4, name="Dave", age=28, city="SF"]
Column storage (ClickHouse, Redshift, Parquet):
id file: [1, 2, 3, 4, 5, ...]
name file: ["Alice", "Bob", "Carol", "Dave", ...]
age file: [30, 25, 35, 28, ...]
city file: ["NYC", "LA", "NYC", "SF", ...]
When a query touches only a subset of columns, the columnar engine skips every column file it does not need:
The column scan operates on contiguous arrays of same-typed values. Here is simplified pseudocode for a columnar scan with filtering:
function columnar_scan(query):
// Step 1: determine which column files to open
needed_cols = query.referenced_columns() // [city, age]
// Step 2: scan the filter column and build a selection bitmap
city_data = load_column_file("city") // sequential I/O
bitmap = new Bitmap(city_data.length)
for i in range(city_data.length):
if city_data[i] == query.filter_value: // "NYC"
bitmap.set(i)
// Step 3: load the aggregate column at selected positions only
age_data = load_column_file("age")
total = 0
count = 0
for i in bitmap.set_positions():
total += age_data[i]
count += 1
return total / count
For SELECT AVG(age) WHERE city = 'NYC' over a table with 50 columns, row storage reads all 50 columns for every row. Columnar storage reads exactly 2 column files. On 500M rows at 20 bytes/column, that is 500 GB vs 20 GB of I/O: a 25x reduction before any compression.
Interview tip: quantify the I/O savings
When discussing columnar storage in an interview, lead with the concrete I/O ratio. "Analytical queries typically reference 3-5 of 50 columns, so column storage reduces disk reads by 10-20x before compression." Interviewers want to see you quantify, not just describe.
Encoding schemes
Columnar layout unlocks compression ratios that row stores cannot match. All values in a column share the same data type, often have low cardinality, and frequently exhibit sort order or temporal patterns. Three encoding schemes dominate.
Run-length encoding (RLE)
RLE replaces consecutive repeated values with a (value, count) pair. It works best on sorted or clustered columns:
Raw column: [NYC, NYC, NYC, NYC, LA, LA, LA, NYC, NYC]
RLE encoded: [(NYC, 4), (LA, 3), (NYC, 2)]
If the table is sorted by city, a column of 100 million rows with 200 distinct cities compresses to roughly 200 entries. The reduction can be 5 to 6 orders of magnitude for well-sorted low-cardinality data.
Dictionary encoding
Dictionary encoding maps repeated string values to small integer codes:
Dictionary: {NYC ā 0, LA ā 1, Chicago ā 2}
Raw column: [NYC, NYC, NYC, LA, LA, NYC]
Encoded column: [0, 0, 0, 1, 1, 0] (1 byte per value instead of 3-7 bytes)
Low-cardinality string columns (country, status, category) compress dramatically. The dictionary itself is small and fits in CPU cache, so filtering can operate on integer comparison instead of string comparison.
Delta encoding
Delta encoding stores the difference between adjacent values instead of the values themselves:
Timestamps: [1704067200, 1704067260, 1704067320, 1704067380]
Deltas: [1704067200, +60, +60, +60]
For monotonically increasing columns (timestamps, auto-increment IDs), deltas are small integers requiring far fewer bits. Combined with bit-packing (storing each delta in the minimum number of bits), delta encoding achieves 8-16x compression on timestamp columns.
| Encoding | Best for | Typical compression | Example |
|---|---|---|---|
| Run-length (RLE) | Sorted, low-cardinality columns | 100x-1000x on sorted data | city column sorted by city |
| Dictionary | Low-cardinality strings | 3-10x | country, status, category |
| Delta | Monotonic numeric columns | 8-16x | timestamps, auto-increment IDs |
| Bit-packing | Small integer ranges | 2-8x | age, quantity, boolean flags |
Compression depends on sort order
RLE performance collapses if the column is not sorted or clustered by that column's values. A city column in random row order shows no consecutive runs and compresses poorly with RLE. In practice, columnar databases sort data by the most-queried column (often a timestamp or partition key) to maximize compression. I've seen compression ratios drop from 100x to 2x just from changing the sort key.
Analytical databases typically combine all three: dictionary-encode strings, delta-encode timestamps, RLE where consecutive duplicates exist, then apply a general-purpose compressor (LZ4, Zstd) on top. The net result is 5-10x overall compression over row storage.
Vectorized execution
Modern columnar engines do not process one row at a time. They process vectors: arrays of 1,000 to 10,000 values from a single column. This unlocks SIMD (Single Instruction, Multiple Data) CPU instructions that operate on multiple values per clock cycle.
The code difference is stark:
# Row-at-a-time execution (traditional):
for row in table:
if row.city == 'NYC':
total += row.age
count += 1
# Vectorized execution (columnar engine):
city_vec = load_vector(columns['city'], offset, 1024)
age_vec = load_vector(columns['age'], offset, 1024)
mask = simd_eq(city_vec, 'NYC') # compare all 1024 in parallel
total += simd_sum(age_vec, mask) # sum only matching values
count += popcount(mask) # count set bits in mask
SIMD instructions process 8-16 values per cycle on modern CPUs (AVX-512 handles 16 32-bit integers simultaneously). Combined with columnar storage that keeps data in cache-friendly contiguous arrays, vectorized execution turns multi-hour analytical queries into queries that complete in minutes. I think of vectorized execution as the "second multiplier" after I/O reduction: column layout reduces data read from disk, and vectorized execution reduces CPU cycles per value.
Zone maps and min-max indexes
Columnar engines maintain lightweight metadata for each data block (typically 64K to 1M rows): the minimum and maximum value in that block. These are called zone maps (also known as min-max indexes or data skipping indexes).
When a query has a WHERE clause on a column, the engine checks zone maps first. If a block's min is greater than the query's max (or the block's max is less than the query's min), the entire block is skipped without reading any data from disk.
Zone maps are most effective when data is sorted or clustered by the filter column. A created_at column sorted chronologically has tight min/max ranges per block, so time-range queries skip nearly all blocks. A randomly ordered column has overlapping ranges across every block, making zone maps useless.
| Zone map effectiveness | Sorted column | Random column |
|---|---|---|
| Blocks skipped for range query | 95-99% | 0-10% |
| Memory overhead per block | ~16 bytes (min + max) | Same |
| Query speedup | 10-100x | Negligible |
This is why sort key selection is one of the most important configuration decisions in a columnar table. In Redshift, it is the SORTKEY clause. In ClickHouse, it is the ORDER BY in the table definition. Get it wrong and your zone maps are decorative.
The write path tradeoff
The read performance of columnar storage comes at a cost: writes are fundamentally more expensive than in row stores.
In row storage, inserting one row writes one contiguous block. In columnar storage, inserting one row requires writing to N separate column files. This is why columnar engines batch aggressively: ClickHouse buffers inserts into "parts" (sorted column chunks), each containing thousands of rows before flushing to disk.
My rule of thumb: if your insert pattern is fewer than 1,000 rows per batch, you almost certainly want a row store or a write buffer (Kafka, a staging table) in front of the columnar engine. The engines are not designed for row-at-a-time writes, and fighting that design always ends badly.
Production usage
| System | Usage | Notable behavior |
|---|---|---|
| Apache Parquet | Columnar file format for data lakes (Hadoop, Spark, Trino) | Row groups (~128 MB each) with per-column chunks. Dictionary and RLE encoding per column. Min/max statistics per row group enable predicate pushdown. The de facto standard for analytical data at rest. |
| ClickHouse | MergeTree engine stores data in columnar format | Sorted by primary key (defines sort order and zone maps). Aggressive compression (LZ4 + delta + dictionary). Vectorized execution by default. Handles millions of inserts/sec via batched parts. |
| Amazon Redshift | Columnar MPP data warehouse | Per-column encoding chosen at COPY time. Zone maps on every block. SORTKEY columns determine data ordering. Interleaved sort keys for multi-column predicates at the cost of higher VACUUM overhead. |
| Apache Arrow | In-memory columnar format for analytical processing | Zero-copy sharing between processes. Used as the in-memory representation by DuckDB, DataFusion, and Polars. SIMD-friendly memory layout designed for vectorized execution. |
| Google BigQuery | Serverless columnar data warehouse | Capacitor columnar format with automatic encoding selection. Dremel-style nested column shredding for complex types (arrays, structs). Transparent zone map pruning on partitioned tables. |
Limitations and when NOT to use it
- Single-row lookups are slow. Reading one full row requires N column file reads (one per column). A
SELECT * WHERE id = 5that takes 1 disk seek in a row store takes 50 seeks in a 50-column columnar store. - Single-row inserts are expensive. Inserting one row means appending to N column files, each requiring its own write. Columnar engines batch inserts into large blocks (thousands to millions of rows) to amortize this cost, adding write latency.
- Updates and deletes are costly. Most columnar stores are append-only. Updates are implemented as "delete old + insert new," and deletes use soft-delete markers merged out during compaction. Real-time OLTP update patterns are fundamentally incompatible.
- Wide SELECT * queries lose the advantage. If a query reads all 50 columns, columnar storage reads the same total data as row storage but with higher overhead from merging columns back into rows.
- Small datasets see no benefit. If the entire table fits in memory (under a few GB), the I/O reduction from columnar layout is irrelevant. Row storage is simpler with lower overhead for small tables.
- Sort key choice is binding. Data sorted by
created_athas excellent zone map pruning for time queries but poor pruning forcountryfilters. You cannot optimize zone maps for every column simultaneously without materialized views or interleaved keys.
Interview cheat sheet
- When the interviewer mentions analytical queries, dashboards, or OLAP, immediately name column-oriented storage. Say "columnar layout reads only the columns the query references, typically 5-10% of total data, reducing I/O by 10-20x."
- When asked why columnar storage compresses better, explain type homogeneity: same-type values in a contiguous array enable RLE, dictionary, and delta encoding that row stores cannot apply because adjacent bytes are different types.
- When asked about the write penalty, state that inserting one row touches N column files instead of one row block. Columnar engines mitigate this by batching thousands of rows per insert operation.
- When comparing OLTP vs OLAP storage, draw the 2x2: row storage is fast for point lookups and single-row writes; columnar is fast for full-column scans and aggregations. Neither is universally better.
- When zone maps come up, explain that sort key selection determines whether range queries skip 99% of blocks or 0%. This is the single most impactful configuration decision in a columnar warehouse.
- When vectorized execution is mentioned, say "columnar engines process data in vectors of 1,000+ values using SIMD instructions, achieving 8-16x CPU throughput over row-at-a-time processing."
- When asked about hybrid systems, name TiDB (row + columnar replicas), SingleStore (rowstore + columnstore), and DuckDB (columnar in-process analytics). These trade storage cost for query flexibility.
- When asked about file formats, distinguish Parquet (at-rest columnar for data lakes), Arrow (in-memory columnar for computation), and ORC (Hive's columnar format). Parquet + Arrow is the most common pairing today.
Quick recap
- Column-oriented storage writes each column contiguously on disk, allowing analytical queries to read only the 3-5 columns they need instead of all 50, reducing I/O by 10-20x.
- Three encoding schemes exploit same-type contiguous data: RLE compresses sorted columns by 100-1000x, dictionary encoding maps strings to integers for 3-10x compression, and delta encoding stores differences for 8-16x compression on ordered numerics.
- Vectorized execution applies SIMD instructions to entire column vectors (1,000+ values at a time), achieving 8-16x CPU throughput over row-at-a-time processing.
- Zone maps (min-max indexes per block) let the engine skip 95-99% of data blocks for range queries on sorted columns, but provide zero benefit on unsorted columns.
- Columnar storage is the wrong choice for OLTP: single-row inserts touch N column files, point lookups require N column reads, and updates need delete-plus-insert. Use row storage for transactional workloads.
- Sort key selection is the single most impactful decision in a columnar database, determining compression ratios, zone map effectiveness, and query performance for the dominant access pattern.
Related concepts
- Databases - Column-oriented storage is one of two fundamental storage layouts for databases. Understanding when to choose OLAP vs OLTP storage is the first decision in any data architecture.
- LSM trees - ClickHouse's MergeTree engine uses LSM-inspired compaction to merge columnar data blocks, combining the write path of LSM trees with columnar read performance.
- B-tree indexes - B-trees power row-oriented point lookups. Understanding B-trees clarifies why row stores win at OLTP while columnar stores win at OLAP.