Time series storage internals
How time series databases store, compress, and query high-rate timestamped data, including delta-of-delta encoding, Gorilla compression, downsampling, and retention policies that power monitoring and IoT systems.
The problem
Your monitoring system ingests 1 million sensor readings per second from an IoT fleet. You store them in PostgreSQL: INSERT INTO readings (sensor_id, ts, value) VALUES (101, now(), 23.5). At 1M inserts/sec, the B-tree index on (sensor_id, ts) generates massive write amplification, random I/O saturates the disk, and insert throughput collapses to a fraction of what the sensors produce.
The read side is equally painful. An engineer queries 30 days of data for one sensor: SELECT AVG(value) WHERE sensor_id = 101 AND ts > now() - 30d. That is 2.6 billion rows to scan. Even with a proper index, the query takes minutes because the data is spread across thousands of B-tree pages scattered on disk.
PostgreSQL was not designed for this access pattern. It optimizes for random reads and point updates, not for append-only, high-throughput timestamped writes with range scan reads. The timestamps are monotonically increasing, the writes are always at the current time, and reads are almost always "give me everything between time A and time B." A storage engine built around these three properties can be orders of magnitude faster. This is what time series databases solve.
What it is
A time series database (TSDB) is a storage engine optimized for timestamped, append-mostly data with temporal query patterns. Instead of general-purpose B-trees, it uses specialized compression (delta-of-delta for timestamps, XOR for float values), time-partitioned storage (chunks of hours or days), and purpose-built indexes that exploit the monotonic nature of time.
Think of a library that only collects daily newspapers. A general library files every book by author and title, with a card catalog for random lookups. A newspaper archive instead stacks papers chronologically by date. To find all articles from March 2024, you pull the March shelf. No catalog lookup needed, no random seeking. The temporal ordering is the index.
The three properties that TSDBs exploit are: writes are append-only (always at the current timestamp), write volume is high (thousands to millions of points per second), and reads are almost always time-bounded ranges. Every design decision in a TSDB follows from these three assumptions.
How it works
The core architecture of a TSDB has three layers: an in-memory write buffer for absorbing burst traffic, immutable on-disk chunks for persistence, and a time-based index for fast range lookups.
Pseudocode for the write and read paths:
// Write path
function ingest(metric_name, labels, timestamp, value):
series_id = series_index.get_or_create(metric_name, labels)
wal.append(series_id, timestamp, value) // crash safety
head_chunk[series_id].append(timestamp, value) // in-memory
if head_chunk[series_id].age() > CHUNK_DURATION: // e.g., 2 hours
flush_to_disk(head_chunk[series_id]) // compress + write
head_chunk[series_id] = new Chunk()
// Read path: range query
function query_range(metric_name, labels, start, end):
series_id = series_index.lookup(metric_name, labels)
chunks = chunk_index.overlapping(series_id, start, end)
results = []
for chunk in chunks:
data = decompress(chunk)
results.append(data.filter(start, end))
return merge_sorted(results)
Writes always go to the head chunk in memory, which is periodically flushed to an immutable compressed chunk on disk. Reads identify the relevant chunks by time range, decompress only those chunks, and merge the results. This is fundamentally different from B-tree storage where reads and writes operate on the same mutable pages.
Compression: the Gorilla algorithm
Facebook's Gorilla paper (2015) introduced the compression scheme that most TSDBs now use. It exploits two properties of sensor data: timestamps arrive at regular intervals, and consecutive values change slowly.
Delta-of-delta timestamp encoding
Instead of storing raw timestamps, Gorilla stores the difference between consecutive deltas. For regular-interval data, the delta-of-delta is zero, which encodes in a single bit:
Raw timestamps: [1704067200, 1704067260, 1704067320, 1704067380]
Deltas: [-, +60, +60, +60] (regular 60s interval)
Delta-of-deltas: [-, -, 0, 0] (pure zeros for regular data!)
Encoding: 0 = same delta as last time (1 bit)
non-zero = store the actual delta-of-delta (variable bits)
For 1M regular reads per second, timestamps compress by ~96%
XOR value encoding
For float values, Gorilla XORs each value with the previous one. Sensor values that change slowly produce XOR results with many leading and trailing zeros:
Raw values: [23.5, 23.6, 23.5, 23.7]
XOR with previous:
23.5 XOR 23.6 → mostly zeros, only a few bits differ
Store: number of leading zeros + meaningful bits only
The combined result: Gorilla achieves approximately 1.37 bytes per data point compared to 16 bytes raw (8-byte timestamp + 8-byte float64). That is a 12x compression ratio before any general-purpose compression (LZ4, Zstd) on top.
Interview tip: name the compression ratio
When time series storage comes up, say "Gorilla-style compression stores regular metrics at ~1.37 bytes per point versus 16 bytes raw, a 12x reduction, by exploiting the regularity of timestamps and the slow-changing nature of sensor values." This shows you understand the mechanism, not just the existence of compression.
Prometheus uses Gorilla-inspired encoding for its chunks. InfluxDB uses a similar delta + XOR scheme in its TSM engine. VictoriaMetrics further optimizes with custom varint encoding for even better compression on high-cardinality data.
Chunking and the write path
TSDBs partition data into time-bounded chunks (also called blocks or shards). Each chunk covers a fixed time window (typically 2 hours in Prometheus, configurable in InfluxDB). This design has three advantages: writes are always appended to the current chunk, old chunks are immutable and compressible, and retention is implemented by deleting entire chunk files.
InfluxDB's TSM (Time Series Mergetree) engine follows this pattern with an LSM-inspired twist: flushed chunks undergo background compaction, merging small chunks into larger, better-compressed ones. Each TSM file contains a block index at the end that maps each series ID to the byte offset and time range of its data block. Queries consult this index to skip irrelevant blocks entirely.
WAL size can surprise you
The WAL holds all writes since the last chunk flush. If your chunk duration is 2 hours and your write rate is 1M points/sec, the WAL accumulates ~7.2 billion uncompressed data points before flushing. Size the WAL disk accordingly, and monitor WAL replay time after crashes. I've seen teams discover their WAL holds 30 GB of uncompressed data only after a crash takes 20 minutes to replay.
Downsampling and retention policies
Raw data at 1-second resolution for one year equals 31.5 million points per sensor. Multiply that across 100,000 sensors and you have 3.15 trillion data points. Keeping all historical data at full resolution is both expensive and unnecessary: nobody needs per-second granularity when looking at a 6-month trend.
Downsampling reduces data density progressively as data ages:
| Age of data | Resolution | Reduction factor | Points per sensor per day |
|---|---|---|---|
| 0-24 hours | Raw (1s) | 1x | 86,400 |
| 1-7 days | 1-minute averages | 60x | 1,440 |
| 7 days - 1 year | 5-minute averages | 300x | 288 |
| Over 1 year | 1-hour averages | 3,600x | 24 |
Continuous queries (InfluxDB) or recording rules (Prometheus) run on a schedule to precompute downsampled data:
# Prometheus recording rule: precompute 5-minute averages
groups:
- name: downsampling
interval: 5m
rules:
- record: sensor:temperature:avg5m
expr: avg_over_time(sensor_temperature[5m])
- record: sensor:temperature:max5m
expr: max_over_time(sensor_temperature[5m])
Retention policies automatically delete data older than a threshold. In InfluxDB, this is a drop of entire shard files covering the expired time range, which is nearly instantaneous because shards are self-contained. No row-level deletion needed.
My recommendation for interviews: always mention that downsampling is not just about saving storage. Precomputed aggregates make dashboards over long time ranges load in milliseconds instead of scanning billions of raw points. The storage saving is a bonus; the query speed improvement is the primary value.
Labels, series cardinality, and the inverted index
Prometheus and most modern TSDBs identify each time series by a metric name plus a set of key-value labels:
temperature{sensor="101", room="kitchen", building="HQ"} 23.5 @1704067200
temperature{sensor="102", room="lobby"} 21.0 @1704067200
The unique combination of metric name + label set = one series. The TSDB maintains an inverted index mapping each label value to the set of series IDs that contain it. A query like temperature{building="HQ"} looks up "building=HQ" in the inverted index, retrieves matching series IDs, then reads only those series' chunks.
This is fast when cardinality is bounded. The danger is cardinality explosion: using high-cardinality values as labels.
# Safe: bounded cardinality (thousands of series)
temperature{sensor="101"} # one series per physical sensor
# Dangerous: unbounded cardinality (millions of series)
http_requests{request_id="abc123"} # unique label per request
= millions of new series per second
High cardinality violates TSDB assumptions at every layer. The inverted index grows linearly with series count and must fit in memory. Gorilla compression resets at series boundaries, so millions of short-lived series compress poorly. Memory usage scales with active series count because each series has a head chunk in memory.
Prometheus has a practical ceiling around 5-10 million active series before memory usage and compaction times become unmanageable. I've seen teams blow through this limit by adding a pod_name label in Kubernetes environments where pods are ephemeral, creating millions of short-lived series that never get enough data points to compress well.
Time-based indexing
TSDBs use the monotonic nature of time as a primary index dimension. Instead of a general-purpose B-tree, the index structure is simpler and more efficient.
For a query spanning 01:30 to 03:30, the index identifies that chunks 00:00-02:00 and 02:00-04:00 overlap the range. Only those two chunks are decompressed and scanned. The 04:00-06:00 chunk is skipped entirely based on its time range metadata.
This is fundamentally simpler than a B-tree because the key space (time) is ordered and append-only. There are no random insertions into the middle of the index, no page splits, and no rebalancing. The entire index update on write is "append to head chunk." The per-series chunk list grows linearly and is sorted by time implicitly.
Prometheus's TSDB implementation adds tombstones for deleted ranges. When you call the delete API, Prometheus writes a tombstone record that marks a time range as deleted for a series. The actual data is only removed during the next compaction cycle. This means deletes are fast (just write a marker) but disk space is not reclaimed immediately.
The combination of time-partitioned chunks and inverted label indexes is what makes TSDBs fundamentally faster than a general-purpose database for time-range queries. A query over 24 hours of data for 10 series touches exactly 12 chunk files (24h / 2h per chunk) and reads only those 10 series within each chunk. In PostgreSQL, the same query would walk a B-tree index, follow pointers to heap pages scattered across disk, and reassemble rows from pages shared with unrelated data.
Production usage
| System | Usage | Notable behavior |
|---|---|---|
| Prometheus | Pull-based monitoring for Kubernetes and cloud-native systems | Gorilla-inspired chunk encoding. Local disk storage with 2-hour chunk blocks. Inverted index for label queries. Practical limit ~10M active series per instance. |
| InfluxDB | IoT and infrastructure monitoring | TSM engine (LSM-inspired with columnar compression). Tag-based series model. Continuous queries for downsampling. Flux query language. |
| TimescaleDB | PostgreSQL extension for time series | Hypertables partition data by time into chunks. Leverages PostgreSQL's B-tree indexes within each chunk. Full SQL support including JOINs. Best choice when you need relational features with time series performance. |
| VictoriaMetrics | Long-term storage for Prometheus metrics | Custom compression (better than Gorilla for high cardinality). Supports 100M+ active series. Global deduplication. Often used as remote storage backend for Prometheus. |
| Apache Druid | Real-time analytics on event data | Column-oriented segments partitioned by time. Roll-up (pre-aggregation) at ingestion. Sub-second queries on billions of rows. Hybrid between TSDB and OLAP. |
Limitations and when NOT to use it
- Ad-hoc joins across time series and relational data are painful. TSDBs optimize for single-series or label-filtered queries, not for joining sensor readings with a users table. If you need relational joins, use TimescaleDB (PostgreSQL extension) or export to an analytical database.
- Retroactive writes (backfill) are slow or unsupported. Most TSDBs assume writes arrive in time order. Writing data with timestamps hours or days in the past may land in already-compacted chunks, requiring re-opening and recompression. Prometheus explicitly rejects out-of-order samples by default.
- Cardinality explosion can bring the system down. Using unbounded label values (request IDs, user IDs, pod names in ephemeral environments) creates millions of short-lived series that blow through memory limits and crash the TSDB.
- Complex queries beyond time-range aggregation perform poorly. Window functions, subqueries, and multi-step transformations are limited in PromQL and Flux. If your query workload is more analytical than monitoring, a columnar OLAP database is a better fit.
- Small-scale use cases gain nothing. If you have fewer than 10,000 metrics at 1-minute intervals, PostgreSQL with a timestamp index handles the load fine. TSDBs add operational complexity (retention policies, downsampling rules, cardinality monitoring) that is not justified at low volume.
Interview cheat sheet
- When the interviewer mentions metrics, monitoring, IoT, or sensor data, immediately name time series storage. Say "TSDBs exploit three properties: append-only writes, high throughput, and temporal read patterns."
- When asked about compression, explain Gorilla's delta-of-delta encoding for timestamps and XOR encoding for values. State the concrete ratio: "~1.37 bytes per point versus 16 bytes raw, roughly 12x compression."
- When asked about the write path, describe the WAL + head chunk + flush pattern. "Writes append to an in-memory head chunk backed by a WAL. When the chunk window expires (typically 2 hours), it flushes to an immutable compressed file on disk."
- When cardinality comes up, flag it as the number one operational risk. "Every unique label combination creates a new series. Unbounded labels like request IDs create cardinality explosions that consume all available memory."
- When asked about querying old data, explain downsampling. "Retention policies keep raw data for hours or days, then progressively downsample to 1-minute, 5-minute, and 1-hour averages. This makes year-scale dashboard queries return in milliseconds."
- When comparing TSDBs to general-purpose databases, state the tradeoff directly. "TSDBs trade query flexibility (no joins, limited ad-hoc queries) for orders-of-magnitude better write throughput and compression on timestamped data."
- When Prometheus vs InfluxDB comes up, name the key difference: Prometheus uses a pull model (scrapes targets) with PromQL; InfluxDB uses a push model (agents push data) with Flux/SQL. Both use Gorilla-style compression internally.
- When asked about scaling beyond a single node, mention that most TSDBs shard by series (consistent hashing on series ID). Thanos and Cortex add multi-tenant, horizontally-scalable query layers on top of Prometheus.
Quick recap
- Time series databases exploit three data properties: append-only writes, high write throughput, and temporal read patterns, enabling specialized compression and indexing that general-purpose databases cannot match.
- Gorilla compression (delta-of-delta for timestamps, XOR for values) achieves ~1.37 bytes per data point versus 16 bytes raw, a 12x compression ratio, by exploiting the regularity of timestamps and slow-changing nature of sensor values.
- Time-partitioned chunks (2-hour blocks written from an in-memory head chunk backed by a WAL) give TSDBs both high write throughput (sequential appends only) and efficient reads (skip irrelevant chunks by time range).
- Downsampling with retention policies keeps recent data at full resolution and progressively coarser resolution for older data, making year-scale queries return in milliseconds instead of scanning billions of raw points.
- Cardinality explosion is the primary operational failure mode: unbounded label values (request IDs, ephemeral pod names) create millions of short-lived series that exhaust memory and destroy index performance.
- TSDBs trade query flexibility for write performance: no joins, limited ad-hoc queries, append-only writes. If you need relational features with time series performance, use TimescaleDB; if you need analytics over event data, consider a columnar OLAP engine.
Related concepts
- Databases - Time series databases are a specialized category within the broader database landscape. Understanding when to choose a TSDB over a general-purpose database is a core system design decision.
- LSM trees - InfluxDB's TSM engine and Prometheus's chunk storage are both inspired by LSM-tree architecture: WAL-backed memory buffer, immutable flushed files, and background compaction.
- Column-oriented storage - Apache Druid and some TSDB backends use columnar storage within time-partitioned chunks, combining time-series chunking with columnar compression for analytical queries over event data.