Skip lists
Learn how skip lists achieve O(log N) search, insert, and delete without rebalancing by layering probabilistic shortcut lanes over a linked list, and why Redis uses them for sorted sets.
The problem
You need a data structure that stores scores and names together, supports O(log N) insert and delete, and answers "give me all names with scores between 100 and 500" in O(log N + k) time. A sorted array gives you O(log N) search but O(N) insert. A linked list gives you O(1) insert at a known position but O(N) search. A balanced binary search tree (AVL or red-black) gives you O(log N) for all three, but it requires rotations on every insert and delete to maintain balance. Those rotations are an implementation challenge: dozens of cases, pointer surgery, easy to get wrong.
Redis needed a sorted set structure when building ZADD/ZRANGE. The balanced BST option works, but the rotation logic is complex enough that a bug means silent data corruption. Rotations also touch multiple nodes, hurting cache locality under concurrent access.
This is the problem skip lists solve: logarithmic performance without rebalancing, and a dramatically simpler implementation compared to any balanced tree.
What a skip list is
A skip list is a probabilistic data structure built by layering multiple linked lists on top of each other. The bottom layer (level 0) is a complete sorted linked list containing every element. Each element is promoted to level 1 with probability 0.5, to level 2 with probability 0.25, and so on. Higher levels contain progressively fewer nodes and act as "express lanes" that let searches skip over many elements at once.
Analogy: Imagine a highway system with multiple parallel lanes. The rightmost lane is a local road with every exit. The lane to its left skips some exits. The leftmost lane is an express highway with only major exits. To reach a destination, you start in the express lane and move to slower lanes as you get close. A skip list is exactly this structure, built from coin flips.
How it works
The structure
Search
To find key 60:
- Start at the top level (level 2). Head points to 1, then 50, then 100. 50 is less than 60 but 100 is greater. Drop to level 1.
- At level 1, starting from 50. Next is 75. 75 is greater than 60. Drop to level 0.
- At level 0, starting from 50. Next is 60. Found it.
Total comparisons: 6 instead of 10 (a full sequential scan of all ten elements between head and 60).
function search(list, target):
node = list.head
for level from MAX_LEVEL down to 0:
while node.forward[level] != null
and node.forward[level].key < target:
node = node.forward[level]
node = node.forward[0]
if node != null and node.key == target:
return node.value
return NOT_FOUND
Insert
To insert key 55:
- Run the search procedure, recording the last node visited at each level before each drop. These become the "update" nodes.
- Flip coins to determine the height of the new node. Each flip is heads with probability 0.5.
- Splice the new node in at every level up to its height, updating forward pointers.
function insert(list, key, value):
update = array of size MAX_LEVEL
node = list.head
for level from MAX_LEVEL down to 0:
while node.forward[level] != null
and node.forward[level].key < key:
node = node.forward[level]
update[level] = node
new_height = random_level() # flip coins until tails
new_node = Node(key, value, new_height)
for level from 0 to new_height:
new_node.forward[level] = update[level].forward[level]
update[level].forward[level] = new_node
function random_level():
level = 0
while random() < 0.5 and level < MAX_LEVEL:
level += 1
return level
Delete follows the same pattern: collect the update array via search, then unlink the target node from every level it appears in by forwarding the update nodes to the target's forward pointers.
Why Redis uses skip lists for sorted sets
Redis sorted sets (ZSET) need to support five operations efficiently: ZADD (insert), ZREM (delete), ZSCORE (point lookup), ZRANGEBYSCORE (range scan), and ZRANK (rank query: "what position is this element?").
A balanced BST supports the first four in O(log N). Rank queries require augmenting every node with its subtree size and updating that count on every rotation -- doable in theory but adds significant complexity to every structural change. A skip list supports rank queries natively by storing a "span" value at every level that records how many level-0 nodes each forward pointer skips over. ZRANK is then: sum the spans along the search path from head to the target.
Redis chose the skip list because:
- The forward-span trick makes ZRANK O(log N) with no extra overhead on insert or delete.
- No rotations means the implementation is roughly 200 lines of C versus roughly 800 for a red-black tree.
- Range scans at level 0 are simple pointer following with no tree traversal or stack required.
No rotations means no multi-node locking on structural changes
Red-black tree rotations touch two to three nodes simultaneously. In a concurrent setting, that requires locking multiple nodes at once or a complex lock-free rotation algorithm. Skip list inserts and deletes update only the new node and its update-array predecessors, each with a single pointer swap. This makes fine-grained locking far simpler. Java's ConcurrentSkipListMap uses exactly this property to implement a non-blocking sorted map. Redis, being single-threaded, benefits from simpler code and fewer bugs rather than lock performance.
Redis actually pairs a skip list with a hash table for ZSET. The hash table handles O(1) ZSCORE lookups; the skip list handles ordering and ranges.
The span values stored on each forward pointer are what make ZRANK a single-pass operation. Each pointer records how many level-0 nodes it skips over. ZRANK sums those spans along the search path from head to the target node, with no separate data structure needed.
spawnSync d2 ENOENT
ZRANK for node 75: at L1, head-to-20 span=2, 20-to-50 span=2, 50-to-75 span=2. Total rank = 6 (0-indexed). One search path, one pass, no extra traversal.
Memory layout and level 0 scan
Level 0 is a complete sorted linked list. Once you have located the start of a range via the O(log N) search, a forward scan is nothing more than following .next pointers from that position. There is no tree traversal, no stack state, and no backtracking.
For small ranges (50-100 elements), this is reasonably cache-friendly. Each node is typically 64-128 bytes. Accessing 50 nodes sequentially from level 0 touches a manageable number of cache lines.
For large result sets (thousands of elements), linked list pointer chasing produces one cache miss per node because nodes are heap-allocated at scattered addresses. A B-tree's leaf pages are contiguous 8KB or 16KB blocks; skip lists have no equivalent. This is why databases with large range scan requirements prefer B-trees, and skip lists are used for in-memory structures (Redis, LevelDB MemTable) rather than on-disk indexes.
Worst-case and expected performance
| Operation | Skip list (expected) | Skip list (worst case) | AVL / Red-Black | B-tree |
|---|---|---|---|---|
| Search | O(log N) | O(N) | O(log N) | O(log N) |
| Insert | O(log N) | O(N) | O(log N) + rotation | O(log N) |
| Delete | O(log N) | O(N) | O(log N) + rotation | O(log N) |
| Range scan | O(log N + k) | O(N + k) | O(log N + k) | O(log N + k) |
| Rank query | O(log N) | O(N) | O(log N) with augmentation | Not native |
| Implementation complexity | Low | Low | High | High |
The O(N) worst case occurs if all coin flips are heads (every node reaches max level and the express lanes are useless) or all tails (every node stays at level 0 only). In practice, with N = 1,000,000 and p = 0.5, the probability of significant degeneration is vanishingly small. Redis caps the maximum level at 32, which is sufficient for 2^32 elements.
Production usage
| System | Usage | Notable behavior |
|---|---|---|
| Redis ZSET | Skip list plus hash table backing sorted sets | ZRANK is O(log N) via span tracking; range scans via level-0 pointer following |
| LevelDB / RocksDB | MemTable stores recent writes in a skip list | On flush to disk, the skip list iterates level-0 in order to produce a sorted SSTable |
| Apache Lucene | In-memory posting list structures during index merges | Skip lists used for fast intersection and merging of sorted document ID lists |
| CockroachDB (Pebble) | In-memory write buffer | Pebble (RocksDB-derived) uses a skip list-based MemTable for the write buffer |
Skip list worst case is O(N), not O(log N) guaranteed
Unlike red-black trees, skip lists have no rebalancing guarantee. O(log N) is probabilistic. With MAX_LEVEL set too high relative to the actual data size, searches spend many iterations at empty express lanes before reaching meaningful nodes. With a biased random generator, most nodes stay at level 0 and searches degrade to a linear scan. Always verify the empirical distribution of your random_level() function: roughly N/2 nodes at level 0, N/4 at level 1, and so on. Redis caps at level 32, sufficient for up to 2^32 elements with a fair coin.
Limitations and when NOT to use it
- Not cache-friendly for large range scans. Linked list traversal at level 0 produces one cache miss per node hop. B-tree leaf pages are contiguous in memory; skip list nodes are heap-allocated and scattered. For result sets of thousands of elements, B-trees are meaningfully faster.
- Higher memory per element. Each node stores between 1 and MAX_LEVEL forward pointers plus span values. At p = 0.5, the expected number of pointers per node is 2. This is comparable to a BST node's 2 child pointers, but with larger allocation overhead per node in many allocators.
- O(N) worst case. Unlikely in practice, but it exists and is theoretically unbounded. A balanced BST guarantees O(log N) worst case. For hard real-time systems where predictability matters more than average performance, use a red-black tree.
- Not suitable for persistent on-disk structures. Skip lists must be rebuilt from scratch on startup (LevelDB flushes its MemTable to an on-disk SSTable). Random pointer traversal on disk is expensive. B-trees are designed around disk block alignment; skip lists are not.
- Non-deterministic structure. Two skip lists built from identical insertions in different orders may have different shapes. This makes debugging harder than a deterministic BST and complicates reproducibility in tests.
Skip list vs. alternatives
Interview cheat sheet
- When asked why Redis uses skip lists for ZADD/ZRANGE, say: no rotations needed, rank queries are O(log N) via forward span tracking, range scans are O(log N + k) via level-0 pointer following, and the implementation is far simpler than a red-black tree.
- When asked about time complexity, say: O(log N) expected for search, insert, delete, and rank. O(log N + k) for range queries. O(N) worst case is theoretically possible but vanishingly unlikely with large N and a fair random number generator.
- When asked how search works, say: start at the highest level, move right while the next key is less than the target, then drop one level and repeat. The same traversal is used for insert and delete to collect the update array.
- When asked about the coin flip, say: each node is promoted to level L+1 with probability p (typically 0.5 or 0.25). With p = 0.5, the expected number of forward pointers per node is 2. With p = 0.25, expected pointers per node is 1.33, using less memory at the cost of slightly slower search.
- When asked how rank queries work, say: each forward pointer at every level stores a span value equal to the number of level-0 nodes it skips. ZRANK sums the spans along the search path from the head to the target node. No extra data structure is required.
- When asked to compare skip lists vs B-trees, say: skip lists are better for in-memory sorted data with rank queries; B-trees are better for on-disk storage because they align to disk block boundaries and provide better cache locality for large range scans.
- When asked about the worst case, say: O(N) is possible if the random number generator is degenerate, but Redis caps levels at 32, making degeneration of any dataset with fewer than 2^32 elements essentially impossible with a fair coin.
- When asked what a delete looks like, say: identical to insert. Run the search, collect the update array (last node at each level before the target), then unlink the target by pointing the update nodes to the target's forward pointers at each level. O(log N) expected.
Quick recap
- A skip list is a sorted linked list with probabilistic "express lanes" that allow O(log N) average-case search, insert, and delete without any rebalancing.
- Each node is promoted to higher levels by a coin flip with probability p (typically 0.5). Higher levels skip over more nodes and reduce the number of comparisons in a search.
- Search traverses from the highest level down: move right while the next key is smaller than the target, then drop one level and repeat until reaching level 0.
- Redis uses skip lists for sorted sets because forward-span tracking gives O(log N) ZRANK natively, range scans are simple level-0 pointer follows, and there are no rotations to implement or maintain.
- The worst case is O(N) if the random generator degenerates, but Redis caps levels at 32, and degeneration of any dataset with fewer than 2^32 elements is vanishingly rare with a fair coin.
- Skip lists excel for in-memory sorted data with rank queries and range scans; B-trees are better for on-disk storage because they align to disk block boundaries and provide better cache locality for large sequential reads.
Related concepts
- B-tree indexes covers the on-disk sorted index alternative, explaining why B-trees are preferred for persistent storage and how they handle page splits and fragmentation.
- Redis data structures covers how Redis pairs skip lists with hash tables to implement sorted sets, and where other data structures (hashes, lists, sets) fit.
- Databases provides the broader context for when ordered indexes matter for query planning and range scan performance.