Hashing and hash functions
Learn what properties make a hash function suitable for databases, consistent hashing rings, and checksums, and why choosing the wrong hash function is a frequent source of unexpected collisions and performance cliffs.
The problem
You inherit a session-storage service. The engineer who built it chose a simple hash function: h(id) = id % 16. Sixteen buckets, each expected to hold an even share of sessions. In staging, with random IDs, the distribution looked fine.
In production, user IDs come from an auto-increment column that was seeded at 16 and increments in steps of 16. Every production ID is 16, 32, 48, 64... Every ID is a multiple of 16. Every ID maps to bucket 0.
At 1 million sessions, every lookup scans all 1 million entries in bucket 0. The hash table has degenerated to a linked list. The function chose a distribution that perfectly matched the data's periodicity, which means it distributed nothing at all.
What a hash function is
A hash function takes an arbitrary input (a string, a number, a byte sequence) and maps it to a fixed-size integer. The output size is constant regardless of input size: fnv1a("hello") and fnv1a(entire_book) both produce the same 32-bit number.
Analogy: A post office sorting machine. Every letter (any size, any sender) gets assigned to one bin (0 through N-1). A good machine spreads letters evenly. A broken machine puts everything in bin zero.
Four properties define a useful hash function:
- Deterministic: The same input always produces the same output.
- Uniform distribution: Outputs spread evenly across the output space. Modulo on periodic inputs is maximally non-uniform.
- Avalanche effect: A one-bit change in input flips roughly half the output bits. Without this, similar keys cluster together.
- Speed: A few nanoseconds per key for non-cryptographic uses.
How it works
The core operation: apply the hash function to the key, then take modulo the bucket count.
bucket = hash(key) % bucket_count
A good hash function distributes keys uniformly so each bucket holds roughly total_keys / bucket_count entries.
FNV-1a (Fowler-Noll-Vo) is one of the simplest correct non-cryptographic hash functions. It processes the input one byte at a time:
# FNV-1a (32-bit)
FNV_PRIME = 16777619
FNV_OFFSET_BASIS = 2166136261
function fnv1a(data: bytes) -> int:
hash = FNV_OFFSET_BASIS
for byte in data:
hash = hash XOR byte # mix the byte into the state
hash = hash * FNV_PRIME # spread the state via multiplication
hash = hash AND 0xFFFFFFFF # keep it 32-bit
return hash
The XOR mixes the byte into the state. The multiplication by a prime spreads the result across all bit positions: one changed byte propagates into every subsequent bit. That is the avalanche effect. I find the simplicity here deceptive. People look at this and think modulo is just as good. Modulo has no avalanche effect at all.
The diagram below shows a four-bucket hash table after inserting four keys, including one collision that forms a chain in bucket 0:
spawnSync d2 ENOENT
Why multiplication beats division for hashing
Division-based hashing (h = key % M) maps clustered or periodic keys into clustered buckets: all multiples of 16 map to bucket 0 with 16 buckets. Multiplication-based hashing (h = floor(M * frac(key * A)), where A is an irrational constant like the golden ratio 0.618...) destroys periodicity because multiplying by an irrational constant distributes even evenly-spaced inputs uniformly across buckets. FNV-1a and MurmurHash both use this principle internally: the multiply by prime step is the multiplication phase that breaks up any periodicity in the input bytes.
Non-cryptographic vs cryptographic hash functions
Not all hash functions serve the same purpose. The key distinction is speed versus security properties.
| Name | Speed | Cryptographic | Primary use case |
|---|---|---|---|
| FNV-1a | ~1 ns/key | No | Simple hash tables, checksums |
| MurmurHash3 | ~1 ns/key | No | Hash tables, Bloom filters, distributed routing |
| xxHash3 | ~0.3 ns/key | No | High-throughput hashing, in-memory tables |
| CityHash | ~0.5 ns/key | No | Hash tables, string maps |
| SHA-256 | ~50 ns/key | Yes | Digital signatures, integrity verification |
| Blake3 | ~5 ns/key | Yes | File integrity, key derivation |
| bcrypt / Argon2 | ~100 ms/attempt (design goal) | Yes (KDF) | Password storage only |
The mistake I see most often is reaching for SHA-256 in a hash table "because it avoids collisions better." SHA-256 is 50x slower than MurmurHash3 and provides no collision-resistance benefit for internal hash tables. The inverse mistake is worse: using MurmurHash3 for password storage. MurmurHash3 is designed to be fast, which means an attacker can test billions of MurmurHash3 candidates per second against a stolen hash database.
The rule: non-crypto hash for performance-sensitive internals (hash tables, Bloom filters, partition routing), crypto hash for integrity checks, and a purpose-built key-derivation function (Argon2 or bcrypt) for anything involving user passwords.
Hash collisions and the birthday paradox
Two different inputs that produce the same hash are a collision. Hash tables handle collisions with chaining (a linked list at the bucket) or open addressing (probe the next empty slot). Either approach degrades lookup performance as collisions accumulate.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.