Hashing and hash functions
Learn what properties make a hash function suitable for databases, consistent hashing rings, and checksums, and why choosing the wrong hash function is a frequent source of unexpected collisions and performance cliffs.
The problem
You inherit a session-storage service. The engineer who built it chose a simple hash function: h(id) = id % 16. Sixteen buckets, each expected to hold an even share of sessions. In staging, with random IDs, the distribution looked fine.
In production, user IDs come from an auto-increment column that was seeded at 16 and increments in steps of 16. Every production ID is 16, 32, 48, 64... Every ID is a multiple of 16. Every ID maps to bucket 0.
At 1 million sessions, every lookup scans all 1 million entries in bucket 0. The hash table has degenerated to a linked list. The function chose a distribution that perfectly matched the data's periodicity, which means it distributed nothing at all.
What a hash function is
A hash function takes an arbitrary input (a string, a number, a byte sequence) and maps it to a fixed-size integer. The output size is constant regardless of input size: fnv1a("hello") and fnv1a(entire_book) both produce the same 32-bit number.
Analogy: A post office sorting machine. Every letter (any size, any sender) gets assigned to one bin (0 through N-1). A good machine spreads letters evenly. A broken machine puts everything in bin zero.
Four properties define a useful hash function:
- Deterministic: The same input always produces the same output.
- Uniform distribution: Outputs spread evenly across the output space. Modulo on periodic inputs is maximally non-uniform.
- Avalanche effect: A one-bit change in input flips roughly half the output bits. Without this, similar keys cluster together.
- Speed: A few nanoseconds per key for non-cryptographic uses.
How it works
The core operation: apply the hash function to the key, then take modulo the bucket count.
bucket = hash(key) % bucket_count
A good hash function distributes keys uniformly so each bucket holds roughly total_keys / bucket_count entries.
FNV-1a (Fowler-Noll-Vo) is one of the simplest correct non-cryptographic hash functions. It processes the input one byte at a time:
# FNV-1a (32-bit)
FNV_PRIME = 16777619
FNV_OFFSET_BASIS = 2166136261
function fnv1a(data: bytes) -> int:
hash = FNV_OFFSET_BASIS
for byte in data:
hash = hash XOR byte # mix the byte into the state
hash = hash * FNV_PRIME # spread the state via multiplication
hash = hash AND 0xFFFFFFFF # keep it 32-bit
return hash
The XOR mixes the byte into the state. The multiplication by a prime spreads the result across all bit positions: one changed byte propagates into every subsequent bit. That is the avalanche effect. I find the simplicity here deceptive. People look at this and think modulo is just as good. Modulo has no avalanche effect at all.
The diagram below shows a four-bucket hash table after inserting four keys, including one collision that forms a chain in bucket 0:
spawnSync d2 ENOENT
Why multiplication beats division for hashing
Division-based hashing (h = key % M) maps clustered or periodic keys into clustered buckets: all multiples of 16 map to bucket 0 with 16 buckets. Multiplication-based hashing (h = floor(M * frac(key * A)), where A is an irrational constant like the golden ratio 0.618...) destroys periodicity because multiplying by an irrational constant distributes even evenly-spaced inputs uniformly across buckets. FNV-1a and MurmurHash both use this principle internally: the multiply by prime step is the multiplication phase that breaks up any periodicity in the input bytes.
Non-cryptographic vs cryptographic hash functions
Not all hash functions serve the same purpose. The key distinction is speed versus security properties.
| Name | Speed | Cryptographic | Primary use case |
|---|---|---|---|
| FNV-1a | ~1 ns/key | No | Simple hash tables, checksums |
| MurmurHash3 | ~1 ns/key | No | Hash tables, Bloom filters, distributed routing |
| xxHash3 | ~0.3 ns/key | No | High-throughput hashing, in-memory tables |
| CityHash | ~0.5 ns/key | No | Hash tables, string maps |
| SHA-256 | ~50 ns/key | Yes | Digital signatures, integrity verification |
| Blake3 | ~5 ns/key | Yes | File integrity, key derivation |
| bcrypt / Argon2 | ~100 ms/attempt (design goal) | Yes (KDF) | Password storage only |
The mistake I see most often is reaching for SHA-256 in a hash table "because it avoids collisions better." SHA-256 is 50x slower than MurmurHash3 and provides no collision-resistance benefit for internal hash tables. The inverse mistake is worse: using MurmurHash3 for password storage. MurmurHash3 is designed to be fast, which means an attacker can test billions of MurmurHash3 candidates per second against a stolen hash database.
The rule: non-crypto hash for performance-sensitive internals (hash tables, Bloom filters, partition routing), crypto hash for integrity checks, and a purpose-built key-derivation function (Argon2 or bcrypt) for anything involving user passwords.
Hash collisions and the birthday paradox
Two different inputs that produce the same hash are a collision. Hash tables handle collisions with chaining (a linked list at the bucket) or open addressing (probe the next empty slot). Either approach degrades lookup performance as collisions accumulate.
The birthday paradox shows how quickly collisions appear. For a hash function with N possible output values, there is roughly a 50% chance of a collision after sqrt(N) inputs:
32-bit hash: sqrt(2^32) ≈ 65,536 inputs for 50% collision probability
64-bit hash: sqrt(2^64) ≈ 4,294,967,296 inputs for 50% collision probability
At 10 million entries in a 32-bit hash table, you are far past the birthday threshold. Expect thousands of collisions. For large key spaces (user IDs, URLs, event log entries), always use 64-bit or 128-bit hashes. A 32-bit hash for a table above 100,000 entries starts accumulating collisions at a rate that matters.
32-bit hashes collide faster than expected
A 32-bit hash will produce a 50% collision probability at around 65,000 inputs. If your hash table holds 500,000 entries, collisions are not rare events. Most production systems should default to 64-bit hashes. The memory cost is 4 bytes extra per entry, which is almost never the binding constraint.
HashDoS and seeded hashing
Without a random seed, an adversary who knows your hash function can craft HTTP POST parameters that all map to the same bucket. They submit 10,000 such keys. Your O(1) hash table becomes O(N). The server pegs one CPU for minutes per request.
This attack happened at scale in 2011 against PHP, Python, Ruby, Java, and ASP.NET simultaneously. Attackers sent crafted HTTP form bodies whose parameter names all collided in the request-parsing hash table. A single POST request could consume 100% of a CPU core for several minutes.
The fix is seeded hashing: initialize the hash function with a random value at process startup. The seed changes which inputs collide without changing the output distribution. An adversary without the seed cannot predict which keys will land in the same bucket.
# Python 3.3+ uses a random PYTHONHASHSEED at startup by default
# hash("abc") returns a different value every process launch
print(hash("abc")) # e.g., 7284352934 — changes per run
# To reproduce deterministically in tests only:
# PYTHONHASHSEED=0 python test_suite.py
Python added random seeding in 3.3. Java fixed HashMap in Java 7u6. Ruby added it in 1.9.3. If your framework does not seed its internal hash function, treat it as a security issue for any service that accepts external input as hash keys.
Production usage
| System | Usage | Notable behavior |
|---|---|---|
| PostgreSQL | Hash indexes, hash joins | Custom hash per data type; hash indexes support only equality (no range queries); hash joins require the smaller relation to fit in work_mem |
| Java HashMap | Seeded key hashing | Added seed randomization in Java 7u6 after the 2011 HashDoS attack; uses hashCode() ^ (hashCode() >>> 16) to spread low-order bits and reduce clustering |
| Kafka | Partition routing via murmur2 | murmur2(key) % num_partitions routes each message; changing partition count after topic creation breaks the mapping, sending a key's messages to a different partition |
| nginx upstream | Load balancing via CRC32 | hash $request_uri consistent applies CRC32 on a consistent ring; plain hash (without consistent) reassigns all keys on any upstream change |
Limitations
- A hash function that matches your key distribution degenerates your hash table to O(N). Always test with actual production key distributions, not random ones.
- Non-cryptographic hash functions (MurmurHash3, FNV-1a, xxHash) must never be used for passwords, MACs, or anything requiring collision resistance against a determined adversary.
- Hash tables degrade noticeably when load factor exceeds 70-80%. Most standard library implementations resize at this point, but custom implementations often do not. A hash table at 95% capacity has significantly worse lookup performance even with a good hash function.
- Hash output is not stable across languages, library versions, or compiler flags unless you pin the implementation explicitly. Kafka's murmur2 differs from standard murmur2 implementations in Python and Go. Verify cross-language partition key consistency explicitly.
- Hash functions are one-way. You cannot reverse a hash to recover the input. Do not use hash output as an encoding or compression scheme.
- Modular hashing is the wrong tool for distributed node routing. Adding one backend node changes the bucket assignment of nearly every key. For routing across nodes that can be added or removed, use consistent hashing instead.
When to choose a hash function
Interview cheat sheet
- When a hash table lookup is slow, ask about key distribution and load factor before anything else. A hash function that matches the data's periodicity (like
id % 16on multiples of 16) degenerates to O(N) regardless of everything else. - When someone uses SHA-256 for a hash table key, flag it as a performance issue. It is 50x slower than MurmurHash3 and provides no collision-resistance benefit for internal tables.
- When asked about passwords, state clearly that hash functions are never used directly. Use Argon2 or bcrypt: they are intentionally slow and memory-hard, which makes brute-force impractical.
- When HashDoS comes up, explain that the attack exploits predictable collision patterns. The fix is seeded hashing (randomizing the hash basis at process startup). Python, Java, and Ruby all added this after the 2011 attack.
- When choosing a hash function for a high-throughput system (Bloom filters, Kafka, in-memory hash tables), recommend xxHash3 or MurmurHash3. For checksums and integrity verification, SHA-256 or Blake3.
- When asked about 32-bit versus 64-bit hashes, state the threshold: 50% collision probability appears at 65,000 inputs for a 32-bit hash. For any table above 100,000 entries, use 64-bit.
- When discussing Kafka partition routing, note that
murmur2(key) % num_partitionsis stable only while partition count is fixed. Resizing a topic breaks the mapping and disrupts any consumer relying on per-key ordering. - When someone proposes modular hashing for distributed cache routing, redirect to consistent hashing. Modular hashing reassigns almost all keys when a node is added. Consistent hashing moves only
keys / (N+1)keys on average.
Quick recap
- A hash function maps arbitrary input to a fixed-size number. Core properties are determinism, uniform distribution, avalanche effect, and speed.
- A hash function that aligns with your key distribution (like modulo on periodic values) degenerates the hash table to a linked list with O(N) lookups.
- Non-cryptographic hashes (MurmurHash3, FNV-1a, xxHash3) are correct for hash tables and Bloom filters. Use SHA-256 or Blake3 for integrity checks, and Argon2 or bcrypt for passwords.
- The birthday paradox means a 32-bit hash has 50% collision probability at about 65,000 inputs. For tables above 100,000 entries, always use 64-bit hashes.
- Seeded hashing (randomizing the hash basis at startup) prevents HashDoS attacks, where an adversary crafts inputs that all collide into one bucket. Python, Java, and Ruby added this after the 2011 attack.
- Modular hashing reassigns nearly every key when bucket count changes. For routing across nodes that can be added or removed, use consistent hashing instead.
Related concepts
- Consistent hashing — A ring-based hash variant designed for distributed node routing. Adding one node moves approximately
keys / (N+1)entries, not all keys. Built on top of a standard hash function but designed for membership changes. - Bloom filters — Use k independent hash functions per element. The uniformity of those hash functions directly determines the false positive rate. Weak hash functions produce clustering that inflates FP rates above the theoretical bound.
- Databases — PostgreSQL hash indexes use the same bucket-hash mechanics. They support only equality lookups and do not support range queries, because hash output has no ordering relationship to the original keys.