Hashing and hash functions

The problem

You inherit a session-storage service. The engineer who built it chose a simple hash function: h(id) = id % 16. Sixteen buckets, each expected to hold an even share of sessions. In staging, with random IDs, the distribution looked fine.

In production, user IDs come from an auto-increment column that was seeded at 16 and increments in steps of 16. Every production ID is 16, 32, 48, 64... Every ID is a multiple of 16. Every ID maps to bucket 0.

At 1 million sessions, every lookup scans all 1 million entries in bucket 0. The hash table has degenerated to a linked list. The function chose a distribution that perfectly matched the data's periodicity, which means it distributed nothing at all.

What a hash function is

A hash function takes an arbitrary input (a string, a number, a byte sequence) and maps it to a fixed-size integer. The output size is constant regardless of input size: fnv1a("hello") and fnv1a(entire_book) both produce the same 32-bit number.

Analogy: A post office sorting machine. Every letter (any size, any sender) gets assigned to one bin (0 through N-1). A good machine spreads letters evenly. A broken machine puts everything in bin zero.

Four properties define a useful hash function:

Deterministic: The same input always produces the same output.
Uniform distribution: Outputs spread evenly across the output space. Modulo on periodic inputs is maximally non-uniform.
Avalanche effect: A one-bit change in input flips roughly half the output bits. Without this, similar keys cluster together.
Speed: A few nanoseconds per key for non-cryptographic uses.

How it works

The core operation: apply the hash function to the key, then take modulo the bucket count.

bucket = hash(key) % bucket_count

A good hash function distributes keys uniformly so each bucket holds roughly total_keys / bucket_count entries.

FNV-1a (Fowler-Noll-Vo) is one of the simplest correct non-cryptographic hash functions. It processes the input one byte at a time:

# FNV-1a (32-bit)
FNV_PRIME        = 16777619
FNV_OFFSET_BASIS = 2166136261

function fnv1a(data: bytes) -> int:
  hash = FNV_OFFSET_BASIS
  for byte in data:
    hash = hash XOR byte           # mix the byte into the state
    hash = hash * FNV_PRIME        # spread the state via multiplication
    hash = hash AND 0xFFFFFFFF     # keep it 32-bit
  return hash

The XOR mixes the byte into the state. The multiplication by a prime spreads the result across all bit positions: one changed byte propagates into every subsequent bit. That is the avalanche effect. I find the simplicity here deceptive. People look at this and think modulo is just as good. Modulo has no avalanche effect at all.

The diagram below shows a four-bucket hash table after inserting four keys, including one collision that forms a chain in bucket 0:

D2 render error.

spawnSync d2 ENOENT

Why multiplication beats division for hashing

Division-based hashing (h = key % M) maps clustered or periodic keys into clustered buckets: all multiples of 16 map to bucket 0 with 16 buckets. Multiplication-based hashing (h = floor(M * frac(key * A)), where A is an irrational constant like the golden ratio 0.618...) destroys periodicity because multiplying by an irrational constant distributes even evenly-spaced inputs uniformly across buckets. FNV-1a and MurmurHash both use this principle internally: the multiply by prime step is the multiplication phase that breaks up any periodicity in the input bytes.

Non-cryptographic vs cryptographic hash functions

Not all hash functions serve the same purpose. The key distinction is speed versus security properties.

Name	Speed	Cryptographic	Primary use case
FNV-1a	~1 ns/key	No	Simple hash tables, checksums
MurmurHash3	~1 ns/key	No	Hash tables, Bloom filters, distributed routing
xxHash3	~0.3 ns/key	No	High-throughput hashing, in-memory tables
CityHash	~0.5 ns/key	No	Hash tables, string maps
SHA-256	~50 ns/key	Yes	Digital signatures, integrity verification
Blake3	~5 ns/key	Yes	File integrity, key derivation
bcrypt / Argon2	~100 ms/attempt (design goal)	Yes (KDF)	Password storage only

The mistake I see most often is reaching for SHA-256 in a hash table "because it avoids collisions better." SHA-256 is 50x slower than MurmurHash3 and provides no collision-resistance benefit for internal hash tables. The inverse mistake is worse: using MurmurHash3 for password storage. MurmurHash3 is designed to be fast, which means an attacker can test billions of MurmurHash3 candidates per second against a stolen hash database.

The rule: non-crypto hash for performance-sensitive internals (hash tables, Bloom filters, partition routing), crypto hash for integrity checks, and a purpose-built key-derivation function (Argon2 or bcrypt) for anything involving user passwords.

Hash collisions and the birthday paradox

Two different inputs that produce the same hash are a collision. Hash tables handle collisions with chaining (a linked list at the bucket) or open addressing (probe the next empty slot). Either approach degrades lookup performance as collisions accumulate.

The problem

What a hash function is

Four properties define a useful hash function:

Deterministic: The same input always produces the same output.
Uniform distribution: Outputs spread evenly across the output space. Modulo on periodic inputs is maximally non-uniform.
Avalanche effect: A one-bit change in input flips roughly half the output bits. Without this, similar keys cluster together.
Speed: A few nanoseconds per key for non-cryptographic uses.

How it works

The core operation: apply the hash function to the key, then take modulo the bucket count.

bucket = hash(key) % bucket_count

A good hash function distributes keys uniformly so each bucket holds roughly total_keys / bucket_count entries.

FNV-1a (Fowler-Noll-Vo) is one of the simplest correct non-cryptographic hash functions. It processes the input one byte at a time:

# FNV-1a (32-bit)
FNV_PRIME        = 16777619
FNV_OFFSET_BASIS = 2166136261

function fnv1a(data: bytes) -> int:
  hash = FNV_OFFSET_BASIS
  for byte in data:
    hash = hash XOR byte           # mix the byte into the state
    hash = hash * FNV_PRIME        # spread the state via multiplication
    hash = hash AND 0xFFFFFFFF     # keep it 32-bit
  return hash

The diagram below shows a four-bucket hash table after inserting four keys, including one collision that forms a chain in bucket 0:

D2 render error.

spawnSync d2 ENOENT

Why multiplication beats division for hashing

Non-cryptographic vs cryptographic hash functions

Not all hash functions serve the same purpose. The key distinction is speed versus security properties.

Name	Speed	Cryptographic	Primary use case
FNV-1a	~1 ns/key	No	Simple hash tables, checksums
MurmurHash3	~1 ns/key	No	Hash tables, Bloom filters, distributed routing
xxHash3	~0.3 ns/key	No	High-throughput hashing, in-memory tables
CityHash	~0.5 ns/key	No	Hash tables, string maps
SHA-256	~50 ns/key	Yes	Digital signatures, integrity verification
Blake3	~5 ns/key	Yes	File integrity, key derivation
bcrypt / Argon2	~100 ms/attempt (design goal)	Yes (KDF)	Password storage only

Hashing and hash functions

The problem

What a hash function is

How it works

Non-cryptographic vs cryptographic hash functions

Hash collisions and the birthday paradox

Continue Reading with Premium

Comments

Hashing and hash functions

The problem

What a hash function is

How it works

Non-cryptographic vs cryptographic hash functions

Hash collisions and the birthday paradox

Continue Reading with Premium

Comments