Files in database anti-pattern

TL;DR

Storing binary files (images, PDFs, videos, audio) as BLOBs in a relational database uses a storage engine designed for structured, indexed, queryable data to store bytes it cannot query, index, or compress.
The costs: row bloat slows all queries (not just file queries), backups take forever, replication lag grows, sequential table scans read giant BLOBs, VACUUM is slow, and you cannot CDN-cache database content.
The right answer in nearly every case: store files in an object store (S3, GCS, Azure Blob Storage). Store the file path or key in the database. Serve files via CDN.
The only exception: very small, frequently updated binary data (thumbnails under 100KB in a system that never serves them to end users) where the overhead of S3 round-trips dominates.

The Problem

A developer building a document management system stores uploaded PDFs directly in the database. The schema looks reasonable:

CREATE TABLE documents (
    id         UUID PRIMARY KEY,
    filename   VARCHAR(255),
    content    BYTEA,           -- stores the PDF binary
    uploaded_at TIMESTAMP,
    user_id    UUID
);

Upload 50,000 documents averaging 2MB each. Your database is now 100GB larger just from content columns. Your nightly backup takes 6 hours instead of 20 minutes. A sequential scan of the documents table for any reason reads 100GB of binary data. VACUUM must process every BYTEA column on every dead tuple. Replication log includes all 100GB, so your replica has 30 minutes of lag.

Your CDN cannot cache the PDFs because the browser downloads them through an API endpoint that hits your database. Every document download consumes a database connection for the duration of the transfer. At 2MB per file and 100ms network time, that is one connection tied up per download.

I inherited a system exactly like this in 2021. The team had 50,000 user-uploaded documents in PostgreSQL. A simple SELECT filename FROM documents WHERE user_id = ? took 800ms because the planner had to skip over BYTEA toast pages. Moving files to S3 dropped that query to 3ms.

Why It Happens

Teams store files in the database because it feels simpler. One system, one backup, one transaction. The individually-reasonable logic:

"One system is easier to operate." True for development. Catastrophic at scale when that one system does two fundamentally different jobs.
"We need transactions around the file." You need the metadata to be transactional. The file itself is immutable after upload.
"S3 is another dependency." So is your database. And S3 has 99.999999999% durability, better than any self-managed database.
"We only have a few files." Every system that stores files in the database started with "just a few files."

The mismatch is fundamental. Databases are optimized for narrow, indexed, queryable rows. Files are the opposite:

Characteristic	Files (BLOBs)	Relational Data
Access pattern	Full read or write, no partial	Column-level, row-level, range queries
Indexability	Not indexable	Fully indexable
Compression	Already compressed (JPEG, PDF)	Benefits from row compression
Query predicates	Never queried as a WHERE clause	The whole point
Row width	Makes rows huge	Rows should be narrow and uniform
CDN compatibility	No (database is not HTTP)	Not applicable
Concurrency	One connection per download	Row-level locking, MVCC

How to Detect It

TL;DR

Storing binary files (images, PDFs, videos, audio) as BLOBs in a relational database uses a storage engine designed for structured, indexed, queryable data to store bytes it cannot query, index, or compress.

The costs: row bloat slows all queries (not just file queries), backups take forever, replication lag grows, sequential table scans read giant BLOBs, VACUUM is slow, and you cannot CDN-cache database content.

The right answer in nearly every case: store files in an object store (S3, GCS, Azure Blob Storage). Store the file path or key in the database. Serve files via CDN.

The only exception: very small, frequently updated binary data (thumbnails under 100KB in a system that never serves them to end users) where the overhead of S3 round-trips dominates.

The Problem

A developer building a document management system stores uploaded PDFs directly in the database. The schema looks reasonable:

CREATE TABLE documents (
    id         UUID PRIMARY KEY,
    filename   VARCHAR(255),
    content    BYTEA,           -- stores the PDF binary
    uploaded_at TIMESTAMP,
    user_id    UUID
);

Why It Happens

Teams store files in the database because it feels simpler. One system, one backup, one transaction. The individually-reasonable logic:

"One system is easier to operate." True for development. Catastrophic at scale when that one system does two fundamentally different jobs.

"We need transactions around the file." You need the metadata to be transactional. The file itself is immutable after upload.

"S3 is another dependency." So is your database. And S3 has 99.999999999% durability, better than any self-managed database.

"We only have a few files." Every system that stores files in the database started with "just a few files."

The mismatch is fundamental. Databases are optimized for narrow, indexed, queryable rows. Files are the opposite:

Characteristic	Files (BLOBs)	Relational Data
Access pattern	Full read or write, no partial	Column-level, row-level, range queries
Indexability	Not indexable	Fully indexable
Compression	Already compressed (JPEG, PDF)	Benefits from row compression
Query predicates	Never queried as a WHERE clause	The whole point
Row width	Makes rows huge	Rows should be narrow and uniform
CDN compatibility	No (database is not HTTP)	Not applicable
Concurrency	One connection per download	Row-level locking, MVCC

Files in database anti-pattern

TL;DR

The Problem

Why It Happens

How to Detect It

Continue Reading with Premium

Comments

Files in database anti-pattern

TL;DR

The Problem

Why It Happens

How to Detect It

Continue Reading with Premium

Comments