Files in database anti-pattern
Learn why storing binary files as BLOBs in a relational database bloats row sizes, explodes backup times, and creates replication lag, and when object storage is always the right answer.
TL;DR
- Storing binary files (images, PDFs, videos, audio) as BLOBs in a relational database uses a storage engine designed for structured, indexed, queryable data to store bytes it cannot query, index, or compress.
- The costs: row bloat slows all queries (not just file queries), backups take forever, replication lag grows, sequential table scans read giant BLOBs, VACUUM is slow, and you cannot CDN-cache database content.
- The right answer in nearly every case: store files in an object store (S3, GCS, Azure Blob Storage). Store the file path or key in the database. Serve files via CDN.
- The only exception: very small, frequently updated binary data (thumbnails under 100KB in a system that never serves them to end users) where the overhead of S3 round-trips dominates.
The Problem
A developer building a document management system stores uploaded PDFs directly in the database. The schema looks reasonable:
CREATE TABLE documents (
id UUID PRIMARY KEY,
filename VARCHAR(255),
content BYTEA, -- stores the PDF binary
uploaded_at TIMESTAMP,
user_id UUID
);
Upload 50,000 documents averaging 2MB each. Your database is now 100GB larger just from content columns. Your nightly backup takes 6 hours instead of 20 minutes. A sequential scan of the documents table for any reason reads 100GB of binary data. VACUUM must process every BYTEA column on every dead tuple. Replication log includes all 100GB, so your replica has 30 minutes of lag.
Your CDN cannot cache the PDFs because the browser downloads them through an API endpoint that hits your database. Every document download consumes a database connection for the duration of the transfer. At 2MB per file and 100ms network time, that is one connection tied up per download.
I inherited a system exactly like this in 2021. The team had 50,000 user-uploaded documents in PostgreSQL. A simple SELECT filename FROM documents WHERE user_id = ? took 800ms because the planner had to skip over BYTEA toast pages. Moving files to S3 dropped that query to 3ms.
Why It Happens
Teams store files in the database because it feels simpler. One system, one backup, one transaction. The individually-reasonable logic:
- "One system is easier to operate." True for development. Catastrophic at scale when that one system does two fundamentally different jobs.
- "We need transactions around the file." You need the metadata to be transactional. The file itself is immutable after upload.
- "S3 is another dependency." So is your database. And S3 has 99.999999999% durability, better than any self-managed database.
- "We only have a few files." Every system that stores files in the database started with "just a few files."
The mismatch is fundamental. Databases are optimized for narrow, indexed, queryable rows. Files are the opposite:
| Characteristic | Files (BLOBs) | Relational Data |
|---|---|---|
| Access pattern | Full read or write, no partial | Column-level, row-level, range queries |
| Indexability | Not indexable | Fully indexable |
| Compression | Already compressed (JPEG, PDF) | Benefits from row compression |
| Query predicates | Never queried as a WHERE clause | The whole point |
| Row width | Makes rows huge | Rows should be narrow and uniform |
| CDN compatibility | No (database is not HTTP) | Not applicable |
| Concurrency | One connection per download | Row-level locking, MVCC |
How to Detect It
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.