Schema design approach
How to design a data schema in a system design interview, starting from entities, mapping access patterns, picking storage, and making normalization decisions.
TL;DR
- Schema comes after API, never before. Your API endpoints define the access patterns, and access patterns drive every schema decision.
- Use the 4-step process: identify entities (nouns from your API), map access patterns (trace each GET), choose storage technology, and decide normalization level.
- Normalization vs denormalization is not a religious debate. Read-heavy paths get denormalized data for speed. Write-heavy paths stay normalized for consistency. Name the tradeoff explicitly.
- Index every field that appears in a WHERE clause or ORDER BY. If you can't explain why a field is indexed, you probably don't need the index.
- The schema is not one database. Different entities often belong in different storage systems (SQL for transactions, Redis for hot reads, Elasticsearch for search). Say this out loud in your interview.
The Schema That Didn't Serve Any Query
Here's a scenario I see constantly in mock interviews. A candidate is designing a social media feed. They jump straight to schema design: "I'll have a users table, a posts table, a follows table, a likes table, all nicely normalized, third normal form." Beautiful relational design. Then the interviewer asks: "Show me how you'd query the home feed."
The candidate writes: SELECT * FROM posts JOIN follows ON posts.author_id = follows.followee_id WHERE follows.follower_id = ? ORDER BY posts.created_at DESC LIMIT 20. The interviewer nods and asks: "How fast is this query when a user follows 500 people and each has posted 10,000 tweets?"
The answer: it's a disaster. The JOIN scans millions of rows. There's no index that makes this efficient. The normalized schema is technically correct but operationally useless for the actual query the system needs to serve.
The root cause: they designed the schema before defining the API. They had no idea what queries they needed to support, so they defaulted to "normalize everything." The schema looked clean on the whiteboard but couldn't answer the one question that mattered.
My recommendation: never start with the schema. Start with the API. Let the endpoints tell you what queries you need. Then design a schema that serves those queries efficiently.
Schema before API is backward
A normalized schema is a default, not a goal. If your schema can't efficiently serve your hottest query path, it's wrong regardless of how clean it looks. Always derive schema from access patterns, and access patterns come from the API.
The 4-Step Schema Process
After you've defined your API (covered in the previous article), you have everything you need to design the schema. Here's the process I walk through every time.
Let me walk through each step using the Twitter example from the API article.
Step 1: Identify Core Entities
Pull the nouns from your API responses. If you defined the Twitter API properly, the entities fall out immediately:
From POST /tweets request + response:
โ Tweet entity: tweet_id, author_id, content, media_ids[], created_at
From GET /timeline/home response:
โ User entity: user_id, username, display_name, avatar_url
โ Tweet entity (confirmed): includes like_count, author_name (denormalized?)
From POST /follows:
โ Follow entity: follower_id, followee_id, created_at
From POST /tweets/{id}/likes:
โ Like entity: user_id, tweet_id, created_at
From GET /users/{id}/profile:
โ User entity (confirmed): includes follower_count, following_count (derived)
That gives you four core entities:
| Entity | Key Fields | Notes |
|---|---|---|
| User | user_id (PK), username (unique), display_name, bio, avatar_url | follower_count and following_count are derived |
| Tweet | tweet_id (PK), author_id (FK), content, created_at | like_count is derived |
| Follow | follower_id + followee_id (composite PK), created_at | Represents the social graph |
| Like | user_id + tweet_id (composite PK), created_at | Used for liked_by_me in timeline |
For your interview: say "These are my four core entities, derived directly from the API nouns. Let me now trace how each API endpoint queries them."
What about the timeline?
The timeline isn't a core entity in the traditional sense. It's a derived, precomputed read model. This distinction matters: entities represent your source of truth, while read models represent your optimized query paths. I'll come back to timelines when we talk about denormalization.
Step 2: Map Access Patterns
This is where most candidates skip ahead, and it's where the best candidates shine. Take each API endpoint and trace the exact query it needs.
The critical query: GET /timeline/home
GET /timeline/home?cursor={token}&limit=20
Query requirement:
"Give me the 20 most recent tweets by people I follow,
starting after the cursor position."
Option A (normalized JOIN):
SELECT t.*, u.display_name, u.avatar_url
FROM tweets t
JOIN follows f ON t.author_id = f.followee_id
JOIN users u ON t.author_id = u.user_id
WHERE f.follower_id = :me
AND t.created_at < :cursor_timestamp
ORDER BY t.created_at DESC
LIMIT 20
Problem: follows table has millions of rows per popular user.
This JOIN is catastrophically slow at scale.
Option B (precomputed timeline):
SELECT tweet_id FROM timeline
WHERE user_id = :me
AND created_at < :cursor_timestamp
ORDER BY created_at DESC
LIMIT 20
Then hydrate: SELECT * FROM tweets WHERE tweet_id IN (...)
This is O(1) with an index on (user_id, created_at).
The cost moves to write time (fan-out on write).
Option B is the answer at Twitter scale. The access pattern analysis just told you that you need a separate timeline table (or cache), which is a precomputed denormalized read model.
Other access patterns
| API Endpoint | Query Pattern | Index Needed |
|---|---|---|
POST /tweets | INSERT into tweets | None (write) |
GET /timeline/home | Range scan on timeline by (user_id, created_at) | Composite: (user_id, created_at DESC) |
POST /follows | INSERT into follows, check uniqueness | Unique: (follower_id, followee_id) |
POST /likes | INSERT into likes, check uniqueness, increment counter | Unique: (user_id, tweet_id) |
GET /users/{id}/profile | Point lookup by user_id | PK: user_id |
liked_by_me field | Point lookup: does (me, tweet_id) exist? | Covered by likes PK |
I always tell candidates: "Your index strategy writes itself once you've mapped the access patterns. Every WHERE clause and ORDER BY column needs an index. Everything else doesn't."
The access pattern table is your interview cheat code
Drawing a table like the one above takes 60 seconds and immediately shows the interviewer that you understand how queries hit storage. It also makes your indexing decisions self-evident. No guessing, no cargo-culting indexes on every column.
Step 3: Choose Storage Technology
Different entities have different access patterns, and different access patterns need different storage systems. This is where you stop pretending everything goes in one PostgreSQL database.
The storage selection framework
| Question | If Yes | If No |
|---|---|---|
| Need ACID transactions? | SQL (PostgreSQL, MySQL) | Consider NoSQL |
| Need JOINs across entities? | SQL, or denormalize | NoSQL is fine |
| Write volume over 50K/sec? | Consider NoSQL or partitioned SQL | SQL handles it |
| Need sub-millisecond reads? | Redis or in-memory cache | SQL with index is fine |
| Need full-text search? | Elasticsearch | SQL LIKE is fine for small data |
| Need time-series queries? | TimescaleDB or Cassandra | Standard SQL |
Twitter storage map
Users:
Access: Point lookups, infrequent writes
Choice: PostgreSQL
Why: ACID for profile updates, simple point lookups by PK
Tweets:
Access: High write volume (~5K/sec), point lookups by tweet_id
Choice: PostgreSQL (or Cassandra at extreme scale)
Why: Need durability, tweets are immutable after write
Follows (social graph):
Access: "Who does X follow?" and "Who follows X?"
Choice: PostgreSQL with proper indexes
Why: Two directional queries on the same table,
(follower_id) index + (followee_id) index
Likes:
Access: High write volume (~50K/sec), point lookups
Choice: PostgreSQL (or Redis for counters)
Why: Uniqueness constraint needed, counters can be in Redis
Timeline (precomputed read model):
Access: Range scans by (user_id, created_at), sub-200ms P99
Choice: Redis sorted set
Why: In-memory, O(log N) range queries on sorted sets,
ZRANGEBYSCORE makes cursor pagination trivial
For your interview: say "I'm using PostgreSQL as my primary store for Users, Tweets, Follows, and Likes because I need ACID guarantees and JOIN capability. For the timeline read path, I'm using Redis sorted sets because the P99 requirement is under 200ms and this is a pre-computed view, not source of truth." That's it. Clear reasoning, clear separation.
The key insight: storage selection is per-entity, not per-system. Saying "I'll use PostgreSQL" for everything is a junior answer. Saying "I'll use PostgreSQL for transactional data and Redis for the hot read path" shows you understand that different data has different access needs.
Step 4: Normalization vs Denormalization
This is the decision that separates mechanical schema design from thoughtful schema design. The honest answer: it depends on the access pattern. But "it depends" is not what you say in an interview. You name the specific tradeoff.
The decision framework
Normalize when:
- The data changes frequently (user profiles, settings)
- Consistency matters more than read speed (financial records, inventory)
- Multiple write paths update the same data (collaborative editing)
Denormalize when:
- The read path is orders of magnitude hotter than the write path (timelines, feeds)
- The data changes infrequently (author name on a tweet)
- Staleness is acceptable (display name can lag by minutes)
The Twitter denormalization decisions
Let me walk through three specific denormalization decisions and show how to present them:
Decision 1: Author name in timeline tweets.
The timeline response includes author_name and author_avatar. The normalized approach would be: store only author_id in the tweet, JOIN to the users table at read time. The denormalized approach: inline author_name and author_avatar into the cached timeline entry.
"I'm denormalizing author data into each timeline entry. If a user changes their display name, existing timeline entries will show the old name until the cache refreshes or the entry falls off the timeline (typically within hours). This is an acceptable staleness window for display names, and it eliminates a per-tweet JOIN on the hot read path."
Decision 2: Like count on tweets.
The normalized approach: SELECT COUNT(*) FROM likes WHERE tweet_id = ?. The denormalized approach: maintain a like_count column on the tweets table, incremented on each like.
"I'm pre-computing like_count as a counter on the tweet row. This avoids a COUNT aggregation on every timeline render. The counter may lag by a few seconds during high-concurrency bursts, but exact real-time counts are not a product requirement for likes."
Decision 3: Follower and following counts on user profiles.
Same pattern: pre-computed counters instead of SELECT COUNT(*) FROM follows WHERE follower_id = ? on every profile view.
"Both follower_count and following_count are pre-computed counters on the user row. Updated on follow/unfollow. The alternative (counting rows in the follows table on each profile view) doesn't scale when a user has millions of followers."
Notice the pattern: every denormalization decision names what's duplicated, what the staleness risk is, and why the performance benefit justifies it. That's what the interviewer wants to hear.
The denormalization sentence template
"I'm denormalizing [X data] into [Y location] for [Z read performance benefit]. The tradeoff is [staleness/update complexity]. This is acceptable because [product reasoning]." Use this template every time you make a denormalization decision in an interview.
Indexing Strategy
Indexes are the most overlooked part of schema design in interviews. Candidates define tables, skip indexes entirely, and then can't explain why their queries would be fast.
The rule is simple: every field that appears in a WHERE clause or ORDER BY needs an index. Everything else does not.
Twitter indexes
-- Users table
CREATE TABLE users (
user_id UUID PRIMARY KEY,
username VARCHAR(30) UNIQUE, -- index: lookup by username
display_name VARCHAR(100),
bio TEXT,
avatar_url TEXT,
follower_count INT DEFAULT 0,
following_count INT DEFAULT 0
);
-- Tweets table
CREATE TABLE tweets (
tweet_id UUID PRIMARY KEY,
author_id UUID REFERENCES users(user_id),
content VARCHAR(280),
created_at TIMESTAMP DEFAULT NOW(),
like_count INT DEFAULT 0
);
CREATE INDEX idx_tweets_author_time
ON tweets (author_id, created_at DESC);
-- Supports: "all tweets by user X, newest first"
-- Follows table
CREATE TABLE follows (
follower_id UUID REFERENCES users(user_id),
followee_id UUID REFERENCES users(user_id),
created_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (follower_id, followee_id)
);
CREATE INDEX idx_follows_followee
ON follows (followee_id);
-- Supports: "who follows user X?" (reverse lookup)
-- Likes table
CREATE TABLE likes (
user_id UUID REFERENCES users(user_id),
tweet_id UUID REFERENCES tweets(tweet_id),
created_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (user_id, tweet_id)
);
CREATE INDEX idx_likes_tweet
ON likes (tweet_id);
-- Supports: "all likes on tweet X" and counting
Every index has a comment explaining what query it supports. No guessing, no "index everything just in case."
Common Mistakes
Designing schema before API. The number one mistake. You end up with a normalized schema that can't serve your hottest query. Always trace from API to access pattern to schema.
One database for everything. "I'll use PostgreSQL" is not a storage strategy. Different entities have different access patterns. The timeline is a hot read path that belongs in Redis. The social graph might outgrow a single Postgres instance. Name the storage per entity.
Over-normalizing. Third normal form is a starting point, not a finish line. If your read path requires three JOINs to assemble a timeline entry, you've normalized past the point of usefulness. Denormalize the hot read path and keep the source of truth normalized.
No indexes mentioned. Defining a table without indexes is like designing a highway without on-ramps. Say which columns are indexed and why. The "why" is always: "because this column appears in the WHERE clause of my hottest query."
Ignoring derived data. Counts (followers, likes, views) that are displayed on every page load should be pre-computed counters, not COUNT(*) queries. If you're aggregating on every read, you're doing work that should have been done on write.
Skipping the timeline problem. For any feed-based system (Twitter, Instagram, LinkedIn), the home timeline query is the critical design challenge. A naive JOIN across follows and posts doesn't scale. You need to address fan-out (write-time or read-time) and pre-computed timelines.
How This Shows Up in Interviews
At junior/mid level: The interviewer expects you to define 3 to 4 entities with their columns and primary keys. Showing one or two indexes and explaining why they exist is a strong signal. You don't need to discuss storage selection in depth.
At senior level: The interviewer expects you to map access patterns explicitly, choose storage per entity, and make deliberate denormalization decisions with stated tradeoffs. They'll probe: "Why did you put the timeline in Redis instead of PostgreSQL?" You need a concrete answer rooted in access patterns and latency requirements.
At staff+ level: The interviewer expects you to discuss data lifecycle: "This counter is eventually consistent, refreshed every 5 seconds. The source of truth is in PostgreSQL, but the hot read path is in Redis with a 1-minute TTL. On cache miss, we fall through to the database." You're expected to reason about consistency windows, cache invalidation, and what happens when the denormalized data diverges from the source of truth.
The common mistake at all levels: spending too long on the schema. Your schema section should take 3 to 5 minutes. Define entities, name the access patterns, pick storage, state one or two denormalization decisions, and move on to the architecture discussion.
Quick Recap
- Schema comes after API, always. Your API endpoints define the access patterns. Design the schema to serve those patterns efficiently, not to satisfy normalization theory.
- Use the 4-step process: extract entities from API nouns, map each GET to a query pattern, choose storage technology per entity, and make deliberate normalization decisions.
- Different entities need different storage. SQL for transactions, Redis for hot reads, Elasticsearch for search. Saying "I'll use one PostgreSQL database" is a junior answer.
- Normalization is the default for write paths. Denormalization is the answer for read-heavy hot paths. Name the tradeoff explicitly: what's duplicated, what's the staleness risk, and why the benefit justifies the cost.
- Index every column that appears in a WHERE or ORDER BY clause. No more, no less. Comment each index with the query it supports.
- Pre-compute derived data (counters, aggregates, timelines) instead of computing them on every read. Move the work from the read path to the write path.
- State your schema decisions as tradeoffs, not as absolutes. "I chose X because of Y, accepting Z" is the formula that signals senior-level thinking.