Schema design approach

TL;DR

Schema comes after API, never before. Your API endpoints define the access patterns, and access patterns drive every schema decision.
Use the 4-step process: identify entities (nouns from your API), map access patterns (trace each GET), choose storage technology, and decide normalization level.
Normalization vs denormalization is not a religious debate. Read-heavy paths get denormalized data for speed. Write-heavy paths stay normalized for consistency. Name the tradeoff explicitly.
Index every field that appears in a WHERE clause or ORDER BY. If you can't explain why a field is indexed, you probably don't need the index.
The schema is not one database. Different entities often belong in different storage systems (SQL for transactions, Redis for hot reads, Elasticsearch for search). Say this out loud in your interview.

The Schema That Didn't Serve Any Query

Here's a scenario I see constantly in mock interviews. A candidate is designing a social media feed. They jump straight to schema design: "I'll have a users table, a posts table, a follows table, a likes table, all nicely normalized, third normal form." Beautiful relational design. Then the interviewer asks: "Show me how you'd query the home feed."

The candidate writes: SELECT * FROM posts JOIN follows ON posts.author_id = follows.followee_id WHERE follows.follower_id = ? ORDER BY posts.created_at DESC LIMIT 20. The interviewer nods and asks: "How fast is this query when a user follows 500 people and each has posted 10,000 tweets?"

The answer: it's a disaster. The JOIN scans millions of rows. There's no index that makes this efficient. The normalized schema is technically correct but operationally useless for the actual query the system needs to serve.

The root cause: they designed the schema before defining the API. They had no idea what queries they needed to support, so they defaulted to "normalize everything." The schema looked clean on the whiteboard but couldn't answer the one question that mattered.

My recommendation: never start with the schema. Start with the API. Let the endpoints tell you what queries you need. Then design a schema that serves those queries efficiently.

Schema before API is backward

A normalized schema is a default, not a goal. If your schema can't efficiently serve your hottest query path, it's wrong regardless of how clean it looks. Always derive schema from access patterns, and access patterns come from the API.

The 4-Step Schema Process

After you've defined your API (covered in the previous article), you have everything you need to design the schema. Here's the process I walk through every time.

Let me walk through each step using the Twitter example from the API article.

Step 1: Identify Core Entities

Pull the nouns from your API responses. If you defined the Twitter API properly, the entities fall out immediately:

From POST /tweets request + response:
  → Tweet entity: tweet_id, author_id, content, media_ids[], created_at

From GET /timeline/home response:
  → User entity: user_id, username, display_name, avatar_url
  → Tweet entity (confirmed): includes like_count, author_name (denormalized?)

From POST /follows:
  → Follow entity: follower_id, followee_id, created_at

From POST /tweets/{id}/likes:
  → Like entity: user_id, tweet_id, created_at

From GET /users/{id}/profile:
  → User entity (confirmed): includes follower_count, following_count (derived)

That gives you four core entities:

Entity	Key Fields	Notes
User	user_id (PK), username (unique), display_name, bio, avatar_url	follower_count and following_count are derived
Tweet	tweet_id (PK), author_id (FK), content, created_at	like_count is derived
Follow	follower_id + followee_id (composite PK), created_at	Represents the social graph
Like	user_id + tweet_id (composite PK), created_at	Used for `liked_by_me` in timeline

For your interview: say "These are my four core entities, derived directly from the API nouns. Let me now trace how each API endpoint queries them."

What about the timeline?

The timeline isn't a core entity in the traditional sense. It's a derived, precomputed read model. This distinction matters: entities represent your source of truth, while read models represent your optimized query paths. I'll come back to timelines when we talk about denormalization.

Step 2: Map Access Patterns

This is where most candidates skip ahead, and it's where the best candidates shine. Take each API endpoint and trace the exact query it needs.

The critical query: GET /timeline/home

GET /timeline/home?cursor={token}&limit=20

Query requirement:
  "Give me the 20 most recent tweets by people I follow,
   starting after the cursor position."

Option A (normalized JOIN):
  SELECT t.*, u.display_name, u.avatar_url
  FROM tweets t
  JOIN follows f ON t.author_id = f.followee_id
  JOIN users u ON t.author_id = u.user_id
  WHERE f.follower_id = :me
    AND t.created_at < :cursor_timestamp
  ORDER BY t.created_at DESC
  LIMIT 20

  Problem: follows table has millions of rows per popular user.
  This JOIN is catastrophically slow at scale.

Option B (precomputed timeline):
  SELECT tweet_id FROM timeline
  WHERE user_id = :me
    AND created_at < :cursor_timestamp
  ORDER BY created_at DESC
  LIMIT 20

  Then hydrate: SELECT * FROM tweets WHERE tweet_id IN (...)

  This is O(1) with an index on (user_id, created_at).
  The cost moves to write time (fan-out on write).

Option B is the answer at Twitter scale. The access pattern analysis just told you that you need a separate timeline table (or cache), which is a precomputed denormalized read model.

Other access patterns

API Endpoint	Query Pattern	Index Needed
`POST /tweets`	INSERT into tweets	None (write)
`GET /timeline/home`	Range scan on timeline by (user_id, created_at)	Composite: `(user_id, created_at DESC)`
`POST /follows`	INSERT into follows, check uniqueness	Unique: `(follower_id, followee_id)`
`POST /likes`	INSERT into likes, check uniqueness, increment counter	Unique: `(user_id, tweet_id)`
`GET /users/{id}/profile`	Point lookup by user_id	PK: `user_id`
`liked_by_me` field	Point lookup: does (me, tweet_id) exist?	Covered by likes PK

I always tell candidates: "Your index strategy writes itself once you've mapped the access patterns. Every WHERE clause and ORDER BY column needs an index. Everything else doesn't."

The access pattern table is your interview cheat code

Drawing a table like the one above takes 60 seconds and immediately shows the interviewer that you understand how queries hit storage. It also makes your indexing decisions self-evident. No guessing, no cargo-culting indexes on every column.

Step 3: Choose Storage Technology

Different entities have different access patterns, and different access patterns need different storage systems. This is where you stop pretending everything goes in one PostgreSQL database.

The storage selection framework

Question	If Yes	If No
Need ACID transactions?	SQL (PostgreSQL, MySQL)	Consider NoSQL
Need JOINs across entities?	SQL, or denormalize	NoSQL is fine
Write volume over 50K/sec?	Consider NoSQL or partitioned SQL	SQL handles it
Need sub-millisecond reads?	Redis or in-memory cache	SQL with index is fine
Need full-text search?	Elasticsearch	SQL LIKE is fine for small data
Need time-series queries?	TimescaleDB or Cassandra	Standard SQL

Twitter storage map

Users:
  Access: Point lookups, infrequent writes
  Choice: PostgreSQL
  Why: ACID for profile updates, simple point lookups by PK

Tweets:
  Access: High write volume (~5K/sec), point lookups by tweet_id
  Choice: PostgreSQL (or Cassandra at extreme scale)
  Why: Need durability, tweets are immutable after write

Follows (social graph):
  Access: "Who does X follow?" and "Who follows X?"
  Choice: PostgreSQL with proper indexes
  Why: Two directional queries on the same table,
       (follower_id) index + (followee_id) index

Likes:
  Access: High write volume (~50K/sec), point lookups
  Choice: PostgreSQL (or Redis for counters)
  Why: Uniqueness constraint needed, counters can be in Redis

Timeline (precomputed read model):
  Access: Range scans by (user_id, created_at), sub-200ms P99
  Choice: Redis sorted set
  Why: In-memory, O(log N) range queries on sorted sets,
       ZRANGEBYSCORE makes cursor pagination trivial

For your interview: say "I'm using PostgreSQL as my primary store for Users, Tweets, Follows, and Likes because I need ACID guarantees and JOIN capability. For the timeline read path, I'm using Redis sorted sets because the P99 requirement is under 200ms and this is a pre-computed view, not source of truth." That's it. Clear reasoning, clear separation.

The key insight: storage selection is per-entity, not per-system. Saying "I'll use PostgreSQL" for everything is a junior answer. Saying "I'll use PostgreSQL for transactional data and Redis for the hot read path" shows you understand that different data has different access needs.

Step 4: Normalization vs Denormalization

This is the decision that separates mechanical schema design from thoughtful schema design. The honest answer: it depends on the access pattern. But "it depends" is not what you say in an interview. You name the specific tradeoff.

The decision framework

Normalize when:

The data changes frequently (user profiles, settings)
Consistency matters more than read speed (financial records, inventory)
Multiple write paths update the same data (collaborative editing)

Denormalize when:

The read path is orders of magnitude hotter than the write path (timelines, feeds)
The data changes infrequently (author name on a tweet)
Staleness is acceptable (display name can lag by minutes)

The Twitter denormalization decisions

Let me walk through three specific denormalization decisions and show how to present them:

Decision 1: Author name in timeline tweets.

The timeline response includes author_name and author_avatar. The normalized approach would be: store only author_id in the tweet, JOIN to the users table at read time. The denormalized approach: inline author_name and author_avatar into the cached timeline entry.

"I'm denormalizing author data into each timeline entry. If a user changes their display name, existing timeline entries will show the old name until the cache refreshes or the entry falls off the timeline (typically within hours). This is an acceptable staleness window for display names, and it eliminates a per-tweet JOIN on the hot read path."

Decision 2: Like count on tweets.

The normalized approach: SELECT COUNT(*) FROM likes WHERE tweet_id = ?. The denormalized approach: maintain a like_count column on the tweets table, incremented on each like.

"I'm pre-computing like_count as a counter on the tweet row. This avoids a COUNT aggregation on every timeline render. The counter may lag by a few seconds during high-concurrency bursts, but exact real-time counts are not a product requirement for likes."

Decision 3: Follower and following counts on user profiles.

Same pattern: pre-computed counters instead of SELECT COUNT(*) FROM follows WHERE follower_id = ? on every profile view.

"Both follower_count and following_count are pre-computed counters on the user row. Updated on follow/unfollow. The alternative (counting rows in the follows table on each profile view) doesn't scale when a user has millions of followers."

Notice the pattern: every denormalization decision names what's duplicated, what the staleness risk is, and why the performance benefit justifies it. That's what the interviewer wants to hear.

The denormalization sentence template

"I'm denormalizing [X data] into [Y location] for [Z read performance benefit]. The tradeoff is [staleness/update complexity]. This is acceptable because [product reasoning]." Use this template every time you make a denormalization decision in an interview.

Indexing Strategy

Indexes are the most overlooked part of schema design in interviews. Candidates define tables, skip indexes entirely, and then can't explain why their queries would be fast.

The rule is simple: every field that appears in a WHERE clause or ORDER BY needs an index. Everything else does not.

Twitter indexes

-- Users table
CREATE TABLE users (
  user_id     UUID PRIMARY KEY,
  username    VARCHAR(30) UNIQUE,   -- index: lookup by username
  display_name VARCHAR(100),
  bio         TEXT,
  avatar_url  TEXT,
  follower_count  INT DEFAULT 0,
  following_count INT DEFAULT 0
);

-- Tweets table
CREATE TABLE tweets (
  tweet_id    UUID PRIMARY KEY,
  author_id   UUID REFERENCES users(user_id),
  content     VARCHAR(280),
  created_at  TIMESTAMP DEFAULT NOW(),
  like_count  INT DEFAULT 0
);
CREATE INDEX idx_tweets_author_time
  ON tweets (author_id, created_at DESC);
  -- Supports: "all tweets by user X, newest first"

-- Follows table
CREATE TABLE follows (
  follower_id UUID REFERENCES users(user_id),
  followee_id UUID REFERENCES users(user_id),
  created_at  TIMESTAMP DEFAULT NOW(),
  PRIMARY KEY (follower_id, followee_id)
);
CREATE INDEX idx_follows_followee
  ON follows (followee_id);
  -- Supports: "who follows user X?" (reverse lookup)

-- Likes table
CREATE TABLE likes (
  user_id   UUID REFERENCES users(user_id),
  tweet_id  UUID REFERENCES tweets(tweet_id),
  created_at TIMESTAMP DEFAULT NOW(),
  PRIMARY KEY (user_id, tweet_id)
);
CREATE INDEX idx_likes_tweet
  ON likes (tweet_id);
  -- Supports: "all likes on tweet X" and counting

Every index has a comment explaining what query it supports. No guessing, no "index everything just in case."

Common Mistakes

Designing schema before API. The number one mistake. You end up with a normalized schema that can't serve your hottest query. Always trace from API to access pattern to schema.

One database for everything. "I'll use PostgreSQL" is not a storage strategy. Different entities have different access patterns. The timeline is a hot read path that belongs in Redis. The social graph might outgrow a single Postgres instance. Name the storage per entity.

Over-normalizing. Third normal form is a starting point, not a finish line. If your read path requires three JOINs to assemble a timeline entry, you've normalized past the point of usefulness. Denormalize the hot read path and keep the source of truth normalized.

No indexes mentioned. Defining a table without indexes is like designing a highway without on-ramps. Say which columns are indexed and why. The "why" is always: "because this column appears in the WHERE clause of my hottest query."

Ignoring derived data. Counts (followers, likes, views) that are displayed on every page load should be pre-computed counters, not COUNT(*) queries. If you're aggregating on every read, you're doing work that should have been done on write.

Skipping the timeline problem. For any feed-based system (Twitter, Instagram, LinkedIn), the home timeline query is the critical design challenge. A naive JOIN across follows and posts doesn't scale. You need to address fan-out (write-time or read-time) and pre-computed timelines.

How This Shows Up in Interviews

At junior/mid level: The interviewer expects you to define 3 to 4 entities with their columns and primary keys. Showing one or two indexes and explaining why they exist is a strong signal. You don't need to discuss storage selection in depth.

At senior level: The interviewer expects you to map access patterns explicitly, choose storage per entity, and make deliberate denormalization decisions with stated tradeoffs. They'll probe: "Why did you put the timeline in Redis instead of PostgreSQL?" You need a concrete answer rooted in access patterns and latency requirements.

At staff+ level: The interviewer expects you to discuss data lifecycle: "This counter is eventually consistent, refreshed every 5 seconds. The source of truth is in PostgreSQL, but the hot read path is in Redis with a 1-minute TTL. On cache miss, we fall through to the database." You're expected to reason about consistency windows, cache invalidation, and what happens when the denormalized data diverges from the source of truth.

The common mistake at all levels: spending too long on the schema. Your schema section should take 3 to 5 minutes. Define entities, name the access patterns, pick storage, state one or two denormalization decisions, and move on to the architecture discussion.

Quick Recap

Schema comes after API, always. Your API endpoints define the access patterns. Design the schema to serve those patterns efficiently, not to satisfy normalization theory.
Use the 4-step process: extract entities from API nouns, map each GET to a query pattern, choose storage technology per entity, and make deliberate normalization decisions.
Different entities need different storage. SQL for transactions, Redis for hot reads, Elasticsearch for search. Saying "I'll use one PostgreSQL database" is a junior answer.
Normalization is the default for write paths. Denormalization is the answer for read-heavy hot paths. Name the tradeoff explicitly: what's duplicated, what's the staleness risk, and why the benefit justifies the cost.
Index every column that appears in a WHERE or ORDER BY clause. No more, no less. Comment each index with the query it supports.
Pre-compute derived data (counters, aggregates, timelines) instead of computing them on every read. Move the work from the read path to the write path.
State your schema decisions as tradeoffs, not as absolutes. "I chose X because of Y, accepting Z" is the formula that signals senior-level thinking.