Transformer architecture
Understand how the transformer's encoder-decoder structure, positional encoding, and residual connections work together, and why this architecture has dominated AI since 2017.
TL;DR
- The transformer replaced RNNs in 2017 by using attention instead of recurrence, enabling full parallelism during training.
- A single transformer block is: multi-head attention, Add and LayerNorm, feed-forward network, Add and LayerNorm. Stack N of these (12 for BERT-base, 96 for GPT-3, 80 for Llama 3 70B).
- Residual connections (
x + sublayer(x)) solve the vanishing gradient problem and are why 96-layer models train at all. - The FFN hidden dimension is 4x the model dimension. Roughly two-thirds of all parameters in GPT-3 live in FFN layers, and research suggests factual knowledge is stored there.
- Decoder-only (GPT, Claude, Llama) is the dominant paradigm for LLMs in 2026. Know all three variants (encoder-only, decoder-only, encoder-decoder) and when each fits.
- Knowing the transformer block structure cold is the single most useful architecture fact for AI system design interviews.
The problem it solves
Before 2017, sequence models meant recurrent neural networks. An LSTM processes a sentence token by token, updating a hidden state at each step. Token 47 cannot be processed until token 46 finishes, which means GPU parallelism barely helps.
The second problem is long-range dependencies. An LSTM might capture relationships across 100 tokens with difficulty, but at 1,000 tokens the hidden state has been overwritten so many times that early context is effectively lost. Researchers stacked bidirectional passes and attention on top of LSTMs, but the result was complex and still fundamentally sequential.
The third problem is training speed. A 1,000-token sequence requires 1,000 sequential forward steps through an RNN. A transformer processes all 1,000 tokens in a single matrix multiplication. On an A100 GPU, this difference means hours versus days for the same dataset.
I've seen teams in 2018 spend weeks tuning LSTM-based models that a transformer replaced in a single afternoon of training. The parallelism advantage is not theoretical; it's a 10-50x wall-clock speedup on modern GPUs.
The 2017 paper "Attention Is All You Need" (Vaswani et al., Google Brain) removed recurrence entirely. The architecture it introduced is the foundation of every major LLM today.
What is it?
The transformer is a neural network architecture that processes sequences using attention alone, with no recurrence or convolution.
Think of it as an assembly line in a factory. In the old factory (RNN), one worker handles each item sequentially, passing a note to the next worker. In the new factory (transformer), every worker sees every item at the same time, coordinates instantly through attention, and the entire batch moves through each station together. The stations are identical transformer blocks, stacked one after another.
The two key properties are full parallelism (every token processed simultaneously during training) and direct long-range dependency modeling (any token can attend to any other token in a single layer). A full model is: input embeddings, positional encoding, N identical transformer blocks, and a final projection to vocabulary logits.
Every major language model since 2018, from BERT to GPT-4 to Claude to Llama, is built on this architecture. The differences between models are in size, training data, and which variant of the transformer they use (encoder-only, decoder-only, or encoder-decoder), not in the fundamental block structure.
The scale is worth internalizing. BERT-base has 12 blocks and 110M parameters. GPT-3 has 96 blocks and 175B parameters. Llama 3 70B has 80 blocks and 70B parameters. Same block, different repetition counts and dimensions. The architecture is deceptively simple once you understand the single block.
How it works
The transformer block
This is the structure you need to memorize. Every modern LLM is built by stacking this block repeatedly.
The diagram above shows the Pre-LN variant used by modern models (Llama, Mistral, GPT-3). The original 2017 paper placed LayerNorm after the residual add (Post-LN), but Pre-LN has become the standard because it trains more stably at depth.
A complete model stacks N of these blocks. BERT-base uses 12, GPT-3 uses 96, and Llama 3 8B uses 32. Then a final linear layer projects the output to vocabulary size and softmax produces token probabilities.
Residual connections
Every sublayer adds its input directly to its output: output = x + sublayer(x). This single design choice is why deep transformers train at all.
Without residual connections, gradients must flow through every layer during backpropagation. In a 96-layer model, the gradient signal is multiplied by 96 Jacobian matrices. If any of those matrices have eigenvalues smaller than 1, the gradient shrinks exponentially and early layers receive near-zero updates. This is the vanishing gradient problem, and it kills training in deep networks.
Residuals create a "shortcut highway" where gradients flow backward through the addition operation unchanged. Even if a layer is poorly initialized, the identity path (x) passes through unmodified, and the layer only needs to learn a small correction (the delta sublayer(x)). In my experience, this is the single most important architectural decision in transformers, yet candidates rarely mention it in interviews.
Layer normalization
Layer normalization stabilizes activations by normalizing across the feature dimension within each token. Two variants exist:
Post-LN (original 2017 paper): normalize after the residual add. This places the normalization on the main residual path, which can interfere with gradient flow at depth. Models trained with Post-LN often require careful learning rate warmup and sometimes fail to converge past 48 layers.
Pre-LN (GPT-3, Llama, Mistral): normalize before each sublayer, keeping the residual path clean. The residual connection carries unnormalized values, which preserves gradient magnitude. Pre-LN trains more stably, especially beyond 32 layers, and is the default for all modern architectures.
Interview tip: Pre-LN vs Post-LN
If asked why modern transformers are deeper than the original, mention Pre-LN. "The original paper used Post-LN, which struggles past 48 layers. Modern models use Pre-LN, which keeps the residual path clean and allows stable training at 96+ layers." This signals you understand the architecture beyond surface level.
The feed-forward network
After each attention layer, a two-layer MLP processes each token independently. The hidden dimension is typically 4x the model dimension: a 4096-dim model has a 16,384-dim FFN hidden layer. The total parameter count is dominated by these FFN layers.
In GPT-3, approximately two-thirds of all 175 billion parameters are in FFN layers. Research by Geva et al. (2021) suggests FFN layers act as "key-value memories" where individual neurons activate for specific factual patterns (like "capital of France" triggering neurons associated with "Paris"). Whether this fully explains factual storage is debated, but the empirical pattern is consistent.
Modern models like Llama 3 and Mistral replace the standard GELU activation with SwiGLU. SwiGLU uses a gated linear unit that empirically produces slightly better results. The tradeoff: SwiGLU changes the FFN inner dimension from 4x to roughly 8/3 x model_dim to keep parameter count equivalent. This is a detail that comes up in staff-level interviews.
Positional encoding
Transformers process all tokens simultaneously and have no built-in notion of order. Without positional encoding, "the cat sat on the mat" and "mat the on sat cat the" produce identical representations. Three approaches exist:
Sinusoidal (original 2017): deterministic sine and cosine waves at different frequencies for each position. No learned parameters. Generalizes poorly beyond training length.
Learned absolute (BERT): a separate learned embedding for each position index, added to the token embedding. Simple but completely fails at positions unseen during training.
Rotary Position Embedding (RoPE): encodes relative position directly into query and key vectors by rotating them before computing attention. Used by Llama, Mistral, and most modern open-weight models. Generalizes significantly better to longer sequences.
The trend is clear: every major open-weight model released since 2023 uses RoPE. If you are choosing a base model for fine-tuning, prefer architectures with RoPE for maximum flexibility on context length.
RoPE and context extension
A model trained with RoPE up to 4,096 tokens can be extended to longer contexts using techniques like YaRN or positional interpolation. But performance typically degrades beyond the training length. Don't assume a model handles 128K tokens well just because it claims to. Always check benchmarks like RULER or LongBench.
I've seen production systems fail silently because teams assumed a model with a 128K context window performed equally well at position 100K as at position 4K. Always benchmark your actual use case at the actual input lengths you expect.
Forward pass walkthrough
Here is what happens when a single input passes through one transformer block. Watch each component activate in sequence.
Each block repeats this exact flow. Llama 3 8B stacks 32 of these blocks. GPT-3 stacks 96. The block structure never changes; only the number of layers and the dimensions within each block vary.
Encoder-only, decoder-only, and encoder-decoder
The original 2017 paper described an encoder-decoder model for machine translation. Modern variants specialize the architecture for different tasks.
Encoder-only (BERT, RoBERTa, E5): every token attends to every other token bidirectionally. Best for classification, search ranking, and embedding generation. Not generative. Google replaced BM25 with BERT for query understanding in 2019, improving 10% of English search queries.
Decoder-only (GPT, Claude, Llama, Mistral): a causal attention mask prevents tokens from attending to future positions. Text is generated one token at a time, left to right. This is the dominant paradigm for LLMs because the training objective (next-token prediction) is simple, scalable, and produces emergent capabilities at sufficient scale.
Encoder-decoder (T5, BART, NLLB): the encoder reads input bidirectionally, and the decoder generates output with cross-attention into the encoder's representations. The decoder uses two attention mechanisms per block: self-attention over its own generated tokens (causal) and cross-attention over the encoder's output. Useful for tasks with a clear input-output structure like translation. More complex to serve and less common for general-purpose LLMs.
In my experience, the most common confusion in interviews is candidates treating these as three unrelated architectures. They are not. All three use the same transformer block internally. The differences are: which attention mask is applied (bidirectional vs. causal), whether cross-attention exists, and which training objective is used (masked LM, next-token prediction, or span corruption).
For production LLMs in 2026, decoder-only dominates. The simplicity of autoregressive generation and the proven scaling laws make it the default choice unless you have a specific reason to use encoder-only or encoder-decoder.
Key variants and types
| Model | Architecture | Layers | Hidden Dim | FFN Dim | Attention | Positional | Key Innovation |
|---|---|---|---|---|---|---|---|
| BERT-base | Encoder-only | 12 | 768 | 3072 | 12 heads, full bidirectional | Learned absolute | Masked language modeling pretraining |
| GPT-3 | Decoder-only | 96 | 12288 | 49152 | 96 heads, causal | Learned absolute | Scaling laws, few-shot emergence |
| T5-large | Encoder-decoder | 24+24 | 1024 | 4096 | 16 heads, full + causal | Relative bias | Text-to-text framing for all tasks |
| Llama 3 8B | Decoder-only | 32 | 4096 | 14336 | 32 heads, GQA (8 KV heads) | RoPE | Open-weight, SwiGLU, GQA |
| Mistral 7B | Decoder-only | 32 | 4096 | 14336 | 32 heads, GQA (8 KV heads) | RoPE | Sliding window attention |
Grouped Query Attention (GQA) deserves special mention. Standard multi-head attention gives each head its own Q, K, and V projections. GQA shares K and V heads across groups of Q heads (Llama 3 uses 32 Q heads with only 8 KV heads). This reduces the KV cache memory during inference by 4x, which directly cuts serving cost for long-context generation.
SwiGLU activation is the other modern innovation in the table. The original transformer used ReLU in the FFN, later replaced by GELU. Llama and Mistral use SwiGLU, a gated linear unit that multiplies the input by a sigmoid-gated version of itself. The empirical result is slightly better loss at the same parameter count. The architectural tradeoff: SwiGLU requires three weight matrices in the FFN instead of two, so the inner dimension is reduced from 4x to roughly 8/3 x model_dim to keep the total parameter count equivalent.
When to use / when to avoid
When to use each variant
- Use encoder-only (BERT, E5) when your task is classification, named entity recognition, semantic search, or embedding generation. Cheaper to run, often better at these tasks than decoder-only models.
- Use decoder-only (Llama, GPT-4, Claude) when you need text generation, chat, code completion, summarization, or any task that produces variable-length output.
- Use encoder-decoder (T5, NLLB) for translation, structured summarization, or tasks with a defined input-output mapping where bidirectional encoding improves quality.
- Use GQA-based models (Llama 3, Mistral) when serving cost matters. The KV cache reduction compounds across concurrent requests and long contexts.
I've seen teams default to "just use GPT-4 for everything" without considering that a fine-tuned BERT costs 100x less per inference for classification and often achieves better accuracy. The right architecture depends on the task, not on which model has the most hype.
When to avoid
- Avoid encoder-only for any generative task. BERT cannot generate fluent text. It was not trained with a generative objective and has no causal mask to support autoregressive output.
- Avoid decoder-only for pure classification unless you are already serving a large LLM and want to consolidate. A fine-tuned BERT is 10-100x cheaper to serve for classification than prompting GPT-4.
- Avoid encoder-decoder for general-purpose chat. The extra cross-attention complexity adds latency without benefit when the "source" and "target" are the same conversational context.
- Avoid any transformer for tasks with fewer than 1,000 training examples and strict latency requirements. Simpler models (logistic regression, gradient-boosted trees) often win here.
Real-world examples
GPT-3 (OpenAI, 2020): 96-layer decoder-only, 175B parameters, 12288 model dimension. First demonstrated that scaling decoder-only transformers produces emergent capabilities (few-shot reasoning, code generation) without task-specific fine-tuning. Training cost estimated at $4.6M on V100 GPUs. The follow-up GPT-4 is estimated at over 1 trillion parameters across a mixture-of-experts architecture.
BERT (Google, 2018): 12/24-layer encoder-only. Still the backbone of search ranking systems globally. Google's production search integrated BERT in 2019 for query understanding, reportedly improving relevance for 10% of all English queries. Fine-tuned BERT variants run at sub-millisecond latency per query, making them ideal for high-throughput production systems where cost matters more than generative flexibility.
Llama 3 70B (Meta, 2024): 80-layer decoder-only, 8192 model dimension, GQA with 64 Q heads and 8 KV heads, RoPE positional encoding, SwiGLU activation. The open-weight release made state-of-the-art architecture accessible for self-hosting. The 8B variant runs on a single consumer GPU with 4-bit quantization.
Mistral 7B (Mistral AI, 2023): 32-layer decoder-only, 4096 model dimension, GQA with sliding window attention. Introduced the idea of limiting attention to a local window (4096 tokens) in lower layers while using full attention in upper layers. This reduces the quadratic attention cost for long sequences at the expense of slightly reduced long-range dependency modeling in early layers. Outperformed Llama 2 13B despite being nearly half the size.
T5-XXL (Google, 2020): 24+24 layer encoder-decoder, 11B parameters. Framed every NLP task as text-to-text (classification becomes "classify: input" producing "positive" or "negative"). Demonstrated that encoder-decoder architectures can be competitive at scale, though the paradigm has been overtaken by decoder-only for general-purpose use.
Limitations and tradeoffs
- Quadratic attention cost: self-attention is O(n^2) in sequence length. Doubling context length from 4K to 8K quadruples the attention computation. Techniques like FlashAttention optimize the constant factor, but the quadratic scaling remains fundamental.
- Training cost scales with depth and width: 96-layer models require careful engineering (Pre-LN, gradient clipping, learning rate warmup, mixed-precision training). Training GPT-3 cost an estimated $4.6M; GPT-4 is estimated at over $100M.
- No persistent memory: the transformer has no state across inference calls. Everything it "knows" about a conversation must fit in the context window per request. This is a fundamental architectural limitation, not a fixable bug. Retrieval augmented generation and external memory systems are workarounds, not solutions.
- Architecture monoculture: nearly every major LLM uses the same core components. This simplifies tooling and transfer learning but means the field has concentrated risk on a single architectural family. State-space models (Mamba, RWKV) offer linear-time alternatives but remain unproven at GPT-4 scale.
The fundamental tension is expressiveness versus cost. Deeper and wider models are more capable, but quadratic attention and massive parameter counts make them expensive to train and serve. Every production deployment is a tradeoff between model quality and inference budget.
The attention bottleneck is real
At 128K context length, the attention computation alone can dominate inference latency. For Llama 3 70B, a 128K-token forward pass requires computing attention scores across 128K x 128K = 16.4 billion position pairs per layer, across 80 layers. This is why KV caching, FlashAttention, and context window management are critical production concerns, not academic curiosities.
How this shows up in interviews
When to bring it up
Mention the transformer block structure in any AI system design question. It is relevant when discussing model selection (encoder-only vs. decoder-only), inference cost (attention is quadratic, FFN is the parameter bottleneck), or training decisions (Pre-LN for stability, GQA for serving efficiency).
This concept also comes up when discussing scaling strategies. If the interviewer asks "how would you make this system handle higher throughput," knowing that GQA reduces KV cache memory 4-8x is directly actionable. If they ask about model customization, knowing the difference between encoder-only fine-tuning (cheaper, task-specific) and decoder-only prompting (flexible, expensive) shapes your recommendation.
Depth calibration
- Junior: know the block structure (attention, residual, norm, FFN), know the three variants (encoder-only, decoder-only, encoder-decoder), and know that decoder-only is dominant.
- Senior: explain why residual connections matter for deep models, explain Pre-LN vs Post-LN, compare sinusoidal vs RoPE positional encoding, explain GQA and its memory savings.
- Staff: discuss SwiGLU vs GELU tradeoffs, explain KV cache memory math for GQA, reason about scaling laws in terms of parameter allocation between attention and FFN, discuss architectural alternatives like state-space models (Mamba) and their tradeoffs. Be able to calculate approximate parameter counts given layer count, hidden dim, and FFN dim. Know the cost implications of quadratic attention at long context lengths.
Q&A table
| Interviewer asks | Strong answer |
|---|---|
| "Walk me through a transformer block." | "Pre-LN normalize, multi-head attention, residual add. Pre-LN normalize again, FFN with 4x expansion, residual add. Stack N times." |
| "Why do residual connections matter?" | "They create a gradient highway. Without them, signals degrade exponentially through 96 Jacobians and early layers never learn." |
| "Why is the FFN 4x wider?" | "Empirical choice from the original paper. The expansion gives each token a richer nonlinear transform. Two-thirds of GPT-3's parameters are in FFN." |
| "What is GQA and why use it?" | "Share KV heads across groups of Q heads. Llama 3 uses 32 Q heads with 8 KV heads, cutting KV cache memory 4x for cheaper serving." |
| "Pre-LN or Post-LN?" | "Pre-LN. It normalizes before the sublayer, keeping the residual path clean. Post-LN struggles past 48 layers." |
| "When would you pick BERT over GPT?" | "Classification, search ranking, embeddings. Bidirectional attention gives richer representations and BERT is 10-100x cheaper to serve." |
Common interview mistakes
| Mistake | Why it's wrong | Say this instead |
|---|---|---|
| "The transformer uses attention instead of layers" | Attention IS a layer. The transformer uses attention layers instead of recurrent layers. The block has both attention and FFN layers. | "The transformer replaces recurrence with self-attention layers, stacked with FFN layers and residual connections." |
| "BERT and GPT are completely different architectures" | They share the same transformer block. The difference is the attention mask: bidirectional vs. causal. | "BERT and GPT use the same transformer block. The key difference is the attention mask and training objective." |
| "Positional encoding tells the model where words are" | Vague. There are three major approaches with different tradeoffs for length generalization. | "Sinusoidal, learned, and RoPE each encode position differently. RoPE is the modern default because it generalizes to longer sequences." |
| "More layers always means a better model" | Ignores training stability, diminishing returns, and the critical role of Pre-LN. Models can degrade with depth if not carefully trained. | "More layers help, but only with Pre-LN, proper warmup, and gradient clipping. Post-LN models often fail past 48 layers." |
| "The FFN is just a simple neural network" | The FFN contains two-thirds of all parameters and is where factual knowledge appears to be stored. It is the largest component by parameter count. | "The FFN at 4x width holds most of the model's parameters. Research suggests it stores factual knowledge as key-value memories." |
Test your understanding
Quick recap
- A transformer block is: Pre-LN, multi-head attention, residual add, Pre-LN, FFN (4x width), residual add. Stack N of these blocks to build any modern LLM.
- Residual connections create gradient highways that make 96-layer models trainable. Without them, vanishing gradients kill deep networks.
- Pre-LN (normalize before sublayer) replaced Post-LN in all modern architectures because it trains stably past 48 layers.
- The FFN holds roughly two-thirds of all parameters and appears to store factual knowledge as key-value memories in individual neurons.
- Encoder-only (BERT) for classification and embeddings, decoder-only (GPT, Llama) for generation, encoder-decoder (T5) for translation. Decoder-only dominates in 2026.
- RoPE positional encoding is the modern standard, enabling better length generalization than sinusoidal or learned absolute approaches.
- For interviews, memorize the block structure and know why each component exists: residuals for gradients, Pre-LN for stability, FFN for knowledge, GQA for serving cost.
Related concepts
- Attention mechanism - The core computation inside each transformer block. Understanding attention is prerequisite to understanding transformers.
- Large language models - LLMs are built by scaling the transformer architecture. This article covers what happens when you stack many blocks with billions of parameters.
- Self-supervised learning - The training paradigm (masked LM, next-token prediction) that makes transformer pretraining possible without labeled data.
- Embeddings - The input representation that feeds into the transformer. Token and positional embeddings are the first layer of every transformer model.