Large language models
Learn how LLMs predict tokens at scale, why the training pipeline has three distinct stages, and how to choose the right model for your system.
TL;DR
- An LLM is a transformer neural network trained to predict the next token in a sequence. At sufficient scale (100B+ parameters, trillions of tokens), general reasoning emerges without being explicitly programmed.
- The Chinchilla scaling law (2022) proved that doubling training data beats doubling parameters for a fixed compute budget. Data quality is the bottleneck, not model size.
- Every LLM goes through three training stages: pre-training (raw text), supervised fine-tuning (instruction pairs), and RLHF (human preference alignment). You almost always deploy the third.
- The context window (measured in tokens, not words) is the single most important engineering constraint. Everything outside it is invisible.
- At $10/million output tokens, a chatbot serving 100K daily users (averaging 500 output tokens each) generates roughly $15K/month in LLM costs alone. Model selection is an infrastructure decision.
The Problem It Solves
Before LLMs, understanding natural language required a different system for every task. Sentiment analysis needed one model. Question answering needed another. Translation needed a third. Each required hand-labeled training data, custom architectures, and domain experts.
If the user phrased a question differently than your training examples, the system broke. Every new domain meant starting over from scratch.
The deeper problem was that language understanding requires world knowledge. You cannot classify "that's sick" as positive or negative without understanding slang, context, and speaker intent. Encoding that breadth of knowledge explicitly, for every possible task, was intractable.
I've worked with teams that maintained 15+ specialized NLP models before switching to a single LLM endpoint. The operational overhead alone was crushing.
LLMs collapsed the "one model per task" paradigm by learning the statistical structure of language at internet scale. Because virtually all human knowledge exists in written form, a model trained to predict text ends up learning a surprising amount about the world.
What Is It?
A large language model is a transformer neural network, typically with tens of billions of parameters, trained on trillions of tokens of text with a single objective: given the preceding tokens, predict the next one.
Think of it like a colleague who has read every book, every Stack Overflow thread, every Wikipedia article, and every code repository on the internet. They cannot perfectly recall any single document, but they have absorbed the patterns. When you ask a question, they don't look it up; they reconstruct an answer from everything they've absorbed. Sometimes they reconstruct something that never existed (hallucination), but the pattern-matching is remarkably powerful.
What makes LLMs "large" is not just parameter count. It is the combination of model size, training data volume, and compute that crosses a threshold where emergent capabilities appear: in-context learning, chain-of-thought reasoning, and zero-shot task performance that was not explicitly trained for.
The key insight: LLMs do not "understand" in the way humans do. They learn statistical relationships between tokens at massive scale, and that statistical knowledge is useful enough to solve real engineering problems.
How It Works
The inference loop: tokenize, embed, transform, sample, repeat
Every time you send a prompt to an LLM, the same five-step loop executes. Understanding this loop is the foundation for every engineering decision you will make with LLMs.
- Tokenize: Your text is split into tokens (subword pieces). "unhappiness" becomes ["un", "happiness"]. GPT-4 uses roughly 100K tokens in its vocabulary.
- Embed: Each token is mapped to a high-dimensional vector (typically 4096-12288 dimensions). These vectors encode semantic meaning.
- Transform: The embedded vectors pass through dozens of transformer layers (96 in GPT-4). Each layer applies self-attention (what should I focus on?) and feed-forward networks (what should I output?).
- Sample: The final layer produces a logit vector with one score per vocabulary token. Softmax converts these to probabilities, and a sampling strategy (temperature, top-p, top-k) selects the next token.
- Repeat: The selected token is appended to the input, and the entire forward pass runs again. A 500-token response requires 500 sequential forward passes.
This autoregressive loop is why LLM latency scales linearly with output length. It is also why techniques like KV-cache (storing intermediate computations from previous tokens) are critical for production performance.
Temperature, top-p, and top-k: controlling randomness
The sampler is where you, the engineer, have the most direct control over model behavior. Three parameters matter:
Temperature scales the logit values before softmax. Temperature = 0 always picks the highest-probability token (deterministic). Temperature = 1.0 samples from the unmodified distribution. Temperature > 1.0 flattens the distribution, making unlikely tokens more probable. For factual tasks, use 0.0-0.3. For creative tasks, use 0.7-1.0.
Top-p (nucleus sampling) keeps only tokens whose cumulative probability reaches p. With top-p = 0.9, the model samples from the smallest set of tokens that together have 90% probability. This dynamically adjusts how many tokens are considered: for confident predictions, that might be 2-3 tokens. For uncertain predictions, it might be hundreds.
Top-k simply keeps the k most probable tokens and discards the rest. Top-k = 50 means only the 50 highest-probability tokens are candidates. This is cruder than top-p because it does not adapt to the shape of the distribution.
My recommendation: use temperature + top-p together. I've seen teams waste weeks debugging "random" model behavior that was just temperature set too high for a structured extraction task.
Context window: the engineering constraint that rules everything
The context window defines how many tokens the model can process in a single forward pass. This includes both your input (system prompt + user message + retrieved context) and the model's output. For GPT-4o, it is 128K tokens. For Claude 3.5/3.7 Sonnet, it is 200K tokens.
This is THE constraint that shapes every LLM-powered architecture. If your task requires more context than the window allows, you must choose: truncate, summarize, use RAG to selectively retrieve, or switch to a model with a larger window.
Three things most engineers underestimate about context windows:
- Tokens are not words. One English word averages 1.3 tokens. Code tokenizes less efficiently (more tokens per line). Non-English languages can be 2-4x more expensive in tokens.
- Longer context degrades performance. The "lost in the middle" phenomenon (Liu et al., 2023) showed that LLMs struggle to use information placed in the middle of long contexts. Putting critical information at the start or end of the prompt produces measurably better results.
- Context costs money on every call. A 100K token context at $2.50/million input tokens costs $0.25 per request. At 100K requests/day, that is $25K/day in input tokens alone.
For your interview: mention the context window early when discussing any LLM architecture. Say "the context window is X tokens, which means we can fit Y documents, and that shapes how we design the retrieval layer." That signals real engineering understanding.
Scaling laws: why Chinchilla changed everything
In 2022, DeepMind's Chinchilla paper answered a question the field had been debating: for a fixed compute budget, should you train a bigger model or use more data?
The answer was unambiguous: use more data. Chinchilla (70B parameters, 1.4T tokens) outperformed Gopher (280B parameters, 300B tokens) despite being 4x smaller, because it was trained on 4.7x more data. The optimal ratio they found was roughly 20 tokens per parameter.
This result reshaped the industry. It is why Llama 2 (70B) was trained on 2T tokens, and Llama 3 (70B) on 15T tokens. It is why the race shifted from "who can build the biggest model" to "who has the best training data."
The engineering takeaway: when evaluating models, check the token-to-parameter ratio. A 7B model trained on 2T tokens (285 tokens/param) will likely outperform a 13B model trained on 500B tokens (38 tokens/param).
The training pipeline: pre-training, SFT, and RLHF
Every production LLM goes through three distinct training stages. This is the distinction I see most candidates skip over, and it matters when you are deciding which model to deploy.
Stage 1: Pre-training is the expensive stage. The base model is trained on trillions of tokens of internet text, books, code, and academic papers. The objective is pure next-token prediction. No instructions, no conversations, no safety guardrails. Pre-training GPT-4 reportedly cost over $100M in compute. The output is a raw autocomplete engine.
Stage 2: Supervised Fine-Tuning (SFT) transforms the autocomplete engine into an assistant. Human annotators write thousands of (prompt, ideal-response) pairs. The model is fine-tuned on these examples so it learns to interpret input as instructions rather than documents to continue. After SFT, the model can follow directions, but it may still produce harmful or wrong outputs.
Stage 3: RLHF (Reinforcement Learning from Human Feedback) aligns the model with human preferences. Human raters compare multiple model outputs and rank them. A reward model is trained on those rankings. The language model is then fine-tuned via PPO (Proximal Policy Optimization) to maximize the reward signal. This produces models that are more helpful, less harmful, and better calibrated. GPT-4o, Claude 3.5/3.7, and Llama 3 Instruct all use RLHF.
Deploying a base model is almost always wrong
A base model will autocomplete your prompt as if it is continuing a document, not answering a question. Unless you specifically need raw generation (creative writing autocomplete, research into model behavior), always deploy an instruction-tuned and RLHF-aligned variant.
The alignment tax is real, though. RLHF models sometimes refuse valid requests because the reward model learned to be overly cautious. "I can't help with that" when you ask about security testing or medical information is a known failure mode. This is not a bug in RLHF; it is a tradeoff between capability and safety, and the calibration is imperfect.
Key Variants / Types
| Variant | Training | Best For | Tradeoff |
|---|---|---|---|
| Base model (GPT-4-base, Llama 3 Base) | Pre-training only. Next-token prediction on raw text. | Continued pre-training, domain adaptation, research. | Unusable for chat or instruction-following without further fine-tuning. |
| Instruction-tuned (Llama 3 Instruct, Mistral Instruct) | Base + SFT on instruction/response pairs. | Structured tasks: extraction, classification, summarization. | May produce harmful or uncalibrated outputs. No safety alignment. |
| RLHF-aligned (GPT-4o, Claude 3.5/3.7 Sonnet, Gemini 1.5 Pro) | Base + SFT + RLHF. Human preference optimization. | Production chat, customer-facing products, general-purpose API use. | Alignment tax: occasionally refuses valid requests. Higher training cost. |
| Reasoning models (o1, o3, Claude 3.5 with extended thinking) | RLHF + chain-of-thought reinforcement. Trained to "think before answering." | Complex multi-step reasoning, math, code generation, planning. | Much higher latency (10-60s per response). 3-10x token cost due to thinking tokens. |
The fastest way to waste money on AI is to use a reasoning model for a task that instruction-tuned handles equally well. Classification, extraction, and simple Q&A do not need chain-of-thought. Save the reasoning models for tasks where getting the answer wrong is expensive.
When to Use / When to Avoid
When to use LLMs
- When the task is language-in, language-out and requires understanding semantics, not just pattern matching. Summarization, translation, code generation, conversational interfaces.
- When you need zero-shot or few-shot generalization. If you cannot afford to collect 10K labeled examples for a custom model, an LLM with good prompting often matches or beats a fine-tuned smaller model.
- When the task changes frequently. Updating a prompt is minutes. Retraining a custom model is days. If your task evolves weekly, LLMs are the pragmatic choice.
- When you need to combine multiple capabilities in a single call: "summarize this document, extract key entities, and classify the sentiment."
When to avoid LLMs
- When latency is under 50ms. LLM inference takes 200ms-3s for time-to-first-token. If your product requires sub-50ms responses (autocomplete, real-time gaming), use a traditional model or lookup table.
- When the task is pure classification with stable categories. A fine-tuned BERT (110M params) classifies sentiment at 5ms and pennies per million. GPT-4o does it at 500ms and dollars per million. If the categories do not change, use the smaller model.
- When you cannot tolerate hallucination. Medical diagnosis, legal advice, financial calculations. If wrong answers cause harm, LLMs need heavy guardrails (RAG, output validation, human review) or should not be the primary decision-maker.
- When privacy requirements prohibit sending data to external APIs and you cannot self-host. Open-weight models (Llama 3, Mistral) solve this, but self-hosting 70B+ parameter models requires serious GPU infrastructure.
So when does this actually matter in system design? The honest answer: almost every AI-powered product today uses LLMs somewhere. The question is not "should we use an LLM" but "which model, at what layer, with what guardrails."
Real-World Examples
Stripe's fraud detection augmentation. Stripe uses LLMs to analyze merchant descriptions and user-reported dispute narratives, generating structured explanations for fraud review teams. Their key lesson: LLMs are not replacing the fraud model (that is still gradient-boosted trees on transaction features), but they augment human reviewers who previously spent 4-6 minutes per case. The LLM summary cut review time to under 90 seconds.
Notion's AI summarization and Q&A. Notion processes millions of workspace documents daily. They route queries to different model tiers: simple formatting questions go to a small model (fast, cheap), while complex cross-document reasoning goes to GPT-4o. This tiered routing cut their LLM costs by 60% compared to sending everything to the largest model. I've seen this pattern become standard: route by complexity, not by default to the most capable model.
Cursor's AI code editor. Cursor uses Claude 3.5 Sonnet as its backbone for code generation, with a 200K token context window that allows it to ingest entire codebases. Their non-obvious insight: they found that inserting the most relevant files at the beginning and end of the context (not the middle) improved code suggestion accuracy by 15-20%, directly exploiting the "lost in the middle" phenomenon.
Uber's internal knowledge base. Uber built an LLM-powered system to answer employee questions about internal tools, policies, and codebases. They use Llama 3 70B self-hosted within their VPC (data residency requirement) with RAG over 2M+ internal documents. Serving 50K queries/day, their per-query cost is roughly $0.003 with self-hosted inference, compared to $0.02-0.05 with API providers.
Limitations and Tradeoffs
| Advantage | Limitation |
|---|---|
| General-purpose: one model handles dozens of tasks | Hallucination: generates confident, plausible nonsense |
| Zero-shot capable: works without task-specific training data | Knowledge cutoff: no awareness of events after training date |
| Scales via API: no ML infrastructure needed to start | Cost at scale: token pricing compounds quickly with volume |
| Multilingual out of the box | Non-English languages require 2-4x more tokens (higher cost, worse performance) |
| Improving rapidly: 2026 models far surpass 2023 baselines | Vendor lock-in: switching providers requires re-engineering prompts and eval suites |
Hallucination: the production risk that prompt engineering alone does not fix
Hallucination happens because the model is optimizing for plausibility, not truth. The same mechanism that allows it to write creative fiction also allows it to fabricate citations, invent API endpoints, and confidently state wrong facts.
Prompt engineering reduces hallucination but does not eliminate it. "Only use information from the provided context" helps, but the model can still confabulate when the context is ambiguous or incomplete. The real mitigations are architectural:
- RAG (Retrieval-Augmented Generation): ground the model in retrieved documents.
- Structured output validation: parse the response and verify against a schema or database.
- Citation enforcement: require the model to quote source text, then programmatically verify the quotes exist.
Token economics: the math you need for production
LLM pricing follows a simple model: you pay per token, separately for input and output. Output tokens are typically 3-5x more expensive than input tokens.
Let's do the math for a customer support chatbot:
- 100K daily users, average 3 messages each = 300K conversations/day
- Average conversation: 500 input tokens (system prompt + context + user message) + 200 output tokens
- Model: GPT-4o at $2.50/M input, $10/M output
- Daily cost: (300K x 500 x $2.50/1M) + (300K x 200 x $10/1M) = $375 + $600 = $975/day ($29K/month)
Switching to GPT-4o-mini ($0.15/M input, $0.60/M output) for the same traffic: $22.50 + $36 = $58.50/day ($1.8K/month). That is a 16x cost reduction. If GPT-4o-mini passes your evaluation for this task, using GPT-4o is literally burning money.
Interview tip: always do the cost math
When an interviewer asks about model selection, run the numbers out loud. Show input cost, output cost, daily volume, and monthly total. Then compare two model tiers. This signals you think about LLMs as infrastructure, not magic.
The fundamental tension with LLMs is capability vs. cost vs. latency. You can have any two. The largest models are the most capable but the slowest and most expensive. The smallest models are fast and cheap but less capable. Your job as an engineer is to find the sweet spot for each use case in your system.
How This Shows Up in Interviews
When to bring it up
Mention LLMs when the system design question involves: natural language input, content generation, search relevance (reranking), recommendation explanations, or any "AI-powered" feature. Even if the interviewer does not ask about AI specifically, naming the LLM layer and its tradeoffs signals you think about real-world architectures.
Depth expected by level
- Junior: Know what an LLM is, the difference between base and instruction-tuned, and that hallucination is a risk. Be able to say "I'd use an LLM API here for text generation."
- Senior: Explain the pre-training/SFT/RLHF pipeline, discuss context window constraints, make cost-aware model selection decisions, and design RAG architectures to mitigate hallucination.
- Staff: Discuss scaling laws (Chinchilla), evaluate build-vs-buy for self-hosted inference, design evaluation frameworks for model quality, and reason about multi-model architectures (routing by task complexity). Articulate alignment tax and its product implications.
Q&A table
| Interviewer Asks | Strong Answer |
|---|---|
| "How would you add AI to this product?" | "I'd identify the specific language task, pick the smallest model that passes evaluation, wrap it in RAG if factual accuracy matters, and design for graceful degradation when the model is wrong." |
| "Why not just use GPT-4 for everything?" | "Cost and latency. GPT-4o-mini handles most tasks equally well at 1/16th the cost. I'd evaluate both and route by task complexity." |
| "How do you prevent hallucination?" | "Architectural mitigations: RAG for grounding, structured output schemas for validation, citation enforcement for verifiability. Prompt engineering reduces it but does not eliminate it." |
| "What's the context window and why does it matter?" | "It is the total tokens the model sees per call: input plus output. It caps how much context you can provide and directly affects cost. I'd design the retrieval layer to fit within the window budget." |
| "Should we fine-tune or use prompting?" | "Start with prompting and few-shot examples. If that does not meet quality bars after thorough evaluation, fine-tune on task-specific data. Fine-tuning is a commitment: you own training, evaluation, and model serving." |
Common Interview Mistakes
| Mistake | Why It's Wrong | Say This Instead |
|---|---|---|
| "LLMs understand language like humans do" | LLMs learn statistical correlations between tokens. They have no grounded understanding, no world model in the human sense. This claim signals superficial knowledge. | "LLMs learn statistical patterns across trillions of tokens. The patterns are useful enough to solve real problems, but the model has no grounded understanding." |
| "Just use the biggest model available" | Bigger models have higher latency, higher cost, and often no better task-specific performance. Chinchilla showed smaller models trained on more data outperform larger undertrained ones. | "I'd evaluate multiple model sizes against our specific task and pick the smallest one that passes the quality bar at acceptable latency and cost." |
| "Prompt engineering fixes hallucination" | Prompting reduces hallucination but cannot eliminate it. The model's optimization target is plausibility, not truth. Without architectural guardrails, you will ship wrong answers. | "Prompt engineering helps, but the real fix is architectural: RAG for grounding, output validation, and citation enforcement." |
| "The context window is just a limit you work around" | The context window shapes your entire architecture: retrieval strategy, chunking approach, cost model, and what information the model can reason over. It is a design constraint, not a workaround. | "The context window is the primary design constraint. It determines how much context I retrieve, how I structure prompts, and directly drives per-request cost." |
| "RLHF makes the model smarter" | RLHF does not improve the model's knowledge or reasoning. It aligns the model's outputs with human preferences: more helpful, less harmful, better formatted. It is a behavioral adjustment, not a capability upgrade. | "RLHF aligns the model with human preferences for helpfulness and safety. It does not add new knowledge; it shapes how the model presents what it already knows." |
Test Your Understanding
Quick Recap
- An LLM is a transformer trained on trillions of tokens with a single objective (next-token prediction) that, at scale, produces emergent reasoning, in-context learning, and instruction following.
- Chinchilla scaling laws (2022) proved that training data quantity matters more than model size for a fixed compute budget, reshaping the industry's approach from "bigger models" to "better data."
- Every production model goes through three stages: pre-training (raw text), SFT (instruction pairs), and RLHF (human preference alignment). Always deploy the third.
- The context window is measured in tokens (not words) and determines how much information the model can process per call, directly shaping retrieval strategy, prompt design, and per-request cost.
- Hallucination requires architectural mitigations (RAG, output validation, citation enforcement), not just better prompts. It is the single biggest production risk.
- Token economics are straightforward: model selection (GPT-4o vs. GPT-4o-mini) can produce 10-16x cost differences for equivalent task performance. Evaluate before defaulting to the largest model.
- For interviews, demonstrate cost awareness, explain the three-stage training pipeline, and frame the context window as a design constraint that shapes the entire architecture.
Related Concepts
- Tokenization - Understanding how text becomes tokens is the prerequisite for context window budgeting, cost estimation, and debugging why your prompt is truncated.
- Transformer architecture - The transformer is the engine inside every LLM. Understanding attention and feed-forward layers explains why LLMs can process context in parallel and why scaling works.
- Embeddings - Embeddings are how LLMs represent meaning as vectors. This concept is essential for building RAG systems and understanding why similar prompts produce similar outputs.
- Attention mechanism - Attention is the core operation that allows LLMs to weigh which tokens matter most for each prediction, and it explains both the power and the limitations (lost in the middle) of large context windows.