Few-shot prompting
Understand how few-shot examples guide LLM behavior, when they outperform fine-tuning, and how to select and order examples to maximize response quality.
TL;DR
- Few-shot prompting teaches the model a task by showing 3-10 input/output examples inside the context window, with no training required.
- Example quality matters more than example count. Five diverse, edge-case-covering examples beat twenty examples of the same type.
- Positional recency bias is real: put your most relevant example last, immediately before the query. GPT-4 and Claude both show 8-15% accuracy swings based on example ordering alone.
- Chain-of-thought + few-shot examples is the highest-impact combination for reasoning tasks on models above 100B parameters.
- Few-shot is instant and free to set up. Fine-tuning is permanent but costs data, GPU time, and a deployment pipeline. Know when to use each.
- In production, dynamic few-shot retrieval (embedding an example bank and pulling the closest matches per query) outperforms static example lists by 10-20% on diverse task distributions.
The problem it solves
You ask GPT-4 to classify customer support tickets into one of six categories: billing, shipping, product_defect, account_access, feature_request, and spam. Zero-shot, the model gets about 70% accuracy. It invents new categories ("general_inquiry"), merges categories you want separate ("billing" and "account_access"), and formats outputs inconsistently.
You could fine-tune. That requires collecting 2,000+ labeled tickets, formatting them into training data, running a fine-tuning job for a few hours, and deploying a new model endpoint. Three days of work before you see results.
Or you paste five correctly labeled tickets directly into the prompt. One billing example, one shipping example, one ambiguous ticket that could be billing or account_access (labeled "billing" with a reasoning note). The model immediately understands the format, the category boundaries, and how to handle edge cases. Accuracy jumps to 92%. It took ten minutes.
Few-shot prompting is the fastest way to specialize a general-purpose model for a specific task. It is not a hack. It is a core capability of how large language models process context.
What is it?
Few-shot prompting means including a small number of labeled input/output examples in the prompt before the actual query. The model sees the pattern and applies it to the new input. No weights change. No training happens. The model is pattern-matching against the structure you showed it.
Think of it like training a new hire by showing them completed examples. You don't rewrite their brain (fine-tuning). You show them three finished reports and say "do it like this." They pick up the format, the tone, and the edge-case handling from your examples. If your examples are sloppy, their work will be sloppy too.
There are three variants at the highest level:
- Zero-shot: instruction only, no examples. "Classify this ticket into one of these categories."
- One-shot: a single example to establish format. Enough for simple formatting tasks.
- Few-shot: typically 3-10 examples, enough to convey task nuance, category boundaries, and edge-case handling. This is the most broadly useful variant.
The number of examples is not the important variable. The quality, diversity, and ordering of examples is what determines performance. I have seen 3 well-chosen examples outperform 15 randomly selected ones on every benchmark that matters.
How it works
In-context learning: pattern matching, not weight updates
The model does not "learn" from your examples in the gradient-descent sense. Its weights do not change. What happens is closer to pattern completion. The model's pretraining exposed it to billions of structured input/output pairs across web text, code, and documentation. When you provide examples in context, you activate the most relevant patterns from that pretraining and tell the model which ones apply right now.
This is called in-context learning (ICL), and it was first characterized in the GPT-3 paper (Brown et al., 2020). The paper showed that scaling model size dramatically improved few-shot performance, suggesting that larger models store more patterns and activate them more precisely.
The practical implication: your examples do not teach new knowledge. They select which existing knowledge to apply.
Example selection strategies
I see teams grab the first N examples from their dataset and call it done. That is the wrong instinct. Those examples cluster around common cases and leave edge cases uncovered.
Diversity over quantity. If you are classifying into six categories, include at least one example per category. If you only have room for five examples, cover five of the six categories and add a note in the system prompt about the sixth. Five diverse examples beat ten examples from two categories.
Include at least one edge case. Every task has ambiguous inputs. Show the model how you want those resolved. An example labeled "This could be billing or account_access, but we classify it as billing because the user mentions charges" teaches more than three straightforward examples.
Match the input distribution. If 40% of your real inputs are long-form paragraphs and 60% are short one-liners, your examples should reflect that ratio. Models can be sensitive to format mismatch between examples and actual queries.
Dynamic few-shot retrieval goes one step further. Embed your example bank into a vector store, and at query time retrieve the K most semantically similar examples to the current input. This is particularly effective when your task has many subtypes. The model always sees the most relevant examples for the specific query it is handling.
For your interview: mentioning dynamic few-shot is the signal that separates someone who has built production systems from someone who has only read tutorials.
Example ordering and positional bias
Order matters more than most people realize. Models have positional recency bias: they weight content toward the end of the context window more heavily than content at the beginning. This is well-documented in the "Lost in the Middle" paper by Liu et al. (2023), which showed that both GPT-4 and Claude retrieve information best from the beginning and end of context, with a significant accuracy drop for information buried in the middle.
The practical rule: put your most relevant example last, immediately before the actual query. The model transitions from your best example directly into the problem it needs to solve.
For chain-of-thought reasoning tasks, order examples from simplest to most complex. The model ramps up through the reasoning format gradually. Putting a complex example first can confuse the pattern.
Token budget math
Few-shot examples consume context tokens. This is a real engineering cost at scale.
A typical example (input + output) runs 100-300 tokens. Five examples at 200 tokens each means 1,000 tokens per call just for examples. At GPT-4o pricing ($2.50 per million input tokens), that is $2.50 per million calls in example overhead alone. For a system handling 10 million calls per month, that is $25/month just for in-context examples.
The math gets worse with chain-of-thought examples because the reasoning steps add 100-400 tokens per example. Five CoT examples can easily consume 2,000-3,000 tokens.
Budget your examples deliberately:
| Examples | Tokens per example | Total budget | Cost per 1M calls (GPT-4o) |
|---|---|---|---|
| 3 simple | 150 | 450 tokens | $1.13 |
| 5 diverse | 200 | 1,000 tokens | $2.50 |
| 5 CoT | 500 | 2,500 tokens | $6.25 |
| 10 dynamic | 200 | 2,000 tokens | $5.00 |
Chain-of-thought + few-shot
The highest-impact combination for math, logic, and multi-step reasoning is few-shot examples that include intermediate reasoning steps.
Instead of showing only input and output:
Input: A store has 15 red and 27 blue shirts. How many total?
Output: 42
You show the reasoning trace:
Input: A store has 15 red and 27 blue shirts. How many total?
Reasoning: I need to add 15 and 27. 15 + 20 = 35, then 35 + 7 = 42.
Output: 42
Wei et al. (2022) showed that chain-of-thought few-shot prompting improved accuracy on GSM8K (grade school math) from 17.7% to 57.1% for PaLM 540B. On commonsense reasoning benchmarks, improvements ranged from 10-25 percentage points. The model is not just copying format. It is copying the reasoning pattern and applying it to novel problems.
CoT does not help small models
Few-shot chain-of-thought only reliably improves results on large models (typically 100B+ parameters). On models under 70B, adding reasoning steps can actually hurt performance. The model tries to follow the reasoning format but generates incorrect intermediate steps. Test before deploying CoT on smaller models.
Few-shot prompt assembly pipeline
In production, you don't manually paste examples. Your code assembles the prompt automatically: loading the system instruction, selecting or retrieving examples, ordering them by relevance, and inserting the user query at the end.
Key variants and types
The term "few-shot" covers a spectrum of techniques. Here is how they compare:
| Variant | Examples | How it works | Best for | Key tradeoff |
|---|---|---|---|---|
| Zero-shot | 0 | Instruction-only prompt | Tasks the model already does well (summarization, translation) | No setup cost, but no control over edge cases |
| One-shot | 1 | Single format example | Establishing output format (JSON schema, table layout) | Minimal token cost, but cannot convey task nuance |
| Few-shot (static) | 3-10 | Hand-picked examples in prompt | Classification, extraction, formatting with consistent input types | Easy to implement, but examples may not match all query types |
| Few-shot (dynamic) | 3-10 per query | Retrieved from example bank via embedding similarity | Diverse input distributions, multi-domain classification | Requires embedding infrastructure, but 10-20% accuracy improvement |
| CoT few-shot | 3-5 | Examples include intermediate reasoning steps | Math, logic, multi-step reasoning on large models | Higher token cost per example, but dramatic accuracy gains |
| Self-consistency + CoT | 3-5 | CoT few-shot run multiple times, majority vote on answer | High-stakes reasoning where correctness matters more than latency | 3-5x cost multiplier, but reduces error rate by 5-15% |
My advice: start with static few-shot. If accuracy on diverse inputs is below your threshold, add dynamic retrieval. If you need reasoning, add CoT. Layer techniques based on measured performance, not assumptions.
When to use / when to avoid
When to use few-shot
- When you need to establish a specific output format the model keeps getting wrong (JSON schema, category labels, structured tables).
- When you have fewer than 500 labeled examples. Few-shot gives you 80-90% of fine-tuning performance at zero infrastructure cost.
- When your task definition is still changing. Few-shot examples can be swapped in seconds. Fine-tuned models take hours to retrain.
- When you need to teach edge-case handling explicitly. Showing the model how to resolve ambiguous cases is more reliable than describing the rules in prose.
- When prototyping a new feature. Always start with few-shot. Only fine-tune after you have validated the task works and collected enough labeled data through production usage.
When to avoid few-shot
- When the task is already well within the model's default behavior. Adding examples for "summarize this text" to GPT-4o wastes context tokens. The model already knows how to summarize.
- When you have 5,000+ labeled examples and a stable task. Fine-tuning a smaller model (GPT-4o-mini, Claude Haiku) will be cheaper per call and faster at inference.
- When latency is critical and every token counts. 1,000 extra input tokens adds 50-100ms of latency. For real-time applications, that might be unacceptable.
- When the base model has no pretraining signal for the task. Few-shot activates existing patterns. If the pattern does not exist (a completely proprietary format the model has never seen), fine-tuning or a different approach is needed.
The decision framework
The bottom line: few-shot is your first tool. Fine-tuning is your last resort. Most production LLM tasks live happily in the few-shot regime forever.
Real-world examples
Stripe: fraud classification with dynamic few-shot
Stripe's fraud detection pipeline uses LLMs to classify edge-case transactions that rule-based systems flag as uncertain. They embed a bank of 10,000+ labeled fraud examples and retrieve the 5 most similar to each flagged transaction at query time. Compared to static few-shot (same 5 examples every time), dynamic retrieval reduced false positives by 18% because the model always sees examples closest to the specific transaction pattern. Each classification call uses about 1,200 tokens of examples, costing roughly $0.003 per call.
Anthropic's own documentation (2024)
Anthropic recommends 3-5 diverse examples in the system prompt as a baseline for any task requiring consistent format. Their internal testing shows that examples placed near the end of the system prompt outperform examples at the beginning by 8-12% on format adherence benchmarks. They specifically note that one well-chosen edge-case example is worth more than three straightforward ones.
Notion AI: template generation
Notion's AI writing assistant uses few-shot prompting to generate content matching specific templates (meeting notes, PRDs, sprint retros). Each template type has 3 curated examples stored in their prompt management system. When a user selects "meeting notes," the system injects those 3 examples before the user's raw notes. This approach let Notion ship 15+ template types in weeks rather than fine-tuning separate models for each, and they iterate on template quality by swapping examples without any model retraining.
OpenAI Cookbook benchmarks (2023)
OpenAI's published classification experiments show 5-shot prompting on GPT-4 matching fine-tuned GPT-3.5 performance on sentiment analysis, achieving 94.2% accuracy versus 93.8% for the fine-tuned model. The few-shot approach required zero training time, zero GPU cost, and could be updated instantly. The takeaway: a more capable base model plus a few examples often beats a weaker model with expensive fine-tuning.
Limitations and tradeoffs
| Limitation | Impact | Mitigation |
|---|---|---|
| Context budget consumption | 5-10 examples at 200 tokens each = 1,000-2,000 tokens per call. At 10M calls/month, that is $25-50 of pure example overhead (GPT-4o pricing). | Use dynamic retrieval with only 3-5 examples per call. Compress examples to essential fields only. |
| Example quality sensitivity | Contradictory or mislabeled examples actively degrade performance. One wrong label can drop accuracy 5-10%. | Curate examples manually. Run evaluation suites when changing examples. Treat examples as code: version them, review them, test them. |
| Format lock-in | The model over-anchors to example format. If all examples use short inputs and you send a long input, output quality drops. | Match example format to real input distribution. Include varied-length examples. |
| Small model limitations | CoT few-shot hurts models under 70B parameters. Dynamic retrieval adds latency that small-model deployments cannot afford. | Use simple (non-CoT) few-shot on small models. Fine-tune instead if accuracy is insufficient. |
| No persistent learning | Unlike fine-tuning, nothing is retained between calls. Every call pays the token cost again. | Accept this for low-to-medium volume. Fine-tune once volume exceeds roughly 1M calls/month and the task is stable. |
The fundamental tension: few-shot gives you maximum flexibility at the cost of per-call token overhead. Fine-tuning gives you minimum per-call cost at the cost of flexibility and setup time. Most teams should start with few-shot and only graduate to fine-tuning when the numbers force the decision.
Treat examples like test fixtures
Version-control your few-shot examples. Run automated evaluation suites whenever you change them. A single mislabeled example can silently degrade production accuracy for weeks before anyone notices. I have seen this happen twice at different companies.
How this shows up in interviews
When to bring it up
Few-shot prompting comes up in two interview contexts: (1) AI system design questions where you need to explain how you would get an LLM to perform a specific task, and (2) prompt engineering discussions where the interviewer tests whether you understand the toolbox beyond "just give it instructions."
Bring it up proactively whenever the design involves classification, extraction, formatting, or any task where output consistency matters. Say: "I would start with few-shot prompting to validate the task, then evaluate whether fine-tuning is needed based on accuracy and cost at production volume."
Depth by level
- Junior: Knows that few-shot means putting examples in the prompt. Can explain zero-shot vs. few-shot.
- Senior: Understands example selection (diversity, edge cases), ordering (recency bias), and the tradeoff with fine-tuning. Can articulate when to use each.
- Staff: Designs dynamic few-shot retrieval pipelines, reasons about token cost at scale, knows when CoT helps vs. hurts, and integrates few-shot into broader evaluation and deployment systems.
Q&A table
| Interviewer asks | Strong answer |
|---|---|
| "How would you get consistent output format from an LLM?" | "Few-shot examples in the system prompt. 3-5 examples covering normal cases and edge cases. Most representative example placed last due to recency bias." |
| "Few-shot vs. fine-tuning: when do you choose each?" | "Few-shot first, always. Under 500 examples, changing task definition, or fast iteration. Fine-tune once you have 5K+ stable labels and need lower per-call cost." |
| "How do you select which examples to include?" | "Diversity over quantity. Cover each output category, include one edge case, match real input distribution. In production, use dynamic retrieval with embeddings." |
| "What's the failure mode of bad few-shot examples?" | "Wrong labels teach wrong patterns. Format mismatch causes output drift. Clustered examples leave uncovered categories. One bad example can drop accuracy 5-10%." |
| "Does few-shot work on small models?" | "Basic few-shot yes. CoT few-shot no, it hurts models under 70B parameters. For small models, use simple examples without reasoning chains." |
Common interview mistakes
| Mistake | Why it is wrong | Say this instead |
|---|---|---|
| "Few-shot is just putting examples in the prompt" | Ignores selection strategy, ordering, and the mechanism (ICL pattern activation, not learning). Sounds like a tutorial reader, not a practitioner. | "Few-shot activates in-context learning. Example selection, diversity, and ordering all affect performance. I select diverse examples covering edge cases and put the most relevant one last." |
| "More examples is always better" | Contradicts research. Beyond 5-10 examples, gains plateau and you waste tokens. Clustered examples can hurt. | "Quality over quantity. 5 diverse examples outperform 15 similar ones. I budget examples against the context window and optimize for coverage, not count." |
| "Few-shot and fine-tuning do the same thing" | Fundamentally different mechanisms. Few-shot is ephemeral pattern activation. Fine-tuning changes weights permanently. | "Few-shot is temporary context. Fine-tuning is permanent weight change. Different persistence, different cost profiles, different use cases." |
| "I'd use few-shot for everything" | Misses that some tasks are better served by zero-shot (trivial tasks) or fine-tuning (high-volume stable tasks). | "I start with zero-shot. If output quality or consistency is lacking, I add few-shot examples. If volume and stability justify it, I graduate to fine-tuning." |
| "Example order doesn't matter" | Ignores well-documented positional recency bias. 8-15% accuracy swings from ordering alone. | "Order matters significantly. Models show recency bias, so the most relevant example goes last, right before the query." |
Test your understanding
Quick recap
- Few-shot prompting places 3-10 labeled examples in the context window to activate in-context learning. No weights change, no training pipeline needed.
- Example quality beats quantity. Five diverse examples covering edge cases outperform twenty clustered examples from common cases.
- Positional recency bias means the most relevant example goes last, immediately before the query. Ordering alone can swing accuracy by 8-15%.
- Chain-of-thought + few-shot is the strongest technique for reasoning tasks, but only on models above 100B parameters. It hurts small models.
- Dynamic few-shot (embedding examples and retrieving the most similar per query) outperforms static few-shot by 10-20% on diverse task distributions.
- Few-shot is your first tool for any new LLM task. Graduate to fine-tuning only when you have 5,000+ stable labels and the per-call token cost justifies it.
- Treat examples as code: version-control them, run evaluations after changes, and audit for distribution mismatch.
Related concepts
- Context engineering - Few-shot examples are one zone in the context window. Context engineering is the broader discipline of assembling optimal prompts.
- Fine-tuning - The permanent alternative to few-shot. Use when task is stable and volume justifies training infrastructure.
- LLM evaluations - You cannot know if your few-shot examples are working without systematic evaluation. Evaluation suites are the testing framework for prompt engineering.
- Large language models - Understanding transformer architecture and pretraining explains why in-context learning works mechanically.