Embeddings
Learn how embeddings encode meaning as vectors, why they power semantic search and RAG, and how to choose the right model for production.
TL;DR
- An embedding is a dense float vector (typically 768 to 3072 dimensions) where semantically similar text clusters nearby in vector space.
- Created by passing text through a transformer encoder, then mean-pooling all token outputs into a single fixed-size vector.
- Cosine similarity measures the angle between two vectors. It is nearly always the correct distance function for text because it ignores magnitude.
- Sentence embeddings (SBERT, E5, OpenAI text-embedding-3, Cohere Embed v3) are the production standard. Word-level embeddings (Word2Vec, GloVe) are historical artifacts.
- Use the MTEB benchmark to pick an embedding model. Don't default to ada-002. Smaller, cheaper models often outperform it on your specific task.
The problem it solves
A user types "how do I cancel my subscription" into your support search. Your FAQ says "steps to terminate your account." Keyword search returns nothing. The words don't match, even though the meaning is identical.
This is the synonym problem, and it's only half the story. The other half is the paraphrase problem: "my order never arrived" and "package lost in shipping" describe the same situation with zero word overlap. Traditional search systems treat text as bags of characters. They have no concept of meaning.
I've seen teams spend months tuning Elasticsearch with synonym dictionaries and query expansion rules, only to hit a wall once users start phrasing things creatively. The fundamental issue is that keyword search operates on surface form, not semantics.
Embeddings fix this by mapping text into a geometric space where meaning determines position. Similar meaning produces nearby vectors. You search by proximity, not by keyword match.
What is it?
An embedding is a fixed-size vector of floating point numbers that captures the semantic meaning of its input. For any text input, the model outputs one vector. Similar meanings produce nearby vectors. Different meanings produce distant vectors.
Think of it like a spice rack organized by flavor profile. Cinnamon and nutmeg sit close together (both warm, sweet baking spices). Cayenne and habanero cluster nearby (both hot peppers). Cinnamon and cayenne are far apart. Nobody told the spice rack "cinnamon is warm." The position emerges from the flavor properties. Embeddings work the same way: position in vector space emerges from patterns in training data, not from explicit rules.
The key property is that geometry encodes semantics. "Dog" and "puppy" are close. "Bank" (financial) and "bank" (river) end up in different regions because their surrounding context in training data is different.
For your interview: say "an embedding maps text into a vector space where geometric proximity equals semantic similarity" and you've nailed the definition.
How it works
Transformer encoder and contrastive learning
Embedding models are transformer encoders fine-tuned with a contrastive objective. During training, the model sees pairs of similar sentences (paraphrases, question-answer pairs). It learns to push similar pairs close together in vector space while pushing dissimilar pairs apart.
The most influential approach is SBERT (Sentence-BERT), which introduced Siamese networks for sentence embeddings. Two copies of the same transformer process two sentences independently. A contrastive loss function penalizes the model when similar pairs are far apart or dissimilar pairs are close together. This is called metric learning.
I've seen teams try to skip contrastive fine-tuning and just use a base BERT model's [CLS] token as an embedding. The results are terrible. Without contrastive training, the [CLS] output isn't optimized for semantic similarity at all.
The embedding pipeline: text to searchable vector
At inference time, the pipeline is straightforward. Raw text enters, a normalized vector exits. Each step transforms the data in a specific way.
- Tokenize: split text into subword token IDs using the model's vocabulary (see Tokenization for details on BPE).
- Encode: pass token IDs through 12 to 24 transformer layers. Each layer applies self-attention and feed-forward networks, producing contextualized hidden states.
- Pool: collapse per-token hidden states into a single vector. Mean pooling (averaging all token vectors) consistently outperforms [CLS]-only extraction for retrieval tasks.
- Normalize: scale the vector to unit length (L2 norm = 1). This ensures cosine similarity equals dot product, making search faster.
- Store: index the normalized vector in a vector database (Pinecone, Weaviate, pgvector) for approximate nearest neighbor search.
Distance functions: cosine vs. euclidean vs. dot product
Once you have vectors, you need a way to measure "nearness." Three functions dominate.
Cosine similarity measures the angle between two vectors, ignoring their magnitudes. If two texts point in roughly the same direction in embedding space, they're semantically similar. This is almost always the right choice for text.
Euclidean (L2) distance measures straight-line distance between two points. Use this when magnitude matters (coordinate data, numeric features). For text embeddings, it typically performs worse than cosine because it conflates direction and magnitude.
Dot product is cosine similarity multiplied by both magnitudes. On normalized vectors (unit length), dot product equals cosine similarity exactly. My recommendation: normalize your embeddings once at indexing time, then use dot product for search. It is mathematically equivalent to cosine but cheaper to compute.
The rule of thumb: normalize at write time, dot product at query time. Every major vector database supports this.
Matryoshka embeddings: dimension truncation
Matryoshka Representation Learning (MRL) trains models so that the first N dimensions of the vector are still useful on their own. You can truncate a 3072-dim vector to 256 dims and retain most of the retrieval quality.
This matters for cost. Storing 3072-dim float32 vectors for 10 million documents costs ~120 GB. Truncating to 256 dims drops that to ~10 GB with only 2 to 5% quality degradation on most benchmarks. OpenAI's text-embedding-3 models and Nomic Embed both support this.
The trick is that training uses a multi-loss objective: the model is simultaneously optimized for the full vector and for every prefix length (64, 128, 256, 512, 1024). Early dimensions learn the most important semantic features. Later dimensions capture finer distinctions.
Key variants and types
| Model Type | Examples | How It Works | Best For | Key Tradeoff |
|---|---|---|---|---|
| Static word vectors | Word2Vec, GloVe | One vector per word, no context | Historical reference, simple baselines | "Bank" always has the same vector regardless of context |
| Contextual encoder | BERT, RoBERTa | One vector per token, context-dependent | Token classification, NER | Not optimized for sentence similarity without fine-tuning |
| Sentence transformers | SBERT, E5-large | Contrastive-trained on sentence pairs | Semantic search, retrieval | Need to choose the right training data distribution |
| API embedding models | OpenAI text-embedding-3, Cohere Embed v3 | Proprietary encoder, API access only | Production RAG, fast prototyping | Vendor lock-in, no self-hosting |
| Open multilingual | BGE-M3, Nomic Embed | Open weights, multilingual training | Self-hosted search, multilingual apps | Requires GPU infrastructure to serve |
| Multi-modal | CLIP, SigLIP | Images and text in the same vector space | Image search by text query | Lower text-only quality than dedicated text models |
For a 2026 production system, my recommendation is: start with OpenAI text-embedding-3-small for prototyping. Benchmark against BGE-M3 or Nomic Embed on your actual data before committing. Self-hosted models cost less at scale and avoid vendor lock-in.
When to use / when to avoid
When to use
- Semantic search where keyword overlap is insufficient. Synonyms, paraphrases, cross-lingual queries all require embeddings.
- RAG retrieval step: embed the user query, find nearest document chunks, inject them into the LLM context window.
- Content deduplication: cluster near-identical documents by high cosine similarity (threshold > 0.95 for near-duplicates).
- Recommendation systems: embed users and items into the same space, retrieve by nearest neighbor.
- Anomaly detection: inputs with low similarity to every cluster centroid are outliers worth investigating.
When to avoid
- Exact match requirements. If you need to find "error code XJ-4821" precisely, use traditional search. Embeddings will return semantically similar but wrong results.
- Highly structured data. Tabular data with numeric columns, dates, and IDs belongs in SQL with proper indexes, not in vector space.
- Code search with generic text models. Code-specific models (CodeBERT, StarEncoder, Voyage Code) capture syntax and semantics that general text models miss entirely.
- When you can't afford reindexing. Switching models means re-embedding every document. If your corpus has 100M+ docs and no reindexing pipeline, that is a serious operational constraint.
If you're unsure whether embeddings will help, run a quick A/B test: keyword search vs. embedding search on 1,000 real user queries. The data will tell you within a day.
Real-world examples
Notion AI search uses embeddings to let users query their workspace in natural language. A user searching "meeting notes about the Q3 launch delay" retrieves relevant notes even when the original text says "discussion about timeline slip for summer release." Notion reported a 40% improvement in search satisfaction after switching from keyword to semantic search.
Shopify product discovery embeds product titles and descriptions, then embeds buyer search queries. When a customer types "comfy chair for home office" and the listing says "ergonomic desk chair with lumbar support," cosine similarity is 0.87, high enough to surface the result. Shopify measured a 12% revenue lift from semantic search versus keyword-only search for long-tail queries.
Spotify podcast recommendations uses audio-transcript embeddings to match listeners with new podcasts. By embedding episode transcripts and comparing them to a user's listening history, Spotify surfaces episodes on related topics even when the titles give no clue. This drove a 15% increase in podcast discovery engagement.
OpenAI's training data deduplication uses embeddings to detect near-duplicate examples before fine-tuning. Pairs with cosine similarity above 0.97 are flagged as duplicates. High-similarity pairs in training data lead to memorization rather than generalization, so filtering them out improves model quality.
Limitations and tradeoffs
- Semantic similarity is not factual correctness. "The earth is flat" and "the earth is round" are semantically similar (both about earth's shape) but one is false. Embeddings do not encode truth.
- Domain shift. A model trained on general web text may perform poorly on legal, medical, or financial documents. I've seen retrieval recall drop 30%+ when deploying a general model on specialized corpora. Always evaluate on your domain.
- Dimensionality cost. Storing 3072-dim float32 vectors for 10 million documents requires ~120 GB. Quantization (int8 or binary) can reduce this 4 to 8x with modest quality loss. Matryoshka truncation is another option.
- Embedding staleness. Documents change. Embeddings don't update themselves. You need a pipeline to detect changed content and re-embed it, which adds operational complexity.
- Embedding drift. This is the sneaky one. You index documents with model v1, then upgrade to model v2 for queries. The two models produce vectors in different geometric spaces. Similarity scores become meaningless. Reindexing when you change models is non-negotiable.
Embedding model mismatch is silent corruption
Query embeddings and document embeddings must come from the same model. If you switch models, you must re-embed all documents. Results will degrade silently, not fail loudly. This is the #1 production embedding bug.
The fundamental tension is quality versus cost. Better models produce larger vectors that need more storage and compute. Matryoshka truncation and quantization help, but there is always a tradeoff frontier.
How this shows up in interviews
When to bring it up
Mention embeddings proactively in any system design question involving search, recommendations, content matching, or RAG. If the interviewer describes a search feature, say "I'd use embedding-based retrieval" and briefly explain why keyword search falls short.
Depth expected by level
- Junior: knows what embeddings are, can explain cosine similarity, understands they power semantic search.
- Senior: can compare embedding models, explain contrastive training at a high level, discuss distance functions and normalization. Knows about reindexing cost.
- Staff: can discuss Matryoshka representations, MTEB benchmark selection, embedding drift, quantization tradeoffs, and multi-modal embeddings. Can design the full embedding pipeline including indexing, versioning, and serving.
Interview Q&A
| Interviewer Asks | Strong Answer |
|---|---|
| "What is an embedding?" | "A dense vector where geometric proximity encodes semantic similarity, created by a contrastive-trained transformer encoder." |
| "Why cosine similarity over Euclidean?" | "Cosine measures direction (meaning) and ignores magnitude (text length). On normalized vectors, dot product is equivalent and faster." |
| "How do you pick an embedding model?" | "Benchmark on my actual data using MTEB metrics. Don't default to ada-002, smaller open models often win on specific domains." |
| "What happens when you switch models?" | "Every document must be re-embedded. Mixing vectors from different models produces garbage similarity scores." |
| "How do you handle storage cost at scale?" | "Matryoshka truncation to reduce dimensions, int8 quantization, and binary embeddings for the first-pass ANN filter." |
| "What are multi-modal embeddings?" | "Models like CLIP embed images and text in the same vector space, enabling cross-modal search like text queries finding images." |
Common interview mistakes
| Mistake | Why It's Wrong | Say This Instead |
|---|---|---|
| "I'd just use OpenAI ada-002 for embeddings" | ada-002 is outdated and often outperformed by smaller, cheaper models. Shows you haven't evaluated alternatives. | "I'd benchmark text-embedding-3-small against BGE-M3 on my domain using MTEB retrieval metrics." |
| "Cosine similarity and dot product are different things" | On normalized vectors (which you should always use), they are mathematically identical. | "After L2-normalizing, dot product equals cosine similarity. I normalize at index time and use dot product for speed." |
| "Word2Vec is fine for semantic search" | Word2Vec produces static, context-free word vectors. It cannot distinguish "bank" (money) from "bank" (river). | "Word2Vec is a historical baseline. Production systems use sentence-level embeddings like SBERT or text-embedding-3 for full-context representations." |
| "We can just swap in the new embedding model" | Mixing vectors from different models corrupts your entire index silently. | "Switching models requires a full reindex. I'd run dual indexes during migration and cut over atomically." |
| "Higher dimensions are always better" | Diminishing returns past a point, and storage/latency costs scale linearly with dimension count. | "I'd start with 256 or 512 dims using Matryoshka truncation and increase only if retrieval quality demands it." |
Test your understanding
Quick recap
- Embeddings are dense vectors where semantic proximity equals geometric proximity in high-dimensional space.
- Created by contrastive-trained transformer encoders: input text goes through tokenization, attention layers, mean pooling, and normalization to produce one fixed-size vector.
- Cosine similarity (or dot product on normalized vectors) is the standard distance function. Normalize at index time, dot product at query time.
- Use the MTEB benchmark to shortlist models, then evaluate on your actual domain data. Never pick a model based on leaderboard rank alone.
- Matryoshka representations let you truncate dimensions for cheaper storage with minimal quality loss. Start small, scale up only if needed.
- Switching embedding models requires a full reindex. Budget for this operationally and design dual-index migration patterns.
- Multi-modal models (CLIP, SigLIP) embed images and text in the same space, enabling cross-modal search at the cost of per-modality quality.
Related concepts
- Large Language Models - LLMs consume embeddings as their input representation. Understanding embeddings clarifies how LLMs process meaning.
- Tokenization - Tokenization is the first step in the embedding pipeline. Token boundaries directly affect embedding quality.
- Vector Databases for AI - Vector databases are where embeddings live in production. Store, index, and query patterns depend on embedding properties.
- Retrieval Augmented Generation - RAG uses embedding-based retrieval as its core mechanism. Embedding quality is the ceiling on RAG answer quality.