Fine-tuned embeddings for RAG
Learn how fine-tuning embedding models on domain-specific data improves RAG retrieval accuracy by 5-15%, when it's worth the investment, and how to build training datasets from existing user queries.
TL;DR
- General-purpose embedding models define "semantic similarity" based on internet-scale text. In specialized domains, their similarity function is wrong by definition, and retrieval fails in ways that no amount of prompt engineering can fix.
- Fine-tuned embeddings retrain the similarity function itself: after fine-tuning, "myocardial infarction" and "heart attack" cluster together, while "sepsis" and "infection" stay apart, because your fine-tuning data taught the model YOUR definition of similarity.
- A fine-tuned BGE-base model (110M parameters) can outperform text-embedding-3-large on your specific domain while being faster and cheaper to serve.
- Expect 5-15% improvement in NDCG@10 on domain-specific retrieval. Training 100K pairs with GPT-4o-mini costs roughly $30-50; fine-tuning compute on an A100 for 4 hours costs roughly $5-10.
- The main cost is operational: once you own the model, you host it. You also need to re-index every document any time you update the model. Start with reranking and contextual retrieval first; fine-tune when those plateau.
The Problem It Solves
Your RAG system serves a customer support desk for an e-commerce platform. A user asks, "why do items keep going out of stock before I can buy them?" The vector database retrieves the top-5 chunks. Three of them are from the shipping FAQ about delivery timelines. Two are from the order tracking help page. The actual inventory and demand-management documentation, which would explain fulfillment constraints perfectly, lands at position 12.
The reason is not a bug. The embedding model is working exactly as designed. "My order was late" and "shipping was fast" both encode near each other because they share vocabulary about the order delivery domain. "Items always out of stock" encodes further away even though both express dissatisfaction with your fulfillment operation, because the surface tokens are different. The model learned "semantic similarity" from 800 billion tokens of web text where "shipping" and "order" are much stronger signals than frustrated customer sentiment.
In specialized domains, the general definition of semantic similarity is frequently wrong. In medical RAG, "sepsis" and "infection" should NOT be embedded close together: sepsis is a systemic inflammatory response requiring immediate ICU intervention, while a localized infection may need only topical treatment. A model trained on web text learns that both appear in medical contexts and clusters them. In legal RAG, "consideration" (the exchange of value that makes a contract binding) should NOT embed near "consideration" in the everyday sense. The models embed word senses based on frequency distributions that your domain violates constantly.
The result is retrieval that fails in ways that look random but are actually systematic. Your best documents are consistently ranked below documents that share domain vocabulary but answer different questions. No chunking strategy, no prompt template, no larger context window fixes this. The similarity function itself is wrong.
What Is It?
Fine-tuned embedding training updates the weights of an embedding model on domain-specific (query, relevant_document) pairs, teaching the model your definition of semantic similarity rather than the default learned from internet-scale text.
Think of it like calibrating a scale. A kitchen scale works fine for general weights. But a jeweler calibrates their scale to accurately measure in milligrams rather than grams, because the jeweler's work requires precision that the general-purpose scale cannot provide. Fine-tuning is calibration: the base model already understands language, but you recalibrate its sensitivity to the distinctions that matter in your domain.
How It Works
The Training Data: (Query, Positive, Negative) Triplets
The core input to fine-tuning is a dataset of training pairs (at minimum) or triplets (preferred). Each example teaches the model about one instance of your domain's similarity:
- query: a natural language question or search query (what a user would type)
- positive document: the chunk that genuinely answers or matches the query
- hard negative document: a chunk that looks similar but does NOT actually answer the query
The hard negative is what separates a good fine-tuning dataset from a mediocre one. Easy negatives (completely unrelated documents) do not teach the model anything useful because the general-purpose model already ranks them low. Hard negatives are documents that the base model ranks high but should be deprioritized, such as the page about "late payment fees" when the user is asking about "late delivery penalties."
Training data sources, in order of quality:
| Source | Quality | Cost | Availability |
|---|---|---|---|
| Real user query logs + click data | Highest | Free (if in production) | Only if deployed |
| Expert annotation | Very high | High (human hours) | Always |
| LLM-generated synthetic pairs | High | Low ($30-50 for 100K pairs) | Always |
| Existing QA/FAQ documents | Medium | Free | If you have docs |
Generating Synthetic Training Pairs with an LLM
For most teams not yet in production, LLM-generated pairs are the practical starting point. For each document chunk, you ask an LLM to generate 3-5 queries that a real user would ask when they need this information.
QUERY_GEN_PROMPT = """
Given the following document excerpt, generate {n} diverse search queries that
a user would ask when looking for this information. The queries should represent
different ways a user might phrase the same information need.
Document excerpt:
{chunk}
Return as a JSON list of {n} strings. Queries must be natural, conversational,
and varied in phrasing. Do not use the same words from the chunk verbatim.
"""
# Example output for a medical protocol chunk:
# [
# "what should i do if a patient has a fever above 103",
# "high fever treatment protocol for adults",
# "when is acetaminophen not enough for fever",
# "fever management in ICU patients",
# ]
GPT-4o-mini generates 100K pairs for approximately $30-50 at current rates (April 2026). Each pair becomes one (query, positive_doc) training example. To generate hard negatives automatically, retrieve the top-5 documents for each generated query using the base model, then exclude the ground-truth document from the result. Those near-misses are your hard negatives.
The Loss Function: Multiple Negatives Ranking Loss
The standard training objective for embedding models is Multiple Negatives Ranking Loss (MNRL). MNRL trains the model to rank the true positive higher than every other document in the training batch.
In a batch of B examples, for query q_i with positive p_i, every other positive document p_j (where j != i) becomes an in-batch negative. The loss maximizes the cosine similarity between q_i and p_i while minimizing similarity to every other positive in the batch. The denominator sums over all B positives, so a batch of 64 creates 63 negatives per query. Larger batch sizes create more in-batch negatives and produce a stronger training signal. Use batch sizes of 64 or higher.
For triplet data (with explicit hard negatives), Triplet Loss is the alternative. The model is penalized whenever sim(query, negative) - sim(query, positive) + margin is above zero. In other words: the loss pushes the negative's score far enough below the positive that a margin is maintained. Triplet loss gives more precise control over the training signal but requires an explicit negative for every training example.
The Training Process End-to-End
The Architecture: Starting From a Pre-Trained Encoder
Never train an embedding model from scratch. Always start from a pre-trained checkpoint and fine-tune its weights. The base model already understands syntax, morphology, and general language semantics. Fine-tuning adjusts the higher-level representations to reflect your domain's similarity structure.
The recommended base models for fine-tuning (as of April 2026):
| Model | Params | Dim | Max Tokens | License | Best For |
|---|---|---|---|---|---|
| BGE-base-en-v1.5 | 110M | 768 | 512 | MIT | General starting point |
| E5-base-v2 | 110M | 768 | 512 | MIT | Strong BEIR baseline |
| mxbai-embed-large | 335M | 1024 | 512 | Apache 2.0 | Higher quality, more compute |
| BGE-large-en-v1.5 | 335M | 1024 | 512 | MIT | When size budget allows |
You cannot fine-tune OpenAI's text-embedding-3-small or text-embedding-3-large (closed weights). If you need a hosted solution, Cohere offers a custom embedding model training service through their enterprise API.
Matryoshka Representation Learning (MRL)
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.