Fine-tuned embeddings for RAG

TL;DR

General-purpose embedding models define "semantic similarity" based on internet-scale text. In specialized domains, their similarity function is wrong by definition, and retrieval fails in ways that no amount of prompt engineering can fix.
Fine-tuned embeddings retrain the similarity function itself: after fine-tuning, "myocardial infarction" and "heart attack" cluster together, while "sepsis" and "infection" stay apart, because your fine-tuning data taught the model YOUR definition of similarity.
A fine-tuned BGE-base model (110M parameters) can outperform text-embedding-3-large on your specific domain while being faster and cheaper to serve.
Expect 5-15% improvement in NDCG@10 on domain-specific retrieval. Training 100K pairs with GPT-4o-mini costs roughly $30-50; fine-tuning compute on an A100 for 4 hours costs roughly $5-10.
The main cost is operational: once you own the model, you host it. You also need to re-index every document any time you update the model. Start with reranking and contextual retrieval first; fine-tune when those plateau.

Your RAG system serves a customer support desk for an e-commerce platform. A user asks, "why do items keep going out of stock before I can buy them?" The vector database retrieves the top-5 chunks. Three of them are from the shipping FAQ about delivery timelines. Two are from the order tracking help page. The actual inventory and demand-management documentation, which would explain fulfillment constraints perfectly, lands at position 12.

The reason is not a bug. The embedding model is working exactly as designed. "My order was late" and "shipping was fast" both encode near each other because they share vocabulary about the order delivery domain. "Items always out of stock" encodes further away even though both express dissatisfaction with your fulfillment operation, because the surface tokens are different. The model learned "semantic similarity" from 800 billion tokens of web text where "shipping" and "order" are much stronger signals than frustrated customer sentiment.

In specialized domains, the general definition of semantic similarity is frequently wrong. In medical RAG, "sepsis" and "infection" should NOT be embedded close together: sepsis is a systemic inflammatory response requiring immediate ICU intervention, while a localized infection may need only topical treatment. A model trained on web text learns that both appear in medical contexts and clusters them. In legal RAG, "consideration" (the exchange of value that makes a contract binding) should NOT embed near "consideration" in the everyday sense. The models embed word senses based on frequency distributions that your domain violates constantly.

The result is retrieval that fails in ways that look random but are actually systematic. Your best documents are consistently ranked below documents that share domain vocabulary but answer different questions. No chunking strategy, no prompt template, no larger context window fixes this. The similarity function itself is wrong.

What Is It?

Fine-tuned embedding training updates the weights of an embedding model on domain-specific (query, relevant_document) pairs, teaching the model your definition of semantic similarity rather than the default learned from internet-scale text.

Think of it like calibrating a scale. A kitchen scale works fine for general weights. But a jeweler calibrates their scale to accurately measure in milligrams rather than grams, because the jeweler's work requires precision that the general-purpose scale cannot provide. Fine-tuning is calibration: the base model already understands language, but you recalibrate its sensitivity to the distinctions that matter in your domain.

How It Works

The Training Data: (Query, Positive, Negative) Triplets

The core input to fine-tuning is a dataset of training pairs (at minimum) or triplets (preferred). Each example teaches the model about one instance of your domain's similarity:

query: a natural language question or search query (what a user would type)
positive document: the chunk that genuinely answers or matches the query
hard negative document: a chunk that looks similar but does NOT actually answer the query

The hard negative is what separates a good fine-tuning dataset from a mediocre one. Easy negatives (completely unrelated documents) do not teach the model anything useful because the general-purpose model already ranks them low. Hard negatives are documents that the base model ranks high but should be deprioritized, such as the page about "late payment fees" when the user is asking about "late delivery penalties."

Training data sources, in order of quality:

Source	Quality	Cost	Availability
Real user query logs + click data	Highest	Free (if in production)	Only if deployed
Expert annotation	Very high	High (human hours)	Always
LLM-generated synthetic pairs	High	Low ($30-50 for 100K pairs)	Always
Existing QA/FAQ documents	Medium	Free	If you have docs

Generating Synthetic Training Pairs with an LLM

For most teams not yet in production, LLM-generated pairs are the practical starting point. For each document chunk, you ask an LLM to generate 3-5 queries that a real user would ask when they need this information.

QUERY_GEN_PROMPT = """
Given the following document excerpt, generate {n} diverse search queries that
a user would ask when looking for this information. The queries should represent
different ways a user might phrase the same information need.

Document excerpt:
{chunk}

Return as a JSON list of {n} strings. Queries must be natural, conversational,
and varied in phrasing. Do not use the same words from the chunk verbatim.
"""

# Example output for a medical protocol chunk:
# [
#   "what should i do if a patient has a fever above 103",
#   "high fever treatment protocol for adults",
#   "when is acetaminophen not enough for fever",
#   "fever management in ICU patients",
# ]

GPT-4o-mini generates 100K pairs for approximately $30-50 at current rates (April 2026). Each pair becomes one (query, positive_doc) training example. To generate hard negatives automatically, retrieve the top-5 documents for each generated query using the base model, then exclude the ground-truth document from the result. Those near-misses are your hard negatives.

The Loss Function: Multiple Negatives Ranking Loss

The standard training objective for embedding models is Multiple Negatives Ranking Loss (MNRL). MNRL trains the model to rank the true positive higher than every other document in the training batch.

In a batch of B examples, for query q_i with positive p_i, every other positive document p_j (where j != i) becomes an in-batch negative. The loss maximizes the cosine similarity between q_i and p_i while minimizing similarity to every other positive in the batch. The denominator sums over all B positives, so a batch of 64 creates 63 negatives per query. Larger batch sizes create more in-batch negatives and produce a stronger training signal. Use batch sizes of 64 or higher.

For triplet data (with explicit hard negatives), Triplet Loss is the alternative. The model is penalized whenever sim(query, negative) - sim(query, positive) + margin is above zero. In other words: the loss pushes the negative's score far enough below the positive that a margin is maintained. Triplet loss gives more precise control over the training signal but requires an explicit negative for every training example.

The Training Process End-to-End

Raw Corpus

>Unprocessed domain documents

Chunk + LLM Query Gen

>Waiting for corpus...

Hard Negative Mining

>Waiting for pairs...

SentenceTransformers Training

>Waiting...

NDCG@10 Eval

>Waiting...

Fine-Tuned Model

>Not ready

Fine-tuned embedding training pipeline: from raw documents to a domain-calibrated model

The Architecture: Starting From a Pre-Trained Encoder

Never train an embedding model from scratch. Always start from a pre-trained checkpoint and fine-tune its weights. The base model already understands syntax, morphology, and general language semantics. Fine-tuning adjusts the higher-level representations to reflect your domain's similarity structure.

The recommended base models for fine-tuning (as of April 2026):

Model	Params	Dim	Max Tokens	License	Best For
BGE-base-en-v1.5	110M	768	512	MIT	General starting point
E5-base-v2	110M	768	512	MIT	Strong BEIR baseline
mxbai-embed-large	335M	1024	512	Apache 2.0	Higher quality, more compute
BGE-large-en-v1.5	335M	1024	512	MIT	When size budget allows

You cannot fine-tune OpenAI's text-embedding-3-small or text-embedding-3-large (closed weights). If you need a hosted solution, Cohere offers a custom embedding model training service through their enterprise API.

Matryoshka Representation Learning (MRL)

TL;DR

General-purpose embedding models define "semantic similarity" based on internet-scale text. In specialized domains, their similarity function is wrong by definition, and retrieval fails in ways that no amount of prompt engineering can fix.
Fine-tuned embeddings retrain the similarity function itself: after fine-tuning, "myocardial infarction" and "heart attack" cluster together, while "sepsis" and "infection" stay apart, because your fine-tuning data taught the model YOUR definition of similarity.
A fine-tuned BGE-base model (110M parameters) can outperform text-embedding-3-large on your specific domain while being faster and cheaper to serve.
Expect 5-15% improvement in NDCG@10 on domain-specific retrieval. Training 100K pairs with GPT-4o-mini costs roughly $30-50; fine-tuning compute on an A100 for 4 hours costs roughly $5-10.
The main cost is operational: once you own the model, you host it. You also need to re-index every document any time you update the model. Start with reranking and contextual retrieval first; fine-tune when those plateau.

query: a natural language question or search query (what a user would type)
positive document: the chunk that genuinely answers or matches the query
hard negative document: a chunk that looks similar but does NOT actually answer the query

Training data sources, in order of quality:

Source	Quality	Cost	Availability
Real user query logs + click data	Highest	Free (if in production)	Only if deployed
Expert annotation	Very high	High (human hours)	Always
LLM-generated synthetic pairs	High	Low ($30-50 for 100K pairs)	Always
Existing QA/FAQ documents	Medium	Free	If you have docs

Generating Synthetic Training Pairs with an LLM

QUERY_GEN_PROMPT = """
Given the following document excerpt, generate {n} diverse search queries that
a user would ask when looking for this information. The queries should represent
different ways a user might phrase the same information need.

Document excerpt:
{chunk}

Return as a JSON list of {n} strings. Queries must be natural, conversational,
and varied in phrasing. Do not use the same words from the chunk verbatim.
"""

# Example output for a medical protocol chunk:
# [
#   "what should i do if a patient has a fever above 103",
#   "high fever treatment protocol for adults",
#   "when is acetaminophen not enough for fever",
#   "fever management in ICU patients",
# ]

The Loss Function: Multiple Negatives Ranking Loss

The Training Process End-to-End

Raw Corpus

>Unprocessed domain documents

Chunk + LLM Query Gen

>Waiting for corpus...

Hard Negative Mining

>Waiting for pairs...

SentenceTransformers Training

>Waiting...

NDCG@10 Eval

>Waiting...

Fine-Tuned Model

>Not ready

Fine-tuned embedding training pipeline: from raw documents to a domain-calibrated model

The Architecture: Starting From a Pre-Trained Encoder

The recommended base models for fine-tuning (as of April 2026):

Model	Params	Dim	Max Tokens	License	Best For
BGE-base-en-v1.5	110M	768	512	MIT	General starting point
E5-base-v2	110M	768	512	MIT	Strong BEIR baseline
mxbai-embed-large	335M	1024	512	Apache 2.0	Higher quality, more compute
BGE-large-en-v1.5	335M	1024	512	MIT	When size budget allows

Fine-tuned embeddings for RAG

TL;DR

The Problem It Solves

What Is It?

How It Works

The Training Data: (Query, Positive, Negative) Triplets

Generating Synthetic Training Pairs with an LLM

The Loss Function: Multiple Negatives Ranking Loss

The Training Process End-to-End

The Architecture: Starting From a Pre-Trained Encoder

Matryoshka Representation Learning (MRL)

Continue Reading with Premium

Comments

Fine-tuned embeddings for RAG

TL;DR

The Problem It Solves

What Is It?

How It Works

The Training Data: (Query, Positive, Negative) Triplets

Generating Synthetic Training Pairs with an LLM

The Loss Function: Multiple Negatives Ranking Loss

The Training Process End-to-End

The Architecture: Starting From a Pre-Trained Encoder

Matryoshka Representation Learning (MRL)

Continue Reading with Premium

Comments