LLM caching

TL;DR

Exact-match caching works for deterministic inputs (same SQL template, same form fill). It misses almost all reuse in conversational AI.
Semantic caching embeds the query, finds similar cached queries by cosine similarity, and returns the cached response if the similarity exceeds a threshold. Works for FAQ-style chatbots and search.
Prefix caching (Anthropic, OpenAI) caches the KV state of shared context prefixes at the GPU level. If 100 requests share the same 2K-token system prompt, only the first computes it. Up to 90% cost reduction on the shared portion.
The threshold calibration problem: too high (0.99) means almost no cache hits; too low (0.85) means wrong cached answers. Calibrate against your actual query distribution.
Cache invalidation is harder than with traditional caches because there are no exact keys. When underlying knowledge changes, you must expire semantic cache entries by re-embedding and comparing.

LLM inference costs money every time. A user asking "what's your return policy?" five hundred times a day generates 500 identical LLM calls, each costing roughly the same as the first. At $0.003 per 1K tokens with a 500-token response, that's $0.75/day for a single FAQ question at moderate traffic. With 50 common questions and 10,000 daily active users, you're looking at hundreds of dollars per day in avoidable inference costs.

Beyond cost, there's latency. A cached response returns in under 10ms. A fresh LLM inference takes 1-5 seconds depending on model size and response length. For predictable, high-frequency queries, caching isn't just an optimization, it's a UX requirement.

Caching LLMs correctly is harder than caching REST API responses, though. LLM inputs are unstructured text. "What's the return policy?" and "Can I return something?" have the same intent but different strings. Exact-match caching fails almost completely for open-ended inputs. You need a smarter strategy.

What is it?

LLM caching is the practice of storing LLM responses and returning stored results (instead of computing new ones) when an identical or sufficiently similar request arrives. There are three distinct caching strategies, and they operate at different layers: exact-match caching (application layer), semantic caching (application layer with vector search), and prefix caching (provider infrastructure layer).

The strategies aren't mutually exclusive. A production system might use prefix caching for shared system prompts, semantic caching for FAQ-style queries, and exact-match caching for deterministic template fills.

How it works

Exact-match caching

Cache key is the exact prompt string. Lookup is O(1) and completely deterministic. It's the right choice when you can guarantee identical inputs: the same code comment generates the same docstring request, the same SQL generator template with the same parameters produces the same query.

For anything conversational or open-ended, exact-match cache hit rates are usually below 1%. Not worth the complexity.

Semantic caching

Embed the incoming query with the same embedding model you use for retrieval. Search a vector database of previously answered (query, response) pairs for queries above a cosine similarity threshold. If a match is found, return the cached response. If not, call the LLM, store the (query_embedding, response) pair, and return the fresh response.

The GPTCache library provides a ready-built semantic cache layer. You configure the similarity threshold, TTL, and storage backend.

from gptcache import cache
from gptcache.adapter import openai

# Initialize semantic cache with similarity threshold 0.9
cache.init(similarity_evaluation=..., threshold=0.9)

# All openai.ChatCompletion.create() calls now go through cache
response = openai.ChatCompletion.create(
    model="gpt-4o-2024-11-20",
    messages=[{"role": "user", "content": "What is your return policy?"}]
)

Semantic caching works well for: customer support chatbots (limited FAQ domain), classification pipelines (same text classified repeatedly), search summarization (same documents summarized multiple times).

Incoming Query

>Waiting...

Embed Query

>Waiting...

Vector Search

>Waiting...

Cache Hit

>Waiting...

LLM Call + Store

>Waiting...

Semantic cache flow: embed the query, search for similar cached responses, return on hit or call LLM on miss

Provider prefix caching

Anthropic and OpenAI both offer prefix caching at the infrastructure level. When you send a request, the provider checks whether the opening portion of your prompt was already computed in a recent request. If yes, it reuses the cached KV state and only computes the new tokens.

Anthropic enables this automatically for prompts over 1,024 tokens. OpenAI requires you to opt in. The cost reduction applies only to the cached prefix tokens, typically at a 50-90% discount.

The practical impact is largest when many requests share a long system prompt. A 2,000-token system prompt shared across 100 requests means only the first request pays full price for those 2,000 tokens. The other 99 pay the cached rate.

Structure your prompts to front-load the stable content (system instructions, document context) and put the variable content (user message, dynamic context) at the end. Provider prefix caching rewards this structure.

TL;DR

Exact-match caching works for deterministic inputs (same SQL template, same form fill). It misses almost all reuse in conversational AI.
Semantic caching embeds the query, finds similar cached queries by cosine similarity, and returns the cached response if the similarity exceeds a threshold. Works for FAQ-style chatbots and search.
Prefix caching (Anthropic, OpenAI) caches the KV state of shared context prefixes at the GPU level. If 100 requests share the same 2K-token system prompt, only the first computes it. Up to 90% cost reduction on the shared portion.
The threshold calibration problem: too high (0.99) means almost no cache hits; too low (0.85) means wrong cached answers. Calibrate against your actual query distribution.
Cache invalidation is harder than with traditional caches because there are no exact keys. When underlying knowledge changes, you must expire semantic cache entries by re-embedding and comparing.

The problem it solves

What is it?

How it works

Exact-match caching

For anything conversational or open-ended, exact-match cache hit rates are usually below 1%. Not worth the complexity.

Semantic caching

The GPTCache library provides a ready-built semantic cache layer. You configure the similarity threshold, TTL, and storage backend.

from gptcache import cache
from gptcache.adapter import openai

# Initialize semantic cache with similarity threshold 0.9
cache.init(similarity_evaluation=..., threshold=0.9)

# All openai.ChatCompletion.create() calls now go through cache
response = openai.ChatCompletion.create(
    model="gpt-4o-2024-11-20",
    messages=[{"role": "user", "content": "What is your return policy?"}]
)

Incoming Query

>Waiting...

Embed Query

>Waiting...

Vector Search

>Waiting...

Cache Hit

>Waiting...

LLM Call + Store

>Waiting...

Semantic cache flow: embed the query, search for similar cached responses, return on hit or call LLM on miss

Provider prefix caching

Anthropic enables this automatically for prompts over 1,024 tokens. OpenAI requires you to opt in. The cost reduction applies only to the cached prefix tokens, typically at a 50-90% discount.

LLM caching

TL;DR

The problem it solves

What is it?

How it works

Exact-match caching

Semantic caching

Provider prefix caching

Continue Reading with Premium

Comments

LLM caching

TL;DR

The problem it solves

What is it?

How it works

Exact-match caching

Semantic caching

Provider prefix caching

Continue Reading with Premium

Comments