Vector databases
How vector databases store and search high-dimensional embeddings for semantic search, recommendation, and AI applications, including ANN algorithms, similarity functions, and when to use them.
TL;DR
- Vector databases store embeddings (dense numeric arrays that encode semantic meaning) and support nearest-neighbor search over them.
- Traditional databases find exact matches. Vector databases find "similar" items using distance functions like cosine similarity, Euclidean distance, and dot product.
- Approximate Nearest Neighbor (ANN) algorithms like HNSW trade perfect recall for query speed, making sub-millisecond searches over billions of vectors practical.
- Use vector databases for semantic search, recommendation systems, anomaly detection, and RAG (retrieval-augmented generation).
- They complement traditional databases rather than replacing them. Metadata filtering happens in a relational layer, vector similarity in the vector layer.
The Problem It Solves
Your e-commerce search bar gets a query: "comfortable shoes for standing all day." Your Elasticsearch cluster dutifully tokenizes those words, checks its inverted index, and returns every product listing that contains the exact terms "comfortable," "shoes," "standing," or "all day." The user scrolls past 200 results of standing desks, shoe racks, and "all-day comfort mattresses" before giving up.
Meanwhile, the product they actually want, labeled "ergonomic footwear for long shifts," never appears. It doesn't contain the word "comfortable" or "standing." Keyword search is structurally incapable of bridging that gap. It matches tokens, not meaning.
I see this pattern constantly: teams spend months tuning synonyms, stemming rules, and boosting heuristics in Elasticsearch, trying to brute-force semantic understanding into a system designed for lexical matching. It works for 80% of queries and fails spectacularly on the rest.
The fundamental issue: keyword search operates on string equality. It has no concept of meaning. Two sentences can be semantically identical while sharing zero words, and keyword search will score them as completely unrelated.
Vector databases solve this by operating on meaning directly. Instead of comparing strings, they compare numerical representations of meaning (embeddings) using distance functions. Two documents that mean similar things will have similar embeddings, regardless of the words they use.
What Is It?
A vector database is a specialized storage system designed to index, store, and query high-dimensional vectors (embeddings). Where a relational database answers "give me the row where id = 42," a vector database answers "give me the 10 items most similar to this input."
Think of it like a library. A traditional database is the card catalog: you look up a book by its exact title, author, or ISBN. A vector database is the librarian who has read every book and can say, "Oh, you liked that one? You'd probably love these three, they explore similar themes." The librarian doesn't match titles; they match meaning.
The core workflow has three stages: content goes in as raw data, gets transformed into vectors by an embedding model, and gets indexed for fast retrieval. At query time, the query itself gets embedded, and the database finds the stored vectors closest to it.
For your interview: say "vector databases store embeddings and support approximate nearest-neighbor search, so we can do semantic similarity instead of keyword matching." That one sentence covers what interviewers need to hear.
How It Works
Step 1: Generate embeddings
An embedding is a model's compressed representation of meaning. The model (text encoder, image encoder, etc.) maps an input to a fixed-length float array. Inputs that are semantically similar produce vectors that are close together in the high-dimensional space.
// Generating embeddings with OpenAI's API
const response = await openai.embeddings.create({
model: "text-embedding-ada-002",
input: "comfortable shoes for standing all day"
});
const vector = response.data[0].embedding;
// [0.23, -0.41, 0.87, 0.14, ..., 0.62] (1536 dimensions)
// "ergonomic footwear for long shifts" produces a vector
// nearly identical in direction, cosine similarity ~0.94
// "cheese pizza recipe" produces a vector pointing in a
// completely different direction, cosine similarity ~0.12
The embedding model is a black box to the vector database. It doesn't care how the vectors were produced, only that semantically similar inputs yield nearby vectors. You can use OpenAI, Sentence-BERT, CLIP (for images), or any model that outputs fixed-length float arrays.
Step 2: Choose a similarity function
The database needs a way to measure "closeness" between two vectors. Three functions dominate:
| Function | Formula | Range | Best for |
|---|---|---|---|
| Cosine similarity | cos(A,B) = (A . B) / (|A| x |B|) | -1 to 1 (1 = identical) | Text embeddings where direction matters more than magnitude |
| Euclidean distance (L2) | sqrt(sum((ai - bi)^2)) | 0 to infinity (0 = identical) | Image embeddings, spatial data, normalized vectors |
| Dot product | sum(ai x bi) | unbounded | MIPS (maximum inner product search), recommendation scoring |
My recommendation: start with cosine similarity for text workloads. It's the most forgiving because it ignores vector magnitude, so embeddings from different models or normalization schemes still compare reasonably. Switch to dot product only if you're doing recommendation scoring where magnitude encodes confidence.
Step 3: Index the vectors
A naive nearest-neighbor search over N vectors requires computing distance to all N vectors, which is O(N) per query. At 100 million vectors with 1536 dimensions each, that's over 150 billion floating-point operations per query. Not viable for interactive latency.
This is where Approximate Nearest Neighbor (ANN) algorithms come in. They build index structures that trade perfect recall for sub-millisecond query times.
Step 4: Query
At query time, embed the user's input with the same model, then search the index for the K nearest vectors. The database returns vector IDs and similarity scores, which you enrich with metadata from a relational store.
// Querying Pinecone
const queryResponse = await index.query({
vector: queryEmbedding, // same model used for ingestion
topK: 10, // return 10 nearest neighbors
includeMetadata: true,
filter: { // metadata pre-filter
category: { $eq: "shoes" },
price: { $lte: 100 },
in_stock: { $eq: true }
}
});
// queryResponse.matches:
// [{ id: "prod_4821", score: 0.94, metadata: { name: "..." } },
// { id: "prod_1173", score: 0.91, metadata: { name: "..." } },
// ...]
Key Components
| Component | Role |
|---|---|
| Embedding model | Converts raw content (text, images, code) into fixed-length float vectors. External to the database. |
| Vector index | Data structure (HNSW graph, IVF clusters, PQ codebook) that enables fast approximate search. |
| Distance function | Measures similarity between vectors (cosine, L2, dot product). Configured per index. |
| Metadata store | Stores non-vector attributes (price, category, timestamps) for filtering and enrichment. |
| Ingestion pipeline | Batches raw content through the embedding model and writes vectors + metadata to the database. |
| Query engine | Embeds the query, searches the index, applies metadata filters, returns ranked results. |
| Quantization layer | Compresses vectors (e.g., float32 to int8) to reduce memory and storage costs at a small recall penalty. |
Types / Variations
ANN Algorithms
The indexing algorithm is the most consequential choice you'll make with a vector database. Each algorithm makes a different trade-off between build time, memory usage, query latency, and recall.
HNSW (Hierarchical Navigable Small World)
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.