Vector databases
How vector databases store and search high-dimensional embeddings for semantic search, recommendation, and AI applications, including ANN algorithms, similarity functions, and when to use them.
TL;DR
- Vector databases store embeddings (dense numeric arrays that encode semantic meaning) and support nearest-neighbor search over them.
- Traditional databases find exact matches. Vector databases find "similar" items using distance functions like cosine similarity, Euclidean distance, and dot product.
- Approximate Nearest Neighbor (ANN) algorithms like HNSW trade perfect recall for query speed, making sub-millisecond searches over billions of vectors practical.
- Use vector databases for semantic search, recommendation systems, anomaly detection, and RAG (retrieval-augmented generation).
- They complement traditional databases rather than replacing them. Metadata filtering happens in a relational layer, vector similarity in the vector layer.
The Problem It Solves
Your e-commerce search bar gets a query: "comfortable shoes for standing all day." Your Elasticsearch cluster dutifully tokenizes those words, checks its inverted index, and returns every product listing that contains the exact terms "comfortable," "shoes," "standing," or "all day." The user scrolls past 200 results of standing desks, shoe racks, and "all-day comfort mattresses" before giving up.
Meanwhile, the product they actually want, labeled "ergonomic footwear for long shifts," never appears. It doesn't contain the word "comfortable" or "standing." Keyword search is structurally incapable of bridging that gap. It matches tokens, not meaning.
I see this pattern constantly: teams spend months tuning synonyms, stemming rules, and boosting heuristics in Elasticsearch, trying to brute-force semantic understanding into a system designed for lexical matching. It works for 80% of queries and fails spectacularly on the rest.
The fundamental issue: keyword search operates on string equality. It has no concept of meaning. Two sentences can be semantically identical while sharing zero words, and keyword search will score them as completely unrelated.
Vector databases solve this by operating on meaning directly. Instead of comparing strings, they compare numerical representations of meaning (embeddings) using distance functions. Two documents that mean similar things will have similar embeddings, regardless of the words they use.
What Is It?
A vector database is a specialized storage system designed to index, store, and query high-dimensional vectors (embeddings). Where a relational database answers "give me the row where id = 42," a vector database answers "give me the 10 items most similar to this input."
Think of it like a library. A traditional database is the card catalog: you look up a book by its exact title, author, or ISBN. A vector database is the librarian who has read every book and can say, "Oh, you liked that one? You'd probably love these three, they explore similar themes." The librarian doesn't match titles; they match meaning.
The core workflow has three stages: content goes in as raw data, gets transformed into vectors by an embedding model, and gets indexed for fast retrieval. At query time, the query itself gets embedded, and the database finds the stored vectors closest to it.
For your interview: say "vector databases store embeddings and support approximate nearest-neighbor search, so we can do semantic similarity instead of keyword matching." That one sentence covers what interviewers need to hear.
How It Works
Step 1: Generate embeddings
An embedding is a model's compressed representation of meaning. The model (text encoder, image encoder, etc.) maps an input to a fixed-length float array. Inputs that are semantically similar produce vectors that are close together in the high-dimensional space.
// Generating embeddings with OpenAI's API
const response = await openai.embeddings.create({
model: "text-embedding-ada-002",
input: "comfortable shoes for standing all day"
});
const vector = response.data[0].embedding;
// [0.23, -0.41, 0.87, 0.14, ..., 0.62] (1536 dimensions)
// "ergonomic footwear for long shifts" produces a vector
// nearly identical in direction, cosine similarity ~0.94
// "cheese pizza recipe" produces a vector pointing in a
// completely different direction, cosine similarity ~0.12
The embedding model is a black box to the vector database. It doesn't care how the vectors were produced, only that semantically similar inputs yield nearby vectors. You can use OpenAI, Sentence-BERT, CLIP (for images), or any model that outputs fixed-length float arrays.
Step 2: Choose a similarity function
The database needs a way to measure "closeness" between two vectors. Three functions dominate:
| Function | Formula | Range | Best for |
|---|---|---|---|
| Cosine similarity | cos(A,B) = (A . B) / (|A| x |B|) | -1 to 1 (1 = identical) | Text embeddings where direction matters more than magnitude |
| Euclidean distance (L2) | sqrt(sum((ai - bi)^2)) | 0 to infinity (0 = identical) | Image embeddings, spatial data, normalized vectors |
| Dot product | sum(ai x bi) | unbounded | MIPS (maximum inner product search), recommendation scoring |
My recommendation: start with cosine similarity for text workloads. It's the most forgiving because it ignores vector magnitude, so embeddings from different models or normalization schemes still compare reasonably. Switch to dot product only if you're doing recommendation scoring where magnitude encodes confidence.
Step 3: Index the vectors
A naive nearest-neighbor search over N vectors requires computing distance to all N vectors, which is O(N) per query. At 100 million vectors with 1536 dimensions each, that's over 150 billion floating-point operations per query. Not viable for interactive latency.
This is where Approximate Nearest Neighbor (ANN) algorithms come in. They build index structures that trade perfect recall for sub-millisecond query times.
Step 4: Query
At query time, embed the user's input with the same model, then search the index for the K nearest vectors. The database returns vector IDs and similarity scores, which you enrich with metadata from a relational store.
// Querying Pinecone
const queryResponse = await index.query({
vector: queryEmbedding, // same model used for ingestion
topK: 10, // return 10 nearest neighbors
includeMetadata: true,
filter: { // metadata pre-filter
category: { $eq: "shoes" },
price: { $lte: 100 },
in_stock: { $eq: true }
}
});
// queryResponse.matches:
// [{ id: "prod_4821", score: 0.94, metadata: { name: "..." } },
// { id: "prod_1173", score: 0.91, metadata: { name: "..." } },
// ...]
Key Components
| Component | Role |
|---|---|
| Embedding model | Converts raw content (text, images, code) into fixed-length float vectors. External to the database. |
| Vector index | Data structure (HNSW graph, IVF clusters, PQ codebook) that enables fast approximate search. |
| Distance function | Measures similarity between vectors (cosine, L2, dot product). Configured per index. |
| Metadata store | Stores non-vector attributes (price, category, timestamps) for filtering and enrichment. |
| Ingestion pipeline | Batches raw content through the embedding model and writes vectors + metadata to the database. |
| Query engine | Embeds the query, searches the index, applies metadata filters, returns ranked results. |
| Quantization layer | Compresses vectors (e.g., float32 to int8) to reduce memory and storage costs at a small recall penalty. |
Types / Variations
ANN Algorithms
The indexing algorithm is the most consequential choice you'll make with a vector database. Each algorithm makes a different trade-off between build time, memory usage, query latency, and recall.
HNSW (Hierarchical Navigable Small World)
HNSW builds a layered proximity graph at index time. Each layer is a subset of the full graph, with the top layers sparse (long-range jumps) and the bottom layer containing all vectors (short-range, precise search). At query time, you enter at the top and navigate downward.
HNSW achieves sub-millisecond queries on millions of vectors with 95%+ recall. The trade-off: it stores the full graph in memory alongside the vectors, so memory usage is 1.5-2x the raw vector size.
IVF (Inverted File Index)
IVF clusters vectors using k-means at build time, assigning each vector to its nearest centroid. At query time, it searches only the nearest nprobe clusters instead of the full dataset.
Build: cluster N vectors into K buckets (K = sqrt(N) is typical)
Query: find nearest nprobe buckets โ search only those
Trade-off: faster build than HNSW, ~10-20% lower recall at same latency
Best for: indexes rebuilt frequently (recommendation models retrained daily)
Product Quantization (PQ)
PQ compresses vectors by splitting each into sub-vectors and quantizing each sub-vector to its nearest centroid in a small codebook. This reduces memory by 4-32x at a recall cost of 5-15%.
PQ is often combined with IVF (IVF-PQ) for large-scale systems where HNSW's memory requirements are prohibitive. Pinecone and Milvus both use IVF-PQ as their default index for collections over 10 million vectors.
ScaNN (Google)
ScaNN uses anisotropic vector quantization, which preserves the angular relationships between vectors better than standard PQ. It's the algorithm behind Google's production embedding search and is available as an open-source library. Benchmark-leading recall at matched latencies, but less widely supported outside Google's ecosystem.
ANN Algorithm Comparison
| Algorithm | Query latency | Recall@10 | Memory overhead | Build time | Best for |
|---|---|---|---|---|---|
| HNSW | < 1ms | 95-99% | 1.5-2x vectors | Medium | General purpose, highest recall required |
| IVF | 1-5ms | 85-95% | 1.1x vectors | Fast | Frequently rebuilt indexes |
| IVF-PQ | 2-10ms | 80-90% | 0.1-0.3x vectors | Fast | Billions of vectors, memory-constrained |
| ScaNN | < 1ms | 95-98% | 0.5-1x vectors | Medium | Google ecosystem, high-throughput |
Recall is not accuracy
When a vector database reports "95% recall," it means the approximate search returns 95% of the vectors that a brute-force exact search would find. The other 5% are valid results that the algorithm missed due to its approximation shortcuts. For search and recommendations this is usually acceptable. For compliance or audit workloads where you must find every match, you may need exact search or very high recall settings, both of which cost significantly more in latency and compute.
Dedicated Vector DBs vs. pgvector
This is the "build vs. buy" question I get asked most often.
| Dimension | pgvector (PostgreSQL extension) | Dedicated vector DB (Pinecone, Weaviate, Qdrant) |
|---|---|---|
| Setup complexity | Add extension to existing PG | New infrastructure, new operational burden |
| ACID transactions | Full PostgreSQL transactions | Limited or none |
| Metadata filtering | Native SQL WHERE clauses, JOINs | Custom filter syntax, limited JOIN support |
| Scale ceiling | ~5-10M vectors per instance | Billions of vectors, distributed by design |
| ANN performance | Good (HNSW via ivfflat or hnsw index) | Optimized (purpose-built indexes, GPU-accelerated) |
| Ecosystem | Any PostgreSQL client, ORMs, tooling | Vendor-specific SDKs, cloud-only for some |
My rule of thumb: if you have fewer than 5 million vectors and already run PostgreSQL, start with pgvector. You get transactional consistency, familiar SQL, and zero new infrastructure. Once you hit 10M+ vectors or need sub-5ms p99 latency at scale, evaluate dedicated vector databases.
Metadata Filtering
Vector search alone returns semantic neighbors. Real applications need to combine vector similarity with business constraints: "show me semantically similar shoes that are in-stock, under $100, in size 10."
This is trickier than it sounds. The ANN index is optimized for vector distance, not attribute filtering. Three approaches:
Pre-filter applies metadata constraints first, then runs ANN search on the filtered subset. Works well when the filter is highly selective (e.g., "shoes in size 10" cuts the dataset by 90%). Fails when the filtered subset is still millions of vectors, because you're rebuilding the search scope dynamically.
Post-filter runs ANN search on the full index and discards non-matching results. Fast vector retrieval, but if your top-10 results all fail the metadata filter, you return empty. Common with tight constraints on popular categories.
Hybrid (what most production systems use) over-fetches by a configurable factor. Request top 100 candidates, apply metadata filters, return the best 10 that pass. Tunable, predictable, and the default in Pinecone, Weaviate, and Qdrant.
Trade-offs
| Advantage | Disadvantage |
|---|---|
| Semantic search: finds results by meaning, not keywords | Index build time: HNSW indexing millions of vectors takes minutes to hours |
| Sub-millisecond query latency with ANN at scale | Memory-intensive: HNSW stores vectors + graph in RAM (1536-dim x 1M vectors = ~6GB before graph overhead) |
| Multi-modal: same architecture works for text, images, audio, code | Recall is approximate: ANN misses 1-5% of true nearest neighbors |
| Natural fit for LLM/RAG pipelines | No ACID transactions: most vector DBs lack transactional guarantees |
| Scales to billions of vectors with sharding | Embedding model dependency: query quality is bounded by embedding quality, not the database |
| Metadata filtering combines structured and unstructured search | Operational complexity: new infrastructure to monitor, tune, and maintain |
The fundamental tension is recall vs. latency. Perfect recall (finding every true neighbor) requires brute-force search, which is too slow. Sub-millisecond latency requires approximation, which misses some results. Every tuning decision, from HNSW's M and efConstruction parameters to quantization bit-width, is a dial on this spectrum.
When to Use It / When to Avoid It
Use a vector database when:
- You need semantic search ("find things similar to X" rather than "find exact match for X")
- You're building RAG pipelines and need to retrieve relevant context for an LLM
- Recommendations require "more like this" functionality over a large content library
- You're doing cross-modal search (search images with text queries, or vice versa)
- Deduplication or plagiarism detection across millions of documents
- Anomaly detection by finding vectors far from all cluster centers
Avoid a vector database when:
- Exact lookups by ID or key will do. Use a relational or key-value store.
- Full-text keyword search with boolean queries. Use Elasticsearch or similar.
- Transactional workloads. Vector databases are not ACID-compliant.
- Your dataset is under 100K items. Brute-force cosine similarity in-memory (even in NumPy) is fast enough that you don't need an index.
- You haven't validated that embeddings improve your task. If keyword search works for your users, a vector database adds complexity for no gain.
If you can do it with a SQL LIKE clause or Elasticsearch, you probably should. Vector databases shine precisely where those tools fail: when the user's intent and the document's language don't share vocabulary.
Real-World Examples
OpenAI / ChatGPT Plugins
OpenAI's plugin system used Pinecone as the retrieval layer for ChatGPT plugins. When a user asked a question, the system embedded the query, searched Pinecone for relevant plugin documentation chunks, and injected those chunks into the LLM context. This is the canonical RAG pattern: embed, retrieve, generate. Pinecone handles billions of vectors across its platform with p99 query latency under 50ms.
Spotify
Spotify represents every song, podcast, and user as an embedding vector. When you listen to a track, Spotify embeds it, finds the nearest neighbors in its vector space, and surfaces those as "Discover Weekly" recommendations. Their system handles over 500 million users and 100+ million tracks. The shift from collaborative filtering to embedding-based recommendations improved discovery metrics by 30%+ according to their engineering blog.
Notion AI
Notion's AI search feature embeds every page, database row, and comment in a user's workspace. When a user searches "our Q4 pricing decision," the system finds semantically relevant documents even if they're titled "Revenue Strategy Update 2025" with no mention of "pricing" or "Q4." They use a vector database alongside their existing PostgreSQL store, with metadata filtering for workspace permissions and access control.
How This Shows Up in Interviews
When to bring it up
Mention vector databases when the problem involves:
- Search that needs to understand intent, not just keywords (e-commerce, knowledge base, support tickets)
- Recommendation systems ("show similar items")
- Any LLM-powered feature (RAG, semantic caching, document Q&A)
- Content moderation (finding similar-to-known-bad content)
You don't need to go deep on ANN algorithms unless the interviewer asks. Say "we'd use a vector database with HNSW indexing for sub-millisecond similarity search" and move on to the broader architecture.
Depth expected at senior/staff level
- Understand the embedding pipeline: raw content enters, fixed-dimension float vectors leave. The model choice bounds the quality.
- Know the three similarity functions and when each applies (cosine for text, L2 for images, dot product for MIPS).
- Explain the recall vs. latency trade-off: HNSW gives high recall but uses more memory; IVF-PQ trades recall for memory efficiency at billion-vector scale.
- Articulate when pgvector is sufficient vs. when you need a dedicated vector DB.
- Describe metadata filtering strategies (pre/post/hybrid) and why naive post-filtering fails with tight constraints.
- Know quantization as a cost lever: float32 to int8 cuts memory 4x at 2-5% recall loss.
Interview shortcut: the RAG diagram
For any LLM-powered feature, sketch the RAG pipeline: user query enters, embedding model converts to vector, vector DB returns top-K context chunks, chunks get injected into the LLM prompt, LLM generates a grounded response. This five-box diagram answers "how does your AI feature avoid hallucination?" in 30 seconds.
Follow-up Q&A
| Interviewer asks | Strong answer |
|---|---|
| Why not just use Elasticsearch for semantic search? | ES does BM25 (keyword matching). It can be extended with vector search via kNN, but it's not optimized for it. Dedicated vector DBs have purpose-built ANN indexes, quantization, and distributed vector sharding. For hybrid keyword + semantic, ES kNN is a reasonable starting point. |
| How do you handle stale embeddings when content updates? | Re-embed on content change events. Batch re-embedding for bulk updates. Use a content hash to detect changes. For RAG, chunk-level re-embedding avoids re-processing entire documents. |
| What happens when your embedding model changes? | All vectors must be re-embedded with the new model. You can't mix vectors from different models in the same index. Run a migration pipeline: re-embed all content, build a new index, swap atomically. This is the main operational cost of model upgrades. |
| How do you scale a vector database horizontally? | Shard vectors by a hash of the document ID across nodes. Each shard holds a subset of the index. Queries fan out to all shards, each returns its local top-K, and a coordinator merges results. Same scatter-gather pattern as Elasticsearch. |
| What's the cost model for vector databases? | Dominated by memory (vectors + index in RAM), then compute (embedding generation), then storage (vectors on disk for persistence). Quantization (float32 to int8) cuts memory 4x. Tiered storage (hot vectors in RAM, cold on SSD) helps for infrequent queries. |
Test Your Understanding
Quick Recap
- Vector databases store embeddings (dense float arrays) and find nearest neighbors using distance functions, enabling search by meaning rather than keywords.
- The three similarity functions are cosine (text, direction-only), Euclidean L2 (images, spatial), and dot product (recommendation scoring). Cosine is the safe default.
- HNSW is the most common ANN index: sub-millisecond queries at 95%+ recall, at the cost of storing the full graph in memory (1.5-2x raw vector size).
- Metadata filtering is how you combine "semantically similar" with "in-stock, under $100, size 10." Hybrid over-fetch is the production default.
- pgvector is sufficient under 5-10 million vectors with existing PostgreSQL. Dedicated vector databases (Pinecone, Weaviate, Qdrant) are justified at billion-vector scale or when you need sub-5ms p99.
- Vector databases complement relational databases. They handle similarity search; PostgreSQL handles transactions, joins, access control, and exact lookups.
- Model changes require full re-embedding and atomic index swaps. You cannot mix vectors from different models in the same index.
Related Concepts
- Databases: Vector databases complement relational stores. Understanding when to use which is essential for a complete data architecture.
- Sharding: At billion-vector scale, vector databases shard indexes across nodes using the same scatter-gather pattern as distributed databases.
- Caching: Semantic caching (caching LLM responses by embedding similarity of the query) is an emerging pattern that combines caching and vector search.
- Message Queues: Embedding pipelines often use Kafka or SQS to decouple content ingestion from vector indexing, especially for high-throughput or batch re-embedding.