Vector databases for AI
Learn how vector databases power RAG and semantic search, how HNSW and IVF indexes work, why metadata filtering is the most common production failure point, and how to choose between pgvector and dedicated solutions.
TL;DR
- Vector databases answer "find me the K most similar vectors to this query" in milliseconds by building approximate nearest neighbor (ANN) index structures that skip 99% of stored vectors.
- Exact KNN is O(n*d) per query: too slow past a few hundred thousand vectors. ANN trades 1-5% recall for 100x+ speed improvement.
- HNSW offers O(log N) query time with 95%+ recall but is memory-hungry. IVF uses less memory with tunable precision via the nprobe parameter.
- Use pgvector when you have under 1M vectors and an existing PostgreSQL setup. Switch to Qdrant, Pinecone, or Weaviate for scale, dedicated performance, or filter-aware multi-tenancy.
- Metadata filtering is the most common production failure point: post-filter applied to narrow metadata conditions silently returns far fewer results than requested, destroying RAG quality.
The problem it solves
You are building a RAG system with 500,000 embedded document chunks. Each chunk is a vector of 1,536 floating-point numbers. A user submits a query. You embed it the same way. Now you need the 10 most semantically similar chunks in under 100ms.
There is no SQL WHERE clause for this. Cosine similarity between two 1,536-dimensional vectors requires 1,536 multiplications. Doing this for 500,000 vectors is 768 million multiplications per query. Feasible in a cron job, not in a user-facing request.
Vector databases build ANN index structures that let queries skip the vast majority of stored vectors and still return the nearest neighbors with high probability. You trade 1-5% recall for queries that stay under 20ms even at hundreds of millions of vectors.
What is it?
A vector database stores high-dimensional float vectors alongside metadata, and provides an API for approximate nearest neighbor search. You insert vectors during ingestion with their metadata. At query time, you send a query vector, specify K, and receive the K most similar stored vectors with their associated metadata.
Think of it like a library organized by meaning rather than by catalog code. In a traditional library, you find books by exact index lookup. In a meaning-organized library, you describe what you are looking for in plain language, and the system surfaces books that are semantically most similar to your description, even if none use your exact words.
Vector databases are not a replacement for your relational database. They are a specialized read-optimized index for one query type: approximate nearest neighbor search in high-dimensional space. Production systems use them alongside PostgreSQL or DynamoDB, not instead of them.
How it works
Similarity metrics
Before finding nearest neighbors, you need a definition of "similar." Three common metrics:
Cosine similarity measures the angle between two vectors, ignoring magnitude. Range is -1 to 1. Best for text embeddings where the direction encodes meaning. Use this for OpenAI and sentence-transformer embeddings.
Dot product (inner product) measures direction and magnitude. Equivalent to cosine similarity when vectors are normalized. Slightly faster to compute. Used internally by many retrieval systems.
Euclidean distance (L2) measures absolute distance in vector space. Less common for text but used in image and multimodal embeddings.
For RAG with OpenAI or sentence-transformer embeddings, normalize your vectors and use cosine similarity. This is the standard and generally optimal choice.
ANN index structures
The two dominant ANN index types in production are HNSW and IVF.
HNSW (Hierarchical Navigable Small World) builds a multi-layer graph of vectors. Top layers are sparse long-range connections used for coarse navigation. The base layer contains all vectors in a dense graph. A query enters at the top, greedily navigates toward the query vector layer-by-layer, and exits at the base layer with the nearest neighbors.
HNSW characteristics: O(log N) query time, 95-99% recall at typical settings, high memory usage (roughly 100-200 bytes per vector beyond raw float storage).
IVF (Inverted File Index) clusters vectors using k-means at build time. At query time, only the nearest M clusters are searched (M is the tunable nprobe parameter). Lower memory than HNSW, tunable precision, but lower recall at equivalent speed.
Metadata filtering
Every production RAG system filters by metadata alongside vector search. You do not retrieve from all 500,000 chunks. You retrieve "the top-10 chunks from documents uploaded by tenant_id=42 in the last 30 days."
Metadata filtering has three implementations with very different tradeoffs:
Post-filter: Run full ANN search first, get top-K candidates, then filter by metadata. Simple to implement but silently breaks with narrow filters. If a tenant has 200 documents in a 500,000-document index, most ANN results belong to other tenants and get discarded.
Pre-filter: Filter the metadata index first (get matching IDs), then run ANN only within that set. Controls the search corpus but degrades ANN recall for small filtered sets, because the graph structure was built for the full index.
Filter-aware ANN: The ANN search integrates metadata conditions during traversal. Qdrant's filterable HNSW is the most mature implementation. Recall and result count stay consistent even with narrow filters. This is the correct approach for multi-tenant systems.
Post-filter silently degrades quality
Post-filter does not error. It silently returns 2-3 results when you requested 10, because most ANN top-K candidates belonged to other users. Monitor result count per query and alert when it drops below K.
Full ingestion and query pipeline
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.