Vector databases

TL;DR

Vector databases store embeddings (dense numeric arrays that encode semantic meaning) and support nearest-neighbor search over them.
Traditional databases find exact matches. Vector databases find "similar" items using distance functions like cosine similarity, Euclidean distance, and dot product.
Approximate Nearest Neighbor (ANN) algorithms like HNSW trade perfect recall for query speed, making sub-millisecond searches over billions of vectors practical.
Use vector databases for semantic search, recommendation systems, anomaly detection, and RAG (retrieval-augmented generation).
They complement traditional databases rather than replacing them. Metadata filtering happens in a relational layer, vector similarity in the vector layer.

Your e-commerce search bar gets a query: "comfortable shoes for standing all day." Your Elasticsearch cluster dutifully tokenizes those words, checks its inverted index, and returns every product listing that contains the exact terms "comfortable," "shoes," "standing," or "all day." The user scrolls past 200 results of standing desks, shoe racks, and "all-day comfort mattresses" before giving up.

Meanwhile, the product they actually want, labeled "ergonomic footwear for long shifts," never appears. It doesn't contain the word "comfortable" or "standing." Keyword search is structurally incapable of bridging that gap. It matches tokens, not meaning.

I see this pattern constantly: teams spend months tuning synonyms, stemming rules, and boosting heuristics in Elasticsearch, trying to brute-force semantic understanding into a system designed for lexical matching. It works for 80% of queries and fails spectacularly on the rest.

The fundamental issue: keyword search operates on string equality. It has no concept of meaning. Two sentences can be semantically identical while sharing zero words, and keyword search will score them as completely unrelated.

Vector databases solve this by operating on meaning directly. Instead of comparing strings, they compare numerical representations of meaning (embeddings) using distance functions. Two documents that mean similar things will have similar embeddings, regardless of the words they use.

What Is It?

A vector database is a specialized storage system designed to index, store, and query high-dimensional vectors (embeddings). Where a relational database answers "give me the row where id = 42," a vector database answers "give me the 10 items most similar to this input."

Think of it like a library. A traditional database is the card catalog: you look up a book by its exact title, author, or ISBN. A vector database is the librarian who has read every book and can say, "Oh, you liked that one? You'd probably love these three, they explore similar themes." The librarian doesn't match titles; they match meaning.

The core workflow has three stages: content goes in as raw data, gets transformed into vectors by an embedding model, and gets indexed for fast retrieval. At query time, the query itself gets embedded, and the database finds the stored vectors closest to it.

For your interview: say "vector databases store embeddings and support approximate nearest-neighbor search, so we can do semantic similarity instead of keyword matching." That one sentence covers what interviewers need to hear.

How It Works

Step 1: Generate embeddings

An embedding is a model's compressed representation of meaning. The model (text encoder, image encoder, etc.) maps an input to a fixed-length float array. Inputs that are semantically similar produce vectors that are close together in the high-dimensional space.

// Generating embeddings with OpenAI's API
const response = await openai.embeddings.create({
  model: "text-embedding-ada-002",
  input: "comfortable shoes for standing all day"
});

const vector = response.data[0].embedding;
// [0.23, -0.41, 0.87, 0.14, ..., 0.62]  (1536 dimensions)

// "ergonomic footwear for long shifts" produces a vector
// nearly identical in direction, cosine similarity ~0.94

// "cheese pizza recipe" produces a vector pointing in a
// completely different direction, cosine similarity ~0.12

The embedding model is a black box to the vector database. It doesn't care how the vectors were produced, only that semantically similar inputs yield nearby vectors. You can use OpenAI, Sentence-BERT, CLIP (for images), or any model that outputs fixed-length float arrays.

Step 2: Choose a similarity function

The database needs a way to measure "closeness" between two vectors. Three functions dominate:

Function	Formula	Range	Best for
Cosine similarity	cos(A,B) = (A . B) / (\|A\| x \|B\|)	-1 to 1 (1 = identical)	Text embeddings where direction matters more than magnitude
Euclidean distance (L2)	sqrt(sum((ai - bi)^2))	0 to infinity (0 = identical)	Image embeddings, spatial data, normalized vectors
Dot product	sum(ai x bi)	unbounded	MIPS (maximum inner product search), recommendation scoring

My recommendation: start with cosine similarity for text workloads. It's the most forgiving because it ignores vector magnitude, so embeddings from different models or normalization schemes still compare reasonably. Switch to dot product only if you're doing recommendation scoring where magnitude encodes confidence.

Step 3: Index the vectors

A naive nearest-neighbor search over N vectors requires computing distance to all N vectors, which is O(N) per query. At 100 million vectors with 1536 dimensions each, that's over 150 billion floating-point operations per query. Not viable for interactive latency.

This is where Approximate Nearest Neighbor (ANN) algorithms come in. They build index structures that trade perfect recall for sub-millisecond query times.

Step 4: Query

At query time, embed the user's input with the same model, then search the index for the K nearest vectors. The database returns vector IDs and similarity scores, which you enrich with metadata from a relational store.

// Querying Pinecone
const queryResponse = await index.query({
  vector: queryEmbedding,      // same model used for ingestion
  topK: 10,                     // return 10 nearest neighbors
  includeMetadata: true,
  filter: {                     // metadata pre-filter
    category: { $eq: "shoes" },
    price: { $lte: 100 },
    in_stock: { $eq: true }
  }
});

// queryResponse.matches:
// [{ id: "prod_4821", score: 0.94, metadata: { name: "..." } },
//  { id: "prod_1173", score: 0.91, metadata: { name: "..." } },
//  ...]

Key Components

Component	Role
Embedding model	Converts raw content (text, images, code) into fixed-length float vectors. External to the database.
Vector index	Data structure (HNSW graph, IVF clusters, PQ codebook) that enables fast approximate search.
Distance function	Measures similarity between vectors (cosine, L2, dot product). Configured per index.
Metadata store	Stores non-vector attributes (price, category, timestamps) for filtering and enrichment.
Ingestion pipeline	Batches raw content through the embedding model and writes vectors + metadata to the database.
Query engine	Embeds the query, searches the index, applies metadata filters, returns ranked results.
Quantization layer	Compresses vectors (e.g., float32 to int8) to reduce memory and storage costs at a small recall penalty.

Types / Variations

ANN Algorithms

The indexing algorithm is the most consequential choice you'll make with a vector database. Each algorithm makes a different trade-off between build time, memory usage, query latency, and recall.

HNSW (Hierarchical Navigable Small World)

TL;DR

Vector databases store embeddings (dense numeric arrays that encode semantic meaning) and support nearest-neighbor search over them.
Traditional databases find exact matches. Vector databases find "similar" items using distance functions like cosine similarity, Euclidean distance, and dot product.
Approximate Nearest Neighbor (ANN) algorithms like HNSW trade perfect recall for query speed, making sub-millisecond searches over billions of vectors practical.
Use vector databases for semantic search, recommendation systems, anomaly detection, and RAG (retrieval-augmented generation).
They complement traditional databases rather than replacing them. Metadata filtering happens in a relational layer, vector similarity in the vector layer.

The Problem It Solves

What Is It?

How It Works

Step 1: Generate embeddings

// Generating embeddings with OpenAI's API
const response = await openai.embeddings.create({
  model: "text-embedding-ada-002",
  input: "comfortable shoes for standing all day"
});

const vector = response.data[0].embedding;
// [0.23, -0.41, 0.87, 0.14, ..., 0.62]  (1536 dimensions)

// "ergonomic footwear for long shifts" produces a vector
// nearly identical in direction, cosine similarity ~0.94

// "cheese pizza recipe" produces a vector pointing in a
// completely different direction, cosine similarity ~0.12

Step 2: Choose a similarity function

The database needs a way to measure "closeness" between two vectors. Three functions dominate:

Function	Formula	Range	Best for
Cosine similarity	cos(A,B) = (A . B) / (\|A\| x \|B\|)	-1 to 1 (1 = identical)	Text embeddings where direction matters more than magnitude
Euclidean distance (L2)	sqrt(sum((ai - bi)^2))	0 to infinity (0 = identical)	Image embeddings, spatial data, normalized vectors
Dot product	sum(ai x bi)	unbounded	MIPS (maximum inner product search), recommendation scoring

Step 3: Index the vectors

This is where Approximate Nearest Neighbor (ANN) algorithms come in. They build index structures that trade perfect recall for sub-millisecond query times.

Step 4: Query

// Querying Pinecone
const queryResponse = await index.query({
  vector: queryEmbedding,      // same model used for ingestion
  topK: 10,                     // return 10 nearest neighbors
  includeMetadata: true,
  filter: {                     // metadata pre-filter
    category: { $eq: "shoes" },
    price: { $lte: 100 },
    in_stock: { $eq: true }
  }
});

// queryResponse.matches:
// [{ id: "prod_4821", score: 0.94, metadata: { name: "..." } },
//  { id: "prod_1173", score: 0.91, metadata: { name: "..." } },
//  ...]

Key Components

Component	Role
Embedding model	Converts raw content (text, images, code) into fixed-length float vectors. External to the database.
Vector index	Data structure (HNSW graph, IVF clusters, PQ codebook) that enables fast approximate search.
Distance function	Measures similarity between vectors (cosine, L2, dot product). Configured per index.
Metadata store	Stores non-vector attributes (price, category, timestamps) for filtering and enrichment.
Ingestion pipeline	Batches raw content through the embedding model and writes vectors + metadata to the database.
Query engine	Embeds the query, searches the index, applies metadata filters, returns ranked results.
Quantization layer	Compresses vectors (e.g., float32 to int8) to reduce memory and storage costs at a small recall penalty.

Vector databases

TL;DR

The Problem It Solves

What Is It?

How It Works

Step 1: Generate embeddings

Step 2: Choose a similarity function

Step 3: Index the vectors

Step 4: Query

Key Components

Types / Variations

ANN Algorithms

HNSW (Hierarchical Navigable Small World)

Continue Reading with Premium

Comments

Vector databases

TL;DR

The Problem It Solves

What Is It?

How It Works

Step 1: Generate embeddings

Step 2: Choose a similarity function

Step 3: Index the vectors

Step 4: Query

Key Components

Types / Variations

ANN Algorithms

HNSW (Hierarchical Navigable Small World)

Continue Reading with Premium

Comments