Design an AI chatbot with memory

TL;DR

Three memory tiers handle different recall horizons: short-term (current session turns in the context window), long-term (summarized session history in a database), and semantic (extracted user facts as embeddings for retrieval).
A 20-turn conversation consumes around 4K-8K tokens. Naively stuffing all prior sessions into the prompt blows past 128K tokens for any user with more than 50 sessions.
Semantic memory retrieval via HNSW index over 10K stored facts takes under 20ms and adds roughly 500-1,000 tokens of highly relevant context per turn.
The memory extraction step after each conversation turn is the secret sauce: an LLM call extracts structured facts ("user prefers Python", "user works at a fintech startup") and upserts them to the memory store.
Conflicting memories (user said Python last month, TypeScript today) require a recency-weighted resolution strategy, not blind overwrites.

Requirements

Functional requirements

Users can send messages and receive contextually relevant responses that reference earlier turns in the same conversation.
The system remembers facts and preferences from previous sessions. If a user said "I work at a fintech startup" three weeks ago, the chatbot recalls it without re-asking.
Users can view, edit, and delete their stored memories. Memory is not a black box.
The chatbot handles multi-turn dialogue with correct coreference resolution ("it", "that API", "the bug from earlier" all resolve correctly).
The system gracefully degrades when memory retrieval fails or is slow. An answer without memory is better than no answer.

Non-functional requirements

Latency: P95 end-to-end response time under 2 seconds, including memory retrieval.
Scale: 100K concurrent users, 500 messages per second at peak.
Token budget: average prompt stays under 8K tokens per turn (keeping inference cost under $0.02 per message on GPT-4o at $2.50/1M input tokens).
Memory retrieval: under 50ms P99 for semantic search over up to 100K stored facts per user.
Availability: 99.9% uptime. Memory store failures must not block the chat path.
Memory accuracy: retrieved facts should be relevant to the current query at least 85% of the time (measured by human eval on a sample).

The hardest engineering problem here: keeping the context window small enough to be fast and cheap while injecting enough memory to make the chatbot feel like it genuinely remembers you. This tension between token budget and recall quality drives every architectural decision.

The core entities

Conversation

conversation_id, user_id, title, created_at, last_message_at, message_count, summary

Message

message_id, conversation_id, role (user/assistant/system), content, token_count, created_at

MemoryFact

fact_id, user_id, content ("prefers Python for backend work"), category (preference/biographical/project/instruction), source_conversation_id, embedding (1536-dim float vector), confidence, created_at, updated_at, superseded_by

SessionSummary

summary_id, conversation_id, user_id, summary_text, key_topics, created_at, token_count

EmbeddingIndex

index_id, user_id, vector_count, index_type (HNSW), last_rebuilt_at

API design

POST /api/chat - Send a message and get a response (streaming)

This is the core endpoint. It handles message receipt, memory retrieval, context assembly, and LLM inference in a single streaming response.

Request:  { "conversation_id": "conv_abc", "message": "Help me set up a FastAPI server" }
Response (SSE stream):
  data: { "type": "chunk", "content": "Since you prefer Python and " }
  data: { "type": "chunk", "content": "work at a fintech startup, " }
  data: { "type": "chunk", "content": "here's a FastAPI setup..." }
  data: { "type": "done", "message_id": "msg_456", "tokens_used": 1847 }

POST /api/conversations - Start a new conversation

Request:  { "user_id": "u_123" }
Response: { "conversation_id": "conv_abc", "created_at": "2026-04-11T10:00:00Z" }

GET /api/conversations/:id/messages - Retrieve conversation history

Response: { "messages": [{ "message_id": "msg_1", "role": "user", "content": "...", "created_at": "..." }, ...], "has_more": true, "cursor": "msg_50" }

GET /api/users/:id/memories - View stored memory facts

Response: { "memories": [{ "fact_id": "f_1", "content": "Prefers Python for backend", "category": "preference", "confidence": 0.92, "updated_at": "..." }, ...] }

DELETE /api/users/:id/memories/:fact_id - Delete a specific memory

Response: { "deleted": true, "fact_id": "f_1" }

PUT /api/users/:id/memories/:fact_id - Edit a memory fact

Request:  { "content": "Prefers TypeScript for backend" }
Response: { "fact_id": "f_1", "content": "Prefers TypeScript for backend", "updated_at": "..." }

The system has two pipelines: the online path (user sends message, gets a response) and the offline path (memory extraction and summarization after each turn). The online path must be fast. The offline path can be async.

When a user sends a message, the chat service loads the recent conversation history (short-term memory), queries the semantic memory store for relevant past facts, assembles a context window under the token budget, calls the LLM, and streams the response back. After the response is sent, an async worker extracts new facts from the conversation turn and upserts them into the memory store.

I have seen candidates design the memory retrieval as a synchronous blocking call in the main inference path. That works, but only if your vector search is genuinely fast (under 50ms). If it is not, put it behind a timeout with graceful degradation: if memory retrieval takes longer than 100ms, skip it and respond without long-term context. A slightly less personalized answer is always better than a 5-second hang.

The three memory tiers map cleanly to different storage systems. Short-term memory (last 10-20 messages) lives in the conversation store and is loaded directly. Long-term memory (session summaries) lives in a relational database. Semantic memory (extracted facts as embeddings) lives in a vector database with HNSW indexing.

Here is the turn-by-turn processing pipeline animated step by step. This is the online path that executes on every user message.

TL;DR

Three memory tiers handle different recall horizons: short-term (current session turns in the context window), long-term (summarized session history in a database), and semantic (extracted user facts as embeddings for retrieval).
A 20-turn conversation consumes around 4K-8K tokens. Naively stuffing all prior sessions into the prompt blows past 128K tokens for any user with more than 50 sessions.
Semantic memory retrieval via HNSW index over 10K stored facts takes under 20ms and adds roughly 500-1,000 tokens of highly relevant context per turn.
The memory extraction step after each conversation turn is the secret sauce: an LLM call extracts structured facts ("user prefers Python", "user works at a fintech startup") and upserts them to the memory store.
Conflicting memories (user said Python last month, TypeScript today) require a recency-weighted resolution strategy, not blind overwrites.

Requirements

Functional requirements

Users can send messages and receive contextually relevant responses that reference earlier turns in the same conversation.
The system remembers facts and preferences from previous sessions. If a user said "I work at a fintech startup" three weeks ago, the chatbot recalls it without re-asking.
Users can view, edit, and delete their stored memories. Memory is not a black box.
The chatbot handles multi-turn dialogue with correct coreference resolution ("it", "that API", "the bug from earlier" all resolve correctly).
The system gracefully degrades when memory retrieval fails or is slow. An answer without memory is better than no answer.

Non-functional requirements

Latency: P95 end-to-end response time under 2 seconds, including memory retrieval.
Scale: 100K concurrent users, 500 messages per second at peak.
Token budget: average prompt stays under 8K tokens per turn (keeping inference cost under $0.02 per message on GPT-4o at $2.50/1M input tokens).
Memory retrieval: under 50ms P99 for semantic search over up to 100K stored facts per user.
Availability: 99.9% uptime. Memory store failures must not block the chat path.
Memory accuracy: retrieved facts should be relevant to the current query at least 85% of the time (measured by human eval on a sample).

The hardest engineering problem here: keeping the context window small enough to be fast and cheap while injecting enough memory to make the chatbot feel like it genuinely remembers you. This tension between token budget and recall quality drives every architectural decision.

The core entities

Conversation

conversation_id, user_id, title, created_at, last_message_at, message_count, summary

Message

message_id, conversation_id, role (user/assistant/system), content, token_count, created_at

MemoryFact

fact_id, user_id, content ("prefers Python for backend work"), category (preference/biographical/project/instruction), source_conversation_id, embedding (1536-dim float vector), confidence, created_at, updated_at, superseded_by

SessionSummary

summary_id, conversation_id, user_id, summary_text, key_topics, created_at, token_count

EmbeddingIndex

index_id, user_id, vector_count, index_type (HNSW), last_rebuilt_at

API design

POST /api/chat - Send a message and get a response (streaming)

This is the core endpoint. It handles message receipt, memory retrieval, context assembly, and LLM inference in a single streaming response.

Request:  { "conversation_id": "conv_abc", "message": "Help me set up a FastAPI server" }
Response (SSE stream):
  data: { "type": "chunk", "content": "Since you prefer Python and " }
  data: { "type": "chunk", "content": "work at a fintech startup, " }
  data: { "type": "chunk", "content": "here's a FastAPI setup..." }
  data: { "type": "done", "message_id": "msg_456", "tokens_used": 1847 }

POST /api/conversations - Start a new conversation

Request:  { "user_id": "u_123" }
Response: { "conversation_id": "conv_abc", "created_at": "2026-04-11T10:00:00Z" }

GET /api/conversations/:id/messages - Retrieve conversation history

Response: { "messages": [{ "message_id": "msg_1", "role": "user", "content": "...", "created_at": "..." }, ...], "has_more": true, "cursor": "msg_50" }

GET /api/users/:id/memories - View stored memory facts

Response: { "memories": [{ "fact_id": "f_1", "content": "Prefers Python for backend", "category": "preference", "confidence": 0.92, "updated_at": "..." }, ...] }

DELETE /api/users/:id/memories/:fact_id - Delete a specific memory

Response: { "deleted": true, "fact_id": "f_1" }

PUT /api/users/:id/memories/:fact_id - Edit a memory fact

Request:  { "content": "Prefers TypeScript for backend" }
Response: { "fact_id": "f_1", "content": "Prefers TypeScript for backend", "updated_at": "..." }

High-level design

Here is the turn-by-turn processing pipeline animated step by step. This is the online path that executes on every user message.

Design an AI chatbot with memory

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Continue Reading with Premium

Comments

Design an AI chatbot with memory

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Continue Reading with Premium

Comments