Design an AI chatbot with memory
Walk through designing a multi-turn chatbot that remembers context across sessions using short-term, long-term, and semantic memory tiers while staying within token budgets.
TL;DR
- Three memory tiers handle different recall horizons: short-term (current session turns in the context window), long-term (summarized session history in a database), and semantic (extracted user facts as embeddings for retrieval).
- A 20-turn conversation consumes around 4K-8K tokens. Naively stuffing all prior sessions into the prompt blows past 128K tokens for any user with more than 50 sessions.
- Semantic memory retrieval via HNSW index over 10K stored facts takes under 20ms and adds roughly 500-1,000 tokens of highly relevant context per turn.
- The memory extraction step after each conversation turn is the secret sauce: an LLM call extracts structured facts ("user prefers Python", "user works at a fintech startup") and upserts them to the memory store.
- Conflicting memories (user said Python last month, TypeScript today) require a recency-weighted resolution strategy, not blind overwrites.
Requirements
Functional requirements
- Users can send messages and receive contextually relevant responses that reference earlier turns in the same conversation.
- The system remembers facts and preferences from previous sessions. If a user said "I work at a fintech startup" three weeks ago, the chatbot recalls it without re-asking.
- Users can view, edit, and delete their stored memories. Memory is not a black box.
- The chatbot handles multi-turn dialogue with correct coreference resolution ("it", "that API", "the bug from earlier" all resolve correctly).
- The system gracefully degrades when memory retrieval fails or is slow. An answer without memory is better than no answer.
Non-functional requirements
- Latency: P95 end-to-end response time under 2 seconds, including memory retrieval.
- Scale: 100K concurrent users, 500 messages per second at peak.
- Token budget: average prompt stays under 8K tokens per turn (keeping inference cost under $0.02 per message on GPT-4o at $2.50/1M input tokens).
- Memory retrieval: under 50ms P99 for semantic search over up to 100K stored facts per user.
- Availability: 99.9% uptime. Memory store failures must not block the chat path.
- Memory accuracy: retrieved facts should be relevant to the current query at least 85% of the time (measured by human eval on a sample).
The hardest engineering problem here: keeping the context window small enough to be fast and cheap while injecting enough memory to make the chatbot feel like it genuinely remembers you. This tension between token budget and recall quality drives every architectural decision.
The core entities
Conversation
conversation_id,user_id,title,created_at,last_message_at,message_count,summary
Message
message_id,conversation_id,role(user/assistant/system),content,token_count,created_at
MemoryFact
fact_id,user_id,content("prefers Python for backend work"),category(preference/biographical/project/instruction),source_conversation_id,embedding(1536-dim float vector),confidence,created_at,updated_at,superseded_by
SessionSummary
summary_id,conversation_id,user_id,summary_text,key_topics,created_at,token_count
EmbeddingIndex
index_id,user_id,vector_count,index_type(HNSW),last_rebuilt_at
API design
POST /api/chat - Send a message and get a response (streaming)
This is the core endpoint. It handles message receipt, memory retrieval, context assembly, and LLM inference in a single streaming response.
Request: { "conversation_id": "conv_abc", "message": "Help me set up a FastAPI server" }
Response (SSE stream):
data: { "type": "chunk", "content": "Since you prefer Python and " }
data: { "type": "chunk", "content": "work at a fintech startup, " }
data: { "type": "chunk", "content": "here's a FastAPI setup..." }
data: { "type": "done", "message_id": "msg_456", "tokens_used": 1847 }
POST /api/conversations - Start a new conversation
Request: { "user_id": "u_123" }
Response: { "conversation_id": "conv_abc", "created_at": "2026-04-11T10:00:00Z" }
GET /api/conversations/:id/messages - Retrieve conversation history
Response: { "messages": [{ "message_id": "msg_1", "role": "user", "content": "...", "created_at": "..." }, ...], "has_more": true, "cursor": "msg_50" }
GET /api/users/:id/memories - View stored memory facts
Response: { "memories": [{ "fact_id": "f_1", "content": "Prefers Python for backend", "category": "preference", "confidence": 0.92, "updated_at": "..." }, ...] }
DELETE /api/users/:id/memories/:fact_id - Delete a specific memory
Response: { "deleted": true, "fact_id": "f_1" }
PUT /api/users/:id/memories/:fact_id - Edit a memory fact
Request: { "content": "Prefers TypeScript for backend" }
Response: { "fact_id": "f_1", "content": "Prefers TypeScript for backend", "updated_at": "..." }
High-level design
The system has two pipelines: the online path (user sends message, gets a response) and the offline path (memory extraction and summarization after each turn). The online path must be fast. The offline path can be async.
When a user sends a message, the chat service loads the recent conversation history (short-term memory), queries the semantic memory store for relevant past facts, assembles a context window under the token budget, calls the LLM, and streams the response back. After the response is sent, an async worker extracts new facts from the conversation turn and upserts them into the memory store.
I have seen candidates design the memory retrieval as a synchronous blocking call in the main inference path. That works, but only if your vector search is genuinely fast (under 50ms). If it is not, put it behind a timeout with graceful degradation: if memory retrieval takes longer than 100ms, skip it and respond without long-term context. A slightly less personalized answer is always better than a 5-second hang.
The three memory tiers map cleanly to different storage systems. Short-term memory (last 10-20 messages) lives in the conversation store and is loaded directly. Long-term memory (session summaries) lives in a relational database. Semantic memory (extracted facts as embeddings) lives in a vector database with HNSW indexing.
Here is the turn-by-turn processing pipeline animated step by step. This is the online path that executes on every user message.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.