Design a prompt management platform
Walk through designing a prompt management system with version control, A/B testing, automated evaluation, and rollback, serving 500 prompts across 20 product teams.
TL;DR
- The core architectural insight is decoupling prompts from application code. Prompts live in a versioned registry, and applications resolve them at runtime via a sub-5ms API call through a Redis cache.
- Use a git-like versioning model with "staging" and "production" pointers per prompt. Every prompt change goes through an automated eval gate (200 test cases in under 2 minutes) before promotion, so bad prompts never reach users.
- A/B testing with Thompson Sampling (multi-armed bandit) converges to the winning variant in 2-3 days instead of the 14 days required by fixed traffic splits, reducing exposure to underperforming prompts by 80%.
- Prompt-model coupling is the sneaky production killer. A prompt optimized for GPT-4o may score 15-20% lower on Claude 3.5 Sonnet. Bind every prompt version to a specific model identifier.
- The production lesson that separates juniors from seniors: prompt management is not a developer tool problem, it is a release engineering problem. Treat prompt changes like code deploys with staging, evaluation, canary rollout, and instant rollback.
Requirements
Functional requirements
- Teams can create, edit, and version prompts through a web UI or API, with each edit producing an immutable version.
- The system resolves a
prompt_idto the correct prompt text and parameters (model, temperature, max_tokens) at runtime, supporting per-environment overrides (staging vs production). - Teams can run A/B tests between prompt versions with configurable traffic splits and automatic metric collection.
- Every prompt change triggers an automated eval suite against a test set before the prompt can be promoted to production.
- Teams can instantly roll back any prompt to a previous version without a code deploy.
- The system tracks per-version metrics (latency, cost, user satisfaction, eval scores) and surfaces them in a dashboard.
Non-functional requirements
- Prompt resolution P99 latency under 5ms (cached) and under 50ms (cold).
- Support 500 active prompts across 20 product teams with 10,000 resolution requests per second at peak.
- Eval suite execution completes 200 test cases in under 2 minutes.
- Rollback takes effect within 5 seconds across all application instances.
- 99.99% availability for the resolution API (this is on the critical path for every LLM call in the company).
- Prompt version history retained for 12 months for audit and compliance.
The hardest engineering problem here: A/B testing prompts is fundamentally different from A/B testing UI changes. Prompt outputs are non-deterministic, evaluation is subjective, and the feedback signal (is this response "good"?) arrives asynchronously or not at all. You need both automated eval metrics and human judgment, and you need a traffic allocation strategy that minimizes exposure to bad variants.
The core entities
Prompt
prompt_id,name,team_id,description,created_at,updated_at,production_version_id,staging_version_id
PromptVersion
version_id,prompt_id,version_number,template_text,model_id(e.g.,gpt-4o,claude-3.5-sonnet),temperature,max_tokens,system_prompt,few_shot_examples[],created_by,created_at,eval_status(pending/passed/failed),eval_score
EvalRun
eval_run_id,version_id,test_set_id,total_cases,passed_cases,avg_score,p50_latency_ms,total_cost_usd,started_at,completed_at,gate_result(pass/fail/manual_review)
Experiment
experiment_id,prompt_id,variants[](each withversion_id,traffic_weight),allocation_strategy(fixed_split/thompson_sampling),start_date,end_date,status(active/completed/stopped),winner_version_id
MetricEvent
event_id,version_id,experiment_id,request_id,latency_ms,cost_usd,user_rating(1-5, optional),eval_score,timestamp
API design
POST /api/prompts - create a new prompt
Request: {
"name": "customer-support-classifier",
"team_id": "team_support",
"description": "Classifies inbound tickets into categories"
}
Response: {
"prompt_id": "prm_abc123",
"name": "customer-support-classifier",
"created_at": "2026-04-11T10:00:00Z"
}
POST /api/prompts/:prompt_id/versions - create a new version of an existing prompt
Request: {
"template_text": "Classify the following customer message into one of: {{categories}}.\n\nMessage: {{message}}\n\nCategory:",
"model_id": "gpt-4o-mini",
"temperature": 0.0,
"max_tokens": 50,
"system_prompt": "You are a precise classification assistant.",
"few_shot_examples": [
{ "input": "My order hasn't arrived", "output": "shipping" },
{ "input": "I want a refund", "output": "billing" }
]
}
Response: {
"version_id": "ver_def456",
"version_number": 12,
"eval_status": "pending"
}
GET /api/resolve/:prompt_id - runtime resolution endpoint (the hot path)
Query params: ?env=production&variables={"categories":"shipping,billing,technical","message":"My order is late"}
Response: {
"version_id": "ver_def456",
"resolved_text": "Classify the following customer message into one of: shipping,billing,technical.\n\nMessage: My order is late\n\nCategory:",
"model_id": "gpt-4o-mini",
"temperature": 0.0,
"max_tokens": 50,
"system_prompt": "You are a precise classification assistant.",
"experiment_id": "exp_789",
"variant": "B"
}
This is the most critical endpoint. Every LLM call in the company hits this first. It must be fast (sub-5ms from Redis), reliable (99.99%), and experiment-aware (return the right variant for A/B tests).
POST /api/prompts/:prompt_id/experiments - start an A/B test
Request: {
"variants": [
{ "version_id": "ver_def456", "traffic_weight": 0.5 },
{ "version_id": "ver_ghi789", "traffic_weight": 0.5 }
],
"allocation_strategy": "thompson_sampling",
"primary_metric": "user_rating",
"min_samples": 500
}
Response: {
"experiment_id": "exp_xyz",
"status": "active"
}
POST /api/prompts/:prompt_id/rollback - roll back to a specific version
Request: {
"target_version_id": "ver_abc123",
"environment": "production"
}
Response: {
"prompt_id": "prm_abc123",
"production_version_id": "ver_abc123",
"rollback_propagated_at": "2026-04-11T10:05:00Z",
"cache_invalidation_ms": 230
}
High-level design
The platform has two distinct paths: the authoring path (low-traffic, high-latency-tolerance) where teams create and test prompts, and the resolution path (high-traffic, ultra-low-latency) where applications fetch the active prompt at runtime. These paths share a database but operate at completely different scales.
The authoring path flows through a web UI and API to a Prompt Service that stores versions in Postgres. When a new version is created, the eval pipeline kicks off automatically, running 200 test cases against the new prompt and scoring results. If the eval gate passes, the version becomes eligible for promotion. Teams can then promote to staging, run an A/B test, and eventually promote to production.
The resolution path is the critical path. Every LLM call in your company starts with a prompt resolution. The application sends prompt_id + environment to the Resolution API, which checks Redis first (99%+ cache hit rate). On a cache miss, it reads from Postgres, populates the cache, and returns the resolved prompt. When an A/B test is running, the resolver uses consistent hashing on user_id to assign a variant deterministically.
I have seen candidates design the authoring UI in detail and forget that the resolution API is the hard part. The interviewer cares about the hot path: how does every LLM call in production get its prompt in under 5ms?
Here is the prompt deployment pipeline step by step, from authoring to live traffic:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.