Design an AI code assistant

TL;DR

The 50ms constraint is the entire design. Standard LLM inference takes 500ms-2s, so you run a small quantized model (6B params, INT4) via a regional API close to the developer, not a general-purpose LLM.
Context collection is the quality lever: cursor position, file prefix/suffix, open tabs, recent edits, and language detection all feed the completion model. Budget allocation across these sources determines acceptance rate.
The task is fill-in-the-middle (FIM): given the prefix before the cursor and the suffix after, predict what goes in the middle. This is different from next-token prediction.
Implicit acceptance signals (Tab pressed, key pressed without Tab, partial accept) power the quality flywheel without requiring any user action.
Speculative decoding with a small draft model plus a large verifier achieves 2-3x inference speedup while maintaining suggestion quality.
Multi-file refactoring uses a plan-then-execute agentic architecture with AST-based indexing, topological edit ordering, and per-diff syntax validation.

Requirements

Functional requirements

The IDE extension offers inline, single-line or multi-line code completion as the developer types.
Completions stream token-by-token; the first token appears in under 50ms (time to first token, TTFT).
The system supports all major languages (Python, TypeScript, Go, Java, Rust) with language-aware context.
Developer accept/reject signals are collected and used to improve suggestion quality over time.
Privacy-sensitive repositories can opt out of sending code to remote servers.

Non-functional requirements

10M daily active developers, each generating roughly 500 completion requests per session.
TTFT under 50ms at P90. Full completion (50-200 tokens) under 300ms.
Suggestion acceptance rate (SAR) target at or above 30% across all file types.
System degrades gracefully: if the model API is unreachable, the IDE continues working without suggestions rather than blocking the editor.
Training signal latency: accept/reject events must be ingested and usable for model fine-tuning within 24 hours.
Cost per completion under $0.001 at steady state. At 5B daily requests, inference cost must stay below $5M per day.

The hardest engineering problem here is not the model itself. It is assembling the right context from a sprawling codebase into a 4K-8K token window in under 10ms, while the developer is still typing. Context quality is the difference between a 20% and a 40% acceptance rate.

The core entities

CompletionRequest

request_id, user_id, repo_id, language, prefix (tokens before cursor), suffix (tokens after cursor), open_files[], cursor_line, cursor_column, timestamp
The prefix and suffix are the FIM inputs. The open_files array provides cross-file context candidates.

CompletionResponse

request_id, suggestions[] (text, token count, confidence), model_version, cache_hit, ttft_ms, total_ms, context_sources[]
The context_sources field records which files contributed to the prompt, enabling downstream quality analysis.

AcceptanceEvent (the training signal)

event_id, request_id, action (accept/reject/partial_accept/backspace_after_accept), accepted_tokens, time_to_action_ms, timestamp
time_to_action_ms measures how long the developer spent reading the suggestion before acting. Short times with rejection suggest the suggestion was obviously wrong.

ModelVersion

model_id, base_model, fine_tune_checkpoint, training_cutoff, sar_by_language{}, deployed_at, speculative_decoding_config

ContextBudget

total_tokens, prefix_allocation, type_defs_allocation, recent_edits_allocation, repo_search_allocation, adaptation_weights{}
Tracks how the token budget was split for each request. Used in post-hoc analysis to correlate context allocation with acceptance rate.

Three endpoints serve the entire system. The completion endpoint is the hot path, called on every idle keystroke. The signals endpoint collects implicit feedback in batches. The status endpoint provides observability for both the IDE extension and internal dashboards.

POST /api/complete (hot path, called on every idle keystroke)

Request:  {
  "prefix": "def calculate_total(items):\n    total = 0\n    for item in items:\n        total += ",
  "suffix": "\n    return total",
  "language": "python",
  "context_files": ["cart.py", "models.py"],
  "cursor_line": 4,
  "cursor_column": 18
}
Response: {
  "suggestions": [{"text": "item.price * item.quantity", "tokens": 6, "confidence": 0.87}],
  "ttft_ms": 38,
  "cache_hit": false,
  "context_sources": ["cart.py:CartItem", "models.py:Price"]
}

The context_sources field in the response tells the IDE which files contributed to the suggestion. This is used for quality analytics: correlate context source types with acceptance rates to tune the budget allocator.

POST /api/signals (batched acceptance events, sent every 30s)

Request:  {
  "events": [
    { "request_id": "r_abc", "action": "accept", "accepted_tokens": 6, "time_to_action_ms": 1200 },
    { "request_id": "r_def", "action": "reject", "accepted_tokens": 0, "time_to_action_ms": 400 }
  ]
}
Response: { "received": 2 }

Signals are batched client-side and sent every 30 seconds to avoid adding latency to the completion hot path. The time_to_action_ms field measures how long the developer spent reading the suggestion before acting, which helps distinguish "glanced and dismissed" from "read carefully and rejected."

GET /api/model/status

Response: {
  "model_version": "codestral-22b-v3",
  "sar_7d": 0.33,
  "p90_ttft_ms": 44,
  "cache_hit_rate": 0.21,
  "status": "healthy"
}

The status endpoint is used by both the IDE extension (to show model health in the status bar) and by internal monitoring dashboards. If the model version changes, the IDE can display a "model updated" notification.

High-level design

The 50ms constraint is not negotiable. It eliminates general-purpose frontier models (GPT-4o, Claude Sonnet) that take 500ms-2s even with fast inference. The solution is a dedicated code completion model of 6-7 billion parameters, INT4-quantized, served from a regional API cluster closest to the developer's geographic region. I have seen candidates propose using GPT-4o with aggressive caching to meet the latency target. It does not work: cache hit rates on code are 15-25%, meaning 75-85% of requests still hit the slow path.

The IDE extension collects context with a 100-200ms debounce on idle keystrokes to avoid flooding the API on every character. Context includes the prefix (code above the cursor, up to 2,000 tokens), the suffix (code below the cursor, up to 500 tokens), open files in the editor, and a language/framework fingerprint. The model uses this for the fill-in-the-middle (FIM) task: predict what belongs between the prefix and suffix.

Completions stream back token by token. The IDE renders each token as it arrives so the developer sees text materialising rather than waiting for a full block. If an earlier token makes the suggestion obviously wrong, the developer can press any key to dismiss immediately. This is the UX that makes 50ms feel fast even when the full completion takes 250ms.

The system has two distinct paths: the hot path (completion request, under 200ms) and the cold path (signal ingestion and model training, daily/weekly). Every architectural decision is driven by making the hot path as fast as possible while the cold path operates in the background.

The architecture above shows all major components and their relationships. The IDE extension is the entry point; it debounces keystrokes, collects context, and manages the streaming display of ghost text. The regional API cluster handles routing, caching, and model serving. The signal pipeline runs asynchronously and feeds back into model training on a weekly cadence.

TL;DR

The 50ms constraint is the entire design. Standard LLM inference takes 500ms-2s, so you run a small quantized model (6B params, INT4) via a regional API close to the developer, not a general-purpose LLM.
Context collection is the quality lever: cursor position, file prefix/suffix, open tabs, recent edits, and language detection all feed the completion model. Budget allocation across these sources determines acceptance rate.
The task is fill-in-the-middle (FIM): given the prefix before the cursor and the suffix after, predict what goes in the middle. This is different from next-token prediction.
Implicit acceptance signals (Tab pressed, key pressed without Tab, partial accept) power the quality flywheel without requiring any user action.
Speculative decoding with a small draft model plus a large verifier achieves 2-3x inference speedup while maintaining suggestion quality.
Multi-file refactoring uses a plan-then-execute agentic architecture with AST-based indexing, topological edit ordering, and per-diff syntax validation.

Requirements

Functional requirements

The IDE extension offers inline, single-line or multi-line code completion as the developer types.
Completions stream token-by-token; the first token appears in under 50ms (time to first token, TTFT).
The system supports all major languages (Python, TypeScript, Go, Java, Rust) with language-aware context.
Developer accept/reject signals are collected and used to improve suggestion quality over time.
Privacy-sensitive repositories can opt out of sending code to remote servers.

Non-functional requirements

10M daily active developers, each generating roughly 500 completion requests per session.
TTFT under 50ms at P90. Full completion (50-200 tokens) under 300ms.
Suggestion acceptance rate (SAR) target at or above 30% across all file types.
System degrades gracefully: if the model API is unreachable, the IDE continues working without suggestions rather than blocking the editor.
Training signal latency: accept/reject events must be ingested and usable for model fine-tuning within 24 hours.
Cost per completion under $0.001 at steady state. At 5B daily requests, inference cost must stay below $5M per day.

The hardest engineering problem here is not the model itself. It is assembling the right context from a sprawling codebase into a 4K-8K token window in under 10ms, while the developer is still typing. Context quality is the difference between a 20% and a 40% acceptance rate.

The core entities

CompletionRequest

request_id, user_id, repo_id, language, prefix (tokens before cursor), suffix (tokens after cursor), open_files[], cursor_line, cursor_column, timestamp
The prefix and suffix are the FIM inputs. The open_files array provides cross-file context candidates.

CompletionResponse

request_id, suggestions[] (text, token count, confidence), model_version, cache_hit, ttft_ms, total_ms, context_sources[]
The context_sources field records which files contributed to the prompt, enabling downstream quality analysis.

AcceptanceEvent (the training signal)

event_id, request_id, action (accept/reject/partial_accept/backspace_after_accept), accepted_tokens, time_to_action_ms, timestamp
time_to_action_ms measures how long the developer spent reading the suggestion before acting. Short times with rejection suggest the suggestion was obviously wrong.

ModelVersion

model_id, base_model, fine_tune_checkpoint, training_cutoff, sar_by_language{}, deployed_at, speculative_decoding_config

ContextBudget

total_tokens, prefix_allocation, type_defs_allocation, recent_edits_allocation, repo_search_allocation, adaptation_weights{}
Tracks how the token budget was split for each request. Used in post-hoc analysis to correlate context allocation with acceptance rate.

API design

POST /api/complete (hot path, called on every idle keystroke)

Request:  {
  "prefix": "def calculate_total(items):\n    total = 0\n    for item in items:\n        total += ",
  "suffix": "\n    return total",
  "language": "python",
  "context_files": ["cart.py", "models.py"],
  "cursor_line": 4,
  "cursor_column": 18
}
Response: {
  "suggestions": [{"text": "item.price * item.quantity", "tokens": 6, "confidence": 0.87}],
  "ttft_ms": 38,
  "cache_hit": false,
  "context_sources": ["cart.py:CartItem", "models.py:Price"]
}

POST /api/signals (batched acceptance events, sent every 30s)

Request:  {
  "events": [
    { "request_id": "r_abc", "action": "accept", "accepted_tokens": 6, "time_to_action_ms": 1200 },
    { "request_id": "r_def", "action": "reject", "accepted_tokens": 0, "time_to_action_ms": 400 }
  ]
}
Response: { "received": 2 }

GET /api/model/status

Response: {
  "model_version": "codestral-22b-v3",
  "sar_7d": 0.33,
  "p90_ttft_ms": 44,
  "cache_hit_rate": 0.21,
  "status": "healthy"
}

Design an AI code assistant

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Continue Reading with Premium

Comments

Design an AI code assistant

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Continue Reading with Premium

Comments