Log Aggregator
Design the observability backbone of a large distributed system: ingest, index, and query millions of log events and time-series metrics per second across thousands of servers in near real time.
What is a distributed logging and metrics system?
A distributed logging system collects, stores, and makes searchable every log line emitted by every server in a fleet. At 10,000 application servers the challenge is not writing logs; it is collecting them without losing data during traffic spikes, indexing 1TB per day so queries return in under 5 seconds, and keeping 30 days of hot data queryable while archiving a full year cheaply. I find this is one of the best interview questions for senior candidates because the naive version is trivial, but each scale constraint eliminates a different shortcut. This interview question tests pipeline architecture, write-heavy system design, inverted-index fundamentals, and the tradeoff between storage cost and query latency.
Functional Requirements
Core Requirements
- Collect logs and metrics from thousands of servers in near real time.
- Store logs such that they can be searched and queried by time range, service name, and log level.
- Export aggregated metrics for dashboarding and threshold-based alerting.
Below the Line (out of scope)
- Distributed tracing and APM (Application Performance Monitoring)
- Log-based security intrusion detection (SIEM)
- Log-based billing or audit trails with tamper-proof guarantees
The hardest part in scope: Indexing logs fast enough to query. Writing 1TB/day at 12 MB/sec is manageable. The trap is that naive storage (one log line = one document) makes full-text search across billions of rows take minutes, not seconds. Time-partitioned columnar segments with an inverted index are the correct answer, and explaining why is the heart of this design.
Distributed tracing is below the line because it requires correlating spans across services using a trace context header (W3C TraceContext or Zipkin format). It is architecturally distinct from log aggregation: traces need a causal graph store, not a full-text index. To add it, I would build a Trace Ingestion Service that accepts OTLP spans from instrumented services, fans them into a separate Kafka topic, and writes to a columnar store (Jaeger with Cassandra backing) keyed by trace_id. The log pipeline described here is unchanged.
SIEM is below the line because it requires real-time pattern matching against threat signatures, which demands a streaming analytics engine (Flink or Spark Streaming) on top of the log pipeline. This is a consumer of logs, not a change to the pipeline. To add it: attach a Flink job to the Kafka log topic that evaluates each log event against a rule set and emits alerts to a Security Incident topic.
Log-based billing with tamper-proof guarantees is below the line because it requires append-only immutable storage with cryptographic chaining. The log pipeline described here does not guarantee immutability. To add tamper evidence, attach a Write-Once Object Store (S3 Object Lock with Compliance mode) and write signed log batches there in parallel with the primary index.
Non-Functional Requirements
Core Requirements
- Durability: 99.9% of log messages must be delivered. Some loss is acceptable (a handful of debug logs dropped during a Kafka restart is tolerable; error logs must not be lost).
- Write throughput: 10,000 servers, each generating roughly 1,000 log lines per second at peak, means up to 10 million events/second cluster-wide. At an average log line size of 1.2 KB, that is about 12 MB/sec of sustained write traffic.
- Query latency: Time-range search across all logs (e.g., all
ERRORevents for servicepaymentsin the last 10 minutes) must return in under 5 seconds at p95. - Retention: 30 days hot (indexed, fast query via Elasticsearch or equivalent). 365 days cold (archival in S3/object storage, query via Athena or batch scan).
- Scale: 1 TB of log data ingested per day. After 30 days, roughly 30 TB of hot storage. After 365 days, roughly 270 TB of cold archival storage (assuming 3x compression for cold storage).
Below the Line
- Sub-second query latency (requires pre-aggregated materialized views, changes the indexing model)
- Multi-tenant log isolation with per-customer encryption-at-rest
- Log anomaly detection using ML (consumer of logs, not a pipeline change)
Read/write ratio: This is among the most write-heavy systems you will design in an interview. Writes dominate: thousands of servers emit logs continuously while queries are sporadic bursts from on-call engineers or dashboards. A rough ratio is 100:1 writes to reads during normal operations, inverted briefly during incidents when many engineers are querying simultaneously. This asymmetry is the lens for every architecture decision in this article. The write path must be cheap and lossy-tolerant; the read path must be fast on demand.
I treat the 5-second query SLA as the key number. It is strict enough to require a proper inverted index (not a full table scan) but lenient enough that we do not need pre-materialized per-service aggregates on every query. I would say this number out loud in the interview and then pause; it is the single constraint that drives the entire storage tier decision.
Core Entities
- LogEvent: A single log line with timestamp, service name, host, log level, message body, and optional structured fields (request ID, user ID, error code).
- MetricPoint: A single numeric measurement at a point in time: metric name, value, tags (host, service, region), and timestamp. Stored separately from logs.
- Index Segment: A time-bounded, immutable chunk of the log index. One segment covers a fixed time window (e.g., one hour of data). The unit of querying.
- Alert Rule: A threshold condition on a metric (e.g.,
error_rate > 1% for 5min). Evaluated periodically against the metrics pipeline. - Dashboard: A saved collection of metric queries rendered as charts. Backed by the metrics query service.
Schema design and partition strategies are deferred to the deep dives. The five entities above are sufficient to drive the API and High-Level Design.
API Design
FR 1 - Ingest logs from a server:
POST /ingest/logs
Body: { events: [{ timestamp, service, host, level, message, fields? }] }
Response: 202 Accepted
202 (not 200) because the log pipeline is asynchronous. The request is accepted and queued; it is not yet durable. Clients that need durability acknowledgment should use the Kafka SDK directly.
Batching in the request body (the events array) is mandatory: single-event ingestion at 10,000 events/sec per server would create catastrophic per-request overhead on the HTTP layer.
FR 2 - Query logs by time range, service, and level:
Naive shape:
GET /logs/search?service=payments&level=ERROR&from=2026-03-29T10:00Z&to=2026-03-29T10:10Z
Response: { logs: [...], next_cursor }
This naive shape breaks at scale: returning all matching logs in one shot at 1TB/day means a 10-minute window can contain millions of matching rows. The evolved shape adds cursor-based pagination and a result limit.
Evolved shape:
GET /logs/search?service=payments&level=ERROR&from=2026-03-29T10:00Z&to=2026-03-29T10:10Z&limit=100&cursor={opaque_cursor}
Response: { logs: [...], next_cursor, total_matched }
Cursor-based pagination is required here. Offset-based pagination (skip N) requires the query engine to scan and discard N rows on every page request, which is prohibitively expensive against a time-series index. The cursor encodes the last-seen segment ID and offset, allowing the query engine to resume without re-scanning.
FR 3 - Export metrics for dashboarding and alerting:
GET /metrics/query?metric=http_error_rate&from=2026-03-29T09:00Z&to=2026-03-29T10:00Z&step=60s&tags=service:payments
Response: { datapoints: [{ timestamp, value }] }
POST /alerts
Body: { metric, condition, threshold, duration_s, notification_channel }
Response: { alert_id }
The metrics query API is modelled on Prometheus's range query API (/query_range). The step parameter controls downsampling resolution. Dashboards use this endpoint to render time-series charts.
FR 4 - Stream live logs (tail -f equivalent):
GET /logs/stream?service=payments&level=ERROR
Response: text/event-stream (Server-Sent Events)
Server-Sent Events over HTTP is the right choice here. WebSockets are bidirectional and add unnecessary complexity when the client only reads. SSE is a standard HTTP connection that the server pushes events on; proxies and load balancers handle it well. The client receives data: events as new log lines match the filter.
High-Level Design
1. Collect logs from thousands of EC2 servers
The collection path: a lightweight agent on each server buffers log lines locally and ships batches to Kafka. The agent absorbs local write spikes without creating backpressure on downstream consumers.
The naive approach is to have each server POST logs directly to an ingestion API. That breaks immediately: 10,000 servers each making HTTP calls to a single ingestion service creates millions of concurrent connections and eliminates any buffering. One slow ingest service stalls the entire fleet. I have seen this exact architecture in production at a startup that hit 500 servers and then watched their ingest service fall over during every deploy.
The key insight is that collection and ingestion must be decoupled by a durable buffer (Kafka). The agent on each server is responsible for one thing: getting bytes off disk and into Kafka reliably. Everything downstream can fail and restart without losing a log line.
Components:
- Log Agent (Fluent Bit): A lightweight sidecar process on every EC2 instance. Tails log files or consumes from
journald. Batches events into 1-second windows and ships to Kafka. Writes a local disk buffer if Kafka is unreachable. - Kafka (Log Topic): Central durable buffer. Partitioned by
service_nameso that all logs for a given service go to the same partition set. Replication factor 3 for durability. - Kafka (Metrics Topic): Separate topic for numeric metric points. Partitioned by
metric_name.
Request walkthrough:
- Application writes a log line to stdout or a log file on the EC2 instance.
- Fluent Bit agent tails the file (or reads from stdout pipe), parses the log line, and enriches it with
host,service, andregionmetadata. - Fluent Bit batches events in a 1-second window and writes the batch to the Kafka
logstopic, partitioned byservice_name. - If Kafka is unreachable, Fluent Bit writes to a local disk buffer (up to 512 MB) and retries with exponential backoff. This is what gives us the 99.9% durability guarantee: the agent survives transient Kafka outages without dropping events.
- Kafka brokers replicate the batch to 2 additional brokers before acknowledging.
acks=allis set on the producer.
The agent-plus-Kafka pattern is the collection backbone. Every other component in the system is a consumer of this durable buffer. A single ingestion pipeline failure does not lose logs; the Kafka retention window (48 hours by default) gives consumers time to recover and replay.
2. Store and index logs for search
The index path: Kafka consumers read log batches and write them into the Elasticsearch index. The indexer builds time-partitioned inverted indexes on each batch before committing the Kafka offset.
I treat the indexing layer as a black box in this section and explain exactly how Elasticsearch builds inverted indexes in Deep Dive 2. For now, the important structure is: Kafka consumer reads, indexer writes, query API reads.
Components:
- Log Indexer Service: A pool of Kafka consumer workers. Each worker reads a batch from the
logstopic, parses structured fields, and does a bulk write to Elasticsearch. The Kafka offset is committed only after the Elasticsearch write succeeds, ensuring at-least-once semantics. - Elasticsearch Cluster: Stores and indexes log documents. Each index covers a 1-hour time window (time-based index rotation). Documents are indexed on
timestamp,service,level, and a full-text inverted index onmessage. - Query API: Stateless service that accepts search requests, translates them to Elasticsearch DSL queries, executes scatter-gather across relevant shards, and returns paginated results.
- S3 Cold Archive: Log Indexer also writes raw log batches (Parquet-compressed) to S3 in parallel. After 30 days, hot indexes are deleted. Cold data stays in S3 for 365 days and is queryable via AWS Athena.
Request walkthrough (write path):
- Log Indexer consumer reads a batch of log events from Kafka.
- Log Indexer bulk-writes the batch to the current active Elasticsearch index (e.g.,
logs-2026-03-29-10). - Elasticsearch tokenizes and indexes each
messagefield into an inverted posting list. - Log Indexer also writes the same batch as compressed Parquet to
s3://logs-archive/{date}/{service}/batch-{uuid}.parquet. - Log Indexer commits the Kafka consumer offset after both writes succeed.
Request walkthrough (search query):
- Engineer sends
GET /logs/search?service=payments&level=ERROR&from=...&to=.... - Query API determines which Elasticsearch time-window indexes overlap the requested range.
- Query API fans the query out to all shards of those indexes in parallel (scatter phase).
- Each shard evaluates the inverted index for
level:ERRORandservice:paymentsand sorts hits by timestamp. - Query API merges and re-ranks results from all shards (gather phase) and returns the first page with a cursor.
The dual-write to Elasticsearch (hot) and S3 (cold) happens in the same indexer commit cycle. This keeps the cold archive in sync with the hot index without a separate archival job. I would draw this dual-write on the whiteboard as a fork after the Kafka consumer, because interviewers often ask "what about long-term storage?" and having the answer already in the diagram shows forethought.
3. Export metrics for dashboarding and alerting
The metrics path is a separate pipeline from logs. Numeric metric points flow through their own Kafka topic into a time-series store (Prometheus or InfluxDB). The alerting engine evaluates rules against the time-series store on a polling cadence.
Mixing logs and metrics into the same storage system is a common design mistake. Logs are unstructured text with variable field schemas. Metrics are typed numeric time series with predictable access patterns. I have personally watched a team try to store metrics in Elasticsearch and then wonder why their Grafana dashboards took 15 seconds to load; the inverted index is the wrong data structure for numeric range aggregations.
Time-series databases (Prometheus, InfluxDB, TimescaleDB) are optimized for range queries on numeric data at cardinality that would kill Elasticsearch. Keep them separate.
Components:
- Metrics Consumer: Reads from the Kafka
metricstopic. Aggregates raw data points into 15-second resolution windows (to reduce storage volume) and writes to the Prometheus remote write endpoint. - Prometheus: Time-series database. Stores metric name, value, timestamp, and label set (tags). Optimized for range queries and aggregations. Each metric sample is 16 bytes on disk, so 1 year of 15-second samples for 100,000 metrics is roughly 3 TB.
- Alertmanager: Evaluates alert rules (defined as PromQL expressions) on a 30-second polling cadence. Fires alerts to PagerDuty, Slack, or email when thresholds are exceeded.
- Grafana: Dashboard UI. Executes PromQL queries against Prometheus via the
/metrics/queryAPI endpoint and renders time-series charts.
Request walkthrough (metrics dashboard):
- Grafana sends
GET /metrics/query?metric=http_error_rate&from=...&to=...&step=60s. - Query API translates to a PromQL range query:
rate(http_errors_total[5m]). - Prometheus evaluates the query across the requested time range and returns downsampled data points.
- Grafana renders the time series chart.
Request walkthrough (alerting):
- Alertmanager polls Prometheus every 30 seconds, evaluating each alert rule.
- If
http_error_rate > 1%is true for 5 consecutive evaluations (2.5 minutes), Alertmanager transitions the alert toFIRINGstate. - Alertmanager routes the alert to the configured receiver (PagerDuty for
severity=critical, Slack forseverity=warning).
The metrics pipeline is completely independent of the log pipeline. A failure in Elasticsearch does not degrade alerting. A spike in metrics write volume does not back-pressure the log indexer.
Potential Deep Dives
1. How do you collect logs efficiently from 10,000 servers without losing data?
At 10,000 servers generating up to 1,000 log lines per second each, a badly designed collection layer is the first bottleneck. The concerns are: resource usage on the server (the agent cannot consume meaningful CPU or memory from the application), network reliability (Kafka partitions and brokers fail), and spike absorption (a deployment burst generates 10x normal log volume for 30 seconds).
2. How do you store and index 1 TB/day of logs for sub-5-second queries?
The naive approach (one Elasticsearch document per log line, no time-partitioning) turns into a full cluster scan for every query within days of operation. The correct approach is time-partitioned indexes with inverted posting lists that make level:ERROR AND service:payments queries purely index lookups with zero document scanning.
3. How does a log search query fan out and aggregate across the cluster?
A query for service:payments AND level:ERROR across a 10-minute window can hit 30+ Elasticsearch shards simultaneously in a cluster that holds 30 days of hot data. The scatter-gather mechanism must coordinate all shard results into a single ranked, paginated response in under 5 seconds.
4. How do you support live log streaming (tail -f equivalent) at scale?
An engineer debugging a live production incident needs to see new log lines from service=payments appear in near real time, just like tail -f on a local file. At 10 million events/second cluster-wide, a naive broadcast model melts the query tier.
Final Architecture
The complete system combines all components: the agent-based collection backbone, the Kafka dual-topic buffer, the Elasticsearch hot index with S3 cold archive, the Prometheus metrics pipeline, and the SSE-based streaming service.
The architecture has two completely independent read paths: the Query API (batch search via Elasticsearch) and the Streaming Service (live tail via Kafka). A failure in Elasticsearch does not degrade live streaming, and vice versa. This independence is the kind of detail I would highlight at the end of a whiteboard session because it shows you built the system with failure isolation in mind, not just the happy path. The Kafka buffer is the single coupling point between all producers and consumers, intentionally: if any consumer falls behind, the 48-hour Kafka retention means it can catch up without losing data.
Interview Cheat Sheet
- Use a local on-host agent (Fluent Bit) to collect logs. The agent must have a local disk buffer (512 MB per host) to absorb Kafka outages without dropping events and without stalling the application.
- Kafka is the backbone, not the storage. It is the durable buffer between collection and indexing. Use
acks=allwith replication factor 3. Partition byservice_nameso logs from the same service land on the same partition set. - Two separate Kafka topics: one for logs (unstructured text events), one for metrics (typed numeric points). Mixing them into one topic complicates consumer group management and ties the metrics SLA to the log ingestion rate.
- Elasticsearch with hourly time-partitioned indexes is the right hot storage. Use Index Lifecycle Management (ILM) to automatically roll over, shrink, and delete indexes. After 30 days, the hourly index is a single
DELETEAPI call, not an expensive delete-by-query. - Field mapping matters:
service,level, andhostaskeyword(exact-match);messageastext(analyzed, full-text). Settingdynamic: falseprevents field explosion in clusters receiving inconsistent log formats. - Log search is scatter-gather: the Query API fans requests out to all shards of the relevant hourly indexes in parallel, applies a per-shard result budget (200-500 docs), and k-way merges sorted results at the coordinator. Shard timeout (4 seconds) returns partial results rather than failing the query.
- Cold archive in S3 at 1TB/day ingests roughly 270 TB/year (assuming 3x compression). Write Parquet-compressed batches from the Log Indexer in the same commit cycle as the Elasticsearch write. Query cold data with Athena when incident retrospectives require it.
- A key number to memorize: at 12 MB/sec peak log throughput, a 512 MB local disk buffer on each agent covers roughly 7 minutes of Kafka unavailability before data loss begins.
- Live log streaming (tail -f) should bypass Elasticsearch entirely. A dedicated Streaming Service reads from Kafka at the latest offset, applies a filter predicate server-side, and pushes matching events via Server-Sent Events. End-to-end latency is under 2 seconds from application write to browser.
- Metrics and logs use different storage systems: Prometheus for numeric time-series (16 bytes/sample, PromQL-native), Elasticsearch for log events (inverted index, full-text search). Mixing them forces both into a worse trade-off than either dedicated store.
- Alerting runs on Prometheus, not Elasticsearch. Alertmanager polls PromQL expressions every 30 seconds. Threshold alerts (error rate, latency p99) hit Prometheus where aggregation is fast. Log-content alerts (keyword matches) are a separate path and are out of scope.
- At 10,000 servers, the numbers to say out loud: up to 10 million log events per second cluster-wide at peak. At 1.2 KB average event size, that is 12 MB/sec sustained write throughput through Kafka and into the index.
- 99.9% durability is achievable with the agent disk buffer plus Kafka RF=3 plus at-least-once indexer semantics (commit offset after write). The 0.1% loss budget covers spot instance terminations within the 60-second notice window and deliberate deduplication of duplicate deliveries from the agent's retry path.