Log Aggregator

What is a distributed logging and metrics system?

A distributed logging system collects, stores, and makes searchable every log line emitted by every server in a fleet. At 10,000 application servers the challenge is not writing logs; it is collecting them without losing data during traffic spikes, indexing 1TB per day so queries return in under 5 seconds, and keeping 30 days of hot data queryable while archiving a full year cheaply. I find this is one of the best interview questions for senior candidates because the naive version is trivial, but each scale constraint eliminates a different shortcut. This interview question tests pipeline architecture, write-heavy system design, inverted-index fundamentals, and the tradeoff between storage cost and query latency.

Functional Requirements

Core Requirements

Collect logs and metrics from thousands of servers in near real time.
Store logs such that they can be searched and queried by time range, service name, and log level.
Export aggregated metrics for dashboarding and threshold-based alerting.

Below the Line (out of scope)

Distributed tracing and APM (Application Performance Monitoring)
Log-based security intrusion detection (SIEM)
Log-based billing or audit trails with tamper-proof guarantees

The hardest part in scope: Indexing logs fast enough to query. Writing 1TB/day at 12 MB/sec is manageable. The trap is that naive storage (one log line = one document) makes full-text search across billions of rows take minutes, not seconds. Time-partitioned columnar segments with an inverted index are the correct answer, and explaining why is the heart of this design.

Distributed tracing is below the line because it requires correlating spans across services using a trace context header (W3C TraceContext or Zipkin format). It is architecturally distinct from log aggregation: traces need a causal graph store, not a full-text index. To add it, I would build a Trace Ingestion Service that accepts OTLP spans from instrumented services, fans them into a separate Kafka topic, and writes to a columnar store (Jaeger with Cassandra backing) keyed by trace_id. The log pipeline described here is unchanged.

SIEM is below the line because it requires real-time pattern matching against threat signatures, which demands a streaming analytics engine (Flink or Spark Streaming) on top of the log pipeline. This is a consumer of logs, not a change to the pipeline. To add it: attach a Flink job to the Kafka log topic that evaluates each log event against a rule set and emits alerts to a Security Incident topic.

Log-based billing with tamper-proof guarantees is below the line because it requires append-only immutable storage with cryptographic chaining. The log pipeline described here does not guarantee immutability. To add tamper evidence, attach a Write-Once Object Store (S3 Object Lock with Compliance mode) and write signed log batches there in parallel with the primary index.

Non-Functional Requirements

Core Requirements

Durability: 99.9% of log messages must be delivered. Some loss is acceptable (a handful of debug logs dropped during a Kafka restart is tolerable; error logs must not be lost).
Write throughput: 10,000 servers, each generating roughly 1,000 log lines per second at peak, means up to 10 million events/second cluster-wide. At an average log line size of 1.2 KB, that is about 12 MB/sec of sustained write traffic.
Query latency: Time-range search across all logs (e.g., all ERROR events for service payments in the last 10 minutes) must return in under 5 seconds at p95.
Retention: 30 days hot (indexed, fast query via Elasticsearch or equivalent). 365 days cold (archival in S3/object storage, query via Athena or batch scan).
Scale: 1 TB of log data ingested per day. After 30 days, roughly 30 TB of hot storage. After 365 days, roughly 270 TB of cold archival storage (assuming 3x compression for cold storage).

Below the Line

Sub-second query latency (requires pre-aggregated materialized views, changes the indexing model)
Multi-tenant log isolation with per-customer encryption-at-rest
Log anomaly detection using ML (consumer of logs, not a pipeline change)

Read/write ratio: This is among the most write-heavy systems you will design in an interview. Writes dominate: thousands of servers emit logs continuously while queries are sporadic bursts from on-call engineers or dashboards. A rough ratio is 100:1 writes to reads during normal operations, inverted briefly during incidents when many engineers are querying simultaneously. This asymmetry is the lens for every architecture decision in this article. The write path must be cheap and lossy-tolerant; the read path must be fast on demand.

I treat the 5-second query SLA as the key number. It is strict enough to require a proper inverted index (not a full table scan) but lenient enough that we do not need pre-materialized per-service aggregates on every query. I would say this number out loud in the interview and then pause; it is the single constraint that drives the entire storage tier decision.

Core Entities

LogEvent: A single log line with timestamp, service name, host, log level, message body, and optional structured fields (request ID, user ID, error code).
MetricPoint: A single numeric measurement at a point in time: metric name, value, tags (host, service, region), and timestamp. Stored separately from logs.
Index Segment: A time-bounded, immutable chunk of the log index. One segment covers a fixed time window (e.g., one hour of data). The unit of querying.
Alert Rule: A threshold condition on a metric (e.g., error_rate > 1% for 5min). Evaluated periodically against the metrics pipeline.
Dashboard: A saved collection of metric queries rendered as charts. Backed by the metrics query service.

Schema design and partition strategies are deferred to the deep dives. The five entities above are sufficient to drive the API and High-Level Design.

API Design

FR 1 - Ingest logs from a server:

POST /ingest/logs
Body: { events: [{ timestamp, service, host, level, message, fields? }] }
Response: 202 Accepted

202 (not 200) because the log pipeline is asynchronous. The request is accepted and queued; it is not yet durable. Clients that need durability acknowledgment should use the Kafka SDK directly.

Batching in the request body (the events array) is mandatory: single-event ingestion at 10,000 events/sec per server would create catastrophic per-request overhead on the HTTP layer.

FR 2 - Query logs by time range, service, and level:

Naive shape:

GET /logs/search?service=payments&level=ERROR&from=2026-03-29T10:00Z&to=2026-03-29T10:10Z
Response: { logs: [...], next_cursor }

This naive shape breaks at scale: returning all matching logs in one shot at 1TB/day means a 10-minute window can contain millions of matching rows. The evolved shape adds cursor-based pagination and a result limit.

Evolved shape:

GET /logs/search?service=payments&level=ERROR&from=2026-03-29T10:00Z&to=2026-03-29T10:10Z&limit=100&cursor={opaque_cursor}
Response: { logs: [...], next_cursor, total_matched }

Cursor-based pagination is required here. Offset-based pagination (skip N) requires the query engine to scan and discard N rows on every page request, which is prohibitively expensive against a time-series index. The cursor encodes the last-seen segment ID and offset, allowing the query engine to resume without re-scanning.

FR 3 - Export metrics for dashboarding and alerting:

GET /metrics/query?metric=http_error_rate&from=2026-03-29T09:00Z&to=2026-03-29T10:00Z&step=60s&tags=service:payments
Response: { datapoints: [{ timestamp, value }] }

POST /alerts
Body: { metric, condition, threshold, duration_s, notification_channel }
Response: { alert_id }

The metrics query API is modelled on Prometheus's range query API (/query_range). The step parameter controls downsampling resolution. Dashboards use this endpoint to render time-series charts.

FR 4 - Stream live logs (tail -f equivalent):

GET /logs/stream?service=payments&level=ERROR
Response: text/event-stream (Server-Sent Events)

Server-Sent Events over HTTP is the right choice here. WebSockets are bidirectional and add unnecessary complexity when the client only reads. SSE is a standard HTTP connection that the server pushes events on; proxies and load balancers handle it well. The client receives data: events as new log lines match the filter.

High-Level Design

1. Collect logs from thousands of EC2 servers

The collection path: a lightweight agent on each server buffers log lines locally and ships batches to Kafka. The agent absorbs local write spikes without creating backpressure on downstream consumers.

The naive approach is to have each server POST logs directly to an ingestion API. That breaks immediately: 10,000 servers each making HTTP calls to a single ingestion service creates millions of concurrent connections and eliminates any buffering. One slow ingest service stalls the entire fleet. I have seen this exact architecture in production at a startup that hit 500 servers and then watched their ingest service fall over during every deploy.

The key insight is that collection and ingestion must be decoupled by a durable buffer (Kafka). The agent on each server is responsible for one thing: getting bytes off disk and into Kafka reliably. Everything downstream can fail and restart without losing a log line.

Components:

Log Agent (Fluent Bit): A lightweight sidecar process on every EC2 instance. Tails log files or consumes from journald. Batches events into 1-second windows and ships to Kafka. Writes a local disk buffer if Kafka is unreachable.
Kafka (Log Topic): Central durable buffer. Partitioned by service_name so that all logs for a given service go to the same partition set. Replication factor 3 for durability.
Kafka (Metrics Topic): Separate topic for numeric metric points. Partitioned by metric_name.

Request walkthrough:

Application writes a log line to stdout or a log file on the EC2 instance.
Fluent Bit agent tails the file (or reads from stdout pipe), parses the log line, and enriches it with host, service, and region metadata.
Fluent Bit batches events in a 1-second window and writes the batch to the Kafka logs topic, partitioned by service_name.
If Kafka is unreachable, Fluent Bit writes to a local disk buffer (up to 512 MB) and retries with exponential backoff. This is what gives us the 99.9% durability guarantee: the agent survives transient Kafka outages without dropping events.
Kafka brokers replicate the batch to 2 additional brokers before acknowledging. acks=all is set on the producer.

The agent-plus-Kafka pattern is the collection backbone. Every other component in the system is a consumer of this durable buffer. A single ingestion pipeline failure does not lose logs; the Kafka retention window (48 hours by default) gives consumers time to recover and replay.

2. Store and index logs for search

The index path: Kafka consumers read log batches and write them into the Elasticsearch index. The indexer builds time-partitioned inverted indexes on each batch before committing the Kafka offset.

I treat the indexing layer as a black box in this section and explain exactly how Elasticsearch builds inverted indexes in Deep Dive 2. For now, the important structure is: Kafka consumer reads, indexer writes, query API reads.

Components:

Log Indexer Service: A pool of Kafka consumer workers. Each worker reads a batch from the logs topic, parses structured fields, and does a bulk write to Elasticsearch. The Kafka offset is committed only after the Elasticsearch write succeeds, ensuring at-least-once semantics.
Elasticsearch Cluster: Stores and indexes log documents. Each index covers a 1-hour time window (time-based index rotation). Documents are indexed on timestamp, service, level, and a full-text inverted index on message.
Query API: Stateless service that accepts search requests, translates them to Elasticsearch DSL queries, executes scatter-gather across relevant shards, and returns paginated results.
S3 Cold Archive: Log Indexer also writes raw log batches (Parquet-compressed) to S3 in parallel. After 30 days, hot indexes are deleted. Cold data stays in S3 for 365 days and is queryable via AWS Athena.

Request walkthrough (write path):

Log Indexer consumer reads a batch of log events from Kafka.
Log Indexer bulk-writes the batch to the current active Elasticsearch index (e.g., logs-2026-03-29-10).
Elasticsearch tokenizes and indexes each message field into an inverted posting list.
Log Indexer also writes the same batch as compressed Parquet to s3://logs-archive/{date}/{service}/batch-{uuid}.parquet.
Log Indexer commits the Kafka consumer offset after both writes succeed.

Request walkthrough (search query):

What is a distributed logging and metrics system?

Functional Requirements

Core Requirements

Collect logs and metrics from thousands of servers in near real time.
Store logs such that they can be searched and queried by time range, service name, and log level.
Export aggregated metrics for dashboarding and threshold-based alerting.

Below the Line (out of scope)

Distributed tracing and APM (Application Performance Monitoring)
Log-based security intrusion detection (SIEM)
Log-based billing or audit trails with tamper-proof guarantees

The hardest part in scope: Indexing logs fast enough to query. Writing 1TB/day at 12 MB/sec is manageable. The trap is that naive storage (one log line = one document) makes full-text search across billions of rows take minutes, not seconds. Time-partitioned columnar segments with an inverted index are the correct answer, and explaining why is the heart of this design.

Non-Functional Requirements

Core Requirements

Durability: 99.9% of log messages must be delivered. Some loss is acceptable (a handful of debug logs dropped during a Kafka restart is tolerable; error logs must not be lost).
Write throughput: 10,000 servers, each generating roughly 1,000 log lines per second at peak, means up to 10 million events/second cluster-wide. At an average log line size of 1.2 KB, that is about 12 MB/sec of sustained write traffic.
Query latency: Time-range search across all logs (e.g., all ERROR events for service payments in the last 10 minutes) must return in under 5 seconds at p95.
Retention: 30 days hot (indexed, fast query via Elasticsearch or equivalent). 365 days cold (archival in S3/object storage, query via Athena or batch scan).
Scale: 1 TB of log data ingested per day. After 30 days, roughly 30 TB of hot storage. After 365 days, roughly 270 TB of cold archival storage (assuming 3x compression for cold storage).

Below the Line

Sub-second query latency (requires pre-aggregated materialized views, changes the indexing model)
Multi-tenant log isolation with per-customer encryption-at-rest
Log anomaly detection using ML (consumer of logs, not a pipeline change)

Read/write ratio: This is among the most write-heavy systems you will design in an interview. Writes dominate: thousands of servers emit logs continuously while queries are sporadic bursts from on-call engineers or dashboards. A rough ratio is 100:1 writes to reads during normal operations, inverted briefly during incidents when many engineers are querying simultaneously. This asymmetry is the lens for every architecture decision in this article. The write path must be cheap and lossy-tolerant; the read path must be fast on demand.

Core Entities

LogEvent: A single log line with timestamp, service name, host, log level, message body, and optional structured fields (request ID, user ID, error code).
MetricPoint: A single numeric measurement at a point in time: metric name, value, tags (host, service, region), and timestamp. Stored separately from logs.
Index Segment: A time-bounded, immutable chunk of the log index. One segment covers a fixed time window (e.g., one hour of data). The unit of querying.
Alert Rule: A threshold condition on a metric (e.g., error_rate > 1% for 5min). Evaluated periodically against the metrics pipeline.
Dashboard: A saved collection of metric queries rendered as charts. Backed by the metrics query service.

Schema design and partition strategies are deferred to the deep dives. The five entities above are sufficient to drive the API and High-Level Design.

API Design

FR 1 - Ingest logs from a server:

POST /ingest/logs
Body: { events: [{ timestamp, service, host, level, message, fields? }] }
Response: 202 Accepted

202 (not 200) because the log pipeline is asynchronous. The request is accepted and queued; it is not yet durable. Clients that need durability acknowledgment should use the Kafka SDK directly.

Batching in the request body (the events array) is mandatory: single-event ingestion at 10,000 events/sec per server would create catastrophic per-request overhead on the HTTP layer.

FR 2 - Query logs by time range, service, and level:

Naive shape:

GET /logs/search?service=payments&level=ERROR&from=2026-03-29T10:00Z&to=2026-03-29T10:10Z
Response: { logs: [...], next_cursor }

Evolved shape:

GET /logs/search?service=payments&level=ERROR&from=2026-03-29T10:00Z&to=2026-03-29T10:10Z&limit=100&cursor={opaque_cursor}
Response: { logs: [...], next_cursor, total_matched }

FR 3 - Export metrics for dashboarding and alerting:

GET /metrics/query?metric=http_error_rate&from=2026-03-29T09:00Z&to=2026-03-29T10:00Z&step=60s&tags=service:payments
Response: { datapoints: [{ timestamp, value }] }

POST /alerts
Body: { metric, condition, threshold, duration_s, notification_channel }
Response: { alert_id }

The metrics query API is modelled on Prometheus's range query API (/query_range). The step parameter controls downsampling resolution. Dashboards use this endpoint to render time-series charts.

FR 4 - Stream live logs (tail -f equivalent):

GET /logs/stream?service=payments&level=ERROR
Response: text/event-stream (Server-Sent Events)

High-Level Design

1. Collect logs from thousands of EC2 servers

Components:

Log Agent (Fluent Bit): A lightweight sidecar process on every EC2 instance. Tails log files or consumes from journald. Batches events into 1-second windows and ships to Kafka. Writes a local disk buffer if Kafka is unreachable.
Kafka (Log Topic): Central durable buffer. Partitioned by service_name so that all logs for a given service go to the same partition set. Replication factor 3 for durability.
Kafka (Metrics Topic): Separate topic for numeric metric points. Partitioned by metric_name.

Request walkthrough:

Application writes a log line to stdout or a log file on the EC2 instance.
Fluent Bit agent tails the file (or reads from stdout pipe), parses the log line, and enriches it with host, service, and region metadata.
Fluent Bit batches events in a 1-second window and writes the batch to the Kafka logs topic, partitioned by service_name.
If Kafka is unreachable, Fluent Bit writes to a local disk buffer (up to 512 MB) and retries with exponential backoff. This is what gives us the 99.9% durability guarantee: the agent survives transient Kafka outages without dropping events.
Kafka brokers replicate the batch to 2 additional brokers before acknowledging. acks=all is set on the producer.

2. Store and index logs for search

Components:

Log Indexer Service: A pool of Kafka consumer workers. Each worker reads a batch from the logs topic, parses structured fields, and does a bulk write to Elasticsearch. The Kafka offset is committed only after the Elasticsearch write succeeds, ensuring at-least-once semantics.
Elasticsearch Cluster: Stores and indexes log documents. Each index covers a 1-hour time window (time-based index rotation). Documents are indexed on timestamp, service, level, and a full-text inverted index on message.
Query API: Stateless service that accepts search requests, translates them to Elasticsearch DSL queries, executes scatter-gather across relevant shards, and returns paginated results.
S3 Cold Archive: Log Indexer also writes raw log batches (Parquet-compressed) to S3 in parallel. After 30 days, hot indexes are deleted. Cold data stays in S3 for 365 days and is queryable via AWS Athena.

Request walkthrough (write path):

Log Indexer consumer reads a batch of log events from Kafka.
Log Indexer bulk-writes the batch to the current active Elasticsearch index (e.g., logs-2026-03-29-10).
Elasticsearch tokenizes and indexes each message field into an inverted posting list.
Log Indexer also writes the same batch as compressed Parquet to s3://logs-archive/{date}/{service}/batch-{uuid}.parquet.
Log Indexer commits the Kafka consumer offset after both writes succeed.

Request walkthrough (search query):

Log Aggregator

What is a distributed logging and metrics system?

Functional Requirements

Core Requirements

Below the Line (out of scope)

Non-Functional Requirements

Core Requirements

Below the Line

Core Entities

API Design

High-Level Design

1. Collect logs from thousands of EC2 servers

2. Store and index logs for search

Continue Reading with Premium

Comments

Log Aggregator

What is a distributed logging and metrics system?

Functional Requirements

Core Requirements

Below the Line (out of scope)

Non-Functional Requirements

Core Requirements

Below the Line

Core Entities

API Design

High-Level Design

1. Collect logs from thousands of EC2 servers

2. Store and index logs for search

Continue Reading with Premium

Comments