Data pipelines
How data pipelines move and transform data at scale: batch vs. stream processing, Lambda vs. Kappa architecture, ETL vs. ELT, medallion architecture, windowing strategies, and failure handling.
TL;DR
- A data pipeline moves data from sources to destinations, transforming it along the way. The hard part is doing this reliably at scale without losing records or violating latency guarantees.
- Batch processing trades latency for simplicity (hourly/nightly runs). Stream processing trades complexity for freshness (sub-second). Most production systems use both.
- ELT (load raw data first, transform in-warehouse) has replaced ETL as the default for cloud data warehouses. The medallion architecture (bronze/silver/gold) organizes the transformation layers.
- Pipeline failure handling requires three capabilities: dead-letter queues for poison records, checkpointing for crash recovery, and replay from source for fixing transformation bugs.
- Stream processing adds windowing (tumbling, sliding, session) and exactly-once semantics as complexity dimensions that batch pipelines avoid entirely.
The Problem It Solves
Your e-commerce company starts with a simple setup: the product team queries the production database directly for analytics. SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '24 hours' runs fine when you have 10,000 orders per day.
At 500,000 orders per day, that query takes 45 seconds and locks rows that the checkout service needs. Your DBA adds a read replica for analytics queries. That buys you 6 months.
At 2 million orders per day, the analytics team needs joins across orders, customers, inventory, and clickstream data. These joins require denormalized tables that don't exist in the OLTP schema. Someone writes a Python script that runs on a cron job at 3 AM, pulls data from four tables, transforms it, and loads it into a separate analytics database. It works until the script crashes silently one night and nobody notices for three days. Three days of revenue dashboards show zeros, and the CFO thinks the company lost all its revenue.
At 10 million events per day (orders plus page views, clicks, searches, ad impressions), the 3 AM cron job takes 6 hours to run. It fails halfway through on out-of-memory errors. The analytics database has stale data. The fraud detection team needs real-time signals, not yesterday's data. The recommendation engine needs feature vectors computed from the last hour of clickstream data, not last night's.
Every growing company hits this wall. The cron script that "works fine for now" becomes the single most fragile piece of infrastructure. I've watched teams spend months debugging silent data pipeline failures that nobody noticed because the dashboard showed cached data from the last successful run.
The answer is not a better cron script. It's a data pipeline: a system designed from the ground up for reliable, scalable, monitored data movement and transformation.
What Is It?
A data pipeline is a system that moves data from sources to destinations through a series of processing stages, each of which extracts, transforms, enriches, or aggregates the data. The "pipeline" metaphor is literal: data flows in one end, gets processed at each stage, and arrives at the destination in a different shape than it started.
Think of it like a water treatment plant. Raw water (data) flows in from rivers and reservoirs (sources). It passes through filtration, chemical treatment, and quality testing stages (transformations). Clean water (processed data) flows out to homes and businesses (destinations). The plant runs 24/7, monitors pressure and quality at every stage, and has bypass systems for when one stage fails. A data pipeline does the same thing for information.
The key insight: a data pipeline is not just a script that moves data. It's a system with monitoring, failure handling, replay capabilities, and schema management. The difference between "a cron job that runs SQL" and "a data pipeline" is the difference between a garden hose and a water treatment plant.
How It Works
Let's trace a single record through a production pipeline: an order placed on an e-commerce platform that needs to reach the analytics warehouse, the search index, and the fraud detection system.
-
Event production. The checkout service publishes an
order.createdevent to Kafka topicorders. The event includes order ID, customer ID, items, total, payment method, and timestamp. -
Ingestion. A Flink consumer reads from the
orderstopic. It validates the schema (all required fields present, types correct) and drops malformed records to a dead-letter topic for investigation. -
Enrichment. The Flink job looks up the customer's profile from a Redis cache (country, account age, lifetime spend). It joins this context onto the order event, producing an enriched record.
-
Transformation. Business rules are applied: currency conversion to USD, tax calculation, fraud risk score from a sidecar ML model. The enriched, transformed record is written to a downstream Kafka topic
orders.enriched. -
Fan-out to destinations. Three independent consumers read from
orders.enriched: one writes to Snowflake (analytics), one updates Elasticsearch (search), one feeds the real-time fraud dashboard (alerting). Each consumer checkpoints its Kafka offset independently. -
Quality verification. An hourly batch job compares record counts between source (Kafka) and destination (Snowflake). If the counts diverge by more than 0.1%, it triggers an alert.
// Simplified Flink-style pipeline stage (enrichment + transformation)
async function processOrderEvent(event: OrderEvent): Promise<EnrichedOrder> {
// Step 1: Validate schema
if (!event.orderId || !event.customerId || !event.total) {
await deadLetterQueue.send(event, "missing required fields");
return null;
}
// Step 2: Enrich with customer context
const customer = await redis.get(`customer:${event.customerId}`);
if (!customer) {
const customer = await customerDB.findById(event.customerId);
await redis.set(`customer:${event.customerId}`, customer, { ttl: 3600 });
}
// Step 3: Transform
const enrichedOrder: EnrichedOrder = {
...event,
customerCountry: customer.country,
totalUSD: convertToUSD(event.total, event.currency),
fraudScore: await fraudModel.score(event, customer),
processedAt: new Date().toISOString(),
};
// Step 4: Emit to downstream topic
await kafka.produce("orders.enriched", enrichedOrder);
return enrichedOrder;
}
Every stage is independently monitorable. If enrichment latency spikes, you see it in the stage metrics. If the fraud model slows down, that stage's processing time increases while others are unaffected. This isolation is what separates a pipeline from a monolithic script.
For your interview: trace a specific record through the pipeline. "An order event hits Kafka, gets enriched with customer data from Redis, transformed with business rules, then fanned out to Snowflake, Elasticsearch, and the fraud dashboard." Specific > abstract.
Key Components
| Component | Role |
|---|---|
| Source connector | Extracts data from the origin system: CDC (Change Data Capture) for databases, API polling for SaaS tools, file watchers for S3/SFTP drops. Debezium is the standard for CDC. |
| Message broker | Decouples stages and buffers records between them. Kafka is the default for high-throughput pipelines. Kinesis for AWS-native. Provides durability and replay. |
| Stream processor | Stateful computation on records in flight: enrichment, aggregation, windowing. Apache Flink (true streaming), Spark Structured Streaming (micro-batch), or Kafka Streams (library, no separate cluster). |
| Batch processor | Processes large historical datasets on a schedule. Apache Spark, dbt (SQL transforms), or Airflow-orchestrated SQL jobs. Cheaper per record than streaming for bulk historical data. |
| Schema registry | Stores and enforces schema versions for events. Confluent Schema Registry (Avro/Protobuf/JSON Schema). Prevents producers from publishing records that break downstream consumers. |
| Orchestrator | Schedules and monitors batch pipeline DAGs. Apache Airflow, Dagster, or Prefect. Handles retries, dependency ordering, backfill runs. |
| Dead-letter queue | Captures records that fail processing after exhausting retries. Operators investigate and choose to fix-and-replay or discard. Without a DLQ, one bad record stalls an entire partition. |
| Data warehouse | The analytical destination: Snowflake, BigQuery, Redshift, ClickHouse. Columnar storage optimized for aggregate queries over large datasets. |
Types / Variations
Batch vs. Stream Processing
| Dimension | Batch | Stream |
|---|---|---|
| Latency | Minutes to hours (as stale as the batch interval) | Milliseconds to seconds |
| Throughput | Very high (bulk I/O, columnar compression) | Lower per-record (overhead of per-event processing) |
| Complexity | Low (read, transform, write, done) | High (state management, windowing, exactly-once) |
| Failure recovery | Re-run the job from scratch | Checkpoint + resume from offset |
| Cost | Cheap per record (amortized over millions) | More expensive per record (always-on compute) |
| Use case | Historical analytics, ML training, nightly reports | Fraud detection, real-time dashboards, alerting |
Most production systems use both. Batch for historical aggregation and ML training, stream for real-time features and operational alerts. The question is not "batch or stream" but "which workloads justify the complexity of streaming?"
Lambda vs. Kappa Architecture
Lambda runs batch and stream in parallel: the batch layer produces accurate historical results (recomputed hourly), while the speed layer provides low-latency approximations. A serving layer merges both. The problem: two codebases implementing the same business logic. Bugs appear in one but not the other. Operational cost doubles.
Kappa uses only a streaming path. Reprocessing historical data means replaying Kafka from an earlier offset with a new consumer version. One codebase, simpler operations. The trade-off: reprocessing years of history through a streaming system is slow and the storage cost of retaining Kafka logs indefinitely is significant.
Kappa is preferred for new systems. Lambda persists where organizations have existing batch infrastructure they can't migrate away from.
ETL vs. ELT
ETL (Extract, Transform, Load): Transform data before loading. Made sense when the destination (on-premise data warehouse) had limited compute. Transform outside, load only clean data.
ELT (Extract, Load, Transform): Load raw data first, transform in-warehouse. Made practical by powerful columnar warehouses (BigQuery, Redshift, Snowflake) that can run transforms on petabytes efficiently. Preserves raw data for re-transformation when business rules change.
ELT won. Tools like dbt (SQL-based transforms that run inside the warehouse) and Fivetran (managed ELT connectors) dominate the modern data stack. The advantage: when your CFO changes the revenue recognition formula, you re-run the dbt model against the raw bronze data instead of re-extracting from the source.
Medallion Architecture (Bronze / Silver / Gold)
The medallion architecture organizes ELT into three quality tiers:
- Bronze: Raw data exactly as ingested from the source. No transformations. Append-only. This is your insurance policy: if any downstream transform has a bug, you replay from bronze.
- Silver: Cleaned and validated. Deduplication applied, schema enforced, nulls handled, data types cast. Most joins happen here.
- Gold: Business-ready aggregates and denormalized tables. Dashboard queries hit gold tables directly. These are the tables your analyst team uses.
My recommendation: always land data in bronze first. Teams that transform on ingest (skipping bronze) lose the ability to fix bugs without re-extracting from the source. I've seen this cause multi-week data recovery projects.
Windowing Strategies (Stream Processing)
Stream processing aggregations need to answer: "aggregate over what time range?" This is windowing.
| Window Type | Description | Use Case |
|---|---|---|
| Tumbling | Fixed-size, non-overlapping (every 5 minutes exactly) | Hourly revenue totals, periodic metric summaries |
| Sliding | Fixed-size, overlapping (5-minute window, slides every 1 minute) | Moving averages, trend detection |
| Session | Dynamic size, closed by inactivity gap (30 seconds of no events) | User session analytics, clickstream grouping |
Late-arriving events are the biggest headache. An event with a timestamp of 14:59:58 arrives at 15:01:02. The 15:00 tumbling window has already closed and emitted results. Stream processors handle this with watermarks: a threshold that defines how late an event can be and still be included. Events arriving after the watermark are either dropped or sent to a side output for separate handling.
Exactly-once is harder than it sounds
"Exactly-once processing" in stream pipelines typically means exactly-once semantics within the stream processor (Flink achieves this via checkpointing + two-phase commit to Kafka). It does not mean exactly-once delivery to external systems. Writing to a database, calling an API, or sending an email requires idempotent sinks on the receiving end. The pipeline guarantees it processes each record exactly once internally; external side effects require idempotent design.
Trade-offs
| Pros | Cons |
|---|---|
| Decouples data producers from consumers (schema changes don't break downstream immediately) | Operational complexity: monitoring lag, managing consumer offsets, debugging silent failures |
| Enables real-time analytics, fraud detection, and operational alerting that batch cannot provide | Stream processing adds windowing, watermarks, and state management as new complexity dimensions |
| Preserves raw data for re-transformation (medallion architecture) so business rule changes don't require re-extraction | Storage costs grow linearly with retention (Kafka logs, bronze layer, warehouse history) |
| Fan-out pattern lets one event feed multiple destinations without source changes | Schema evolution must be carefully managed or one producer change breaks all consumers |
| Dead-letter queues isolate bad records without stalling the entire pipeline | Debugging distributed pipeline failures across 5+ stages requires mature observability |
The fundamental tension: freshness vs. complexity. Batch pipelines are simple but stale. Stream pipelines are fresh but operationally demanding. The right answer is usually both, with streaming reserved for workloads that genuinely need sub-minute latency.
When to Use It / When to Avoid
Build a data pipeline when:
- Multiple teams need the same data in different shapes (analytics, search, ML features)
- Your analytics queries are hitting the production database and causing contention
- You need real-time signals (fraud detection, alerting, operational dashboards) that batch queries can't provide
- Data needs to flow between systems that don't share a database (microservices, third-party integrations)
- You need to preserve raw data for re-transformation when business rules change
Avoid building a custom pipeline when:
- A direct database query or materialized view solves the problem (don't build infrastructure for something
CREATE MATERIALIZED VIEWhandles) - You have fewer than 100K events per day and a single analytics user (a cron job is genuinely fine here)
- You're adding streaming for a workload that doesn't need sub-minute latency (real-time dashboards that refresh every 5 minutes don't need Flink)
- A managed ELT tool (Fivetran, Airbyte) can handle the ingestion and you only need dbt for transforms
The honest answer: most companies over-engineer their data pipelines. If your data volume is under 1M events/day and your latency requirements are measured in hours, a dbt project running on a cron schedule with Airflow is the right answer. Flink and Kafka are for when the cron job breaks.
Real-World Examples
Netflix processes over 1.5 PB of data per day through their pipeline infrastructure. Their real-time pipeline uses Kafka for event streaming and Flink for stream processing (personalization signals, A/B test metrics, viewing activity). Their batch pipeline uses Spark on Iceberg tables for historical analytics and ML model training. They operate both Lambda-style (batch + stream) because their recommendation models need exact historical accuracy while their operational dashboards need real-time freshness.
Uber runs a real-time surge pricing pipeline that processes millions of ride request events per second. The pipeline enriches each request with geospatial data, driver supply, and historical demand patterns, then computes a surge multiplier in under 200ms. They use Kafka for event transport, Flink for real-time computation, and a custom in-memory feature store for sub-millisecond lookups. A 500ms delay in the surge pipeline means prices don't reflect current demand, directly impacting driver supply allocation.
Spotify combines batch and stream pipelines for music recommendations. Batch pipelines (Spark, running nightly) compute collaborative filtering models over the entire listening history. Stream pipelines process real-time listening events to update session-based recommendations ("because you just listened to X"). Their "Discover Weekly" playlist is batch-computed from full listening history; their "Daily Mix" incorporates same-day streaming signals. The batch pipeline processes ~600TB of data nightly.
How This Shows Up in Interviews
When to bring it up
Data pipelines come up in any system design that involves analytics, search indexing, ML features, or real-time dashboards. When the interviewer asks "how does the data get to the analytics team?" or "how would you build the recommendation engine's feature pipeline?", that's your cue. Also relevant when discussing event-driven architecture, CQRS, or any system where the read model differs from the write model.
Depth expected at senior / staff level
- Articulate when to use batch vs. stream (and why most systems need both)
- Know the medallion architecture (bronze/silver/gold) and why bronze exists
- Explain exactly-once semantics: what Flink guarantees internally vs. what requires idempotent sinks
- Describe windowing strategies and when each type applies
- Name concrete tools: Kafka, Flink, Spark, dbt, Airflow, and know their trade-offs
- Explain how to handle late-arriving events, schema evolution, and backfill
Interview shortcut: name the layers
When designing a data pipeline in an interview, say: "I'd use a medallion architecture. Bronze layer is raw CDC events from Kafka, silver is cleaned and deduplicated, gold is the business-ready aggregates that dashboards query. If we discover a transformation bug, we replay from bronze without re-extracting from the source." This shows architectural maturity in three sentences.
Common follow-up questions
| Interviewer asks | Strong answer |
|---|---|
| "How do you handle schema changes in a pipeline?" | "Use a schema registry (Confluent Schema Registry with Avro or Protobuf). Enforce backward compatibility: new fields get defaults, removed fields trigger a deprecation period. Consumers that can't handle the new schema see the old version. Breaking changes require a new topic." |
| "How do you guarantee no data loss?" | "At-least-once delivery with idempotent sinks. Kafka provides durable, replicated storage. Flink checkpoints offsets + state. If a consumer crashes, it resumes from the last checkpoint. For the warehouse, use MERGE/upsert semantics so duplicate records overwrite rather than duplicate." |
| "Batch or stream for this use case?" | "If the business decision can wait an hour, batch. If it can't (fraud, surge pricing, alerting), stream. Most systems need both: stream for operational paths, batch for historical recomputation and ML training." |
| "How do you monitor a pipeline?" | "Three metrics: consumer lag (how far behind real-time), throughput (records/second per stage), and error rate (DLQ volume). Alert on lag exceeding your SLO (e.g., lag > 5 minutes for a real-time pipeline), throughput drops (source is still producing but consumer stopped), and DLQ growth." |
| "How do you handle backfill?" | "Design pipelines to accept a time range parameter. For Kafka, reset consumer offsets to the desired timestamp. For batch, re-run the dbt model with a date filter. Bronze layer makes this possible because the raw data is always preserved." |
Test Your Understanding
Quick Recap
- A data pipeline moves and transforms data from sources to destinations. The difference between "a cron job that runs SQL" and "a data pipeline" is monitoring, failure handling, replay capabilities, and schema management.
- Batch pipelines trade latency for simplicity. Stream pipelines trade complexity for freshness. Most production systems use both, with streaming reserved for workloads that genuinely need sub-minute latency.
- ELT replaced ETL as the default for cloud warehouses. Load raw data first (bronze), clean and validate (silver), aggregate for business use (gold). The medallion architecture preserves raw data for when business rules change.
- Kappa architecture (streaming only) is preferred for new systems. Lambda (batch + stream) persists where historical reprocessing volume makes streaming-only replay impractical.
- Pipeline failure handling requires three capabilities: dead-letter queues for poison records, checkpointing for crash recovery, and replay from source for fixing transformation bugs.
- Stream processing adds windowing (tumbling, sliding, session) and watermarks for late events as complexity dimensions that batch avoids entirely. Choose the window type based on the aggregation semantics, not the data.
- Monitor three metrics for pipeline health: consumer lag (freshness), throughput (records per second per stage), and DLQ volume (error rate). Alert on lag exceeding your SLO.
Related Concepts
- Message queues: Kafka and other message brokers are the transport layer that connects pipeline stages. Understanding partitioning, consumer groups, and offset management is essential for pipeline design.
- Event-driven architecture: Data pipelines are often the implementation of event-driven patterns. Fan-out, enrichment, and aggregation are pipeline implementations of event notification and event-carried state transfer.
- Consistency models: Stream processing introduces eventual consistency between source and destination. Understanding the consistency guarantees of your pipeline determines what business decisions you can safely base on pipeline output.
- Observability: Pipeline monitoring (lag, throughput, error rate) is a specialized application of observability principles. Without pipeline-specific metrics, failures go undetected for days.