How data pipelines move and transform data at scale: batch vs. stream processing, Lambda vs. Kappa architecture, ETL vs. ELT, medallion architecture, windowing strategies, and failure handling.
30 min read2026-04-04mediumdata-pipelinesstreamingbatch-processinginfrastructure
A data pipeline moves data from sources to destinations, transforming it along the way. The hard part is doing this reliably at scale without losing records or violating latency guarantees.
Batch processing trades latency for simplicity (hourly/nightly runs). Stream processing trades complexity for freshness (sub-second). Most production systems use both.
ELT (load raw data first, transform in-warehouse) has replaced ETL as the default for cloud data warehouses. The medallion architecture (bronze/silver/gold) organizes the transformation layers.
Pipeline failure handling requires three capabilities: dead-letter queues for poison records, checkpointing for crash recovery, and replay from source for fixing transformation bugs.
Stream processing adds windowing (tumbling, sliding, session) and exactly-once semantics as complexity dimensions that batch pipelines avoid entirely.
Your e-commerce company starts with a simple setup: the product team queries the production database directly for analytics. SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '24 hours' runs fine when you have 10,000 orders per day.
At 500,000 orders per day, that query takes 45 seconds and locks rows that the checkout service needs. Your DBA adds a read replica for analytics queries. That buys you 6 months.
At 2 million orders per day, the analytics team needs joins across orders, customers, inventory, and clickstream data. These joins require denormalized tables that don't exist in the OLTP schema. Someone writes a Python script that runs on a cron job at 3 AM, pulls data from four tables, transforms it, and loads it into a separate analytics database. It works until the script crashes silently one night and nobody notices for three days. Three days of revenue dashboards show zeros, and the CFO thinks the company lost all its revenue.
At 10 million events per day (orders plus page views, clicks, searches, ad impressions), the 3 AM cron job takes 6 hours to run. It fails halfway through on out-of-memory errors. The analytics database has stale data. The fraud detection team needs real-time signals, not yesterday's data. The recommendation engine needs feature vectors computed from the last hour of clickstream data, not last night's.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.
Every growing company hits this wall. The cron script that "works fine for now" becomes the single most fragile piece of infrastructure. I've watched teams spend months debugging silent data pipeline failures that nobody noticed because the dashboard showed cached data from the last successful run.
The answer is not a better cron script. It's a data pipeline: a system designed from the ground up for reliable, scalable, monitored data movement and transformation.
A data pipeline is a system that moves data from sources to destinations through a series of processing stages, each of which extracts, transforms, enriches, or aggregates the data. The "pipeline" metaphor is literal: data flows in one end, gets processed at each stage, and arrives at the destination in a different shape than it started.
Think of it like a water treatment plant. Raw water (data) flows in from rivers and reservoirs (sources). It passes through filtration, chemical treatment, and quality testing stages (transformations). Clean water (processed data) flows out to homes and businesses (destinations). The plant runs 24/7, monitors pressure and quality at every stage, and has bypass systems for when one stage fails. A data pipeline does the same thing for information.
The key insight: a data pipeline is not just a script that moves data. It's a system with monitoring, failure handling, replay capabilities, and schema management. The difference between "a cron job that runs SQL" and "a data pipeline" is the difference between a garden hose and a water treatment plant.
Let's trace a single record through a production pipeline: an order placed on an e-commerce platform that needs to reach the analytics warehouse, the search index, and the fraud detection system.
Event production. The checkout service publishes an order.created event to Kafka topic orders. The event includes order ID, customer ID, items, total, payment method, and timestamp.
Ingestion. A Flink consumer reads from the orders topic. It validates the schema (all required fields present, types correct) and drops malformed records to a dead-letter topic for investigation.
Enrichment. The Flink job looks up the customer's profile from a Redis cache (country, account age, lifetime spend). It joins this context onto the order event, producing an enriched record.
Transformation. Business rules are applied: currency conversion to USD, tax calculation, fraud risk score from a sidecar ML model. The enriched, transformed record is written to a downstream Kafka topic orders.enriched.
Fan-out to destinations. Three independent consumers read from orders.enriched: one writes to Snowflake (analytics), one updates Elasticsearch (search), one feeds the real-time fraud dashboard (alerting). Each consumer checkpoints its Kafka offset independently.
Quality verification. An hourly batch job compares record counts between source (Kafka) and destination (Snowflake). If the counts diverge by more than 0.1%, it triggers an alert.
Every stage is independently monitorable. If enrichment latency spikes, you see it in the stage metrics. If the fraud model slows down, that stage's processing time increases while others are unaffected. This isolation is what separates a pipeline from a monolithic script.
For your interview: trace a specific record through the pipeline. "An order event hits Kafka, gets enriched with customer data from Redis, transformed with business rules, then fanned out to Snowflake, Elasticsearch, and the fraud dashboard." Specific > abstract.
Extracts data from the origin system: CDC (Change Data Capture) for databases, API polling for SaaS tools, file watchers for S3/SFTP drops. Debezium is the standard for CDC.
Message broker
Decouples stages and buffers records between them. Kafka is the default for high-throughput pipelines. Kinesis for AWS-native. Provides durability and replay.
Stream processor
Stateful computation on records in flight: enrichment, aggregation, windowing. Apache Flink (true streaming), Spark Structured Streaming (micro-batch), or Kafka Streams (library, no separate cluster).
Batch processor
Processes large historical datasets on a schedule. Apache Spark, dbt (SQL transforms), or Airflow-orchestrated SQL jobs. Cheaper per record than streaming for bulk historical data.
Schema registry
Stores and enforces schema versions for events. Confluent Schema Registry (Avro/Protobuf/JSON Schema). Prevents producers from publishing records that break downstream consumers.
Orchestrator
Schedules and monitors batch pipeline DAGs. Apache Airflow, Dagster, or Prefect. Handles retries, dependency ordering, backfill runs.
Dead-letter queue
Captures records that fail processing after exhausting retries. Operators investigate and choose to fix-and-replay or discard. Without a DLQ, one bad record stalls an entire partition.
Data warehouse
The analytical destination: Snowflake, BigQuery, Redshift, ClickHouse. Columnar storage optimized for aggregate queries over large datasets.