Data pipelines

TL;DR

A data pipeline moves data from sources to destinations, transforming it along the way. The hard part is doing this reliably at scale without losing records or violating latency guarantees.
Batch processing trades latency for simplicity (hourly/nightly runs). Stream processing trades complexity for freshness (sub-second). Most production systems use both.
ELT (load raw data first, transform in-warehouse) has replaced ETL as the default for cloud data warehouses. The medallion architecture (bronze/silver/gold) organizes the transformation layers.
Pipeline failure handling requires three capabilities: dead-letter queues for poison records, checkpointing for crash recovery, and replay from source for fixing transformation bugs.
Stream processing adds windowing (tumbling, sliding, session) and exactly-once semantics as complexity dimensions that batch pipelines avoid entirely.

Your e-commerce company starts with a simple setup: the product team queries the production database directly for analytics. SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '24 hours' runs fine when you have 10,000 orders per day.

At 500,000 orders per day, that query takes 45 seconds and locks rows that the checkout service needs. Your DBA adds a read replica for analytics queries. That buys you 6 months.

At 2 million orders per day, the analytics team needs joins across orders, customers, inventory, and clickstream data. These joins require denormalized tables that don't exist in the OLTP schema. Someone writes a Python script that runs on a cron job at 3 AM, pulls data from four tables, transforms it, and loads it into a separate analytics database. It works until the script crashes silently one night and nobody notices for three days. Three days of revenue dashboards show zeros, and the CFO thinks the company lost all its revenue.

At 10 million events per day (orders plus page views, clicks, searches, ad impressions), the 3 AM cron job takes 6 hours to run. It fails halfway through on out-of-memory errors. The analytics database has stale data. The fraud detection team needs real-time signals, not yesterday's data. The recommendation engine needs feature vectors computed from the last hour of clickstream data, not last night's.

Component	Role
Source connector	Extracts data from the origin system: CDC (Change Data Capture) for databases, API polling for SaaS tools, file watchers for S3/SFTP drops. Debezium is the standard for CDC.
Message broker	Decouples stages and buffers records between them. Kafka is the default for high-throughput pipelines. Kinesis for AWS-native. Provides durability and replay.
Stream processor	Stateful computation on records in flight: enrichment, aggregation, windowing. Apache Flink (true streaming), Spark Structured Streaming (micro-batch), or Kafka Streams (library, no separate cluster).
Batch processor	Processes large historical datasets on a schedule. Apache Spark, dbt (SQL transforms), or Airflow-orchestrated SQL jobs. Cheaper per record than streaming for bulk historical data.
Schema registry	Stores and enforces schema versions for events. Confluent Schema Registry (Avro/Protobuf/JSON Schema). Prevents producers from publishing records that break downstream consumers.
Orchestrator	Schedules and monitors batch pipeline DAGs. Apache Airflow, Dagster, or Prefect. Handles retries, dependency ordering, backfill runs.
Dead-letter queue	Captures records that fail processing after exhausting retries. Operators investigate and choose to fix-and-replay or discard. Without a DLQ, one bad record stalls an entire partition.
Data warehouse	The analytical destination: Snowflake, BigQuery, Redshift, ClickHouse. Columnar storage optimized for aggregate queries over large datasets.

Data pipelines

TL;DR

The Problem It Solves

Continue Reading with Premium

Comments

What Is It?

How It Works

Key Components

Types / Variations

Batch vs. Stream Processing