Batch vs. stream processing

TL;DR

Scenario	Batch processing	Stream processing
Daily sales report	Run overnight, full dataset, exact numbers	Overkill; latency does not matter
Fraud detection	Too slow; fraud must be caught in milliseconds	Process every transaction in real time
Recommendation model training	Retrain weekly/daily on historical data	Not needed; model does not change per-event
Live dashboard (orders/min)	Minutes-to-hours stale	Real-time counters, sub-second freshness
Data warehouse loading	Standard ETL/ELT, scheduled loads	CDC stream for near-real-time warehouse

Default instinct: start with batch. Batch is simpler to build, test, debug, and recover from failure. Move to streaming only when latency requirements demand it. The majority of data pipelines in the world are batch, and that is fine.

The Framing

Your analytics team runs a nightly batch job. At midnight, a Spark job reads the day's events from S3, aggregates revenue by product category, and writes the result to the data warehouse. The marketing team sees yesterday's numbers at 8 AM. This worked for years.

Then the business launches a flash sale. The CEO asks: "How are flash sale numbers looking?" The answer: "We will know tomorrow morning." This is not acceptable. The business wants to see revenue accumulating in real time, not 8-18 hours later.

A streaming pipeline would solve this: Kafka ingests every purchase event, Flink computes a running sum, and the dashboard updates every second. But now you need to handle late-arriving events, out-of-order data, exactly-once semantics, stateful processing, and checkpointing. The batch job was 200 lines of Spark. The streaming pipeline is 2,000 lines of Flink with a separate state management layer.

I have seen multiple teams rewrite a working batch pipeline as a streaming pipeline only to discover the operational complexity was not worth the freshness improvement. The question is not "can we stream this?" but "does the business need sub-minute freshness for this data?"

The honest answer: most data pipelines should be batch. Streaming is for when freshness is a hard business or operational requirement, not a nice-to-have.

How Each Works

Batch: process a bounded dataset

Batch processing operates on a fixed, bounded dataset. You know the start and end of the data before processing begins. The job reads all the data, transforms it, and writes the result.

Typical batch pipeline:
  1. Schedule trigger (cron, Airflow DAG) fires at midnight
  2. Read input: SELECT * FROM events WHERE date = '2024-03-15'
     (or scan S3 partition s3://lake/events/2024/03/15/)
  3. Transform: filter, aggregate, join, enrich
  4. Write output: INSERT INTO warehouse.daily_revenue ...
  5. Job completes. Success or failure is binary.

Tools: Apache Spark, Hadoop MapReduce, dbt, Airflow, AWS Glue

Fault tolerance is simple: if the job fails, re-run it on the same input. The input is immutable (it is yesterday's data sitting in S3). Re-running produces the same output. This is idempotent by nature.

Resource usage is bursty. The cluster spins up, processes the data, and shuts down. You pay for compute only during the job window. Spark on-demand clusters cost a fraction of always-on streaming infrastructure.

The limitation is freshness. The output is only as fresh as the last completed batch. If you run hourly batches, the data is up to 1 hour stale. If daily, up to 24 hours stale.

Stream: process an unbounded dataset

Stream processing operates on a continuous, unbounded flow of events. There is no "end" to the data. Events arrive indefinitely, and the processor maintains running state.

Typical stream pipeline:
  1. Consumer reads from Kafka topic (continuous)
  2. For each event: deserialize, validate, transform
  3. Update running state (counters, windows, aggregations)
  4. Emit result downstream (another Kafka topic, database, dashboard)
  5. Checkpoint state periodically for fault tolerance

Tools: Apache Flink, Kafka Streams, Spark Structured Streaming, AWS Kinesis

Fault tolerance is complex. The processor maintains in-memory state (running counts, window aggregations). If it crashes, that state is lost unless checkpointed. Flink uses periodic snapshots to a durable store (S3, HDFS). Kafka Streams uses a local RocksDB state store backed by a changelog topic in Kafka. Recovery means restoring from the last checkpoint and replaying events since then.

Resource usage is continuous. The streaming job runs 24/7. You pay for always-on compute. For low-volume streams, this is wasteful. For high-volume streams (100K+ events/sec), the continuous processing model is more efficient than repeated batch startups.

Windowing: how streams group time

Streams process infinite data, but aggregations need boundaries. Windowing defines those boundaries.

Tumbling window: fixed-size, non-overlapping intervals.
  [00:00-01:00]: 142 events → emit count=142
  [01:00-02:00]: 189 events → emit count=189
  Use for: per-minute metrics, hourly aggregates

Sliding window: fixed-size, overlapping intervals.
  [00:00-01:00]: 142 events
  [00:30-01:30]: 167 events (overlaps by 30 min)
  Use for: moving averages, rate limiting (100 req per 60s sliding)

Session window: dynamic size, gap-based.
  User clicks at 10:05, 10:06, 10:08, then silence.
  Session = [10:05-10:08] (30-min inactivity gap closes it)
  Use for: user session analytics, engagement tracking

Watermarks solve the late-arrival problem. A watermark is a timestamp assertion: "I believe all events with timestamp before W have arrived." Events arriving after their watermark are either dropped or routed to a side output for late processing. Flink allows configurable watermark strategies (bounded out-of-orderness is the most common: "events can be up to 5 seconds late").

Head-to-Head Comparison

Dimension	Batch	Stream	Verdict
Freshness	Minutes to hours old	Seconds to sub-second	Stream
Fault tolerance	Re-run the job; input is immutable	Checkpoint/restore; state recovery is complex	Batch (simpler)
State management	Stateless (each run is independent)	Stateful (running aggregates in RocksDB, etc.)	Batch (simpler)
Debugging	Deterministic (same input, same output)	Non-deterministic (timing, ordering varies)	Batch
Resource cost	Pay per job (bursty, can use spot instances)	Pay 24/7 (always-on compute)	Batch (usually cheaper)
Throughput	Very high (scan entire partitions)	High but limited by per-event overhead	Batch
Late data handling	Not a concern (dataset is complete)	Watermarks, allowed lateness, side outputs	Batch
Exactly-once	Natural (idempotent re-run)	Requires transactions or idempotent sinks	Batch
Operational complexity	Scheduler + job monitoring	Stateful service + checkpointing + offset management	Batch
Reprocessing	Re-run job on historical partition	Reset Kafka offsets, replay from beginning	Batch (simpler)

The fundamental tension: data freshness vs. operational simplicity. Streaming gives you real-time results at the cost of managing state, checkpoints, watermarks, and always-on infrastructure.

When Batch Wins

Choose batch when freshness is not the bottleneck and simplicity matters. The majority of data workloads fall into this category.

Data warehouse loading (ETL/ELT). Nightly or hourly data loads into Snowflake, BigQuery, or Redshift. The analysts run queries on yesterday's data. Streaming the data in real time adds complexity without meaningful value because the analysts' queries are not real-time anyway.

ML model training. Retraining a recommendation model on the last 90 days of user interactions. The model does not change per-event. A weekly Spark job on S3 data is the right tool.

TL;DR

Scenario	Batch processing	Stream processing
Daily sales report	Run overnight, full dataset, exact numbers	Overkill; latency does not matter
Fraud detection	Too slow; fraud must be caught in milliseconds	Process every transaction in real time
Recommendation model training	Retrain weekly/daily on historical data	Not needed; model does not change per-event
Live dashboard (orders/min)	Minutes-to-hours stale	Real-time counters, sub-second freshness
Data warehouse loading	Standard ETL/ELT, scheduled loads	CDC stream for near-real-time warehouse

The Framing

The honest answer: most data pipelines should be batch. Streaming is for when freshness is a hard business or operational requirement, not a nice-to-have.

How Each Works

Batch: process a bounded dataset

Batch processing operates on a fixed, bounded dataset. You know the start and end of the data before processing begins. The job reads all the data, transforms it, and writes the result.

Typical batch pipeline:
  1. Schedule trigger (cron, Airflow DAG) fires at midnight
  2. Read input: SELECT * FROM events WHERE date = '2024-03-15'
     (or scan S3 partition s3://lake/events/2024/03/15/)
  3. Transform: filter, aggregate, join, enrich
  4. Write output: INSERT INTO warehouse.daily_revenue ...
  5. Job completes. Success or failure is binary.

Tools: Apache Spark, Hadoop MapReduce, dbt, Airflow, AWS Glue

The limitation is freshness. The output is only as fresh as the last completed batch. If you run hourly batches, the data is up to 1 hour stale. If daily, up to 24 hours stale.

Stream: process an unbounded dataset

Stream processing operates on a continuous, unbounded flow of events. There is no "end" to the data. Events arrive indefinitely, and the processor maintains running state.

Typical stream pipeline:
  1. Consumer reads from Kafka topic (continuous)
  2. For each event: deserialize, validate, transform
  3. Update running state (counters, windows, aggregations)
  4. Emit result downstream (another Kafka topic, database, dashboard)
  5. Checkpoint state periodically for fault tolerance

Tools: Apache Flink, Kafka Streams, Spark Structured Streaming, AWS Kinesis

Windowing: how streams group time

Streams process infinite data, but aggregations need boundaries. Windowing defines those boundaries.

Tumbling window: fixed-size, non-overlapping intervals.
  [00:00-01:00]: 142 events → emit count=142
  [01:00-02:00]: 189 events → emit count=189
  Use for: per-minute metrics, hourly aggregates

Sliding window: fixed-size, overlapping intervals.
  [00:00-01:00]: 142 events
  [00:30-01:30]: 167 events (overlaps by 30 min)
  Use for: moving averages, rate limiting (100 req per 60s sliding)

Session window: dynamic size, gap-based.
  User clicks at 10:05, 10:06, 10:08, then silence.
  Session = [10:05-10:08] (30-min inactivity gap closes it)
  Use for: user session analytics, engagement tracking

Head-to-Head Comparison

Dimension	Batch	Stream	Verdict
Freshness	Minutes to hours old	Seconds to sub-second	Stream
Fault tolerance	Re-run the job; input is immutable	Checkpoint/restore; state recovery is complex	Batch (simpler)
State management	Stateless (each run is independent)	Stateful (running aggregates in RocksDB, etc.)	Batch (simpler)
Debugging	Deterministic (same input, same output)	Non-deterministic (timing, ordering varies)	Batch
Resource cost	Pay per job (bursty, can use spot instances)	Pay 24/7 (always-on compute)	Batch (usually cheaper)
Throughput	Very high (scan entire partitions)	High but limited by per-event overhead	Batch
Late data handling	Not a concern (dataset is complete)	Watermarks, allowed lateness, side outputs	Batch
Exactly-once	Natural (idempotent re-run)	Requires transactions or idempotent sinks	Batch
Operational complexity	Scheduler + job monitoring	Stateful service + checkpointing + offset management	Batch
Reprocessing	Re-run job on historical partition	Reset Kafka offsets, replay from beginning	Batch (simpler)

The fundamental tension: data freshness vs. operational simplicity. Streaming gives you real-time results at the cost of managing state, checkpoints, watermarks, and always-on infrastructure.

When Batch Wins

Choose batch when freshness is not the bottleneck and simplicity matters. The majority of data workloads fall into this category.

ML model training. Retraining a recommendation model on the last 90 days of user interactions. The model does not change per-event. A weekly Spark job on S3 data is the right tool.

Batch vs. stream processing

TL;DR

The Framing

How Each Works

Batch: process a bounded dataset

Stream: process an unbounded dataset

Windowing: how streams group time

Head-to-Head Comparison

When Batch Wins

Continue Reading with Premium

Comments

Batch vs. stream processing

TL;DR

The Framing

How Each Works

Batch: process a bounded dataset

Stream: process an unbounded dataset

Windowing: how streams group time

Head-to-Head Comparison

When Batch Wins

Continue Reading with Premium

Comments