Batch vs. stream processing
When to process data in large batches vs. a continuous stream, covering latency requirements, state management, fault tolerance, and the tradeoffs between Lambda and stream-only approaches.
TL;DR
| Scenario | Batch processing | Stream processing |
|---|---|---|
| Daily sales report | Run overnight, full dataset, exact numbers | Overkill; latency does not matter |
| Fraud detection | Too slow; fraud must be caught in milliseconds | Process every transaction in real time |
| Recommendation model training | Retrain weekly/daily on historical data | Not needed; model does not change per-event |
| Live dashboard (orders/min) | Minutes-to-hours stale | Real-time counters, sub-second freshness |
| Data warehouse loading | Standard ETL/ELT, scheduled loads | CDC stream for near-real-time warehouse |
Default instinct: start with batch. Batch is simpler to build, test, debug, and recover from failure. Move to streaming only when latency requirements demand it. The majority of data pipelines in the world are batch, and that is fine.
The Framing
Your analytics team runs a nightly batch job. At midnight, a Spark job reads the day's events from S3, aggregates revenue by product category, and writes the result to the data warehouse. The marketing team sees yesterday's numbers at 8 AM. This worked for years.
Then the business launches a flash sale. The CEO asks: "How are flash sale numbers looking?" The answer: "We will know tomorrow morning." This is not acceptable. The business wants to see revenue accumulating in real time, not 8-18 hours later.
A streaming pipeline would solve this: Kafka ingests every purchase event, Flink computes a running sum, and the dashboard updates every second. But now you need to handle late-arriving events, out-of-order data, exactly-once semantics, stateful processing, and checkpointing. The batch job was 200 lines of Spark. The streaming pipeline is 2,000 lines of Flink with a separate state management layer.
I have seen multiple teams rewrite a working batch pipeline as a streaming pipeline only to discover the operational complexity was not worth the freshness improvement. The question is not "can we stream this?" but "does the business need sub-minute freshness for this data?"
The honest answer: most data pipelines should be batch. Streaming is for when freshness is a hard business or operational requirement, not a nice-to-have.
How Each Works
Batch: process a bounded dataset
Batch processing operates on a fixed, bounded dataset. You know the start and end of the data before processing begins. The job reads all the data, transforms it, and writes the result.
Typical batch pipeline:
1. Schedule trigger (cron, Airflow DAG) fires at midnight
2. Read input: SELECT * FROM events WHERE date = '2024-03-15'
(or scan S3 partition s3://lake/events/2024/03/15/)
3. Transform: filter, aggregate, join, enrich
4. Write output: INSERT INTO warehouse.daily_revenue ...
5. Job completes. Success or failure is binary.
Tools: Apache Spark, Hadoop MapReduce, dbt, Airflow, AWS Glue
Fault tolerance is simple: if the job fails, re-run it on the same input. The input is immutable (it is yesterday's data sitting in S3). Re-running produces the same output. This is idempotent by nature.
Resource usage is bursty. The cluster spins up, processes the data, and shuts down. You pay for compute only during the job window. Spark on-demand clusters cost a fraction of always-on streaming infrastructure.
The limitation is freshness. The output is only as fresh as the last completed batch. If you run hourly batches, the data is up to 1 hour stale. If daily, up to 24 hours stale.
Stream: process an unbounded dataset
Stream processing operates on a continuous, unbounded flow of events. There is no "end" to the data. Events arrive indefinitely, and the processor maintains running state.
Typical stream pipeline:
1. Consumer reads from Kafka topic (continuous)
2. For each event: deserialize, validate, transform
3. Update running state (counters, windows, aggregations)
4. Emit result downstream (another Kafka topic, database, dashboard)
5. Checkpoint state periodically for fault tolerance
Tools: Apache Flink, Kafka Streams, Spark Structured Streaming, AWS Kinesis
Fault tolerance is complex. The processor maintains in-memory state (running counts, window aggregations). If it crashes, that state is lost unless checkpointed. Flink uses periodic snapshots to a durable store (S3, HDFS). Kafka Streams uses a local RocksDB state store backed by a changelog topic in Kafka. Recovery means restoring from the last checkpoint and replaying events since then.
Resource usage is continuous. The streaming job runs 24/7. You pay for always-on compute. For low-volume streams, this is wasteful. For high-volume streams (100K+ events/sec), the continuous processing model is more efficient than repeated batch startups.
Windowing: how streams group time
Streams process infinite data, but aggregations need boundaries. Windowing defines those boundaries.
Tumbling window: fixed-size, non-overlapping intervals.
[00:00-01:00]: 142 events β emit count=142
[01:00-02:00]: 189 events β emit count=189
Use for: per-minute metrics, hourly aggregates
Sliding window: fixed-size, overlapping intervals.
[00:00-01:00]: 142 events
[00:30-01:30]: 167 events (overlaps by 30 min)
Use for: moving averages, rate limiting (100 req per 60s sliding)
Session window: dynamic size, gap-based.
User clicks at 10:05, 10:06, 10:08, then silence.
Session = [10:05-10:08] (30-min inactivity gap closes it)
Use for: user session analytics, engagement tracking
Watermarks solve the late-arrival problem. A watermark is a timestamp assertion: "I believe all events with timestamp before W have arrived." Events arriving after their watermark are either dropped or routed to a side output for late processing. Flink allows configurable watermark strategies (bounded out-of-orderness is the most common: "events can be up to 5 seconds late").
Head-to-Head Comparison
| Dimension | Batch | Stream | Verdict |
|---|---|---|---|
| Freshness | Minutes to hours old | Seconds to sub-second | Stream |
| Fault tolerance | Re-run the job; input is immutable | Checkpoint/restore; state recovery is complex | Batch (simpler) |
| State management | Stateless (each run is independent) | Stateful (running aggregates in RocksDB, etc.) | Batch (simpler) |
| Debugging | Deterministic (same input, same output) | Non-deterministic (timing, ordering varies) | Batch |
| Resource cost | Pay per job (bursty, can use spot instances) | Pay 24/7 (always-on compute) | Batch (usually cheaper) |
| Throughput | Very high (scan entire partitions) | High but limited by per-event overhead | Batch |
| Late data handling | Not a concern (dataset is complete) | Watermarks, allowed lateness, side outputs | Batch |
| Exactly-once | Natural (idempotent re-run) | Requires transactions or idempotent sinks | Batch |
| Operational complexity | Scheduler + job monitoring | Stateful service + checkpointing + offset management | Batch |
| Reprocessing | Re-run job on historical partition | Reset Kafka offsets, replay from beginning | Batch (simpler) |
The fundamental tension: data freshness vs. operational simplicity. Streaming gives you real-time results at the cost of managing state, checkpoints, watermarks, and always-on infrastructure.
When Batch Wins
Choose batch when freshness is not the bottleneck and simplicity matters. The majority of data workloads fall into this category.
Data warehouse loading (ETL/ELT). Nightly or hourly data loads into Snowflake, BigQuery, or Redshift. The analysts run queries on yesterday's data. Streaming the data in real time adds complexity without meaningful value because the analysts' queries are not real-time anyway.
ML model training. Retraining a recommendation model on the last 90 days of user interactions. The model does not change per-event. A weekly Spark job on S3 data is the right tool.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.