Batch vs. stream processing
When to process data in large batches vs. a continuous stream, covering latency requirements, state management, fault tolerance, and the tradeoffs between Lambda and stream-only approaches.
TL;DR
| Scenario | Batch processing | Stream processing |
|---|---|---|
| Daily sales report | Run overnight, full dataset, exact numbers | Overkill; latency does not matter |
| Fraud detection | Too slow; fraud must be caught in milliseconds | Process every transaction in real time |
| Recommendation model training | Retrain weekly/daily on historical data | Not needed; model does not change per-event |
| Live dashboard (orders/min) | Minutes-to-hours stale | Real-time counters, sub-second freshness |
| Data warehouse loading | Standard ETL/ELT, scheduled loads | CDC stream for near-real-time warehouse |
Default instinct: start with batch. Batch is simpler to build, test, debug, and recover from failure. Move to streaming only when latency requirements demand it. The majority of data pipelines in the world are batch, and that is fine.
The Framing
Your analytics team runs a nightly batch job. At midnight, a Spark job reads the day's events from S3, aggregates revenue by product category, and writes the result to the data warehouse. The marketing team sees yesterday's numbers at 8 AM. This worked for years.
Then the business launches a flash sale. The CEO asks: "How are flash sale numbers looking?" The answer: "We will know tomorrow morning." This is not acceptable. The business wants to see revenue accumulating in real time, not 8-18 hours later.
A streaming pipeline would solve this: Kafka ingests every purchase event, Flink computes a running sum, and the dashboard updates every second. But now you need to handle late-arriving events, out-of-order data, exactly-once semantics, stateful processing, and checkpointing. The batch job was 200 lines of Spark. The streaming pipeline is 2,000 lines of Flink with a separate state management layer.
I have seen multiple teams rewrite a working batch pipeline as a streaming pipeline only to discover the operational complexity was not worth the freshness improvement. The question is not "can we stream this?" but "does the business need sub-minute freshness for this data?"
The honest answer: most data pipelines should be batch. Streaming is for when freshness is a hard business or operational requirement, not a nice-to-have.
How Each Works
Batch: process a bounded dataset
Batch processing operates on a fixed, bounded dataset. You know the start and end of the data before processing begins. The job reads all the data, transforms it, and writes the result.
Typical batch pipeline:
1. Schedule trigger (cron, Airflow DAG) fires at midnight
2. Read input: SELECT * FROM events WHERE date = '2024-03-15'
(or scan S3 partition s3://lake/events/2024/03/15/)
3. Transform: filter, aggregate, join, enrich
4. Write output: INSERT INTO warehouse.daily_revenue ...
5. Job completes. Success or failure is binary.
Tools: Apache Spark, Hadoop MapReduce, dbt, Airflow, AWS Glue
Fault tolerance is simple: if the job fails, re-run it on the same input. The input is immutable (it is yesterday's data sitting in S3). Re-running produces the same output. This is idempotent by nature.
Resource usage is bursty. The cluster spins up, processes the data, and shuts down. You pay for compute only during the job window. Spark on-demand clusters cost a fraction of always-on streaming infrastructure.
The limitation is freshness. The output is only as fresh as the last completed batch. If you run hourly batches, the data is up to 1 hour stale. If daily, up to 24 hours stale.
Stream: process an unbounded dataset
Stream processing operates on a continuous, unbounded flow of events. There is no "end" to the data. Events arrive indefinitely, and the processor maintains running state.
Typical stream pipeline:
1. Consumer reads from Kafka topic (continuous)
2. For each event: deserialize, validate, transform
3. Update running state (counters, windows, aggregations)
4. Emit result downstream (another Kafka topic, database, dashboard)
5. Checkpoint state periodically for fault tolerance
Tools: Apache Flink, Kafka Streams, Spark Structured Streaming, AWS Kinesis
Fault tolerance is complex. The processor maintains in-memory state (running counts, window aggregations). If it crashes, that state is lost unless checkpointed. Flink uses periodic snapshots to a durable store (S3, HDFS). Kafka Streams uses a local RocksDB state store backed by a changelog topic in Kafka. Recovery means restoring from the last checkpoint and replaying events since then.
Resource usage is continuous. The streaming job runs 24/7. You pay for always-on compute. For low-volume streams, this is wasteful. For high-volume streams (100K+ events/sec), the continuous processing model is more efficient than repeated batch startups.
Windowing: how streams group time
Streams process infinite data, but aggregations need boundaries. Windowing defines those boundaries.
Tumbling window: fixed-size, non-overlapping intervals.
[00:00-01:00]: 142 events → emit count=142
[01:00-02:00]: 189 events → emit count=189
Use for: per-minute metrics, hourly aggregates
Sliding window: fixed-size, overlapping intervals.
[00:00-01:00]: 142 events
[00:30-01:30]: 167 events (overlaps by 30 min)
Use for: moving averages, rate limiting (100 req per 60s sliding)
Session window: dynamic size, gap-based.
User clicks at 10:05, 10:06, 10:08, then silence.
Session = [10:05-10:08] (30-min inactivity gap closes it)
Use for: user session analytics, engagement tracking
Watermarks solve the late-arrival problem. A watermark is a timestamp assertion: "I believe all events with timestamp before W have arrived." Events arriving after their watermark are either dropped or routed to a side output for late processing. Flink allows configurable watermark strategies (bounded out-of-orderness is the most common: "events can be up to 5 seconds late").
Head-to-Head Comparison
| Dimension | Batch | Stream | Verdict |
|---|---|---|---|
| Freshness | Minutes to hours old | Seconds to sub-second | Stream |
| Fault tolerance | Re-run the job; input is immutable | Checkpoint/restore; state recovery is complex | Batch (simpler) |
| State management | Stateless (each run is independent) | Stateful (running aggregates in RocksDB, etc.) | Batch (simpler) |
| Debugging | Deterministic (same input, same output) | Non-deterministic (timing, ordering varies) | Batch |
| Resource cost | Pay per job (bursty, can use spot instances) | Pay 24/7 (always-on compute) | Batch (usually cheaper) |
| Throughput | Very high (scan entire partitions) | High but limited by per-event overhead | Batch |
| Late data handling | Not a concern (dataset is complete) | Watermarks, allowed lateness, side outputs | Batch |
| Exactly-once | Natural (idempotent re-run) | Requires transactions or idempotent sinks | Batch |
| Operational complexity | Scheduler + job monitoring | Stateful service + checkpointing + offset management | Batch |
| Reprocessing | Re-run job on historical partition | Reset Kafka offsets, replay from beginning | Batch (simpler) |
The fundamental tension: data freshness vs. operational simplicity. Streaming gives you real-time results at the cost of managing state, checkpoints, watermarks, and always-on infrastructure.
When Batch Wins
Choose batch when freshness is not the bottleneck and simplicity matters. The majority of data workloads fall into this category.
Data warehouse loading (ETL/ELT). Nightly or hourly data loads into Snowflake, BigQuery, or Redshift. The analysts run queries on yesterday's data. Streaming the data in real time adds complexity without meaningful value because the analysts' queries are not real-time anyway.
ML model training. Retraining a recommendation model on the last 90 days of user interactions. The model does not change per-event. A weekly Spark job on S3 data is the right tool.
Historical backfills and migrations. Re-processing a year of data to populate a new table, fix a data quality issue, or build a new index. Batch jobs can process terabytes of historical data efficiently. Streaming pipelines are designed for the present, not the past.
Financial reconciliation. End-of-day settlement, monthly revenue rollups, quarterly reports. These require processing the complete dataset for a period, not a running approximation. Batch ensures you have every transaction before reporting.
When Stream Wins
Choose streaming when the business logic requires acting on data within seconds of its arrival. These are the workloads where "process it tomorrow" is not an option.
Fraud detection. A credit card transaction must be evaluated within 100ms. If the system waits for a batch job, the fraudulent charge has already been approved. Stream processing evaluates each transaction against rules and ML models in real time, blocking suspicious transactions before they complete.
Real-time dashboards and monitoring. Operational metrics (requests/sec, error rates, latency percentiles), live event tracking during a product launch, or CEO-facing revenue dashboards that update every second. The value of the data decays rapidly; yesterday's metrics are for post-mortems, not decisions.
Alerting and anomaly detection. If a microservice's error rate spikes from 0.1% to 15%, you need to know within seconds, not after the next batch run. Streaming processors compute sliding-window error rates and fire alerts immediately.
Event-driven architectures. Systems where downstream services react to events (order placed, payment confirmed, item shipped). The event bus (Kafka) is already streaming. Processing those events with a streaming framework is a natural fit.
The Nuance
The "batch vs. stream" framing is often a false dichotomy. Most production data platforms use both, and two architectural patterns formalize how they combine.
Lambda Architecture: batch + stream in parallel
The Lambda Architecture runs a batch layer and a speed layer in parallel. The batch layer reprocesses all historical data periodically (nightly) and produces exact results. The speed layer processes real-time events and produces approximate results. A serving layer merges the two: historical data from the batch layer, plus the delta from the speed layer since the last batch run.
The problem with Lambda: you maintain two codebases (batch + stream) that must produce the same results. Every schema change, business rule update, or bug fix must be applied to both. In practice, the batch and speed layers drift out of sync, creating subtle data discrepancies that are painful to debug.
Kappa Architecture: stream-only with replay
The Kappa Architecture simplifies Lambda by eliminating the batch layer entirely. All processing is streaming. To reprocess historical data, you reset the Kafka consumer offset to the beginning and replay all events through the same streaming pipeline.
The advantage: one codebase for both real-time and historical processing. No divergence between batch and speed layers.
The limitation: reprocessing 90 days of events through a streaming pipeline takes much longer than a batch scan. Kafka retention has a storage cost (keeping 90 days of events at 100K events/sec is roughly 800TB). And some computations (like training an ML model on a year of data) are genuinely better as batch jobs.
My recommendation: use Kappa when your streaming pipeline can handle reprocessing within an acceptable time window (hours, not days). Use Lambda (or just batch) when reprocessing must scan years of historical data that is too large for Kafka retention.
For your interview: when asked about data architecture, mention both Lambda and Kappa by name. Say that Lambda is the safer default but has the dual-codebase problem. Kappa is simpler but requires long Kafka retention and fast replay.
Real-World Examples
Netflix (batch + stream, Lambda-style). Netflix processes 1.5 PB of data daily. Their batch pipeline (Spark on S3) runs nightly to compute recommendation models, content popularity rankings, and A/B test results. Their streaming pipeline (Kafka + Flink) handles real-time personalization, delivery telemetry, and error monitoring. They maintain both because recommendations need full historical context (batch) while operational metrics need sub-second freshness (stream). Netflix's data platform team has explicitly stated that they use the right tool for each use case rather than forcing everything into one paradigm.
Uber (streaming-first, Kappa-style). Uber's pricing and surge computation requires real-time data: current ride demand, driver supply, traffic conditions, and event-driven triggers. Their system processes millions of events per second through Apache Flink, computing surge multipliers within seconds of demand changes. For historical analysis and model retraining, they replay Kafka topics through the same Flink pipelines (Kappa approach). Uber found that maintaining a single streaming codebase reduced bugs caused by batch/stream divergence by an estimated 40%.
Spotify (batch for recommendations, micro-batch for dashboards). Spotify's recommendation engine (Discover Weekly, Daily Mix) trains on 600TB of listening history using nightly Spark batch jobs. Real-time listening counts and trending charts use Spark Structured Streaming with micro-batches (10-30 second intervals). They chose micro-batch over true streaming because the 30-second freshness is sufficient for trending charts, and Structured Streaming integrates natively with their existing Spark infrastructure. True event-at-a-time streaming (Flink) was considered but rejected as unnecessary complexity for their freshness requirements.
How This Shows Up in Interviews
This trade-off appears when designing any data-intensive system: analytics pipelines, dashboards, recommendation engines, fraud detection, or event processing. The interviewer expects you to match the processing model to the freshness requirement.
What they are testing: Do you default to streaming because it sounds impressive, or do you choose batch when it fits? Can you articulate the operational cost of streaming? Do you know Lambda and Kappa architectures and when each applies?
Depth expected at senior level:
- Name specific tools for each paradigm (Spark/Airflow for batch, Flink/Kafka Streams for stream)
- Explain windowing strategies (tumbling, sliding, session) and when each applies
- Know that exactly-once in streaming means "exactly-once effect," not "processed only once"
- Understand watermarks and the late-data problem
- Articulate the cost difference between batch and streaming infrastructure
| Interviewer asks | Strong answer |
|---|---|
| "Should this pipeline be batch or streaming?" | "What's the freshness requirement? If the business needs data within seconds, streaming. If minutes-to-hours is fine, batch is simpler and 10-14x cheaper to operate. Most pipelines should be batch." |
| "How would you handle late-arriving events?" | "In streaming, I would use watermarks with bounded out-of-orderness (e.g., events can be up to 30 seconds late). Late events beyond the watermark go to a side output for reprocessing. In batch, this is not a problem because the full dataset is available." |
| "What's the Lambda architecture?" | "Batch and speed layers running in parallel. Batch reprocesses all historical data for exact results. Speed layer processes real-time events for approximate recent results. A serving layer merges both. The drawback: two codebases that must produce identical results." |
| "Why not just stream everything?" | "Cost and complexity. A Flink cluster running 24/7 costs 10-14x more than a Spark job running 2 hours nightly for the same data volume. Plus, stateful streaming requires checkpointing, watermarks, and exactly-once semantics. For pipelines where hourly freshness is acceptable, this complexity is unjustified." |
Interview tip: name the freshness requirement explicitly
When designing a data pipeline, always ask: "How fresh does the output need to be?" Then map the answer: sub-second = true streaming (Flink), seconds-to-minutes = micro-batch (Spark Structured Streaming), minutes-to-hours = batch (Spark/dbt). This shows you choose tools based on requirements, not hype.
Quick Recap
- Batch processing operates on bounded datasets at scheduled intervals. It is simpler to build, test, debug, and recover from failure. Choose it when freshness of minutes-to-hours is acceptable.
- Stream processing operates on unbounded, continuous data with sub-second freshness. It requires managing state, checkpoints, watermarks, and always-on infrastructure. Choose it when the business cannot wait for the next batch run.
- Lambda Architecture runs batch and stream in parallel, merging results in a serving layer. It gives both accuracy (batch) and freshness (stream) at the cost of maintaining two codebases that must produce identical results.
- Kappa Architecture uses streaming only, replaying historical events from Kafka for reprocessing. It eliminates the dual-codebase problem but requires long Kafka retention and fast replay capability.
- Streaming infrastructure costs 10-14x more than equivalent batch infrastructure for the same data volume. The cost is justified only when sub-minute freshness is a hard business requirement.
- In interviews, always ask "how fresh does the output need to be?" before choosing batch or stream. Defaulting to streaming without justification signals hype-driven architecture, not pragmatic engineering.
Related Trade-offs
- Sync vs. async communication for the request-level version of this trade-off
- Latency vs. throughput for how batching trades individual latency for aggregate throughput
- Event-driven architecture for the event processing patterns that streaming pipelines implement
- Message queues for the Kafka and messaging infrastructure that powers streaming pipelines
- Data pipelines for the broader architecture that batch and stream processing operate within