Batch vs Streaming — Learn Apache Beam

1. Introduction

Data processing is generally split into two categories: batch and streaming. Batch processing handles a fixed size of data that has already been gathered and stored (bounded data). Streaming processing continuously ingest and processes data as it is generated in real-time (unbounded data).

Apache Beam bridges this divide by using a unified programming model where the exact same processing logic can handle both bounded and unbounded data sources.

2. Why This Concept Exists

Before Apache Beam, developers had to choose between two isolated architectures. For batch jobs (e.g., generating daily or weekly financial summaries), they would build Spark or MapReduce pipelines. For streaming jobs (e.g., live fraud detection or system monitoring logs), they would write Storm or Flink pipelines. This duplication required managing separate deployment mechanisms, libraries, and languages.

Beam unifies these worlds. By treating batch processing as a special subset of stream processing (i.e., streaming over a stream that has a definite end), Beam permits writing a single pipeline that handles both.

3. Key Terminology

Bounded Data: Data of a finite, static size (e.g., a database export, a CSV file, a fixed list).
Unbounded Data: An infinite stream of data that constantly updates (e.g., sensor readings, clickstreams, transaction streams).
Latency: The time taken for a single data record to traverse the pipeline.
Throughput: The volume of data processed by the pipeline in a given unit of time.

4. How It Works

Under the Hood: Beam represents data streams as PCollection objects. A PCollection can be either bounded or unbounded.
Logical Execution: In a batch pipeline, the pipeline reads all elements, processes them, and outputs the result.
Streaming Execution: In a streaming pipeline, because the stream never ends, Beam uses Windows (time intervals) to divide the infinite collection into finite chunks before performing aggregations.

5. Visual Diagram

Batch (Bounded):

Database Export
Static Data

➔

Data Block
Finite records

➔

Process
Batch run (Ends)

➔

Report File
Static output

Streaming (Unbounded):

IoT / Clickstream
Continuous Stream

➔

Events [1, 2, 3...]
Infinite elements

➔

Process (Windowed)
Aggregates (Infinite)

➔

Real-time DB
Dashboard updates

6. Code Example

The following code compares a batch execution mapping function to a streaming-ready windowed pipeline execution:

python

import apache_beam as beam
from apache_beam.transforms.window import FixedWindows

# Business logic is decoupled from execution style
def clean_record(row):
    return row.strip().upper()

# 1. Batch Execution Pipeline
with beam.Pipeline() as batch_pipeline:
    (batch_pipeline
     | "ReadStaticData" >> beam.Create(["user_a", "user_b", "user_c"])
     | "BatchClean" >> beam.Map(clean_record)
     | "WriteBatch" >> beam.Map(print))

# 2. Streaming Execution Pipeline (Logical Concept with Windows)
with beam.Pipeline() as streaming_pipeline:
    (streaming_pipeline
     | "ReadMockStream" >> beam.Create([
            ("user_a", 1688169280), # (element, timestamp)
            ("user_b", 1688169340),
            ("user_c", 1688169400)
       ])
     | "ApplyTimestamps" >> beam.Map(lambda x: beam.window.TimestampedValue(x[0], x[1]))
     | "WindowIntoMinutes" >> beam.WindowInto(FixedWindows(60))
     | "StreamClean" >> beam.Map(clean_record)
     | "WriteStream" >> beam.Map(print))

7. Code Explanation

The business transformation method clean_record is completely agnostic of whether it is processing batch files or stream elements.
In the batch pipeline, the elements are read directly, cleaned, and terminated.
In the streaming pipeline, because the source is continuous, we use TimestampedValue to associate event times, apply FixedWindows(60) to divide data into 1-minute blocks, and clean the elements safely within those windows.

8. Real Production Example

In modern e-commerce architectures, purchase transactions are processed in real-time. A unified Beam pipeline processes incoming sales streams (Pub/Sub) to update a live warehouse stock dashboard (Streaming), while the exact same code runs every night over the day's total transaction database logs (Batch) to perform financial reconciliations.

9. Common Mistakes

Grouping Unbounded Data Without Windows: Attempting to run a global aggregation transform (like Combine or GroupByKey) on an unbounded stream without setting up windowing boundaries. This will cause the runner to run out of memory or stall forever because it waits for the stream to end before calculating the result.
Ignoring Late-Arriving Data: Assuming that streaming data always arrives in order. Network lags can delay messages, making them arrive after a window closes. You must set allowed lateness rules to handle these scenarios.

10. Interview Perspective

Question: What is the primary difference in how Flink or Spark runners treat bounded and unbounded PCollections?
Answer: For bounded collections, the runner reads the entire input and runs optimizations based on size. For unbounded collections, the runner keeps workers active permanently, executing operations continuously and tracking progress via watermarks.
Question: Can you reuse a batch transform in a streaming pipeline?
Answer: Yes, Beam’s primary design philosophy is that your custom transforms (PTransform or functions passed to Map/Filter/ParDo) remain identical regardless of the boundary limits of the collection.

11. Best Practices

Separate your core data transformations from the pipeline IO sources so that they are easily reusable in both batch and streaming pipelines.
Always test streaming transforms with custom watermarks using TestStream before deploying to clusters.

12. Summary

Batch handles static datasets; streaming processes continuous data streams.
Beam uses PCollection to represent both bounded and unbounded datasets.
Aggregations on unbounded streams require a windowing definition.
The unified model guarantees code reuse across batch and stream pipelines.

13. Interactive Challenges

Challenge 1: Identify Collection Boundedness (Beginner)

Explain how Apache Beam represents bounded and unbounded collections in memory using the core API classes.

Challenge 2: Apply a Fixed Time Window (Intermediate)

Write a pipeline transformation segment that applies a 5-minute fixed window to an incoming streaming PCollection of log lines.

Challenge 3: Combine Streaming with Allowed Lateness (Advanced)

Configure a streaming window transformation that groups sensor readings into 10-minute fixed windows but allows late data to arrive up to 2 minutes after the window closes.

14. Related Content

Previous LessonWhy Apache Beam?

Next LessonBeam Programming Model