Bounded vs Unbounded — Learn Apache Beam

1. Introduction

In data processing, datasets are classified into two primary categories: Bounded (finite, static, completed) and Unbounded (infinite, dynamic, continuously growing). Apache Beam uses a unified model to handle both types of data, allowing you to use the same programming model regardless of the source.

2. Why This Concept Exists

Historically, developers used separate engines for batch processing (e.g., MapReduce, Hadoop) and stream processing (e.g., Storm, early Flink). Maintaining two separate codebases for batch and streaming was complex and prone to bugs. Apache Beam was designed to unify these concepts, representing both bounded and unbounded data as a PCollection.

3. Key Terminology

Bounded Dataset: A finite dataset of fixed size that has a defined beginning and end (e.g., a file on disk or a snapshot of a database table).
Unbounded Dataset: An infinite dataset that grows continuously without limit (e.g., a message queue, log stream, or live sensor feed).
Unified Model: A programming framework that allows writing code that can execute in either batch or streaming modes.
Source: A connector that reads data into a Beam pipeline (represented as BoundedSource or UnboundedSource internally).

4. How It Works

Bounded Processing: The runner reads all data from the source, processes it, computes aggregations globally, and shuts down when finished.
Unbounded Processing: The runner starts reading and never stops. Because the data is infinite, the runner cannot wait for the "end" of the dataset. Instead, it processes data as it arrives, grouping it into temporal windows to perform aggregations, and continuously writing output to external sinks.

5. Visual Diagram

Bounded Pipeline:

File Data

➔

Read All

➔

Global Sum

➔

Write & Terminate

Unbounded Pipeline:

Message Queue

➔

Read Continuous

➔

Windowed Sums

➔

Write Continuously

6. Code Example

The following example shows how a single pipeline can contain either a bounded source (reading a text file) or an unbounded source (reading from Pub/Sub):

python

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

# 1. Bounded Batch Pipeline
batch_options = PipelineOptions(streaming=False)
with beam.Pipeline(options=batch_options) as p_batch:
    (
        p_batch
        | "ReadFiles" >> beam.io.ReadFromText("gs://my-bucket/historical-logs.csv")
        | "ExtractUser" >> beam.Map(lambda line: line.split(",")[0])
        | "WriteBatch" >> beam.io.WriteToText("gs://my-bucket/output/historical-users")
    )

# 2. Unbounded Streaming Pipeline
stream_options = PipelineOptions(streaming=True)
with beam.Pipeline(options=stream_options) as p_stream:
    (
        p_stream
        | "ReadLive" >> beam.io.ReadFromPubSub(subscription="projects/my-proj/subscriptions/live-logs")
        | "Decode" >> beam.Map(lambda msg: msg.decode("utf-8"))
        | "ExtractUserLive" >> beam.Map(lambda line: line.split(",")[0])
        | "WriteLive" >> beam.io.WriteToPubSub(topic="projects/my-proj/topics/live-users")
    )

7. Code Explanation

The first pipeline is bounded (streaming=False). It reads a static set of historical log files, processes them, writes the output, and terminates.
The second pipeline is unbounded (streaming=True). It consumes messages from Google Cloud Pub/Sub, processes them immediately, and streams them back to another topic, running indefinitely.
Notice how the core transformation logic (ExtractUser vs ExtractUserLive) is identical. This highlights Beam's unified programming model.

8. Real Production Example

A financial institution uses a bounded pipeline every night at midnight to audit all transactions processed during the previous 24 hours. They also run an unbounded pipeline 24/7 to monitor live transaction streams for real-time fraud detection. Both pipelines use the same core parsing and verification functions.

9. Common Mistakes

Applying Batch Sinks to Streaming Pipelines: Using a batch-only sink (like writing to a single static file) in an unbounded streaming pipeline will cause issues because the file is never closed or results are never flushed. You must use streaming-compatible sinks or configure file partitioning.
Assuming Instant Data Availability in Batch: Trying to use real-time triggers and watermarks on bounded sources is usually unnecessary because bounded sources have a simple, complete timeline that is instantly available.

10. Interview Perspective

Question: How does a Beam runner determine if a PCollection is bounded or unbounded?
Answer: Every PCollection has an internal property indicating its boundedness: IsBounded.BOUNDED or IsBounded.UNBOUNDED. This is inherited from the source transform that created it.
Question: Can you join a bounded PCollection with an unbounded PCollection?
Answer: Yes. This is a common pattern, such as using a bounded side input (e.g., user profiles from a database) to enrich an unbounded stream of log events.

11. Best Practices

Keep your business logic transforms agnostic of the source's boundedness so they can be easily reused in both batch and streaming contexts.
When executing streaming pipelines, ensure that monitoring systems are configured to alert on pipeline downtime, since streaming pipelines should never terminate.

12. Summary

Bounded data is static and finite; unbounded data is dynamic and infinite.
Apache Beam unifies both paradigms under the PCollection interface.
The execution mode (streaming=True vs streaming=False) determines how the runner schedules execution and manages resources.

13. Interactive Challenges

Challenge 1: Read a Bounded Text Source (Beginner)

Write a pipeline segment that reads a text file located at /data/input.txt using a bounded source transform.

Challenge 2: Ingest from Unbounded Topic (Intermediate)

Write a streaming pipeline segment that reads raw byte messages from a Pub/Sub topic named projects/prod-proj/topics/device-telemetry.

Challenge 3: Stream Ingestion with File Partitioning (Advanced)

Configure a pipeline step that writes an unbounded PCollection of strings named stream to text files with a prefix /output/stream-file. Configure it to write files continuously using windowed shards so that it doesn't wait for the stream to end.

14. Related Content

Previous LessonStreaming Basics

Next LessonPub/Sub Streaming