Streaming Basics — Learn Apache Beam

1. Introduction

Streaming processing is a data processing paradigm designed for continuous, infinite streams of data. Unlike batch processing, which works on complete, static datasets, streaming processes events as they happen, enabling real-time insights and immediate actions.

2. Why This Concept Exists

Modern applications generate data continuously—clicks on websites, financial transactions, IoT sensor readings, and system logs. Waiting hours or days for batch jobs to complete is no longer sufficient for use cases like fraud detection, real-time alerts, or live analytics dashboards. Streaming engines like Apache Beam provide the tools to ingest, analyze, and output continuous data with minimal latency.

3. Key Terminology

Unbounded Dataset: A dataset of infinite size that grows continuously over time.
Latency: The time elapsed between a data event occurring and that event being processed by the system.
Throughput: The volume of data processed by the pipeline per unit of time.
Event Time: The timestamp indicating when the data event actually occurred at the source.
Processing Time: The timestamp indicating when the data event is processed by a specific step in the pipeline.

4. How It Works

Streaming pipelines run continuously, often for weeks, months, or years without stopping. The system ingests elements from unbounded sources (like Pub/Sub or Kafka) and immediately passes them to workers. Because the dataset is infinite, global aggregations are impossible without partitioning. The stream must be sliced into finite segments using Windowing (e.g., fixed time windows) or stateful operations before running aggregations.

5. Visual Diagram

Unbounded Source
Continuous Broker

➔

Windowing (5-min)
Divide stream in time

➔

Transform / Aggregate
Compute sum / stats

➔

Unbounded Sink
PubSub / BigQuery Stream

6. Code Example

Here is a basic Apache Beam streaming pipeline in Python that reads from a streaming source, window-groups the incoming values, and writes them:

python

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.transforms.window import FixedWindows

# Create pipeline options with streaming enabled
options = PipelineOptions(streaming=True)

with beam.Pipeline(options=options) as p:
    (
        p 
        | "ReadStream" >> beam.io.ReadFromPubSub(subscription="projects/my-project/subscriptions/my-sub")
        | "Decode" >> beam.Map(lambda x: x.decode("utf-8"))
        | "Window" >> beam.WindowInto(FixedWindows(60))  # 60-second windows
        | "Count" >> beam.combiners.Count.Globally()
        | "Format" >> beam.Map(lambda count: f"Events in last minute: {count}")
        | "WriteLog" >> beam.Map(print)
    )

7. Code Explanation

PipelineOptions(streaming=True) is critical; it tells the runner to execute the pipeline in streaming mode rather than batch.
beam.io.ReadFromPubSub is an unbounded source reader that listens continuously for new messages.
beam.WindowInto(FixedWindows(60)) divides the continuous flow into distinct 60-second intervals based on event time.
Count.Globally() calculates the count of elements that fall into each specific window.
print displays the results of each window as soon as the window is triggered and closed.

8. Real Production Example

In a ride-hailing application, GPS coordinates from thousands of drivers are streamed continuously. The pipeline ingests the coordinates, groups them into 1-minute fixed windows, calculates the density of drivers per area, and updates the heat map on the customer app in real-time.

9. Common Mistakes

Forgetting streaming=True: If this option is omitted, the runner may attempt to execute in batch mode, causing the pipeline to wait indefinitely for the unbounded source to finish (which it never does).
Aggregating without Windows: Applying global aggregations (like Count.Globally() or CombineGlobally(sum)) without a windowing strategy on an unbounded dataset will cause memory accumulation and eventual pipeline crashes because the runner tries to hold all historical data in memory.

10. Interview Perspective

Question: What is the main difference between event time and processing time in a streaming pipeline?
Answer: Event time is when the event occurred on the device (e.g., when a user clicked a button), whereas processing time is the clock time of the worker machine running the code. Network delays can cause event time to lag behind processing time.
Question: Why can't we sort an unbounded PCollection?
Answer: Sorting requires seeing all elements in the dataset to determine their relative order. Since an unbounded dataset is infinite and never ends, you cannot sort the collection globally.

11. Best Practices

Always define appropriate windowing before running any aggregation on streaming data.
Filter out corrupt or irrelevant data early in the pipeline to save computation and network bandwidth.
Ensure that external source clients (like Kafka or Pub/Sub) are scaled to handle peak traffic loads.

12. Summary

Streaming processes data continuously with low latency.
Unbounded datasets require windowing to perform aggregations.
streaming=True must be configured in Apache Beam options to run in streaming mode.

13. Interactive Challenges

Challenge 1: Streaming Pipeline Options Setup (Beginner)

Configure the standard PipelineOptions to enable streaming, and instantiate a Beam pipeline using those options.

Challenge 2: Apply Fixed Windowing (Intermediate)

Write a transform step using beam.WindowInto that partitions a streaming PCollection named stream into 5-minute fixed windows.

Challenge 3: Count Windows Globally (Advanced)

Given a streaming PCollection named clicks, write a pipeline segment that applies a 10-second fixed window and counts the number of elements in each window, yielding the count per window.

14. Related Content

Previous LessonAccumulation Modes

Next LessonBounded vs Unbounded