beginner

Introduction to Apache Beam

5 min readLast updated: 2026-06-30

1. Introduction

Apache Beam is an open-source, unified programming model for defining batch and streaming data parallel processing pipelines.

2. Why This Concept Exists

Previously, developers had to write separate code for batch processing (e.g. MapReduce) and stream processing (e.g. Storm). Beam solves this by providing a unified API where the same pipeline code executes on both batch and streaming engines.

3. Key Terminology

  • Pipeline: Encapsulates the entire data processing task.
  • PCollection: A distributed, immutable dataset that the pipeline processes.
  • PTransform: Represents a data processing operation or step in your pipeline.
  • Runner: The execution engine that translates and runs the pipeline (e.g., Flink, Spark, Google Cloud Dataflow).

4. How It Works

  1. Define a Pipeline instance.
  2. Read data from an external source into an initial PCollection.
  3. Apply PTransforms sequentially to filter, map, group, or aggregate elements.
  4. Write the final output PCollection to an external sink.
  5. Execute the pipeline on a selected Runner.

5. Visual Diagram

Input Source
Kafka / GCS File

PCollection 1
Raw Input Bytes

PTransform (Filter)
Keep valid records

PCollection 2
Filtered Records

PTransform (Map)
Extract key metrics

Output Sink
BigQuery / Database

6. Code Example

Here is a basic Apache Beam pipeline that reads a list, squares the values, and writes them:

python
import apache_beam as beam

with beam.Pipeline() as pipeline:
    squares = (
        pipeline
        | "CreateData" >> beam.Create([1, 2, 3, 4, 5])
        | "SquareValues" >> beam.Map(lambda x: x * x)
        | "Print" >> beam.Map(print)
    )

7. Code Explanation

  • beam.Pipeline() initializes a pipeline context.
  • beam.Create acts as a data source loading static list items.
  • beam.Map(lambda x: x * x) applies the squaring logic to each element in the PCollection.
  • beam.Map(print) prints each result.

8. Real Production Example

In a real-world scenario, you might read click events from a messaging queue (GCP Pub/Sub), filter out bot requests, and load aggregated analytics into Google BigQuery.

9. Common Mistakes

  • Forgetting to execute the pipeline: Using with beam.Pipeline() automatically executes it at the end of the context, but creating it without a context block requires calling pipeline.run().
  • Assuming in-place modification: You cannot edit a PCollection. Every transform returns a new PCollection.

10. Interview Perspective

  • Question: Why is it called "Beam"?
  • Answer: It is a portmanteau of Batch and Stream, highlighting its unified processing capabilities.
  • Question: What is the Direct Runner?
  • Answer: It is a local runner used for local development and testing without cluster overhead.

11. Best Practices

  • Keep user-defined functions (UDFs) small and unit-testable.
  • Always provide unique names for transforms (e.g., "CreateData" >> ...) to aid in debugging logs.

12. Summary

  • Unified batch and stream API.
  • Decoupled specification (SDK) and execution (Runner).
  • Data flows sequentially through PTransforms.

13. Interactive Challenges

Challenge 1: Filter Large Values (Beginner)

Write an Apache Beam segment that filters out elements less than 10 (retaining only values greater than or equal to 10).

Challenge 2: Square and Filter Evens (Intermediate)

Create a pipeline transformation segment that first takes an input PCollection of integers, squares each number, and then filters the squared values to keep only even numbers.

Challenge 3: Word Count Mapping (Advanced)

Write a pipeline segment that takes a PCollection of strings representing words (e.g. ["apple", "banana", "apple"]), maps each word to a tuple of (word, 1), and then uses beam.CombinePerKey(sum) to count word occurrences.

14. Related Content

Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)