Introduction to Apache Beam — Learn Apache Beam

1. Introduction

Apache Beam is an open-source, unified programming model for defining batch and streaming data parallel processing pipelines.

2. Why This Concept Exists

Previously, developers had to write separate code for batch processing (e.g. MapReduce) and stream processing (e.g. Storm). Beam solves this by providing a unified API where the same pipeline code executes on both batch and streaming engines.

3. Key Terminology

Pipeline: Encapsulates the entire data processing task.
PCollection: A distributed, immutable dataset that the pipeline processes.
PTransform: Represents a data processing operation or step in your pipeline.
Runner: The execution engine that translates and runs the pipeline (e.g., Flink, Spark, Google Cloud Dataflow).

4. How It Works

Define a Pipeline instance.
Read data from an external source into an initial PCollection.
Apply PTransforms sequentially to filter, map, group, or aggregate elements.
Write the final output PCollection to an external sink.
Execute the pipeline on a selected Runner.

5. Visual Diagram

Input Source
Kafka / GCS File

➔

PCollection 1
Raw Input Bytes

➔

PTransform (Filter)
Keep valid records

➔

PCollection 2
Filtered Records

➔

PTransform (Map)
Extract key metrics

➔

Output Sink
BigQuery / Database

6. Code Example

Here is a basic Apache Beam pipeline that reads a list, squares the values, and writes them:

python

import apache_beam as beam

with beam.Pipeline() as pipeline:
    squares = (
        pipeline
        | "CreateData" >> beam.Create([1, 2, 3, 4, 5])
        | "SquareValues" >> beam.Map(lambda x: x * x)
        | "Print" >> beam.Map(print)
    )

7. Code Explanation

beam.Pipeline() initializes a pipeline context.
beam.Create acts as a data source loading static list items.
beam.Map(lambda x: x * x) applies the squaring logic to each element in the PCollection.
beam.Map(print) prints each result.

8. Real Production Example

In a real-world scenario, you might read click events from a messaging queue (GCP Pub/Sub), filter out bot requests, and load aggregated analytics into Google BigQuery.

9. Common Mistakes

Forgetting to execute the pipeline: Using with beam.Pipeline() automatically executes it at the end of the context, but creating it without a context block requires calling pipeline.run().
Assuming in-place modification: You cannot edit a PCollection. Every transform returns a new PCollection.

10. Interview Perspective

Question: Why is it called "Beam"?
Answer: It is a portmanteau of Batch and Stream, highlighting its unified processing capabilities.
Question: What is the Direct Runner?
Answer: It is a local runner used for local development and testing without cluster overhead.

11. Best Practices

Keep user-defined functions (UDFs) small and unit-testable.
Always provide unique names for transforms (e.g., "CreateData" >> ...) to aid in debugging logs.

12. Summary

Unified batch and stream API.
Decoupled specification (SDK) and execution (Runner).
Data flows sequentially through PTransforms.

13. Interactive Challenges

Challenge 1: Filter Large Values (Beginner)

Write an Apache Beam segment that filters out elements less than 10 (retaining only values greater than or equal to 10).

Challenge 2: Square and Filter Evens (Intermediate)

Create a pipeline transformation segment that first takes an input PCollection of integers, squares each number, and then filters the squared values to keep only even numbers.

Challenge 3: Word Count Mapping (Advanced)

Write a pipeline segment that takes a PCollection of strings representing words (e.g. ["apple", "banana", "apple"]), maps each word to a tuple of (word, 1), and then uses beam.CombinePerKey(sum) to count word occurrences.

14. Related Content

Next LessonWhy Apache Beam?