Apache Beam Architecture — Learn Apache Beam

1. Introduction

The architecture of Apache Beam is designed to solve the problem of portability across different programming languages and execution backends. Unlike monolithic execution frameworks, Beam splits the pipeline lifecycle into two major layers: Pipeline Construction (using language-specific SDKs) and Pipeline Execution (handled by runners). This division is enabled by a standardized intermediate representation and the Fn API.

2. Why This Concept Exists

Before Apache Beam's modern architecture, developers faced a major limitation: runners were engine-specific. If you wrote a Spark job, it ran only on a JVM environment because Spark was built in Scala/Java. Running Python code required complex wrapper processes, which degraded performance.

Beam's architecture introduces the Fn API and Runner API, which leverage standard communication protocols (gRPC over Protocol Buffers) to run pipelines written in any language on any engine.

3. Key Terminology

Runner API: A language-independent definition of a pipeline graph, serialized using Protocol Buffers.
Fn API: The gRPC interface that allows runners (written in Java/Scala) to communicate with user-defined code running in language-specific containers (Workers).
Worker Harness: A lightweight execution container spun up by the runner to run your Python/Go code during pipeline execution.
Logical vs. Physical DAG: Logical DAG is the language-agnostic representation built by the SDK. Physical DAG is the execution graph optimized for Flink, Spark, or Dataflow.

4. How It Works

Logical Compilation: The Python SDK runs your code to construct a pipeline graph, checking syntax and types. It outputs a serialized Protocol Buffer representation (Runner API proto).
Job Submission: The proto is sent to the runner's Job Service.
Physical Compilation: The runner translates the proto into a physical execution graph optimized for its environment (e.g., Flink Task Managers).
Fn API Execution: When executing steps, the runner launches a containerized Worker Harness (via Docker). The runner streams data elements over gRPC to the worker harness, which executes your custom Python functions (like DoFn or lambdas) and returns the output elements.

5. Visual Diagram

1. SDK Construction

User defines pipeline in Python, Go, or Java.

Pipeline Object

2. Serialization

Translates definition into language-neutral protobuf.

Runner API Proto

3. Submission

Transmits graph via gRPC to the Runner's Job Service.

Job Service API

4. Execution Phase (Fn API Portability)

Runner VM Host (JVM)

Manages lifecycle, scheduling, windowing, and data streaming.

Flink / Spark / Dataflow

Docker SDK Harness

Spun up by runner to execute user's Python/Go DoFn code.

Python / Go Worker

Fn API (gRPC Communication Channel)Streams Control, Data, State, and Logs bidirectionally between JVM and Worker.

6. Code Example

The following example defines a custom DoFn that initializes external parameters, showing how worker harnesses manage state during execution:

python

import apache_beam as beam

class ConfiguredFilterDoFn(beam.DoFn):
    def __init__(self, prefix):
        # Definition-time constructor: Runs locally on the client machine
        self.prefix = prefix
        self.counter = None

    def setup(self):
        # Execution-time setup: Runs on the remote worker machine inside the harness
        self.counter = 0

    def process(self, element):
        # Execution-time processing: Called for every element in the harness
        if element.startswith(self.prefix):
            self.counter += 1
            yield f"MATCH #{self.counter}: {element}"

# Run locally using DirectRunner
with beam.Pipeline() as p:
    (p 
     | "CreateLogs" >> beam.Create(["ERR_auth_failed", "INFO_connected", "ERR_db_timeout"])
     | "FilterErrors" >> beam.ParDo(ConfiguredFilterDoFn("ERR"))
     | "Print" >> beam.Map(print))

7. Code Explanation

__init__(self, prefix) is executed during pipeline definition time on the local developer machine. The prefix variable is serialized and bundled into the Pipeline Proto.
setup(self) is executed on the remote worker node once the container starts up. This is where you would instantiate database connections or load models.
process(self, element) is executed for every element sent over the Fn API gRPC socket from the runner process to the worker harness.

8. Real Production Example

When running a Python Beam pipeline on a Google Cloud Dataflow cluster, the Dataflow service acts as the Java runner. Dataflow provisions virtual machines that run a Java runner agent. This agent spins up a Docker container containing the Beam Python SDK worker harness. The Java runner reads data from sources, streams elements via gRPC to the Python container, and collects the processed results to write to BigQuery.

9. Common Mistakes

Instantiating Heavy Objects in __init__: Initializing non-serializable objects (like socket connections, databases, or large ML models) inside the __init__ constructor. This causes serialization errors (like PicklingError) during logical DAG packaging. Heavy objects must always be initialized in the setup() method.
Assuming Worker Local Persistence: Writing to local files inside a DoFn and assuming they will persist. Workers are ephemeral and run in isolated Docker containers.

10. Interview Perspective

Question: What is the difference between pipeline definition time and pipeline execution time?
Answer: Definition time is when the SDK executes the code locally to build the Directed Acyclic Graph (DAG) and serializes it. Execution time is when the runner spins up nodes on a cluster, runs the compiled DAG, and streams actual data elements.
Question: What problem does the Fn API solve?
Answer: It solves language barrier issues. Historically, runners were written in Java, forcing pipelines to be in Java. The Fn API defines standard gRPC services for data transmission and state management, enabling Java runners to delegate execution to Python or Go worker processes.

11. Best Practices

Use setup() or start_bundle() for one-time initialization task blocks (like opening database sessions).
Avoid using global state variables in Python modules as they will not be shared across multi-threaded or distributed workers.

12. Summary

Beam splits compilation (SDK) from execution (Runner).
The Runner API serializes the logical graph using Protocol Buffers.
The Fn API standardizes runtime communication via gRPC.
User code executes inside isolated worker harness containers.

13. Interactive Challenges

Challenge 1: Safe Worker Initialization (Beginner)

Complete the code segment below by choosing the correct location to initialize an database connection helper inside a custom DoFn.

python

class DBReadDoFn(beam.DoFn):
    def __init__(self):
        pass
    # TODO: Add the setup method to initialize self.db

Challenge 2: Multi-Language Side Input Design (Intermediate)

Explain how Beam’s Fn API transfers a PCollection used as a "side input" from a Java runner to a Python worker harness process.

Challenge 3: Parameterized Filter Transformation (Advanced)

Write a complete pipeline segment with a custom DoFn class that takes a threshold parameter in __init__, filters values greater than that threshold, and keeps track of count.

14. Related Content

Previous LessonBeam Programming Model

Next LessonInstallation & Environment Setup