Apache Beam Architecture
1. Introduction
The architecture of Apache Beam is designed to solve the problem of portability across different programming languages and execution backends. Unlike monolithic execution frameworks, Beam splits the pipeline lifecycle into two major layers: Pipeline Construction (using language-specific SDKs) and Pipeline Execution (handled by runners). This division is enabled by a standardized intermediate representation and the Fn API.
2. Why This Concept Exists
Before Apache Beam's modern architecture, developers faced a major limitation: runners were engine-specific. If you wrote a Spark job, it ran only on a JVM environment because Spark was built in Scala/Java. Running Python code required complex wrapper processes, which degraded performance.
Beam's architecture introduces the Fn API and Runner API, which leverage standard communication protocols (gRPC over Protocol Buffers) to run pipelines written in any language on any engine.
3. Key Terminology
- Runner API: A language-independent definition of a pipeline graph, serialized using Protocol Buffers.
- Fn API: The gRPC interface that allows runners (written in Java/Scala) to communicate with user-defined code running in language-specific containers (Workers).
- Worker Harness: A lightweight execution container spun up by the runner to run your Python/Go code during pipeline execution.
- Logical vs. Physical DAG: Logical DAG is the language-agnostic representation built by the SDK. Physical DAG is the execution graph optimized for Flink, Spark, or Dataflow.
4. How It Works
- Logical Compilation: The Python SDK runs your code to construct a pipeline graph, checking syntax and types. It outputs a serialized Protocol Buffer representation (Runner API proto).
- Job Submission: The proto is sent to the runner's Job Service.
- Physical Compilation: The runner translates the proto into a physical execution graph optimized for its environment (e.g., Flink Task Managers).
- Fn API Execution: When executing steps, the runner launches a containerized Worker Harness (via Docker). The runner streams data elements over gRPC to the worker harness, which executes your custom Python functions (like
DoFnor lambdas) and returns the output elements.
5. Visual Diagram
6. Code Example
The following example defines a custom DoFn that initializes external parameters, showing how worker harnesses manage state during execution:
import apache_beam as beam
class ConfiguredFilterDoFn(beam.DoFn):
def __init__(self, prefix):
# Definition-time constructor: Runs locally on the client machine
self.prefix = prefix
self.counter = None
def setup(self):
# Execution-time setup: Runs on the remote worker machine inside the harness
self.counter = 0
def process(self, element):
# Execution-time processing: Called for every element in the harness
if element.startswith(self.prefix):
self.counter += 1
yield f"MATCH #{self.counter}: {element}"
# Run locally using DirectRunner
with beam.Pipeline() as p:
(p
| "CreateLogs" >> beam.Create(["ERR_auth_failed", "INFO_connected", "ERR_db_timeout"])
| "FilterErrors" >> beam.ParDo(ConfiguredFilterDoFn("ERR"))
| "Print" >> beam.Map(print))
7. Code Explanation
__init__(self, prefix)is executed during pipeline definition time on the local developer machine. Theprefixvariable is serialized and bundled into the Pipeline Proto.setup(self)is executed on the remote worker node once the container starts up. This is where you would instantiate database connections or load models.process(self, element)is executed for every element sent over the Fn API gRPC socket from the runner process to the worker harness.
8. Real Production Example
When running a Python Beam pipeline on a Google Cloud Dataflow cluster, the Dataflow service acts as the Java runner. Dataflow provisions virtual machines that run a Java runner agent. This agent spins up a Docker container containing the Beam Python SDK worker harness. The Java runner reads data from sources, streams elements via gRPC to the Python container, and collects the processed results to write to BigQuery.
9. Common Mistakes
- Instantiating Heavy Objects in
__init__: Initializing non-serializable objects (like socket connections, databases, or large ML models) inside the__init__constructor. This causes serialization errors (likePicklingError) during logical DAG packaging. Heavy objects must always be initialized in thesetup()method. - Assuming Worker Local Persistence: Writing to local files inside a
DoFnand assuming they will persist. Workers are ephemeral and run in isolated Docker containers.
10. Interview Perspective
- Question: What is the difference between pipeline definition time and pipeline execution time?
- Answer: Definition time is when the SDK executes the code locally to build the Directed Acyclic Graph (DAG) and serializes it. Execution time is when the runner spins up nodes on a cluster, runs the compiled DAG, and streams actual data elements.
- Question: What problem does the Fn API solve?
- Answer: It solves language barrier issues. Historically, runners were written in Java, forcing pipelines to be in Java. The Fn API defines standard gRPC services for data transmission and state management, enabling Java runners to delegate execution to Python or Go worker processes.
11. Best Practices
- Use
setup()orstart_bundle()for one-time initialization task blocks (like opening database sessions). - Avoid using global state variables in Python modules as they will not be shared across multi-threaded or distributed workers.
12. Summary
- Beam splits compilation (SDK) from execution (Runner).
- The Runner API serializes the logical graph using Protocol Buffers.
- The Fn API standardizes runtime communication via gRPC.
- User code executes inside isolated worker harness containers.