Pipeline Basics
1. Introduction
A Pipeline in Apache Beam is like a blueprint or an assembly line for processing data. It holds all the instructions about where your data comes from, how it is modified, and where it goes.
2. Why This Concept Exists
When processing millions of records, you cannot just load them all into a standard Python loop on one computer. You need a way to describe your steps so that a cluster of computers can run them at the same time. By wrapping your code in a Pipeline object, Beam builds a Directed Acyclic Graph (DAG)—a step-by-step layout of operations with no loops—allowing runners to split and execute the workload.
3. Key Terminology
- Pipeline Object: The main container that manages all the steps and connections of your data flow.
- DAG (Directed Acyclic Graph): A map of your pipeline steps showing the direction data moves. Acyclic means it flows forward and never loops backward.
- PipelineOptions: A configuration object containing settings like which runner to use (e.g. Flink or Dataflow) and how many worker machines to start.
4. How It Works
- Configure: Create a
PipelineOptionsinstance to set runtime options. - Initialize: Create a pipeline container:
with beam.Pipeline(options=options) as p:. - Apply transforms: Use the pipe operator
|to chain operations. - Label: Label each step using the
>>syntax (e.g.,"FilterLogs" >> beam.Filter(...)). - Execute: When you exit the
withblock, the pipeline automatically executes.
5. Visual Diagram
Here is how data flows sequentially inside a basic pipeline:
Data Source
File / Event Broker
PCollection A
Raw read records
PCollection B
Transformed outputs
Data Sink
Storage / Database
6. Code Example
Here is a complete, runnable script demonstrating a basic pipeline setup:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
# 1. Define configuration settings
options = PipelineOptions()
# 2. Open the pipeline context
with beam.Pipeline(options=options) as p:
# 3. Read, transform, and write data
(p
| "LoadNames" >> beam.Create(["alice", "bob", "charlie"])
| "Capitalize" >> beam.Map(lambda name: name.capitalize())
| "Print" >> beam.Map(print))
7. Code Explanation
PipelineOptions()initializes default configuration parameters.with beam.Pipeline(options=options) as p:handles starting and stopping execution hooks.p | "LoadNames" >> beam.Create(...)inputs static values.| "Capitalize" >> beam.Map(...)applies capitalization.| "Print" >> beam.Map(print)prints the results.
8. Real Production Example
In production, a pipeline is configured to run on Google Cloud Dataflow:
from apache_beam.options.pipeline_options import PipelineOptions
options = PipelineOptions(
runner="DataflowRunner",
project="my-gcp-project",
region="us-central1",
temp_location="gs://my-bucket/temp"
)
This instructs the pipeline to run on Google Cloud workers instead of your local machine.
9. Common Mistakes
- Re-using the same name label: Every label string (e.g.
"LoadNames") must be unique. Duplicate labels cause execution crashes. - Running without context blocks: If you construct a pipeline without a
withstatement, it will not run unless you explicitly callp.run()at the end of your script.
10. Interview Perspective
- Question: What is a runner in Apache Beam?
- Answer: A runner is the execution engine that runs your pipeline. Beam defines the pipeline instructions, while runners (like Spark, Flink, or Dataflow) execute them.
- Question: Can you edit a pipeline after it starts running?
- Answer: No. Once the pipeline graph is submitted to a runner, it is static and cannot be modified.
11. Best Practices
- Always provide unique names for transforms using the
>>operator. - Keep pipeline construction code separated from your business transformation logic functions for easier unit testing.
12. Summary
- A pipeline is a Directed Acyclic Graph (DAG).
- Chained using the
|operator and labeled using>>. - Executes automatically upon exiting a
with beam.Pipeline()block.