Pipeline Basics — Learn Apache Beam

1. Introduction

A Pipeline in Apache Beam is like a blueprint or an assembly line for processing data. It holds all the instructions about where your data comes from, how it is modified, and where it goes.

2. Why This Concept Exists

When processing millions of records, you cannot just load them all into a standard Python loop on one computer. You need a way to describe your steps so that a cluster of computers can run them at the same time. By wrapping your code in a Pipeline object, Beam builds a Directed Acyclic Graph (DAG)—a step-by-step layout of operations with no loops—allowing runners to split and execute the workload.

3. Key Terminology

Pipeline Object: The main container that manages all the steps and connections of your data flow.
DAG (Directed Acyclic Graph): A map of your pipeline steps showing the direction data moves. Acyclic means it flows forward and never loops backward.
PipelineOptions: A configuration object containing settings like which runner to use (e.g. Flink or Dataflow) and how many worker machines to start.

4. How It Works

Configure: Create a PipelineOptions instance to set runtime options.
Initialize: Create a pipeline container: with beam.Pipeline(options=options) as p:.
Apply transforms: Use the pipe operator | to chain operations.
Label: Label each step using the >> syntax (e.g., "FilterLogs" >> beam.Filter(...)).
Execute: When you exit the with block, the pipeline automatically executes.

5. Visual Diagram

Here is how data flows sequentially inside a basic pipeline:

Data Source
File / Event Broker

➔ (Read)

PCollection A
Raw read records

➔ (Transform)

PCollection B
Transformed outputs

➔ (Write)

Data Sink
Storage / Database

6. Code Example

Here is a complete, runnable script demonstrating a basic pipeline setup:

python

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

# 1. Define configuration settings
options = PipelineOptions()

# 2. Open the pipeline context
with beam.Pipeline(options=options) as p:
    # 3. Read, transform, and write data
    (p
     | "LoadNames" >> beam.Create(["alice", "bob", "charlie"])
     | "Capitalize" >> beam.Map(lambda name: name.capitalize())
     | "Print" >> beam.Map(print))

7. Code Explanation

PipelineOptions() initializes default configuration parameters.
with beam.Pipeline(options=options) as p: handles starting and stopping execution hooks.
p | "LoadNames" >> beam.Create(...) inputs static values.
| "Capitalize" >> beam.Map(...) applies capitalization.
| "Print" >> beam.Map(print) prints the results.

8. Real Production Example

In production, a pipeline is configured to run on Google Cloud Dataflow:

python

from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions(
    runner="DataflowRunner",
    project="my-gcp-project",
    region="us-central1",
    temp_location="gs://my-bucket/temp"
)

This instructs the pipeline to run on Google Cloud workers instead of your local machine.

9. Common Mistakes

Re-using the same name label: Every label string (e.g. "LoadNames") must be unique. Duplicate labels cause execution crashes.
Running without context blocks: If you construct a pipeline without a with statement, it will not run unless you explicitly call p.run() at the end of your script.

10. Interview Perspective

Question: What is a runner in Apache Beam?
Answer: A runner is the execution engine that runs your pipeline. Beam defines the pipeline instructions, while runners (like Spark, Flink, or Dataflow) execute them.
Question: Can you edit a pipeline after it starts running?
Answer: No. Once the pipeline graph is submitted to a runner, it is static and cannot be modified.

11. Best Practices

Always provide unique names for transforms using the >> operator.
Keep pipeline construction code separated from your business transformation logic functions for easier unit testing.

12. Summary

A pipeline is a Directed Acyclic Graph (DAG).
Chained using the | operator and labeled using >>.
Executes automatically upon exiting a with beam.Pipeline() block.

13. Interactive Challenges

Challenge 1: Basic Word Uppercasing (Beginner)

Write a complete Apache Beam pipeline script segment inside a context block that reads a list containing the strings "data", "engineering", and "beam", uppercase converts them, and prints the result.

Challenge 2: Chained Multi-Step Calculator (Intermediate)

Write a pipeline segment that takes a list of integers [1, 2, 3, 4, 5], multiplies each number by 10, then adds 5 to each, and prints the final values.

Challenge 3: Split and Filter Sentence (Advanced)

Write a pipeline segment that takes a single text sentence "Learning Apache Beam is fun", splits it into individual words, filters out words that are less than 4 letters long, and prints the remaining words.

14. Related Content

Previous LessonCreating Your First Beam Pipeline

Next LessonProject Structure