Introduction to Apache Beam
1. Introduction
Apache Beam is an open-source, unified programming model for defining batch and streaming data parallel processing pipelines.
2. Why This Concept Exists
Previously, developers had to write separate code for batch processing (e.g. MapReduce) and stream processing (e.g. Storm). Beam solves this by providing a unified API where the same pipeline code executes on both batch and streaming engines.
3. Key Terminology
- Pipeline: Encapsulates the entire data processing task.
- PCollection: A distributed, immutable dataset that the pipeline processes.
- PTransform: Represents a data processing operation or step in your pipeline.
- Runner: The execution engine that translates and runs the pipeline (e.g., Flink, Spark, Google Cloud Dataflow).
4. How It Works
- Define a
Pipelineinstance. - Read data from an external source into an initial
PCollection. - Apply
PTransformssequentially to filter, map, group, or aggregate elements. - Write the final output
PCollectionto an external sink. - Execute the pipeline on a selected
Runner.
5. Visual Diagram
Input Source
Kafka / GCS File
PCollection 1
Raw Input Bytes
PTransform (Filter)
Keep valid records
PCollection 2
Filtered Records
PTransform (Map)
Extract key metrics
Output Sink
BigQuery / Database
6. Code Example
Here is a basic Apache Beam pipeline that reads a list, squares the values, and writes them:
import apache_beam as beam
with beam.Pipeline() as pipeline:
squares = (
pipeline
| "CreateData" >> beam.Create([1, 2, 3, 4, 5])
| "SquareValues" >> beam.Map(lambda x: x * x)
| "Print" >> beam.Map(print)
)
7. Code Explanation
beam.Pipeline()initializes a pipeline context.beam.Createacts as a data source loading static list items.beam.Map(lambda x: x * x)applies the squaring logic to each element in the PCollection.beam.Map(print)prints each result.
8. Real Production Example
In a real-world scenario, you might read click events from a messaging queue (GCP Pub/Sub), filter out bot requests, and load aggregated analytics into Google BigQuery.
9. Common Mistakes
- Forgetting to execute the pipeline: Using
with beam.Pipeline()automatically executes it at the end of the context, but creating it without a context block requires callingpipeline.run(). - Assuming in-place modification: You cannot edit a PCollection. Every transform returns a new PCollection.
10. Interview Perspective
- Question: Why is it called "Beam"?
- Answer: It is a portmanteau of Batch and Stream, highlighting its unified processing capabilities.
- Question: What is the Direct Runner?
- Answer: It is a local runner used for local development and testing without cluster overhead.
11. Best Practices
- Keep user-defined functions (UDFs) small and unit-testable.
- Always provide unique names for transforms (e.g.,
"CreateData" >> ...) to aid in debugging logs.
12. Summary
- Unified batch and stream API.
- Decoupled specification (SDK) and execution (Runner).
- Data flows sequentially through PTransforms.