PCollection Basics
1. Introduction
A PCollection is a specialized database or table in Apache Beam. It stands for Pipeline Collection, and it represents the actual data running through your pipeline.
2. Why This Concept Exists
In traditional programs, data is stored in standard lists or arrays inside the computer's memory. When working with huge datasets, a single computer will run out of memory. A PCollection represents data that is split (partitioned) across many machines. You write code as if it is one single list, and Beam handles distributing it across the cluster.
3. Key Terminology
- Bounded Dataset: A dataset of a fixed, unchanging size. Example: reading a static log file from storage.
- Unbounded Dataset: An infinite dataset that grows continuously. Example: reading messages from a live streaming queue.
- Immutability: Once a PCollection is created, you cannot add, remove, or modify items inside it. You can only apply a transform to create a new PCollection.
4. How It Works
- Create: Ingest data from files or messaging queues to create an initial PCollection.
- Transform: Apply operations. A PCollection is passed to a transform, which returns a new PCollection.
- No Random Access: You cannot access elements by index (e.g.
pcollection[2]is invalid). You must perform operations on all elements as a collection.
5. Visual Diagram
Here is how transforms consume and output PCollections:
Input File
Raw Data Source
PCollection A
Raw read lines
PCollection B
JSON objects (Output)
6. Code Example
Here is a script demonstrating how PCollections are declared and manipulated:
import apache_beam as beam
with beam.Pipeline() as p:
# 1. Ingesting list elements creates a PCollection
numbers = p | "CreateNumbers" >> beam.Create([10, 20, 30])
# 2. Transforming the PCollection outputs a new PCollection
doubled = numbers | "DoubleValues" >> beam.Map(lambda x: x * 2)
7. Code Explanation
numbersis a bounded PCollection containing three integers:10,20, and30.numbers | ...feeds the collection into the mapping transform.- The original
numberscollection is untouched (immutability). - A new collection
doubledis returned containing[20, 40, 60].
8. Real Production Example
In a streaming analytics pipeline, a source read operation produces an unbounded PCollection:
messages = p | "ReadFromQueue" >> beam.io.ReadFromPubSub(subscription="projects/my-gcp-project/subscriptions/my-sub")
This PCollection does not have a fixed size; it updates continuously as new messages are published to the queue.
9. Common Mistakes
- Trying to write loops: You cannot write
for item in pcollection:because the dataset is distributed across workers and not stored locally in memory. Use transforms likebeam.Mapinstead. - Assuming element ordering: Elements in a PCollection are processed in parallel across worker nodes. Their order is not guaranteed.
10. Interview Perspective
- Question: Why are PCollections immutable?
- Answer: Immutability prevents state conflicts. If a worker machine crashes mid-operation, the runner can safely retry the step on another worker using the original PCollection state without risking duplicate edits.
- Question: Can a PCollection contain mixed types?
- Answer: Yes, but it is a bad practice because downstream transforms will fail if they encounter unexpected types.
11. Best Practices
- Always associate schema definitions or class types with PCollection elements.
- Keep records relatively small to avoid overloading worker memory during groupings.
12. Summary
- PCollections represent distributed datasets.
- They are immutable to guarantee parallel execution safety.
- Represented in two shapes: Bounded (batch) and Unbounded (streaming).