PCollection — Learn Apache Beam

1. Introduction

A PCollection is a specialized database or table in Apache Beam. It stands for Pipeline Collection, and it represents the actual data running through your pipeline.

2. Why This Concept Exists

In traditional programs, data is stored in standard lists or arrays inside the computer's memory. When working with huge datasets, a single computer will run out of memory. A PCollection represents data that is split (partitioned) across many machines. You write code as if it is one single list, and Beam handles distributing it across the cluster.

3. Key Terminology

Bounded Dataset: A dataset of a fixed, unchanging size. Example: reading a static log file from storage.
Unbounded Dataset: An infinite dataset that grows continuously. Example: reading messages from a live streaming queue.
Immutability: Once a PCollection is created, you cannot add, remove, or modify items inside it. You can only apply a transform to create a new PCollection.

4. How It Works

Create: Ingest data from files or messaging queues to create an initial PCollection.
Transform: Apply operations. A PCollection is passed to a transform, which returns a new PCollection.
No Random Access: You cannot access elements by index (e.g. pcollection[2] is invalid). You must perform operations on all elements as a collection.

5. Visual Diagram

Here is how transforms consume and output PCollections:

Input File
Raw Data Source

➔

PCollection A
Raw read lines

➔ (Parse)

PCollection B
JSON objects (Output)

6. Code Example

Here is a script demonstrating how PCollections are declared and manipulated:

python

import apache_beam as beam

with beam.Pipeline() as p:
    # 1. Ingesting list elements creates a PCollection
    numbers = p | "CreateNumbers" >> beam.Create([10, 20, 30])
    
    # 2. Transforming the PCollection outputs a new PCollection
    doubled = numbers | "DoubleValues" >> beam.Map(lambda x: x * 2)

7. Code Explanation

numbers is a bounded PCollection containing three integers: 10, 20, and 30.
numbers | ... feeds the collection into the mapping transform.
The original numbers collection is untouched (immutability).
A new collection doubled is returned containing [20, 40, 60].

8. Real Production Example

In a streaming analytics pipeline, a source read operation produces an unbounded PCollection:

python

messages = p | "ReadFromQueue" >> beam.io.ReadFromPubSub(subscription="projects/my-gcp-project/subscriptions/my-sub")

This PCollection does not have a fixed size; it updates continuously as new messages are published to the queue.

9. Common Mistakes

Trying to write loops: You cannot write for item in pcollection: because the dataset is distributed across workers and not stored locally in memory. Use transforms like beam.Map instead.
Assuming element ordering: Elements in a PCollection are processed in parallel across worker nodes. Their order is not guaranteed.

10. Interview Perspective

Question: Why are PCollections immutable?
Answer: Immutability prevents state conflicts. If a worker machine crashes mid-operation, the runner can safely retry the step on another worker using the original PCollection state without risking duplicate edits.
Question: Can a PCollection contain mixed types?
Answer: Yes, but it is a bad practice because downstream transforms will fail if they encounter unexpected types.

11. Best Practices

Always associate schema definitions or class types with PCollection elements.
Keep records relatively small to avoid overloading worker memory during groupings.

12. Summary

PCollections represent distributed datasets.
They are immutable to guarantee parallel execution safety.
Represented in two shapes: Bounded (batch) and Unbounded (streaming).

13. Interactive Challenges

Challenge 1: Key Extraction Mapping (Beginner)

Write an Apache Beam map transform statement that takes a PCollection of string records in "id,name" format (e.g. "101,Alice") and returns a PCollection of key-value tuples: (id, name).

Challenge 2: Ingest and Filter Records (Intermediate)

Write a complete pipeline segment inside a context block that ingests a local list of transaction values [150.0, -20.0, 300.5, 0.0, -5.0], filters out negative transactions, and prints the remaining valid transactions.

Challenge 3: Parse CSV Schema Objects (Advanced)

Write a pipeline segment that parses a PCollection of CSV lines representing items ("sku,price,quantity", e.g., "item-A,19.99,3") and transforms them into a PCollection of dictionary records with float price and integer quantity conversions.

14. Related Content

Previous LessonPipeline

Next LessonPTransform

PCollection Basics