ParDo — Learn Apache Beam

1. Introduction

ParDo is the core parallel processing transform in Apache Beam. It takes a custom DoFn and applies it in parallel to every element in a PCollection.

2. Why This Concept Exists

While Map and Filter are handy shorthand functions, they are limited. They cannot easily output multiple values, route elements to different destinations, or access auxiliary databases. ParDo is the swiss-army knife of Apache Beam. Under the hood, both Map and Filter are translated into ParDo operations.

3. Key Terminology

ParDo: The pipeline transform class (beam.ParDo).
DoFn: The class containing the code logic passed inside ParDo.
Side Input: Auxiliary read-only datasets passed into the transform.
Side Output: The ability to output elements into separate, tagged PCollections.

4. How It Works

You instantiate a DoFn class.
You apply the transform inside your pipeline: pcoll | beam.ParDo(MyDoFn()).
Beam distributes the input PCollection elements across workers.
Each worker runs the DoFn's process method on its partitioned elements.

5. Visual Diagram

PCollection Input
[ Element A, B, C ]

➔ (ParDo)

Worker 1 (A)
Yields [10, 20]

Worker 2 (B)
Yields [30]

Worker 3 (C)
Filtered Out (Yields None)

6. Code Example

Applying ParDo to split sentences into individual words:

python

import apache_beam as beam

class SplitWordsDoFn(beam.DoFn):
    def process(self, element):
        for word in element.split(" "):
            yield word

with beam.Pipeline() as p:
    sentences = p | beam.Create(["Hello world", "Apache Beam is great"])
    words = sentences | "Split" >> beam.ParDo(SplitWordsDoFn())

7. Code Explanation

SplitWordsDoFn splits strings into words.
It yields each word using a loop inside process.
beam.ParDo(SplitWordsDoFn()) runs the splitting logic in parallel.

8. Real Production Example: Side Inputs

Passing a lookup map as a Side Input to translate country codes:

python

class TranslateCountryDoFn(beam.DoFn):
    # Pass lookup_map as side input parameter
    def process(self, element, lookup_map):
        country_code = element["country_code"]
        element["country_name"] = lookup_map.get(country_code, "Unknown")
        yield element

# In your pipeline:
# lookup = p | beam.Create({"US": "United States", "FR": "France"})
# translated = users | beam.ParDo(TranslateCountryDoFn(), lookup_map=beam.pvalue.AsDict(lookup))

9. Common Mistakes

Modifying Side Inputs: Side inputs are read-only. Never modify a side input dictionary inside process.
Calling ParDo without instantiating the class: Write beam.ParDo(MyDoFn()) (with parentheses), not beam.ParDo(MyDoFn).

10. Interview Perspective

Question: How does ParDo support multiple outputs?
Answer: Use Tag variables inside process calling yield beam.pvalue.TaggedOutput(tag, element). Downstream, reference the tags from the returned collection object.
Question: What is the relationship between Map and ParDo?
Answer: Map is a wrapper around ParDo. Writing beam.Map(fn) gets compiled into a ParDo running a wrapper DoFn.

11. Best Practices

Use side inputs sparingly. If the side input dataset is too large to fit in memory, use co-grouping operations instead.
Document side output tags clearly for downstream connectors.

12. Summary

ParDo is the primary transform that executes a DoFn.
Supports 1-to-many mappings.
Supports Side Inputs (auxiliary data) and Side Outputs (routing).

13. Interactive Challenges

Challenge 1: Map Transform via ParDo (Beginner)

Define a DoFn class named SquareDoFn and apply it to a PCollection using beam.ParDo to square every integer.

Challenge 2: ParDo with Side Input (Intermediate)

Write a ParDo segment that takes a PCollection of transaction integers and filters out any values that are below a threshold parameter min_val passed as a side input.

Challenge 3: Tagged Multiple Outputs (Advanced)

Define a DoFn class that routes even numbers to a tagged output named "even" and odd numbers to a tagged output named "odd".

14. Related Content

Previous LessonDoFn

Next LessonCoGroupByKey

ParDo Processing