intermediate
ParDo Processing
7 min readLast updated: 2026-06-30
1. Introduction
ParDo is the core parallel processing transform in Apache Beam. It takes a custom DoFn and applies it in parallel to every element in a PCollection.
2. Why This Concept Exists
While Map and Filter are handy shorthand functions, they are limited. They cannot easily output multiple values, route elements to different destinations, or access auxiliary databases. ParDo is the swiss-army knife of Apache Beam. Under the hood, both Map and Filter are translated into ParDo operations.
3. Key Terminology
- ParDo: The pipeline transform class (
beam.ParDo). - DoFn: The class containing the code logic passed inside
ParDo. - Side Input: Auxiliary read-only datasets passed into the transform.
- Side Output: The ability to output elements into separate, tagged PCollections.
4. How It Works
- You instantiate a
DoFnclass. - You apply the transform inside your pipeline:
pcoll | beam.ParDo(MyDoFn()). - Beam distributes the input PCollection elements across workers.
- Each worker runs the
DoFn'sprocessmethod on its partitioned elements.
5. Visual Diagram
PCollection Input
[ Element A, B, C ]
Worker 1 (A)
Yields [10, 20]
Worker 2 (B)
Yields [30]
Worker 3 (C)
Filtered Out (Yields None)
6. Code Example
Applying ParDo to split sentences into individual words:
python
import apache_beam as beam
class SplitWordsDoFn(beam.DoFn):
def process(self, element):
for word in element.split(" "):
yield word
with beam.Pipeline() as p:
sentences = p | beam.Create(["Hello world", "Apache Beam is great"])
words = sentences | "Split" >> beam.ParDo(SplitWordsDoFn())
7. Code Explanation
SplitWordsDoFnsplits strings into words.- It yields each word using a loop inside
process. beam.ParDo(SplitWordsDoFn())runs the splitting logic in parallel.
8. Real Production Example: Side Inputs
Passing a lookup map as a Side Input to translate country codes:
python
class TranslateCountryDoFn(beam.DoFn):
# Pass lookup_map as side input parameter
def process(self, element, lookup_map):
country_code = element["country_code"]
element["country_name"] = lookup_map.get(country_code, "Unknown")
yield element
# In your pipeline:
# lookup = p | beam.Create({"US": "United States", "FR": "France"})
# translated = users | beam.ParDo(TranslateCountryDoFn(), lookup_map=beam.pvalue.AsDict(lookup))
9. Common Mistakes
- Modifying Side Inputs: Side inputs are read-only. Never modify a side input dictionary inside
process. - Calling ParDo without instantiating the class: Write
beam.ParDo(MyDoFn())(with parentheses), notbeam.ParDo(MyDoFn).
10. Interview Perspective
- Question: How does ParDo support multiple outputs?
- Answer: Use
Tagvariables insideprocesscallingyield beam.pvalue.TaggedOutput(tag, element). Downstream, reference the tags from the returned collection object. - Question: What is the relationship between Map and ParDo?
- Answer:
Mapis a wrapper aroundParDo. Writingbeam.Map(fn)gets compiled into aParDorunning a wrapper DoFn.
11. Best Practices
- Use side inputs sparingly. If the side input dataset is too large to fit in memory, use co-grouping operations instead.
- Document side output tags clearly for downstream connectors.
12. Summary
ParDois the primary transform that executes aDoFn.- Supports 1-to-many mappings.
- Supports Side Inputs (auxiliary data) and Side Outputs (routing).
13. Interactive Challenges
14. Related Content
Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)