intermediate
Side Input Patterns
5 min readLast updated: 2026-07-02
1. Introduction
Side Inputs are additional datasets (e.g., config maps, lookup tables, or small databases) passed as auxiliary parameters to a primary transform to enrich incoming elements.
2. Why This Concept Exists
While a standard transform processes elements one-by-one from a single PCollection, you frequently need auxiliary lookup values (such as currency conversion rates) that change periodically. Side Inputs provide a memory-efficient way to distribute lookup tables to all cluster workers.
3. Code Example
Passing a side-input dictionary to a ParDo:
python
import apache_beam as beam
# Main streaming elements
events = [("item-A", 10), ("item-B", 20)]
with beam.Pipeline() as p:
# Small lookup table to pass as Side Input
lookups = (p
| "CreateLookups" >> beam.Create([("item-A", "Category-X"), ("item-B", "Category-Y")])
| "ToDict" >> beam.CombineGlobally(dict))
# Enrich main streaming elements
(p
| "CreateEvents" >> beam.Create(events)
| "EnrichCategory" >> beam.ParDo(
lambda elem, lookup_dict: (elem[0], elem[1], lookup_dict.get(elem[0], "Unknown")),
beam.pvalue.AsSingleton(lookups)
)
| beam.Map(print))
4. Key Takeaways
- Use
beam.pvalue.AsSingletonorbeam.pvalue.AsDictto convert side PCollections into python datatypes. - Side inputs should remain small (< 100MB) since they are loaded into the JVM/Python memory space of the worker nodes.
Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)