intermediate

Side Input Patterns

5 min readLast updated: 2026-07-02

1. Introduction

Side Inputs are additional datasets (e.g., config maps, lookup tables, or small databases) passed as auxiliary parameters to a primary transform to enrich incoming elements.

2. Why This Concept Exists

While a standard transform processes elements one-by-one from a single PCollection, you frequently need auxiliary lookup values (such as currency conversion rates) that change periodically. Side Inputs provide a memory-efficient way to distribute lookup tables to all cluster workers.

3. Code Example

Passing a side-input dictionary to a ParDo:

python
import apache_beam as beam

# Main streaming elements
events = [("item-A", 10), ("item-B", 20)]

with beam.Pipeline() as p:
    # Small lookup table to pass as Side Input
    lookups = (p 
               | "CreateLookups" >> beam.Create([("item-A", "Category-X"), ("item-B", "Category-Y")])
               | "ToDict" >> beam.CombineGlobally(dict))

    # Enrich main streaming elements
    (p 
     | "CreateEvents" >> beam.Create(events)
     | "EnrichCategory" >> beam.ParDo(
         lambda elem, lookup_dict: (elem[0], elem[1], lookup_dict.get(elem[0], "Unknown")),
         beam.pvalue.AsSingleton(lookups)
     )
     | beam.Map(print))

4. Key Takeaways

  • Use beam.pvalue.AsSingleton or beam.pvalue.AsDict to convert side PCollections into python datatypes.
  • Side inputs should remain small (< 100MB) since they are loaded into the JVM/Python memory space of the worker nodes.
Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)