Last Minute RevisionEvergreen

Cheatsheet: Side Inputs

Revision time: 3 mins

Topic Overview

Pass supplementary lookup tables and configuration data into transforms.

Syntax Snapshot

python
import apache_beam as beam

# 1. Create a side input (lookup dict)
metadata = p | "ReadMetadata" >> beam.Create([("user1", "Admin"), ("user2", "User")])
metadata_side = beam.pvalue.AsDict(metadata)

# 2. Consume side input in ParDo
enriched = events | "Join" >> beam.ParDo(
    lambda ev, meta: (ev["id"], meta.get(ev["user"], "Unknown")),
    meta=metadata_side
)

Key Points

  • Side Inputs broadcast auxiliary datasets to all parallel worker nodes.
  • Can be cast as standard types: AsDict, AsList, AsSingleton, or AsIter.
  • Supports slow updates when windowed; joins match window bounds automatically.
  • Avoid when database sizes grow too large (>100MB) to fit on a single node.

Production Recommendations

Developer Checklist
Always ensure the side input PCollection is small enough to fit within VM memory boundaries.
Advertisement
AdSense Slot #556677Leaderboard Banner (728x90)