Side Outputs
1. Introduction
A Side Output allows a single transformation (specifically ParDo) to split its processing and emit elements into multiple distinct output PCollections, rather than sending all results down a single main path.
2. Why This Concept Exists
Data processing pipelines must often handle unexpected formats, errors, or alternative routes. For example, if a pipeline parses strings into JSON, some rows will fail validation. Instead of crashing the entire pipeline or dropping those bad rows, side outputs allow you to route successfully parsed records to the main processing path, while routing malformed rows to a "Dead Letter Queue" path for debugging.
3. Key Terminology
- Main Output: The primary PCollection returned by a ParDo.
- Side Output: Additional PCollections produced by the same ParDo, accessed via tags.
- Dead Letter Queue (DLQ): A pattern where errors are routed to a separate location (like an error log file) for manual inspection.
4. How It Works
- In your custom
DoFn, you yield elements destined for the side outputs using the wrapperbeam.pvalue.TaggedOutput("tag_name", element). - Standard elements are yielded normally without wrappers, going directly to the main output.
- When applying the
ParDo, you specify all expected side output tags using the.with_outputs()method. - This returns a
DoOutputsTuplecontaining all the separated PCollections.
5. Visual Diagram
Main Output
Valid records
Tagged Output ('errors')
Malformed records
6. Code Example
Separating integers into even (main output) and odd (side output 'odd'):
import apache_beam as beam
class SplitNumbers(beam.DoFn):
def process(self, element):
if element % 2 == 0:
yield element
else:
yield beam.pvalue.TaggedOutput("odd", element)
with beam.Pipeline() as p:
results = (p
| "Nums" >> beam.Create([1, 2, 3, 4, 5])
| "Split" >> beam.ParDo(SplitNumbers()).with_outputs("odd", main="even"))
# Extract results
even_pcoll = results.even
odd_pcoll = results.odd
even_pcoll | "PrintEven" >> beam.Map(lambda x: print(f"Even: {x}"))
odd_pcoll | "PrintOdd" >> beam.Map(lambda x: print(f"Odd: {x}"))
7. Code Explanation
yield beam.pvalue.TaggedOutput("odd", element)routes odd numbers to the"odd"output stream.with_outputs("odd", main="even")instructs Beam to expect a side output named"odd", renaming the default main output to"even".- We unpack and process
results.evenandresults.oddseparately.
8. Real Production Example
Using the Dead Letter Queue (DLQ) pattern to catch JSON parsing failures:
import apache_beam as beam
import json
class ParseJsonDoFn(beam.DoFn):
def process(self, element):
try:
parsed = json.loads(element)
yield parsed # Sent to main output
except Exception as e:
# Sent to dead letter queue
yield beam.pvalue.TaggedOutput("dlq", {"raw_data": element, "error": str(e)})
with beam.Pipeline() as p:
raw_strings = p | "RawData" >> beam.Create(['{"name": "Alice"}', 'invalid-json'])
parsed_results = (raw_strings
| "Parse" >> beam.ParDo(ParseJsonDoFn()).with_outputs("dlq", main="parsed"))
# Process parsed logs
parsed_results.parsed | "Process" >> beam.Map(print)
# Save errors to separate file
parsed_results.dlq | "WriteErrors" >> beam.io.WriteToText("errors-log")
9. Common Mistakes
- Forgetting to call
.with_outputs(): If you yieldTaggedOutputbut forget to chain.with_outputs()on theParDo, the execution will fail because Beam will not know how to handle the tagged values. - Using
returninstead ofyield: Trying to return values instead of yielding them.TaggedOutputrelies on generator yields to emit multiple streams dynamically.
10. Interview Perspective
- Question: Can you have multiple side outputs on a single
ParDo? - Answer: Yes, you can specify as many tag strings as needed in
.with_outputs("tag1", "tag2", "tag3"). - Question: What is a Dead Letter Queue?
- Answer: It is an architectural pattern in data engineering where elements that fail processing validation or parsing are written to a separate error queue or file, allowing the main pipeline to continue running without interruptions.
11. Best Practices
- Always use descriptive names for your tags (e.g.
'dlq','invalid_records') rather than generic names like'side1'. - Set up alerts or monitoring dashboards on the outputs of the Dead Letter Queue files in production to watch error rates.
12. Summary
- Side outputs enable multi-path routing from a single
ParDo. - Emitted using
beam.pvalue.TaggedOutput("tag", element). - Declared using
.with_outputs()which returns a dictionary-like object of PCollections.