intermediate

Side Outputs

8 min readLast updated: 2026-07-01

1. Introduction

A Side Output allows a single transformation (specifically ParDo) to split its processing and emit elements into multiple distinct output PCollections, rather than sending all results down a single main path.

2. Why This Concept Exists

Data processing pipelines must often handle unexpected formats, errors, or alternative routes. For example, if a pipeline parses strings into JSON, some rows will fail validation. Instead of crashing the entire pipeline or dropping those bad rows, side outputs allow you to route successfully parsed records to the main processing path, while routing malformed rows to a "Dead Letter Queue" path for debugging.

3. Key Terminology

  • Main Output: The primary PCollection returned by a ParDo.
  • Side Output: Additional PCollections produced by the same ParDo, accessed via tags.
  • Dead Letter Queue (DLQ): A pattern where errors are routed to a separate location (like an error log file) for manual inspection.

4. How It Works

  • In your custom DoFn, you yield elements destined for the side outputs using the wrapper beam.pvalue.TaggedOutput("tag_name", element).
  • Standard elements are yielded normally without wrappers, going directly to the main output.
  • When applying the ParDo, you specify all expected side output tags using the .with_outputs() method.
  • This returns a DoOutputsTuple containing all the separated PCollections.

5. Visual Diagram

Input PCollection

Main Output
Valid records

Tagged Output ('errors')
Malformed records

6. Code Example

Separating integers into even (main output) and odd (side output 'odd'):

python
import apache_beam as beam

class SplitNumbers(beam.DoFn):
    def process(self, element):
        if element % 2 == 0:
            yield element
        else:
            yield beam.pvalue.TaggedOutput("odd", element)

with beam.Pipeline() as p:
    results = (p 
               | "Nums" >> beam.Create([1, 2, 3, 4, 5])
               | "Split" >> beam.ParDo(SplitNumbers()).with_outputs("odd", main="even"))
    
    # Extract results
    even_pcoll = results.even
    odd_pcoll = results.odd

    even_pcoll | "PrintEven" >> beam.Map(lambda x: print(f"Even: {x}"))
    odd_pcoll | "PrintOdd" >> beam.Map(lambda x: print(f"Odd: {x}"))

7. Code Explanation

  • yield beam.pvalue.TaggedOutput("odd", element) routes odd numbers to the "odd" output stream.
  • with_outputs("odd", main="even") instructs Beam to expect a side output named "odd", renaming the default main output to "even".
  • We unpack and process results.even and results.odd separately.

8. Real Production Example

Using the Dead Letter Queue (DLQ) pattern to catch JSON parsing failures:

python
import apache_beam as beam
import json

class ParseJsonDoFn(beam.DoFn):
    def process(self, element):
        try:
            parsed = json.loads(element)
            yield parsed # Sent to main output
        except Exception as e:
            # Sent to dead letter queue
            yield beam.pvalue.TaggedOutput("dlq", {"raw_data": element, "error": str(e)})

with beam.Pipeline() as p:
    raw_strings = p | "RawData" >> beam.Create(['{"name": "Alice"}', 'invalid-json'])
    
    parsed_results = (raw_strings 
                      | "Parse" >> beam.ParDo(ParseJsonDoFn()).with_outputs("dlq", main="parsed"))
    
    # Process parsed logs
    parsed_results.parsed | "Process" >> beam.Map(print)
    
    # Save errors to separate file
    parsed_results.dlq | "WriteErrors" >> beam.io.WriteToText("errors-log")

9. Common Mistakes

  • Forgetting to call .with_outputs(): If you yield TaggedOutput but forget to chain .with_outputs() on the ParDo, the execution will fail because Beam will not know how to handle the tagged values.
  • Using return instead of yield: Trying to return values instead of yielding them. TaggedOutput relies on generator yields to emit multiple streams dynamically.

10. Interview Perspective

  • Question: Can you have multiple side outputs on a single ParDo?
  • Answer: Yes, you can specify as many tag strings as needed in .with_outputs("tag1", "tag2", "tag3").
  • Question: What is a Dead Letter Queue?
  • Answer: It is an architectural pattern in data engineering where elements that fail processing validation or parsing are written to a separate error queue or file, allowing the main pipeline to continue running without interruptions.

11. Best Practices

  • Always use descriptive names for your tags (e.g. 'dlq', 'invalid_records') rather than generic names like 'side1'.
  • Set up alerts or monitoring dashboards on the outputs of the Dead Letter Queue files in production to watch error rates.

12. Summary

  • Side outputs enable multi-path routing from a single ParDo.
  • Emitted using beam.pvalue.TaggedOutput("tag", element).
  • Declared using .with_outputs() which returns a dictionary-like object of PCollections.

13. Interactive Challenges

Challenge 1: Route Exceptions (Beginner)

Write a DoFn that processes strings, yielding them to the main output, unless they start with the character "!" in which case it routes them to a side output tagged "alerts".

Challenge 2: Under-Threshold Outliers (Intermediate)

Write a complete pipeline block where numbers are parsed: those greater than or equal to 50 go to the main output, and those below 50 are routed to a side output tagged "low_scores". Unpack and print both outputs.

Challenge 3: Dynamic Category Router (Advanced)

Write a pipeline segment that processes log rows: rows containing "DEBUG" go to tag "debug", "INFO" go to the main output, and "ERROR" go to tag "error".

14. Related Content

Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)