advanced

Fault Tolerance

8 min readLast updated: 2026-07-01

1. Introduction

Fault tolerance is the ability of a stream processing pipeline to continue running correctly and preserving data consistency even when hardware, network, or software failures occur. In Apache Beam, fault tolerance is co-managed by the programming model and the underlying execution runner (e.g., Flink, Spark, or Dataflow).

2. Why This Concept Exists

Streaming pipelines are designed to run indefinitely—for weeks, months, or years. In any long-running distributed system, worker node crashes, network disconnects, disk failures, and software exceptions are inevitable. A robust streaming architecture must detect these failures, provision replacements, restore previous state, and resume processing without losing data or producing duplicate reports.

3. Key Terminology

  • Checkpointing: The process of periodically saving the internal state of the pipeline (such as window accumulators or user state) to durable storage.
  • Worker Failure: A hardware or OS crash on a virtual machine running a portion of the pipeline's transforms.
  • Dead-Letter Queue (DLQ): A secondary storage location (like a Pub/Sub topic or database table) where malformed or unprocessable elements are sent to prevent them from crashing the pipeline.
  • Dynamic Work Rebalancing: The process of moving tasks from slow or failing workers to healthy workers during execution.

4. How It Works

  • State Checkpoints: As data flows through stateful transforms, the runner periodically writes snapshots of the state and read offsets to a persistent storage system (like Google Cloud Storage or HDFS).
  • Failure Detection: The orchestrator monitors worker health via heartbeats. If a worker goes silent, the orchestrator marks it as failed.
  • State Recovery: The runner spins up a new worker node, assigns it the partitions of the failed worker, and restores its state from the latest healthy checkpoint.
  • Replay: The source connector (e.g. Kafka or Pub/Sub) replays messages from the offset recorded at the time of the checkpoint, allowing the new worker to catch up.

5. Visual Diagram

Durable Storage (GCS/HDFS)
Saves state checkpoints every 30s

Worker A (Healthy)
Continues stream read
Worker B (CRASHED!)
New instance starts, pulls checkpoint store to restore state

6. Code Example

The following example demonstrates a Dead-Letter Queue (DLQ) pattern inside a custom DoFn to handle data parsing errors gracefully, ensuring that corrupted elements do not crash the pipeline:

python
import json
import apache_beam as beam

class SafeParseJsonDoFn(beam.DoFn):
    # Declare tagged outputs for clean and dirty (failed) data
    OUTPUT_TAG_CLEAN = "clean"
    OUTPUT_TAG_DIRTY = "dirty"

    def process(self, element):
        try:
            # Attempt to parse json payload
            parsed = json.loads(element.decode("utf-8"))
            yield beam.pvalue.TaggedOutput(self.OUTPUT_TAG_CLEAN, parsed)
        except Exception as e:
            # In case of failure, output raw payload and error message to DLQ
            failed_record = {
                "raw_payload": str(element),
                "error": str(e)
            }
            yield beam.pvalue.TaggedOutput(self.OUTPUT_TAG_DIRTY, failed_record)

# Usage in pipeline:
# parse_result = raw_bytes | beam.ParDo(SafeParseJsonDoFn()).with_outputs(
#     SafeParseJsonDoFn.OUTPUT_TAG_CLEAN, 
#     SafeParseJsonDoFn.OUTPUT_TAG_DIRTY
# )
# clean_data = parse_result.clean
# dirty_data = parse_result.dirty (Write this to a DLQ Pub/Sub topic)

7. Code Explanation

  • SafeParseJsonDoFn wraps the JSON parsing logic in a try-except block.
  • If parsing succeeds, the record is directed to the "clean" output tag.
  • If parsing fails (due to a malformed payload), the exception is caught, and the error details are sent to the "dirty" output tag.
  • This pattern ensures that a single corrupted record ("poison pill") does not cause the worker to crash repeatedly, which would trigger a pipeline-wide failure loop.

8. Real Production Example

A telemetry pipeline ingests vehicle location metrics. When a server rack hosting a Beam worker loses power, the runner detects the offline node. It spins up a new worker VM, retrieves the state of the vehicle routing tracker from the last Google Cloud Storage checkpoint, and continues updating route metrics, losing zero location records.

9. Common Mistakes

  • Not Separating Code Bugs from Infrastructure Failures: If your DoFn contains a division-by-zero error, the runner will attempt to retry the failing element. If not handled, this causes the worker to crash repeatedly, exhausting retry quotas and shutting down the pipeline. Code errors must be handled via try-except blocks, while infrastructure errors (VM crashes) are handled by the runner.
  • Ignoring Sink Idempotency: If a worker fails halfway through writing a batch to an external database, the replacement worker may replay the data. If the database does not support deduplication or transactions, duplicate writes will occur.

10. Interview Perspective

  • Question: What is a "poison pill" in a streaming pipeline?
  • Answer: A poison pill is an input message that causes a processing transform to throw an unhandled exception or crash. Because the message is never successfully processed, the runner keeps retrying it, creating an infinite crash loop.
  • Question: How does the runner optimize the trade-off between checkpointing frequency and recovery time?
  • Answer: Frequent checkpoints reduce recovery time because the new worker has less data to replay. However, checkpoints consume CPU and network bandwidth. Runners typically balance this by snapshotting state every 30 to 60 seconds.

11. Best Practices

  • Always use the Dead-Letter Queue (DLQ) pattern for inputs originating from unvalidated external sources (e.g. web forms, public APIs).
  • Test pipeline fault tolerance by running a stress test and manually killing a worker machine to observe the runner's self-healing capabilities.
  • Keep your user-defined state structures as small as possible to minimize the write overhead during checkpoint operations.

12. Summary

  • Fault tolerance keeps streaming pipelines running through hardware crashes.
  • Checkpoints save runner state to durable storage periodically.
  • Poison pills must be handled inside code using Dead-Letter Queues (DLQ).

13. Interactive Challenges

Challenge 1: Implement Try-Except DLQ (Beginner)

Complete a DoFn named SafeDividerDoFn that takes tuples of (key, value). Attempt to return (key, 100 / value). If a ZeroDivisionError occurs, emit the error tuple (key, "ZERO_ERROR") to a side output tagged "errors".

Challenge 2: Multi-Output Pipeline Routing (Intermediate)

Given a PCollection of strings named raw_csv, write a pipeline segment that applies SafeParseJsonDoFn (from Section 6) and routes clean elements to a database writer and dirty elements to a backup error file.

Challenge 3: Parse Integer with DLQ Rescue (Advanced)

Create a DoFn named IntegerParserDoFn that parses string elements into integers. If parsing fails, output the original string along with the string "PARSE_FAILURE" as a tuple to a side output tagged "malformed".

14. Related Content

Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)