intermediate

Logging & Monitoring

5 min readLast updated: 2026-07-02

1. Introduction

Logging & Monitoring inside Apache Beam pipelines involves instrumenting code to capture debug diagnostics and telemetry inside distributed worker logs during runtime.

2. Why This Concept Exists

Because pipeline code runs distributed across hundreds of virtual machines in a cluster, traditional stack-trace debugging is impossible. Logging and monitoring provide real-time visibility into pipeline throughput, lag, and errors.

3. Code Example

Integrating Python's standard logging library inside a DoFn:

python
import apache_beam as beam
import logging

class LogProcessDoFn(beam.DoFn):
    def process(self, element):
        # Retrieve logs and output warnings
        logging.info(f"Processing transaction: {element}")
        if float(element.get("amount", 0)) > 10000:
            logging.warning(f"High-value transaction flagged: {element}")
        yield element

4. Key Takeaways

  • Streaming logs are integrated into runner consoles (e.g. Stackdriver for Dataflow, Log4j configurations for Flink).
  • Avoid logging every single record at the INFO level under high-throughput (million records/sec) workloads to prevent worker disk saturation.
Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)