intermediate
Logging & Monitoring
5 min readLast updated: 2026-07-02
1. Introduction
Logging & Monitoring inside Apache Beam pipelines involves instrumenting code to capture debug diagnostics and telemetry inside distributed worker logs during runtime.
2. Why This Concept Exists
Because pipeline code runs distributed across hundreds of virtual machines in a cluster, traditional stack-trace debugging is impossible. Logging and monitoring provide real-time visibility into pipeline throughput, lag, and errors.
3. Code Example
Integrating Python's standard logging library inside a DoFn:
python
import apache_beam as beam
import logging
class LogProcessDoFn(beam.DoFn):
def process(self, element):
# Retrieve logs and output warnings
logging.info(f"Processing transaction: {element}")
if float(element.get("amount", 0)) > 10000:
logging.warning(f"High-value transaction flagged: {element}")
yield element
4. Key Takeaways
- Streaming logs are integrated into runner consoles (e.g. Stackdriver for Dataflow, Log4j configurations for Flink).
- Avoid logging every single record at the
INFOlevel under high-throughput (million records/sec) workloads to prevent worker disk saturation.
Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)