Logs — Learn Apache Beam

1. Introduction

In a distributed environment, you cannot access standard outputs or debuggers directly on the physical execution machines. Google Cloud Dataflow addresses this by capturing all worker stdout, stderr, and logging streams, converting them into structured records, and forwarding them to Google Cloud Logging for search and analysis.

2. Why This Concept Exists

Distributed architectures execute pipeline code across dozens or hundreds of virtual machines in parallel.

Without Centralized Logs: If worker VM #47 encounters a corrupted database record and crashes, diagnosing the root cause would require manually SSH-ing into the instance and hunting through system directories.
With Centralized Logs: Dataflow captures runtime messages in real-time, associating each log statement with its specific job ID, step identifier, worker VM name, and timestamp. This centralized repository is searchable from a single user interface.

3. Key Terminology

Worker Logs: Log messages emitted by the code running inside the worker VM containers (user code, SDK harness, boot systems).
Job Logs: System-level logs generated by the Dataflow service detailing VM provisioning, autoscaling actions, and scheduling status.
Cloud Logging: The GCP service that stores, indexes, and queries logs generated by GCP services.
Log Level / Severity: Categories indicating log importance (DEBUG, INFO, WARNING, ERROR, CRITICAL).

4. How It Works

Log Interception: Inside a custom DoFn or transform, the developer uses the standard Python logging module.
Harness Buffering: The containerized SDK harness intercepts these calls, formatting them into JSON envelopes containing the log text, severity, step name, and thread ID.
Local Transport: The harness writes the formatted logs to a local log file on the worker VM.
Logging Agent: A pre-installed logging agent running on the Compute Engine VM monitors this file, sending batch payloads to the Cloud Logging API.
Console Display: The Cloud Console queries these records, updating the logging tab in the Dataflow monitoring UI.

5. Visual Diagram

logging.info() / logging.error()
Called inside python DoFn

▼

GCP Cloud Logging Agent
Collects local system files

▼ (HTTPS Batch Sync)

Dataflow Console UI
Stream logs in real-time

6. Code Example

The following code demonstrates how to implement structured logging in a custom DoFn, utilizing different severity levels to categorize pipeline events:

python

import logging
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

class ValidateAndLogRecord(beam.DoFn):
    def setup(self):
        # Retrieve the logger named after the module
        self.logger = logging.getLogger(__name__)
        self.logger.info("DoFn setup completed successfully.")

    def process(self, element):
        try:
            # Assume elements are dictionary payloads
            user_id = element.get("user_id")
            amount = element.get("amount")
            
            if not user_id:
                # Issue warning, but don't fail the pipeline
                self.logger.warning("Record missing user_id: %s", str(element))
                return
            
            if amount < 0:
                # Log error, but proceed
                self.logger.error("Negative transaction amount found for User %s: %f", user_id, amount)
                return

            yield element
            
        except Exception as e:
            # Fatal step error
            self.logger.critical("Unexpected crash during record validation: %s", str(e), exc_info=True)
            raise e

def run():
    options = PipelineOptions()
    with beam.Pipeline(options=options) as p:
        (p
         | "CreateMockData" >> beam.Create([
             {"user_id": "usr-1", "amount": 100.5},
             {"amount": 50.0},  # Missing user_id -> triggers Warning
             {"user_id": "usr-2", "amount": -10.0},  # Negative amount -> triggers Error
         ])
         | "Validate" >> beam.ParDo(ValidateAndLogRecord())
        )

if __name__ == "__main__":
    # Configure logging level locally (DirectRunner)
    logging.basicConfig(level=logging.INFO)
    run()

7. Code Explanation

logging.getLogger(__name__) retrieves a standard logger instance. Using a module-level logger allows you to filter logs by package name.
self.logger.warning(...) and self.logger.error(...) emit logs to the standard logging stream. The Cloud Logging agent automatically translates these to WARNING and ERROR severity levels in GCP.
exc_info=True automatically parses the stack trace of an exception, publishing the full debug dump to Cloud Logging under the critical log record.

8. Real Production Example

An e-commerce payment pipeline logs invalid token codes. A security analyst creates a Log-Based Alert in Cloud Logging that monitors the Dataflow job logs. If more than 5 logs with severity ERROR matching the phrase "Token Authentication Failure" occur within 1 minute, the alert system sends a notification to the security team's Slack channel.

9. Common Mistakes

Over-Logging in Hot Loops: Placing self.logger.info() statements inside high-throughput process() methods where workers process millions of items. This causes severe disk I/O bottlenecks, high network serialization costs, and massive GCP Cloud Logging bills.
Using print() for Logging: Relying on Python print() statements. While Dataflow forwards stdout to Cloud Logging, it logs them all at the INFO level, making it impossible to separate normal outputs from system errors.

10. Interview Perspective

Question: How can logging impact the performance of a high-throughput Dataflow streaming pipeline?
Answer: Logging is a blocking disk and network operation. If you write a log line for every element in a high-throughput pipeline (e.g., 50k records/sec), the time spent formatting strings and writing them to the disk buffers will create a bottleneck. Workers will spend more time writing logs than executing transformations, leading to pipeline lag.
Question: How do you change the log level of a running Dataflow job without changing code?
Answer: You can override default log levels by configuring the --default_sdk_harness_log_level_overrides pipeline option at job launch (e.g. {"apache_beam.runners.dataflow": "WARNING"}).

11. Best Practices

Never log raw user personal data (PII) like passwords or credit card numbers to comply with security regulations.
For high-volume datasets, log debug/info summaries in the finish_bundle() method instead of logging inside the individual process() calls.
Use python formatting syntax (e.g. logger.info("Val: %s", val)) instead of string interpolation (logger.info(f"Val: {val}")). String interpolation executes immediately even if the log level is disabled, wasting CPU cycles.

12. Summary

Dataflow forwards VM stdout, stderr, and log streams to Cloud Logging.
Log statements inside DoFns should use the standard logging library.
Use correct log levels (INFO, WARNING, ERROR) to organize alerts.
Avoid logging inside hot execution loops to maintain high processing speeds.

13. Interactive Challenges

Challenge 1: Basic Logged DoFn (Beginner)

Write a DoFn subclass LogLength that takes a string input, logs its value and length at the INFO level, and yields the length.

Challenge 2: Log Warnings for Outliers (Intermediate)

Write a DoFn subclass FilterOutliers that accepts float values. If a value is less than 0.0 or greater than 100.0, log a warning saying "Outlier detected: [value]" and discard it. Otherwise, yield the value.

Challenge 3: Throttled Custom Logger (Advanced)

Write a custom DoFn that processes items in a high-volume pipeline. It should only emit a log once every 1,000 items (a progress heartbeat) to prevent log flooding.

14. Related Content

Previous LessonMonitoring Jobs

Next LessonAutoscaling