Beam Best PracticesEvergreen Article

Unified Batch & Stream Processing in Production

Published: July 02, 20268 min read

One of the central value propositions of Apache Beam is unification: the promise that you can write a single pipeline code and run it as either a batch job (processing historical data) or a streaming job (processing real-time events) with no structural changes.

While this unified model is incredibly powerful, moving a unified pipeline from a test environment into enterprise production reveals several real-world operational challenges.


1. The Core Philosophy

The Beam programming model separates the pipeline logic into four key dimensions (The "4 W's"):

  1. What is computed? (Transforms)
  2. Where in event-time? (Windowing)
  3. When in processing-time are results emitted? (Triggers)
  4. How do results relate? (Accumulation Modes)

For batch processing, the watermark moves instantly to infinity, so window bounds and triggers evaluate instantly. In streaming, the watermark advances continuously, evaluating triggers dynamically.

Because of this mathematical symmetry, the SDK compiles both batch and stream pipelines.


2. Production Challenges to Expect

I/O Connector Divergence

While your processing logic (e.g. mapping, filtering) is unified, your source and sink connectors rarely are.

  • In Streaming: You read from message queues like Google Cloud Pub/Sub or Apache Kafka and write to real-time analytics tables.
  • In Batch: You read from cloud storage file paths (gs://...) or query databases directly.

You must design your pipeline initialization code to dynamically swap I/O stages based on runtime arguments:

python
import apache_beam as beam

class DynamicPipelineOptions(PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_value_provider_argument('--is_streaming', type=bool, default=False)

options = PipelineOptions()
with beam.Pipeline(options=options) as p:
    if options.view_as(DynamicPipelineOptions).is_streaming:
        raw_data = p | "ReadStream" >> beam.io.ReadFromPubSub(subscription="clicks-sub")
    else:
        raw_data = p | "ReadBatch" >> beam.io.ReadFromText("gs://bucket/history.csv")
        
    # Unified logic applies below this step
    processed = raw_data | "UnifiedTransform" >> MyEnrichmentFn()

State Storage and Memory Accumulation

In batch mode, states are discarded once a partition completes. In streaming, window states must be persisted in memory or external state backends for allowed lateness windows. A pipeline that runs cleanly in batch mode can trigger worker OutOfMemory (OOM) errors in stream mode if allowed lateness or triggers accumulate excessive states.


3. Best Practices Checklist

  • [ ] Swap write configurations: For batch writes, use overwrites (WRITE_TRUNCATE). For stream writes, use streaming append options (WRITE_APPEND) and streaming storage write APIs.
  • [ ] Define separate monitoring alerts: Streaming pipelines require latency lag alerts. Batch jobs require duration threshold warnings.
  • [ ] Always test backfills locally: Run historical batch backfills using the Direct Runner or local cluster scripts to verify database key boundaries before launching active streams.