One of the central value propositions of Apache Beam is unification: the promise that you can write a single pipeline code and run it as either a batch job (processing historical data) or a streaming job (processing real-time events) with no structural changes.
While this unified model is incredibly powerful, moving a unified pipeline from a test environment into enterprise production reveals several real-world operational challenges.
The Beam programming model separates the pipeline logic into four key dimensions (The "4 W's"):
For batch processing, the watermark moves instantly to infinity, so window bounds and triggers evaluate instantly. In streaming, the watermark advances continuously, evaluating triggers dynamically.
Because of this mathematical symmetry, the SDK compiles both batch and stream pipelines.
While your processing logic (e.g. mapping, filtering) is unified, your source and sink connectors rarely are.
gs://...) or query databases directly.You must design your pipeline initialization code to dynamically swap I/O stages based on runtime arguments:
import apache_beam as beam
class DynamicPipelineOptions(PipelineOptions):
@classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument('--is_streaming', type=bool, default=False)
options = PipelineOptions()
with beam.Pipeline(options=options) as p:
if options.view_as(DynamicPipelineOptions).is_streaming:
raw_data = p | "ReadStream" >> beam.io.ReadFromPubSub(subscription="clicks-sub")
else:
raw_data = p | "ReadBatch" >> beam.io.ReadFromText("gs://bucket/history.csv")
# Unified logic applies below this step
processed = raw_data | "UnifiedTransform" >> MyEnrichmentFn()
In batch mode, states are discarded once a partition completes. In streaming, window states must be persisted in memory or external state backends for allowed lateness windows. A pipeline that runs cleanly in batch mode can trigger worker OutOfMemory (OOM) errors in stream mode if allowed lateness or triggers accumulate excessive states.
WRITE_TRUNCATE). For stream writes, use streaming append options (WRITE_APPEND) and streaming storage write APIs.