Runtime Parameters
1. Introduction
Runtime Parameters are configuration options passed to a pipeline at execution launch, enabling users to alter the pipeline's behavior (such as changing file sources, databases, or scaling boundaries) without editing or recompiling the pipeline code. In Apache Beam, this is achieved using the ValueProvider class.
2. Why This Concept Exists
By default, Apache Beam evaluates parameters during compile time (when the pipeline DAG is constructed).
- Compile-time evaluation: If you write
beam.io.ReadFromText(my_path)wheremy_pathis a simple string, that value is locked into the graph structure. - The Issue: If you build a reusable template, you cannot change the input path without recompiling and redeploying the template.
Runtime Parameters solve this limitation. By delaying parameter evaluation, templates can be built once and run repeatedly with different options, allowing orchestrators to trigger the pipeline dynamically.
3. Key Terminology
- ValueProvider: A container class in the Apache Beam SDK that wraps a variable, delaying its resolution until the pipeline actually starts executing on workers.
- Static Value: A standard python variable resolved immediately during code compilation.
- Runtime Value: A value supplied only when a job is launched, mapped to a
ValueProviderobject. - NestedValueProvider: A helper class used to transform or combine the outputs of one or more
ValueProviderobjects during execution.
4. How It Works
When you define a pipeline using ValueProvider parameters:
- Graph Construction: The SDK creates placeholders for these variables. Since their values are unknown, the code must treat them as wrapper objects rather than raw strings or integers.
- Job Staging: The job graph containing these placeholders is submitted.
- Job Launch: The user launches the template and supplies values (e.g.
--input_file=gs://bucket/data.csv). - Worker Resolution: When worker VMs boot, the Dataflow runner injects the values into the
ValueProviderplaceholders. Within custom transforms (likeDoFnprocesses), you retrieve the raw value by calling.get()on the wrapper.
5. Visual Diagram
6. Code Example
The following script demonstrates registering runtime parameters and accessing them inside a custom DoFn:
import logging
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
# 1. Define custom options using ValueProvider
class PipelineRuntimeOptions(PipelineOptions):
@classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
"--cutoff_limit",
type=int,
default=50,
help="Cutoff integer threshold for filtering"
)
# 2. Access the value inside DoFn
class FilterByThreshold(beam.DoFn):
def __init__(self, threshold_provider):
# Store the provider reference (do not call .get() here!)
self.threshold_provider = threshold_provider
def process(self, element):
# Retrieve the value at runtime on the workers
limit = self.threshold_provider.get()
if element > limit:
yield element
def run():
options = PipelineOptions()
runtime_opts = options.view_as(PipelineRuntimeOptions)
with beam.Pipeline(options=options) as p:
(p
| "CreateNumbers" >> beam.Create([10, 45, 60, 80, 20])
| "FilterNumbers" >> beam.ParDo(FilterByThreshold(runtime_opts.cutoff_limit))
| "Log" >> beam.Map(print)
)
if __name__ == "__main__":
logging.getLogger().setLevel(logging.INFO)
run()
7. Code Explanation
parser.add_value_provider_argumentregisters the parameter. The option returned (runtime_opts.cutoff_limit) is an instance ofValueProvider, not anint.- Inside
FilterByThreshold.__init__, we accept theValueProviderobject and save it as a class attribute. - CRITICAL: We must not call
self.threshold_provider.get()inside__init__. The__init__constructor runs during graph compilation on the launch machine where the value is still unset, which would trigger a crash. - Inside
process, which runs on the worker VMs during record processing, callingself.threshold_provider.get()safely retrieves the runtime value.
8. Real Production Example
A company processes multi-tenant customer records. They write a single Dataflow job that loads raw database dumps, transforms them, and exports to client-specific GCS folder paths. Instead of deploying distinct pipelines, they trigger a single template passing parameters: --tenant_id=client-abc and --export_bucket=gs://client-abc-reporting. The pipeline reads these values at launch to direct data flow dynamically.
9. Common Mistakes
- Calling
.get()during DAG construction: Attempting to use.get()directly in the pipeline setup code. E.g.,cutoff = runtime_opts.cutoff_limit.get(). This will crash with aRuntimeErrorstating that the value is not available. - Using Non-Supported Transforms: Passing
ValueProviderobjects to transforms that do not explicitly accept them. For example, some custom third-party SDK connectors do not accept ValueProviders for configuration fields.
10. Interview Perspective
- Question: What happens if you try to evaluate a ValueProvider value inside the
__init__method of a DoFn? - Answer: The execution fails with a runtime exception. This is because
__init__is evaluated on the client machine during pipeline compilation. At this stage, the job has not been submitted or launched on GCP, meaning the runtime parameters are completely undefined. - Question: How do you perform modifications on a ValueProvider value before it reaches a transform?
- Answer: You can use
NestedValueProvider. It accepts a parentValueProviderand a translation function, applying the logic once the value is resolved.
11. Best Practices
- Always implement fallback values in
add_value_provider_argumentusing thedefaultparameter. - Wrap complex transformations on runtime values using
NestedValueProviderto avoid messy worker-side conversions. - Verify that your sinks and sources support
ValueProviderarguments in the version of Apache Beam you are using.
12. Summary
- Runtime parameters decouple pipeline code from dynamic variables.
ValueProviderobjects defer variable resolution until execution.- Never call
.get()during graph initialization; only call it inside worker-side execution logic (e.g.process). - Templates use runtime parameters to run dynamic workloads.