intermediate

Runtime Parameters

7 min readLast updated: 2026-07-01

1. Introduction

Runtime Parameters are configuration options passed to a pipeline at execution launch, enabling users to alter the pipeline's behavior (such as changing file sources, databases, or scaling boundaries) without editing or recompiling the pipeline code. In Apache Beam, this is achieved using the ValueProvider class.

2. Why This Concept Exists

By default, Apache Beam evaluates parameters during compile time (when the pipeline DAG is constructed).

  • Compile-time evaluation: If you write beam.io.ReadFromText(my_path) where my_path is a simple string, that value is locked into the graph structure.
  • The Issue: If you build a reusable template, you cannot change the input path without recompiling and redeploying the template.

Runtime Parameters solve this limitation. By delaying parameter evaluation, templates can be built once and run repeatedly with different options, allowing orchestrators to trigger the pipeline dynamically.

3. Key Terminology

  • ValueProvider: A container class in the Apache Beam SDK that wraps a variable, delaying its resolution until the pipeline actually starts executing on workers.
  • Static Value: A standard python variable resolved immediately during code compilation.
  • Runtime Value: A value supplied only when a job is launched, mapped to a ValueProvider object.
  • NestedValueProvider: A helper class used to transform or combine the outputs of one or more ValueProvider objects during execution.

4. How It Works

When you define a pipeline using ValueProvider parameters:

  1. Graph Construction: The SDK creates placeholders for these variables. Since their values are unknown, the code must treat them as wrapper objects rather than raw strings or integers.
  2. Job Staging: The job graph containing these placeholders is submitted.
  3. Job Launch: The user launches the template and supplies values (e.g. --input_file=gs://bucket/data.csv).
  4. Worker Resolution: When worker VMs boot, the Dataflow runner injects the values into the ValueProvider placeholders. Within custom transforms (like DoFn processes), you retrieve the raw value by calling .get() on the wrapper.

5. Visual Diagram

Compile-Time (Local)
DAG is built and staged without values:
ValueProvider Placeholder
Runtime (Execution)
User injects parameters on job submit:
Workers execute using .get() values

6. Code Example

The following script demonstrates registering runtime parameters and accessing them inside a custom DoFn:

python
import logging
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

# 1. Define custom options using ValueProvider
class PipelineRuntimeOptions(PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_value_provider_argument(
            "--cutoff_limit",
            type=int,
            default=50,
            help="Cutoff integer threshold for filtering"
        )

# 2. Access the value inside DoFn
class FilterByThreshold(beam.DoFn):
    def __init__(self, threshold_provider):
        # Store the provider reference (do not call .get() here!)
        self.threshold_provider = threshold_provider

    def process(self, element):
        # Retrieve the value at runtime on the workers
        limit = self.threshold_provider.get()
        if element > limit:
            yield element

def run():
    options = PipelineOptions()
    runtime_opts = options.view_as(PipelineRuntimeOptions)
    
    with beam.Pipeline(options=options) as p:
        (p
         | "CreateNumbers" >> beam.Create([10, 45, 60, 80, 20])
         | "FilterNumbers" >> beam.ParDo(FilterByThreshold(runtime_opts.cutoff_limit))
         | "Log" >> beam.Map(print)
        )

if __name__ == "__main__":
    logging.getLogger().setLevel(logging.INFO)
    run()

7. Code Explanation

  • parser.add_value_provider_argument registers the parameter. The option returned (runtime_opts.cutoff_limit) is an instance of ValueProvider, not an int.
  • Inside FilterByThreshold.__init__, we accept the ValueProvider object and save it as a class attribute.
  • CRITICAL: We must not call self.threshold_provider.get() inside __init__. The __init__ constructor runs during graph compilation on the launch machine where the value is still unset, which would trigger a crash.
  • Inside process, which runs on the worker VMs during record processing, calling self.threshold_provider.get() safely retrieves the runtime value.

8. Real Production Example

A company processes multi-tenant customer records. They write a single Dataflow job that loads raw database dumps, transforms them, and exports to client-specific GCS folder paths. Instead of deploying distinct pipelines, they trigger a single template passing parameters: --tenant_id=client-abc and --export_bucket=gs://client-abc-reporting. The pipeline reads these values at launch to direct data flow dynamically.

9. Common Mistakes

  • Calling .get() during DAG construction: Attempting to use .get() directly in the pipeline setup code. E.g., cutoff = runtime_opts.cutoff_limit.get(). This will crash with a RuntimeError stating that the value is not available.
  • Using Non-Supported Transforms: Passing ValueProvider objects to transforms that do not explicitly accept them. For example, some custom third-party SDK connectors do not accept ValueProviders for configuration fields.

10. Interview Perspective

  • Question: What happens if you try to evaluate a ValueProvider value inside the __init__ method of a DoFn?
  • Answer: The execution fails with a runtime exception. This is because __init__ is evaluated on the client machine during pipeline compilation. At this stage, the job has not been submitted or launched on GCP, meaning the runtime parameters are completely undefined.
  • Question: How do you perform modifications on a ValueProvider value before it reaches a transform?
  • Answer: You can use NestedValueProvider. It accepts a parent ValueProvider and a translation function, applying the logic once the value is resolved.

11. Best Practices

  • Always implement fallback values in add_value_provider_argument using the default parameter.
  • Wrap complex transformations on runtime values using NestedValueProvider to avoid messy worker-side conversions.
  • Verify that your sinks and sources support ValueProvider arguments in the version of Apache Beam you are using.

12. Summary

  • Runtime parameters decouple pipeline code from dynamic variables.
  • ValueProvider objects defer variable resolution until execution.
  • Never call .get() during graph initialization; only call it inside worker-side execution logic (e.g. process).
  • Templates use runtime parameters to run dynamic workloads.

13. Interactive Challenges

Challenge 1: Register Runtime Option (Beginner)

Write an options class registering a single string ValueProvider option named gs_input_path without a default value.

Challenge 2: Construct Runtime Filtering DoFn (Intermediate)

Write a DoFn subclass FilterSuffix that receives a string ValueProvider called suffix_provider and filters out strings that do not end with that suffix.

Challenge 3: Value Transformation with NestedValueProvider (Advanced)

Write a code snippet that takes a string ValueProvider object folder_provider and constructs a new ValueProvider that appends the string "/manifest.json" to the folder path using NestedValueProvider.

14. Related Content

Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)