One of Google Cloud Dataflow's primary benefits is autoscaling: the ability of the runner to dynamically allocate worker VMs based on the pipeline's workload, reducing resource costs while meeting performance guarantees.
However, how does Dataflow decide when to scale up or scale down, and what metrics does it evaluate?
Dataflow's autoscaling algorithm operates differently depending on the pipeline execution mode:
For batch pipelines, Dataflow evaluates the total work remaining (in bytes) and the throughput rate of the current workers.
Streaming pipelines do not have a finite dataset size, so they scale based on latency and backlog:
You configure autoscaling boundaries using execution options. Here is a command example showing how to set maximum worker VM boundaries when running a Python pipeline on Google Cloud Dataflow:
python pipeline.py \
--runner=DataflowRunner \
--project=my-gcp-project \
--region=us-central1 \
--autoscaling_algorithm=THROUGHPUT_BASED \
--num_workers=2 \
--max_num_workers=20 \
--worker_machine_type=n1-standard-4 \
--temp_location=gs://my-bucket/temp
DoFn contains blocking network requests (e.g. calling an external API sequentially), the worker will spend CPU cycles waiting on network responses. Since the CPU usage remains low, Dataflow will not scale up even though System Lag is increasing! Always use connection pools and batch requests.