Cost Optimization — Learn Apache Beam

1. Introduction

Cost Optimization in Google Cloud Dataflow focuses on reducing the financial expenses associated with running data processing pipelines on GCP. Because Dataflow charges are calculated based on resource usage (vCPU, memory, storage, shuffle data processed, and streaming engine usage), optimizing cost involves selecting cheaper scheduling models, utilizing discounted hardware (Spot VMs), and managing network traffic.

2. Why This Concept Exists

While the serverless model eliminates idle cluster waste, running complex pipelines at scale can quickly become expensive if left unmanaged:

Idle Resource Costs: Leaving streaming pipelines running with high worker counts during periods of low activity.
High Compute Rates: Paying full price for standard Compute Engine virtual machines when execution times are flexible.
Network Egress Fees: Copying gigabytes of data across different regions, which incurs significant bandwidth fees.

By applying cost optimization techniques, teams can reduce their Dataflow bill by up to 60-80% while meeting their data processing SLAs.

3. Key Terminology

Spot VM / Preemptible VM: Compute Engine instances available at a steep discount (60-91% off regular pricing) that GCP can reclaim (preempt) at any time if it needs the capacity back.
FlexRS (Flexible Resource Scheduling): A Dataflow scheduling model for batch jobs that combines preemptible and regular VMs, deferring execution to within a 6-hour window when cloud capacity is cheaper.
Network Egress: The cost of moving data out of a Google Cloud datacenter or across regional boundaries.
Dataflow Prime: A version of Dataflow that optimizes resource utilization dynamically using features like vertical autoscaling and right-sized VMs.

4. How It Works

Several strategies can be applied to optimize Dataflow execution costs:

Leveraging Spot VMs: Setting --preemptible=True requests discounted VM instances. If GCP reclaims a Spot worker, Dataflow's control plane automatically spawns a replacement VM, resuming work without losing progress.
Deferred Batching (FlexRS): Using FlexRS allows Dataflow to delay job start times to when GCP has excess compute capacity. The service automatically mixes cheap Spot VMs and stable regular VMs to execute the job cost-effectively.
Regional Alignment: Aligning the GCS staging/temp buckets, input sources, and Dataflow workers in the exact same region (e.g. us-central1). This keeps network traffic internal to the zone, avoiding network egress charges.
Streaming Engine Offloading: Enabling Streaming Engine moves state processing off worker VMs, allowing you to use smaller VM instances with less local SSD storage, which is cheaper.

5. Visual Diagram

Standard Run (High Cost)

Standard Compute VMs active 100% of duration:

Standard Workers ($$$) + Full Egress traffic

FlexRS Scheduler (Low Cost)

Utilizes Spot VM pools dynamically:

90% Spot Workers ($) + In-region local storage

6. Code Example

The following script demonstrates how to configure pipeline options to utilize Spot VMs, FlexRS scheduling, and regional alignment to minimize execution costs:

python

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, WorkerOptions

def run_cost_optimized_batch_job():
    options = PipelineOptions()
    
    # 1. Access Typed Options
    std_opts = options.view_as(StandardOptions)
    gcp_opts = options.view_as(GoogleCloudOptions)
    worker_opts = options.view_as(WorkerOptions)
    
    # 2. Configure Runner and Region
    std_opts.runner = "DataflowRunner"
    gcp_opts.project = "my-gcp-project"
    
    # Align region: Workers run in us-central1
    gcp_opts.region = "us-central1"
    
    # Staging/Temp buckets must reside in the same region (us-central1)
    gcp_opts.staging_location = "gs://my-us-central-bucket/staging"
    gcp_opts.temp_location = "gs://my-us-central-bucket/temp"
    
    # 3. Enable FlexRS Cost Optimized scheduling
    gcp_opts.flexrs_goal = "COST_OPTIMIZED"
    
    # 4. Request Preemptible (Spot) VMs
    worker_opts.preemptible = True
    
    # 5. Cap workers to control budget ceilings
    worker_opts.max_num_workers = 10
    
    # 6. Execute Pipeline
    with beam.Pipeline(options=options) as p:
        (p
         | "ReadData" >> beam.io.ReadFromText("gs://my-us-central-bucket/inputs/*.json")
         | "Transform" >> beam.Map(lambda x: x.upper())
         | "WriteData" >> beam.io.WriteToText("gs://my-us-central-bucket/outputs/processed")
        )

if __name__ == "__main__":
    run_cost_optimized_batch_job()

7. Code Explanation

gcp_opts.region = "us-central1" paired with GCS buckets on gs://my-us-central-bucket/... ensures all traffic remains within the regional boundary, eliminating inter-region network egress costs.
gcp_opts.flexrs_goal = "COST_OPTIMIZED" tells the Dataflow service to queue the job and execute it using a cheaper combination of Spot and standard VMs.
worker_opts.preemptible = True instructs Compute Engine to provision Spot VMs, saving up to 80% on compute billing.
worker_opts.max_num_workers = 10 limits the scale of the worker pool, preventing unexpected cost overruns.

8. Real Production Example

A travel booking platform runs a nightly pipeline to aggregate hotel availability logs. Because the reports only need to be ready by 8:00 AM, the job is not time-sensitive. The team configures the pipeline to use FlexRS and Spot VMs. The job launches at midnight, waits in the queue for 30 minutes until cheaper capacity is available, and completes by 2:00 AM. This optimization cuts their computing bill from $800 a month to $180.

9. Common Mistakes

Using Spot VMs for Latency-Sensitive Streaming: Enabling preemptible = True for a streaming pipeline that requires sub-second event latency. Because Spot VMs can be reclaimed by GCP at any time, a mass-preemption event will cause temporary pipeline lag while new workers boot, violating strict SLAs.
Mismatching Resource Regions: Specifying --region = us-east1 for Dataflow workers, while reading from a GCS bucket located in europe-west1. The team will incur heavy network egress charges for transferring terabytes of data across the Atlantic Ocean.

10. Interview Perspective

Question: What is the difference between classic Preemptible VMs and Dataflow FlexRS?
Answer: Standard Preemptible VMs can be requested on any Dataflow job, but they are subject to sudden preemption, which can disrupt execution if too many VMs are lost at once. FlexRS is a dedicated scheduling service that manages this risk. It delays execution until capacity is available and provisions a mix of standard and preemptible VMs, ensuring the job completes successfully despite preemptions.
Question: Why is setting use_public_ips = False considered a cost optimization technique?
Answer: Google Cloud charges a small hourly fee for every active public IPv4 address assigned to a VM instance. For pipelines scaling to 100+ workers, disabling public IPs saves significant money, in addition to improving network security.

11. Best Practices

Use FlexRS (COST_OPTIMIZED) for batch pipelines that have flexible completion deadlines (e.g. daily or weekly audits).
Enable Spot VMs (preemptible = True) for batch processing to save compute costs.
Always store input, output, staging, and temp data in the same geographic region where your Dataflow workers execute.
Set a conservative max_num_workers to prevent run-away costs from buggy code or unexpected data spikes.

12. Summary

Dataflow billing is based on compute, storage, and network resource consumption.
FlexRS defers execution to run batch jobs when compute capacity is cheaper.
Spot/Preemptible VMs offer up to 80% cost savings for batch workloads.
Aligning regions prevents expensive cross-region network egress fees.

13. Interactive Challenges

Challenge 1: Configure Preemptible Workers (Beginner)

Write a code snippet that configures PipelineOptions to run a job on preemptible (Spot) VM instances.

Challenge 2: Apply FlexRS Scheduling Goal (Intermediate)

Configure PipelineOptions programmatically to run a Dataflow job using FlexRS under the "COST_OPTIMIZED" goal.

Challenge 3: Cost-Optimized Options Builder (Advanced)

Write a Python function get_cheapest_batch_options(project, region, bucket) that returns a PipelineOptions object. It must configure:

Runner: "DataflowRunner"
Staging and Temp paths using the provided bucket (formatted as "gs://{bucket}/staging", etc.)
Preemptible workers enabled
FlexRS goal set to "COST_OPTIMIZED"
Maximum worker count limited to 6
Public IPs disabled

14. Related Content

Previous LessonPipeline Performance Tuning

Next LessonDebugging Pipelines