Worker Configuration — Learn Apache Beam

1. Introduction

Worker Configuration involves setting options that govern the virtual machines provisioned by the Dataflow service. Developers can customize the VM hardware specifications (machine types, disk sizes), network topologies (VPCs, subnetworks, public/private IP configurations), and security contexts (IAM service accounts) under which the workers run.

2. Why This Concept Exists

By default, Dataflow runs workers using standard VM specifications (e.g. n1-standard-1 machine type) and attempts to deploy them in your default VPC network using public IP addresses. However, default settings are often unacceptable for enterprise pipelines:

Security Restrictions: Corporate security policies typically forbid VMs from having public IP addresses to protect against exposure.
Resource Demands: Aggregating huge windows or parsing complex file types can consume more memory than a standard VM has, leading to Out Of Memory (OOM) crashes.
Least Privilege: The default Compute Engine service account has broad editor access to the entire GCP project, which violates security best practices.

Configuring worker options ensures your pipelines run within secure network boundaries, leverage appropriate hardware, and respect security compliance policies.

3. Key Terminology

Machine Type: The Compute Engine hardware profile (CPU, RAM) allocated to workers (e.g., e2-standard-4, n2-highmem-8).
Service Account Email: The identity associated with the worker VMs. The worker uses this account's permissions to access sources and sinks (e.g. GCS, BigQuery).
VPC Subnetwork: The specific virtual network subnet in which workers are spawned.
Use Public IPs: A setting that determines whether worker VMs are assigned public IPv4 addresses.

4. How It Works

When you submit a job configuration containing worker options:

API Parsing: The Dataflow API parses the custom properties (e.g., network tags, subnet path, VM size).
Compute Engine Call: The Dataflow resource manager calls the Compute Engine API to spin up the worker pool.
Network Placement: The VMs are attached to the specified VPC and Subnetwork.
Service Identity: The Compute Engine metadata server assigns the specified custom service account to the VMs.
Execution Isolation: If public IPs are disabled, workers communicate with Google Cloud APIs using private routes rather than the public internet.

5. Visual Diagram

🔒 GCP Private Subnet Architecture

• Machine Type: n2-highmem-4 (32GB RAM) for large keys

• Network: Private IPs only (No Public internet access)

• Service Account: df-runner@project.iam.gserviceaccount.com

6. Code Example

The following code demonstrates configuring secure enterprise worker options programmatically:

python

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, WorkerOptions

def run_enterprise_worker_pipeline():
    options = PipelineOptions()
    
    # 1. Standard/GCP Configurations
    std_opts = options.view_as(StandardOptions)
    std_opts.runner = "DataflowRunner"
    
    gcp_opts = options.view_as(GoogleCloudOptions)
    gcp_opts.project = "my-secure-project"
    gcp_opts.region = "us-east1"
    
    # 2. Configure Worker Options
    worker_opts = options.view_as(WorkerOptions)
    
    # Use memory-optimized machine type for state-heavy jobs
    worker_opts.machine_type = "n2-highmem-4"
    
    # Set worker disk sizes to 100GB to handle local buffer writes
    worker_opts.disk_size_gb = 100
    
    # Turn off public IP addresses for security compliance
    worker_opts.use_public_ips = False
    
    # Deploy workers into a specific private subnetwork
    # Format: regions/<region>/subnetworks/<subnet-name>
    worker_opts.subnetwork = "regions/us-east1/subnetworks/private-secure-subnet"
    
    # Assign custom service account with minimal IAM roles
    gcp_opts.service_account_email = "dataflow-execution-sa@my-secure-project.iam.gserviceaccount.com"
    
    # 3. Submit Pipeline
    with beam.Pipeline(options=options) as p:
        (p
         | "Read" >> beam.Create(["Data Securely Staged"])
         | "Log" >> beam.Map(print)
        )

if __name__ == "__main__":
    run_enterprise_worker_pipeline()

7. Code Explanation

worker_opts.machine_type = "n2-highmem-4" overrides the default VM, provisioning 4 virtual CPUs and 32 GB of RAM per worker, which is ideal for memory-heavy aggregations.
worker_opts.disk_size_gb = 100 increases root disk space to buffer local libraries and shuffle states.
worker_opts.use_public_ips = False removes external IP addresses from the worker nodes.
worker_opts.subnetwork places the VM instances inside a specific private VPC subnetwork.
gcp_opts.service_account_email specifies the identity of the VM. Dataflow uses this account to download files and write outputs, replacing the broad default Compute Engine account.

8. Real Production Example

A healthcare company processes patient record updates. Because HIPAA rules forbid exposing health data systems to the public internet, they configure their Dataflow job with use_public_ips = False and specify their private subnet. To access Google services, they enable Private Google Access on their subnet, allowing workers to write anonymized records to BigQuery using secure internal Google network paths.

9. Common Mistakes

Forgetting Private Google Access: Disabling public IPs (use_public_ips = False) on a subnet that does not have Private Google Access enabled. Without public IPs or Private Google Access, workers cannot communicate with the Dataflow API or download pipeline dependencies, causing the job startup to time out and fail.
Incorrect Subnetwork Format: Specifying only the subnet name (e.g. my-subnet) instead of the fully qualified path (regions/us-central1/subnetworks/my-subnet). This causes resource provisioning errors during launch.

10. Interview Perspective

Question: If worker public IPs are disabled, how do they download Python packages listed in requirements.txt?
Answer: The workers cannot reach the public PyPI repository over the internet. You have two options: 1) configure a Cloud NAT gateway on the VPC subnet to allow outgoing internet traffic, or 2) pre-package all dependencies into a custom container using Flex Templates, eliminating the need to download packages at runtime.
Question: Which GCP service account role is required for custom Dataflow service accounts?
Answer: The service account needs the Dataflow Worker role (roles/dataflow.worker), which grants permissions to communicate with the Dataflow service, alongside standard read/write roles for GCS (roles/storage.objectAdmin) or BigQuery (roles/bigquery.dataEditor).

11. Best Practices

Never use the default Compute Engine service account in production; create a dedicated service account and apply the principle of least privilege.
Set use_public_ips = False for all production pipelines to eliminate external entry points.
If your pipeline performs memory-heavy operations (e.g., custom joins or large state windows), use high-memory machine types (like n2-highmem-4) to prevent OOM errors.

12. Summary

Worker configurations specify machine resources, networking, and security.
Production jobs should use use_public_ips = False for security compliance.
Subnetworks must have Private Google Access enabled if public IPs are disabled.
Assign custom service accounts with narrow IAM roles instead of defaults.

13. Interactive Challenges

Challenge 1: Disable Worker Public IPs (Beginner)

Write a code snippet that configures PipelineOptions to disable the use of public IPs on worker instances.

Challenge 2: Apply Custom Service Account (Intermediate)

Configure PipelineOptions to use a custom service account email "pipelines-sa@company.iam.gserviceaccount.com".

Challenge 3: Enterprise Options Configurator (Advanced)

Write a Python function build_enterprise_worker_options(region, subnet_name) that returns a configured PipelineOptions object. It must configure:

runner set to "DataflowRunner"
use_public_ips set to False
subnetwork matching the provided region and subnet name format
machine_type set to "e2-highmem-8"
disk_size_gb set to 250

14. Related Content

Previous LessonAutoscaling

Next LessonBest Practices