Introduction to Dataflow — Learn Apache Beam

1. Introduction

Google Cloud Dataflow is a fully managed, serverless execution service for running Apache Beam pipelines. By offloading cluster management, provisioning, scaling, and execution optimization to Google Cloud, Dataflow allows developers to focus purely on writing data processing logic using the Apache Beam SDK.

2. Why This Concept Exists

Before serverless runner options existed, executing distributed data processing pipelines (such as Apache Spark or Apache Flink) required teams to manually build, configure, scale, and maintain compute clusters. This involved:

Infrastructure Overhead: Spending time managing VM OS patches, networking, and dependencies.
Inefficient Scaling: Over-provisioning clusters to handle peak loads, which resulted in significant waste, or under-provisioning, which led to delayed pipelines.
Complex Tuning: Manually adjusting cluster partitions and sharding strategies.

Dataflow exists to automate these operational tasks. It acts as a managed execution engine that instantiates, coordinates, and scales worker VMs dynamically to run Apache Beam jobs, shutting down resources once batch jobs finish.

3. Key Terminology

Dataflow Service: The managed cloud service that accepts Apache Beam job graphs, optimizes them, and orchestrates execution.
Dataflow Worker: A Compute Engine virtual machine instance provisioned by the Dataflow service to execute pipeline steps.
Job Graph: A directed acyclic graph (DAG) representation of the pipeline transforms that is sent to the Dataflow service for execution.
Staging Location: A Google Cloud Storage (GCS) path where the SDK uploads files, dependencies, and packages required by workers.
Temp Location: A GCS path used for holding temporary artifacts generated during pipeline execution.

4. How It Works

When you execute an Apache Beam pipeline using the Dataflow Runner:

Local Compilation: The Apache Beam SDK constructs a JSON/Proto representation of the pipeline's execution graph.
Staging: The SDK packages dependencies and uploads them to the designated GCS staging bucket.
Submission: The SDK makes an API call to the Dataflow service, transmitting the job graph and configuration parameters.
Worker Provisioning: The Dataflow service reads the graph, requests VM workers from Compute Engine in the specified region, and deploys the runner code on them.
Execution: Workers pull dependencies from GCS, read input data from sources, run the transforms in parallel, and write outputs to sinks.
Tear Down: The VMs are deleted automatically when execution completes.

5. Visual Diagram

Local Python SDK
Submits Job Graph DAG

GCS Staging Bucket
Stages libraries & files

▼ (Triggers Dataflow Service API)

Compute Engine VMs
Spin up stateless worker nodes

▼ (Executes graph pipeline)

BigQuery / Cloud Storage
Target output sinks

6. Code Example

The following code demonstrates a basic pipeline configured to run on Google Cloud Dataflow:

python

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions

def run_basic_dataflow():
    # 1. Instantiate Pipeline Options
    options = PipelineOptions()
    
    # 2. Set GCP Specific Settings
    gcp_options = options.view_as(GoogleCloudOptions)
    gcp_options.project = "my-gcp-project"
    gcp_options.region = "us-central1"
    gcp_options.job_name = "intro-dataflow-job"
    gcp_options.staging_location = "gs://my-dataflow-bucket/staging"
    gcp_options.temp_location = "gs://my-dataflow-bucket/temp"
    
    # 3. Configure the standard runner to use Dataflow
    options.view_as(StandardOptions).runner = "DataflowRunner"
    
    # 4. Define pipeline steps
    with beam.Pipeline(options=options) as p:
        (p
         | "CreateData" >> beam.Create(["Apache Beam", "Google Cloud Dataflow", "Serverless"])
         | "ComputeLength" >> beam.Map(lambda x: (x, len(x)))
         | "WriteToGCS" >> beam.io.WriteToText("gs://my-dataflow-bucket/output/lengths", file_name_suffix=".txt")
        )

if __name__ == "__main__":
    run_basic_dataflow()

7. Code Explanation

PipelineOptions aggregates all configurations needed by the SDK and the runner.
options.view_as(GoogleCloudOptions) enables access to GCP-specific settings. project and region are mandatory parameters specifying where the job runs.
staging_location and temp_location specify GCS paths where files are temporarily saved.
options.view_as(StandardOptions).runner = "DataflowRunner" requests execution on GCP rather than locally.
WriteToGCS writes output files directly to GCS. The worker VMs execute the write transform in parallel.

8. Real Production Example

A financial services firm receives end-of-day transaction logs in an S3-compatible cloud storage bucket. They use a daily Dataflow batch pipeline to parse the transactions, filter out fraudulent attempts, and output the clean transactions to a data warehouse. By using the Dataflow Runner, they do not pay for idle servers during the day; the cluster launches, processes millions of transactions within 15 minutes, and shuts down instantly.

9. Common Mistakes

Forgetting to Set a Region: Omitting the --region flag. Dataflow requires you to specify a region (e.g., us-central1) to determine where to allocate worker VMs.
Using Local File Paths: Specifying local file paths (like options.staging_location = "/tmp/staging") when using DataflowRunner. Workers cannot access your local file system; you must use Cloud Storage paths (gs://...).
Insufficient Permissions: Running with a service account that lacks standard Dataflow Admin and GCS object writer roles, causing staging failures.

10. Interview Perspective

Question: Why is Dataflow described as "serverless" when it clearly spins up Compute Engine VMs?
Answer: It is serverless from the developer's operational perspective. The user does not manage, configure, or patch the server operating systems, nor do they manually trigger scaling actions. The service takes care of VM provisioning, orchestration, updates, scaling, and deletion automatically.
Question: Can I run a pipeline on Dataflow using local data sources?
Answer: No. Because Dataflow workers run on remote VM instances inside GCP, they cannot access local files. Data sources must be hosted in cloud-accessible services, such as Google Cloud Storage, Pub/Sub, or BigQuery.

11. Best Practices

Use Cloud Storage buckets in the same geographical region as your Dataflow job to minimize data transfer latency and cost.
Set unique, descriptive job names using alphanumeric characters and hyphens to identify jobs easily in the Cloud Console.
Clean up old files in your GCS temporary and staging locations by setting bucket lifecycle rules.

12. Summary

Dataflow is GCP's fully managed, serverless execution engine for Apache Beam.
It eliminates cluster setup, configuration, and scaling tasks.
Jobs require standard pipeline configurations including project, region, and GCS staging/temp paths.
Workers are automatically provisioned and torn down.

13. Interactive Challenges

Challenge 1: Declare Pipeline Options (Beginner)

Configure the minimal GCP pipeline options programmatically to run a job named "my-first-df-job" on the DataflowRunner inside project "my-gcp-project" and region "us-east1".

Challenge 2: Custom Machine Type Configuration (Intermediate)

Modify the pipeline options setup to configure the worker VMs to use the e2-standard-4 machine type on Dataflow.

Challenge 3: Multi-Environment Options Setup (Advanced)

Write a Python helper function get_options(env, job_name) that accepts an environment string ("prod" or "dev") and a job name. It should return a PipelineOptions object. For "prod", it must use the DataflowRunner, us-central1 region, and custom worker pool of n2-standard-8 machines. For "dev", it must use the DirectRunner.

14. Related Content

Previous LessonCustom IO

Next LessonDataflow Architecture