Google Cloud Storage — Learn Apache Beam

1. Introduction

Google Cloud Storage (GCS) is a RESTful object storage service. Apache Beam provides native support for GCS paths (gs://), enabling seamless reading and writing for pipelines running on Google Cloud Dataflow.

2. Why This Concept Exists

GCS is the default data lake storage layer for Google Cloud architectures. Pipelines running on Dataflow need secure, high-throughput access to GCS files. The GCS filesystem plugin integrates with GCP security (IAM), allowing workers to ingest and egress TBs of data efficiently.

3. Key Terminology

gs:// URI: The scheme prefix denoting Google Cloud Storage paths.
Dataflow Runner: Google's managed runner that executes Beam pipelines and provisions VM workers.
IAM Roles: Identity and Access Management policies that control read/write permissions.

4. How It Works

The Python environment requires apache-beam[gcp] installed.
During pipeline execution, Beam processes paths prefixing gs:// using GCS API libraries.
Workers query GCS endpoints, stream blocks of data in parallel, and write sharded files directly back to the target GCS bucket.

5. Visual Diagram

Dataflow Worker VM

➔ (Parallel gs:// API)

GCS Buckets
input-data/ & output-shards/

6. Code Example

Ingesting data from GCS and writing processed results back to GCS:

python

import apache_beam as beam

with beam.Pipeline() as p:
    (p
     | "ReadGCS" >> beam.io.ReadFromText("gs://my-app-bucket/raw-logs/*.json")
     | "FilterStatus" >> beam.Filter(lambda x: "ERROR" in x)
     | "WriteGCS" >> beam.io.WriteToText("gs://my-app-bucket/errors/error-log", file_name_suffix=".txt")
    )

7. Code Explanation

gs://my-app-bucket/raw-logs/*.json resolves all JSON blobs inside the path.
The worker reads shards in parallel, filters the lines, and writes sharded files back to GCS.

8. Real Production Example

Using GCS options to define temporary and staging directories in Dataflow:

python

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions

options = PipelineOptions()
gcp_options = options.view_as(GoogleCloudOptions)
gcp_options.project = "my-gcp-project"
gcp_options.region = "us-central1"
gcp_options.job_name = "gcs-ingestion-job"
gcp_options.staging_location = "gs://my-app-bucket/staging"
gcp_options.temp_location = "gs://my-app-bucket/temp"

p = beam.Pipeline(options=options)
# Define pipeline steps here...

9. Common Mistakes

Staging directory omissions: Running a Google Cloud Dataflow pipeline without setting staging_location and temp_location to valid GCS paths.
IAM configuration errors: The runner VM service account missing Storage Object Viewer or Storage Object Creator roles.

10. Interview Perspective

Question: What happens to staging files stored on GCS when a Dataflow job finishes?
Answer: Staging jars and metadata files remain in GCS to speed up future pipeline submissions. You should set up bucket lifecycle policies to clean up temporary storage.
Question: How does Beam divide a single large GCS file?
Answer: Beam queries GCS for the object metadata (HTTP GET Range) to download and read chunks sequentially or in parallel.

11. Best Practices

Define a GCS lifecycle policy on the temp folder to prevent ongoing storage charges.
Keep files in regional storage buckets close to the Dataflow runner region.

12. Summary

GCS paths use the gs:// protocol.
Requires the apache-beam[gcp] installation package.
Staging and temp locations are mandatory for Dataflow execution.

13. Interactive Challenges

Challenge 1: GCS Path Wildcard (Beginner)

Write an Apache Beam reader statement that matches all text files inside the GCS bucket "company-metrics" under the folder "2026".

Challenge 2: Staging Location Config (Intermediate)

Configure GoogleCloudOptions properties for a project named "my-gcp" and set staging path to "gs://my-bucket/stage".

Challenge 3: GCS Custom Writer (Advanced)

Write a write step that outputs string elements to "gs://output-bucket/results/report" ensuring the file extension is ".csv".

14. Related Content

Previous LessonCloud Storage

Next LessonAmazon S3