beginner
Google Cloud Storage
6 min readLast updated: 2026-07-01
1. Introduction
Google Cloud Storage (GCS) is a RESTful object storage service. Apache Beam provides native support for GCS paths (gs://), enabling seamless reading and writing for pipelines running on Google Cloud Dataflow.
2. Why This Concept Exists
GCS is the default data lake storage layer for Google Cloud architectures. Pipelines running on Dataflow need secure, high-throughput access to GCS files. The GCS filesystem plugin integrates with GCP security (IAM), allowing workers to ingest and egress TBs of data efficiently.
3. Key Terminology
- gs:// URI: The scheme prefix denoting Google Cloud Storage paths.
- Dataflow Runner: Google's managed runner that executes Beam pipelines and provisions VM workers.
- IAM Roles: Identity and Access Management policies that control read/write permissions.
4. How It Works
- The Python environment requires
apache-beam[gcp]installed. - During pipeline execution, Beam processes paths prefixing
gs://using GCS API libraries. - Workers query GCS endpoints, stream blocks of data in parallel, and write sharded files directly back to the target GCS bucket.
5. Visual Diagram
Dataflow Worker VM
➔ (Parallel gs:// API)GCS Buckets
input-data/ & output-shards/
6. Code Example
Ingesting data from GCS and writing processed results back to GCS:
python
import apache_beam as beam
with beam.Pipeline() as p:
(p
| "ReadGCS" >> beam.io.ReadFromText("gs://my-app-bucket/raw-logs/*.json")
| "FilterStatus" >> beam.Filter(lambda x: "ERROR" in x)
| "WriteGCS" >> beam.io.WriteToText("gs://my-app-bucket/errors/error-log", file_name_suffix=".txt")
)
7. Code Explanation
gs://my-app-bucket/raw-logs/*.jsonresolves all JSON blobs inside the path.- The worker reads shards in parallel, filters the lines, and writes sharded files back to GCS.
8. Real Production Example
Using GCS options to define temporary and staging directories in Dataflow:
python
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions
options = PipelineOptions()
gcp_options = options.view_as(GoogleCloudOptions)
gcp_options.project = "my-gcp-project"
gcp_options.region = "us-central1"
gcp_options.job_name = "gcs-ingestion-job"
gcp_options.staging_location = "gs://my-app-bucket/staging"
gcp_options.temp_location = "gs://my-app-bucket/temp"
p = beam.Pipeline(options=options)
# Define pipeline steps here...
9. Common Mistakes
- Staging directory omissions: Running a Google Cloud Dataflow pipeline without setting
staging_locationandtemp_locationto valid GCS paths. - IAM configuration errors: The runner VM service account missing
Storage Object ViewerorStorage Object Creatorroles.
10. Interview Perspective
- Question: What happens to staging files stored on GCS when a Dataflow job finishes?
- Answer: Staging jars and metadata files remain in GCS to speed up future pipeline submissions. You should set up bucket lifecycle policies to clean up temporary storage.
- Question: How does Beam divide a single large GCS file?
- Answer: Beam queries GCS for the object metadata (HTTP GET Range) to download and read chunks sequentially or in parallel.
11. Best Practices
- Define a GCS lifecycle policy on the temp folder to prevent ongoing storage charges.
- Keep files in regional storage buckets close to the Dataflow runner region.
12. Summary
- GCS paths use the
gs://protocol. - Requires the
apache-beam[gcp]installation package. - Staging and temp locations are mandatory for Dataflow execution.
13. Interactive Challenges
14. Related Content
Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)