beginner
Read Transform
7 min readLast updated: 2026-07-01
1. Introduction
A Read Transform is the gateway to loading external files, database tables, or stream events into an Apache Beam pipeline. Rather than loading data into memory entirely at once, read transforms stream or partition massive datasets in parallel.
2. Why This Concept Exists
To process terabytes of data, you cannot use a basic Python open('file.txt') which is limited to a single CPU thread. Beam's read transforms are designed to understand file partitioning and database indexing, enabling a distributed cluster of worker nodes to read independent portions (shards) of the source data concurrently.
3. Key Terminology
- Source Adapter / I/O Connector: A class that handles communication with a specific external storage provider (like
beam.io.ReadFromText). - Splittable Source: A data source that can be divided into smaller chunks (offsets, ranges) to be read independently.
- File Pattern: A string with wildcards (e.g.
gs://bucket/data-*.csv) indicating multiple files.
4. How It Works
- The driver node analyzes the data source (e.g., checks files matching a pattern).
- It splits the workload into independent tasks (splits).
- Workers execute these splits in parallel, parsing records into a distributed
PCollection.
5. Visual Diagram
File.txt (Source)
➔ (Split)Worker A: Lines 0 - 1000
Worker B: Lines 1001 - 2000
Worker C: Lines 2001 - 3000
Consolidated PCollection
6. Code Example
Reading a text file from a local path:
python
import apache_beam as beam
with beam.Pipeline() as p:
lines = p | "ReadLogs" >> beam.io.ReadFromText("sample.log")
lines | "Print" >> beam.Map(print)
7. Code Explanation
beam.io.ReadFromText("sample.log")opens the specified file.- It automatically splits the file by line breaks.
- Each line becomes a distinct string element in the resulting
PCollection.
8. Real Production Example
Reading compressed logs from a Google Cloud Storage bucket path:
python
import apache_beam as beam
from apache_beam.io import ReadFromText
with beam.Pipeline() as p:
(p
| "ReadGCSLogs" >> ReadFromText("gs://my-bucket/logs/*.gz", compression_type="gzip")
| "FilterErrors" >> beam.Filter(lambda line: "ERROR" in line)
| "WriteOutput" >> beam.io.WriteToText("gs://my-bucket/alerts/errors"))
9. Common Mistakes
- Hardcoding Local Paths for Cloud Runners: If you run on a cloud runner, pointing to a local path (like
C:\data.csv) will fail because the worker nodes in the cloud cannot access your local drive. - Incompatible File Encoding: Assuming all files are UTF-8. Non-UTF-8 characters can trigger parsing crashes unless alternative encoding settings are specified.
10. Interview Perspective
- Question: How does Beam divide a single large file among multiple workers?
- Answer: Beam uses file system offsets. For block-compressed formats or raw text files, workers can seek to specific byte offsets (e.g., from byte 1,000,000 to 2,000,000) and identify the start of the next full line.
- Question: Does
ReadFromTextread files sequentially? - Answer: No. If there are multiple files or the file is large, Beam processes separate blocks concurrently across all available workers.
11. Best Practices
- Use standard glob patterns (e.g.
*.csv) to process multiple files in parallel. - Ensure that file types are split-friendly (like CSV, JSON Lines, Parquet, Avro) rather than large un-splittable formats (like standard zip files).
12. Summary
- Read transforms stream external data sources into
PCollections. - They support splitting workloads for parallel speed.
- Common I/O connectors include file, database, and message brokers.
13. Interactive Challenges
14. Related Content
Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)