Read Transform — Learn Apache Beam

1. Introduction

A Read Transform is the gateway to loading external files, database tables, or stream events into an Apache Beam pipeline. Rather than loading data into memory entirely at once, read transforms stream or partition massive datasets in parallel.

2. Why This Concept Exists

To process terabytes of data, you cannot use a basic Python open('file.txt') which is limited to a single CPU thread. Beam's read transforms are designed to understand file partitioning and database indexing, enabling a distributed cluster of worker nodes to read independent portions (shards) of the source data concurrently.

3. Key Terminology

Source Adapter / I/O Connector: A class that handles communication with a specific external storage provider (like beam.io.ReadFromText).
Splittable Source: A data source that can be divided into smaller chunks (offsets, ranges) to be read independently.
File Pattern: A string with wildcards (e.g. gs://bucket/data-*.csv) indicating multiple files.

4. How It Works

The driver node analyzes the data source (e.g., checks files matching a pattern).
It splits the workload into independent tasks (splits).
Workers execute these splits in parallel, parsing records into a distributed PCollection.

5. Visual Diagram

File.txt (Source)

➔ (Split)

Worker A: Lines 0 - 1000

Worker B: Lines 1001 - 2000

Worker C: Lines 2001 - 3000

➔

Consolidated PCollection

6. Code Example

Reading a text file from a local path:

python

import apache_beam as beam

with beam.Pipeline() as p:
    lines = p | "ReadLogs" >> beam.io.ReadFromText("sample.log")
    lines | "Print" >> beam.Map(print)

7. Code Explanation

beam.io.ReadFromText("sample.log") opens the specified file.
It automatically splits the file by line breaks.
Each line becomes a distinct string element in the resulting PCollection.

8. Real Production Example

Reading compressed logs from a Google Cloud Storage bucket path:

python

import apache_beam as beam
from apache_beam.io import ReadFromText

with beam.Pipeline() as p:
    (p
     | "ReadGCSLogs" >> ReadFromText("gs://my-bucket/logs/*.gz", compression_type="gzip")
     | "FilterErrors" >> beam.Filter(lambda line: "ERROR" in line)
     | "WriteOutput" >> beam.io.WriteToText("gs://my-bucket/alerts/errors"))

9. Common Mistakes

Hardcoding Local Paths for Cloud Runners: If you run on a cloud runner, pointing to a local path (like C:\data.csv) will fail because the worker nodes in the cloud cannot access your local drive.
Incompatible File Encoding: Assuming all files are UTF-8. Non-UTF-8 characters can trigger parsing crashes unless alternative encoding settings are specified.

10. Interview Perspective

Question: How does Beam divide a single large file among multiple workers?
Answer: Beam uses file system offsets. For block-compressed formats or raw text files, workers can seek to specific byte offsets (e.g., from byte 1,000,000 to 2,000,000) and identify the start of the next full line.
Question: Does ReadFromText read files sequentially?
Answer: No. If there are multiple files or the file is large, Beam processes separate blocks concurrently across all available workers.

11. Best Practices

Use standard glob patterns (e.g. *.csv) to process multiple files in parallel.
Ensure that file types are split-friendly (like CSV, JSON Lines, Parquet, Avro) rather than large un-splittable formats (like standard zip files).

12. Summary

Read transforms stream external data sources into PCollections.
They support splitting workloads for parallel speed.
Common I/O connectors include file, database, and message brokers.

13. Interactive Challenges

Challenge 1: Read Multiple CSV Files (Beginner)

Write a pipeline segment that reads data from all files matching the pattern sales-*.csv in the current directory and prints their contents.

Challenge 2: Custom Encoding (Intermediate)

Write a pipeline segment that reads a text file "legacy_data.txt" which is encoded using latin-1 rather than the default utf-8.

Challenge 3: Skip Header Lines (Advanced)

Write a pipeline segment that reads a CSV file "data.csv" and skips the first header row, producing a PCollection containing only data rows.

14. Related Content

Previous LessonCreate Transform

Next LessonWrite Transform