intermediate

TextIO

6 min readLast updated: 2026-06-30

1. Introduction

TextIO is the Apache Beam built-in connector used to read from and write to plain text files (like .txt, .csv, or .json logs) stored on local disks or cloud object stores (like Google Cloud Storage, AWS S3, or Azure Blob).

2. Why This Concept Exists

Data engineering pipelines must interface with external storage systems. Text files are the most common legacy and exchange format. TextIO provides parallelized, fault-tolerant readers and writers that partition massive files across worker nodes automatically.

3. Key Terminology

  • Source: A connector that reads data into the pipeline (e.g., ReadFromText).
  • Sink: A connector that writes data out of the pipeline (e.g. WriteToText).
  • Sharding: Splitting output data into multiple file chunks (shards) to allow workers to write in parallel.

4. How It Works

  • Read: beam.io.ReadFromText("path/*.txt") reads files line-by-line. Each line becomes a string element in the resulting PCollection. It supports glob patterns (wildcards) to read thousands of files at once.
  • Write: beam.io.WriteToText("output/path") takes string elements and writes them to disk.
  • By default, the writer automatically splits outputs into multiple files (e.g. path-00000-of-00003) to optimize throughput.

5. Visual Diagram

GCS Source Files
log-2026-01.txt | log-2026-02.txt

Parallel Workers
Process text lines

Output Shards
shard-000-of-002 | shard-001-of-002

6. Code Example

Reading transaction text files, stripping whitespace, and writing output:

python
import apache_beam as beam

with beam.Pipeline() as p:
    (p
     | "ReadLogs" >> beam.io.ReadFromText("data/input.txt", skip_header_lines=1)
     | "CleanLines" >> beam.Map(lambda line: line.strip())
     | "WriteClean" >> beam.io.WriteToText("data/output_clean", file_name_suffix=".txt"))

7. Code Explanation

  • beam.io.ReadFromText("data/input.txt", skip_header_lines=1) reads lines while ignoring the first line (useful for skipping CSV headers).
  • beam.io.WriteToText("data/output_clean", file_name_suffix=".txt") writes the cleaned lines. The resulting files will be named like data/output_clean-00000-of-00001.txt.

8. Real Production Example

When working on Google Cloud, you pass Cloud Storage paths directly:

python
lines = p | beam.io.ReadFromText("gs://my-company-bucket/raw-logs/*.csv")

Beam automatically handles credentials, streaming connections, and parallel worker splits under the hood.

9. Common Mistakes

  • Assuming single file outputs: Running WriteToText("output.txt") does not create a single file named output.txt. It generates sharded files like output.txt-00000-of-00005. To force a single file, set num_shards=1 (highly discouraged for large datasets as it creates worker bottlenecks).
  • Writing non-string elements: WriteToText expects a PCollection of strings. If you pass dictionaries or integers, it will throw a serialization crash. Convert elements to strings first using beam.Map(str).

10. Interview Perspective

  • Question: How does Beam divide a single large text file among multiple workers?
  • Answer: The runner splits files by byte ranges. Workers read separate byte offset chunks from the storage provider, parsing line breaks dynamically to ensure no records are split or read twice.
  • Question: How do you read compressed files?
  • Answer: TextIO handles compression automatically. If files end in .gz or .bz2, set compression_type="gzip" or let Beam auto-detect it.

11. Best Practices

  • Avoid setting num_shards=1 on production writers, as this forces all parallel workers to route their records to a single machine, crashing throughput.
  • Use glob patterns (*) to batch-read files from buckets.

12. Summary

  • ReadFromText and WriteToText handle flat files.
  • Supported on local disk and cloud stores (GCS/S3).
  • Outputs are sharded automatically to optimize write speeds.

13. Interactive Challenges

Challenge 1: Basic Text Reader (Beginner)

Write a pipeline segment that reads all text files inside the directory "data/events/" using wildcards.

Challenge 2: Single-Shard JSON Writer (Intermediate)

Write a transform segment that takes a PCollection of string records and writes them to a single output file path "output/json/summary" with a suffix of ".json".

Challenge 3: CSV Header Skipper (Advanced)

Write a complete pipeline segment inside a context block that reads a CSV file at "data/users.csv", skips the first 2 header lines, maps the remaining lines to strip whitespace, and writes the output back to "data/clean_users".

14. Related Content

Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)