TextIO
1. Introduction
TextIO is the Apache Beam built-in connector used to read from and write to plain text files (like .txt, .csv, or .json logs) stored on local disks or cloud object stores (like Google Cloud Storage, AWS S3, or Azure Blob).
2. Why This Concept Exists
Data engineering pipelines must interface with external storage systems. Text files are the most common legacy and exchange format. TextIO provides parallelized, fault-tolerant readers and writers that partition massive files across worker nodes automatically.
3. Key Terminology
- Source: A connector that reads data into the pipeline (e.g.,
ReadFromText). - Sink: A connector that writes data out of the pipeline (e.g.
WriteToText). - Sharding: Splitting output data into multiple file chunks (shards) to allow workers to write in parallel.
4. How It Works
- Read:
beam.io.ReadFromText("path/*.txt")reads files line-by-line. Each line becomes a string element in the resulting PCollection. It supports glob patterns (wildcards) to read thousands of files at once. - Write:
beam.io.WriteToText("output/path")takes string elements and writes them to disk. - By default, the writer automatically splits outputs into multiple files (e.g.
path-00000-of-00003) to optimize throughput.
5. Visual Diagram
GCS Source Files
log-2026-01.txt | log-2026-02.txt
Parallel Workers
Process text lines
Output Shards
shard-000-of-002 | shard-001-of-002
6. Code Example
Reading transaction text files, stripping whitespace, and writing output:
import apache_beam as beam
with beam.Pipeline() as p:
(p
| "ReadLogs" >> beam.io.ReadFromText("data/input.txt", skip_header_lines=1)
| "CleanLines" >> beam.Map(lambda line: line.strip())
| "WriteClean" >> beam.io.WriteToText("data/output_clean", file_name_suffix=".txt"))
7. Code Explanation
beam.io.ReadFromText("data/input.txt", skip_header_lines=1)reads lines while ignoring the first line (useful for skipping CSV headers).beam.io.WriteToText("data/output_clean", file_name_suffix=".txt")writes the cleaned lines. The resulting files will be named likedata/output_clean-00000-of-00001.txt.
8. Real Production Example
When working on Google Cloud, you pass Cloud Storage paths directly:
lines = p | beam.io.ReadFromText("gs://my-company-bucket/raw-logs/*.csv")
Beam automatically handles credentials, streaming connections, and parallel worker splits under the hood.
9. Common Mistakes
- Assuming single file outputs: Running
WriteToText("output.txt")does not create a single file namedoutput.txt. It generates sharded files likeoutput.txt-00000-of-00005. To force a single file, setnum_shards=1(highly discouraged for large datasets as it creates worker bottlenecks). - Writing non-string elements:
WriteToTextexpects a PCollection of strings. If you pass dictionaries or integers, it will throw a serialization crash. Convert elements to strings first usingbeam.Map(str).
10. Interview Perspective
- Question: How does Beam divide a single large text file among multiple workers?
- Answer: The runner splits files by byte ranges. Workers read separate byte offset chunks from the storage provider, parsing line breaks dynamically to ensure no records are split or read twice.
- Question: How do you read compressed files?
- Answer: TextIO handles compression automatically. If files end in
.gzor.bz2, setcompression_type="gzip"or let Beam auto-detect it.
11. Best Practices
- Avoid setting
num_shards=1on production writers, as this forces all parallel workers to route their records to a single machine, crashing throughput. - Use glob patterns (
*) to batch-read files from buckets.
12. Summary
ReadFromTextandWriteToTexthandle flat files.- Supported on local disk and cloud stores (GCS/S3).
- Outputs are sharded automatically to optimize write speeds.