Write Transform — Learn Apache Beam

1. Introduction

A Write Transform is the final step in a pipeline, responsible for exporting the processed elements of a PCollection back into external storage, such as text files, databases, or cloud warehouses.

2. Why This Concept Exists

Just as reading needs parallelization, writing must be distributed. If a single worker gathered all records to write them sequentially to a single file, it would create an execution bottleneck. Beam's write transforms write chunks of data concurrently from multiple worker nodes, producing sharded files or batched database inserts.

3. Key Terminology

Sink: The destination storage system (e.g. database, object storage).
Sharding: Splitting the output into multiple physical files (e.g. part-00000-of-00005.txt).
Write Disposition: Rules for writing to existing targets (e.g. overwrite, append).

4. How It Works

The pipeline routes records to writing tasks on worker nodes.
Each worker node writes its current batch of elements to a temporary file shard in parallel.
Once all workers complete, the runner executes a final commit, renaming the temp shards to the final output file pattern.

5. Visual Diagram

PCollection Input

➔ (Parallel Write)

Worker A: Temp file 1

Worker B: Temp file 2

Worker C: Temp file 3

➔ (Commit/Rename)

Sharded Output Files

6. Code Example

Writing names to text files in the local directory:

python

import apache_beam as beam

with beam.Pipeline() as p:
    names = p | "CreateNames" >> beam.Create(["Alice", "Bob", "Charlie"])
    names | "WriteToFile" >> beam.io.WriteToText("output-names")

7. Code Explanation

beam.io.WriteToText("output-names") creates files starting with the prefix output-names.
Beam writes elements to sharded text files, automatically adding suffixes like -00000-of-00001 or similar.

8. Real Production Example

Writing logs to a GCS bucket with custom compression and a single shard for small files:

python

import apache_beam as beam

with beam.Pipeline() as p:
    (p
     | "CreateData" >> beam.Create(["Log 1: Info", "Log 2: Error"])
     | "WriteToGCS" >> beam.io.WriteToText(
         "gs://my-bucket/logs/run",
         file_name_suffix=".log",
         num_shards=1,
         shard_name_template=""  # Removes the default part numbering
     ))

9. Common Mistakes

Expecting a Single Output File by Default: Running WriteToText("output.csv") will result in files like output.csv-00000-of-00003. If you require exactly one file, you must specify num_shards=1 and clear the shard_name_template.
Assuming Output File Order: The lines in the output files will not follow a specific ordered sequence because different workers write parts in parallel.

10. Interview Perspective

Question: Why does Apache Beam output files with -00000-of-XXXXX suffixes by default?
Answer: To maximize write throughput. Multiple workers can write to different file parts simultaneously without resource locking.
Question: How do write transforms handle pipeline failures during execution?
Answer: They write to a temporary location first. If the job fails, temporary files are discarded. Temporary files are renamed to their final location only upon successful job completion.

11. Best Practices

Avoid forcing num_shards=1 on large datasets, as it serializes the write stage and slows down the pipeline.
Ensure the target filesystem has sufficient space and write permissions before starting the pipeline.

12. Summary

Write transforms write PCollections back to external storage.
They write to multiple shards in parallel to maintain scalability.
Data is written to temporary paths and committed upon job success.

13. Interactive Challenges

Challenge 1: Output file with Suffix (Beginner)

Write a pipeline segment that writes a list of items to files prefixing the path "report" and ending with a .txt suffix.

Challenge 2: Fixed Shards (Intermediate)

Write a pipeline segment that forces the write transform to write to exactly 5 shards.

Challenge 3: Write with Header (Advanced)

Write a pipeline segment that writes user rows to a file "users" with a header row "id,name,email".

14. Related Content

Previous LessonRead Transform

Next LessonType Hints