beginner

Write Transform

7 min readLast updated: 2026-07-01

1. Introduction

A Write Transform is the final step in a pipeline, responsible for exporting the processed elements of a PCollection back into external storage, such as text files, databases, or cloud warehouses.

2. Why This Concept Exists

Just as reading needs parallelization, writing must be distributed. If a single worker gathered all records to write them sequentially to a single file, it would create an execution bottleneck. Beam's write transforms write chunks of data concurrently from multiple worker nodes, producing sharded files or batched database inserts.

3. Key Terminology

  • Sink: The destination storage system (e.g. database, object storage).
  • Sharding: Splitting the output into multiple physical files (e.g. part-00000-of-00005.txt).
  • Write Disposition: Rules for writing to existing targets (e.g. overwrite, append).

4. How It Works

  • The pipeline routes records to writing tasks on worker nodes.
  • Each worker node writes its current batch of elements to a temporary file shard in parallel.
  • Once all workers complete, the runner executes a final commit, renaming the temp shards to the final output file pattern.

5. Visual Diagram

PCollection Input

Worker A: Temp file 1

Worker B: Temp file 2

Worker C: Temp file 3

Sharded Output Files

6. Code Example

Writing names to text files in the local directory:

python
import apache_beam as beam

with beam.Pipeline() as p:
    names = p | "CreateNames" >> beam.Create(["Alice", "Bob", "Charlie"])
    names | "WriteToFile" >> beam.io.WriteToText("output-names")

7. Code Explanation

  • beam.io.WriteToText("output-names") creates files starting with the prefix output-names.
  • Beam writes elements to sharded text files, automatically adding suffixes like -00000-of-00001 or similar.

8. Real Production Example

Writing logs to a GCS bucket with custom compression and a single shard for small files:

python
import apache_beam as beam

with beam.Pipeline() as p:
    (p
     | "CreateData" >> beam.Create(["Log 1: Info", "Log 2: Error"])
     | "WriteToGCS" >> beam.io.WriteToText(
         "gs://my-bucket/logs/run",
         file_name_suffix=".log",
         num_shards=1,
         shard_name_template=""  # Removes the default part numbering
     ))

9. Common Mistakes

  • Expecting a Single Output File by Default: Running WriteToText("output.csv") will result in files like output.csv-00000-of-00003. If you require exactly one file, you must specify num_shards=1 and clear the shard_name_template.
  • Assuming Output File Order: The lines in the output files will not follow a specific ordered sequence because different workers write parts in parallel.

10. Interview Perspective

  • Question: Why does Apache Beam output files with -00000-of-XXXXX suffixes by default?
  • Answer: To maximize write throughput. Multiple workers can write to different file parts simultaneously without resource locking.
  • Question: How do write transforms handle pipeline failures during execution?
  • Answer: They write to a temporary location first. If the job fails, temporary files are discarded. Temporary files are renamed to their final location only upon successful job completion.

11. Best Practices

  • Avoid forcing num_shards=1 on large datasets, as it serializes the write stage and slows down the pipeline.
  • Ensure the target filesystem has sufficient space and write permissions before starting the pipeline.

12. Summary

  • Write transforms write PCollections back to external storage.
  • They write to multiple shards in parallel to maintain scalability.
  • Data is written to temporary paths and committed upon job success.

13. Interactive Challenges

Challenge 1: Output file with Suffix (Beginner)

Write a pipeline segment that writes a list of items to files prefixing the path "report" and ending with a .txt suffix.

Challenge 2: Fixed Shards (Intermediate)

Write a pipeline segment that forces the write transform to write to exactly 5 shards.

Challenge 3: Write with Header (Advanced)

Write a pipeline segment that writes user rows to a file "users" with a header row "id,name,email".

14. Related Content

Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)