beginner
Write Transform
7 min readLast updated: 2026-07-01
1. Introduction
A Write Transform is the final step in a pipeline, responsible for exporting the processed elements of a PCollection back into external storage, such as text files, databases, or cloud warehouses.
2. Why This Concept Exists
Just as reading needs parallelization, writing must be distributed. If a single worker gathered all records to write them sequentially to a single file, it would create an execution bottleneck. Beam's write transforms write chunks of data concurrently from multiple worker nodes, producing sharded files or batched database inserts.
3. Key Terminology
- Sink: The destination storage system (e.g. database, object storage).
- Sharding: Splitting the output into multiple physical files (e.g.
part-00000-of-00005.txt). - Write Disposition: Rules for writing to existing targets (e.g. overwrite, append).
4. How It Works
- The pipeline routes records to writing tasks on worker nodes.
- Each worker node writes its current batch of elements to a temporary file shard in parallel.
- Once all workers complete, the runner executes a final commit, renaming the temp shards to the final output file pattern.
5. Visual Diagram
PCollection Input
➔ (Parallel Write)Worker A: Temp file 1
Worker B: Temp file 2
Worker C: Temp file 3
Sharded Output Files
6. Code Example
Writing names to text files in the local directory:
python
import apache_beam as beam
with beam.Pipeline() as p:
names = p | "CreateNames" >> beam.Create(["Alice", "Bob", "Charlie"])
names | "WriteToFile" >> beam.io.WriteToText("output-names")
7. Code Explanation
beam.io.WriteToText("output-names")creates files starting with the prefixoutput-names.- Beam writes elements to sharded text files, automatically adding suffixes like
-00000-of-00001or similar.
8. Real Production Example
Writing logs to a GCS bucket with custom compression and a single shard for small files:
python
import apache_beam as beam
with beam.Pipeline() as p:
(p
| "CreateData" >> beam.Create(["Log 1: Info", "Log 2: Error"])
| "WriteToGCS" >> beam.io.WriteToText(
"gs://my-bucket/logs/run",
file_name_suffix=".log",
num_shards=1,
shard_name_template="" # Removes the default part numbering
))
9. Common Mistakes
- Expecting a Single Output File by Default: Running
WriteToText("output.csv")will result in files likeoutput.csv-00000-of-00003. If you require exactly one file, you must specifynum_shards=1and clear theshard_name_template. - Assuming Output File Order: The lines in the output files will not follow a specific ordered sequence because different workers write parts in parallel.
10. Interview Perspective
- Question: Why does Apache Beam output files with
-00000-of-XXXXXsuffixes by default? - Answer: To maximize write throughput. Multiple workers can write to different file parts simultaneously without resource locking.
- Question: How do write transforms handle pipeline failures during execution?
- Answer: They write to a temporary location first. If the job fails, temporary files are discarded. Temporary files are renamed to their final location only upon successful job completion.
11. Best Practices
- Avoid forcing
num_shards=1on large datasets, as it serializes the write stage and slows down the pipeline. - Ensure the target filesystem has sufficient space and write permissions before starting the pipeline.
12. Summary
- Write transforms write PCollections back to external storage.
- They write to multiple shards in parallel to maintain scalability.
- Data is written to temporary paths and committed upon job success.
13. Interactive Challenges
14. Related Content
Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)