intermediate
Partition (Splitting)
5 min readLast updated: 2026-07-01
1. Introduction
The Partition transform splits a single PCollection into a fixed number of smaller PCollections based on a partitioning function you provide.
2. Why This Concept Exists
While Filter removes records you do not want, you frequently need to route records to different destinations without losing any. For example, you want to send success logs to an archival bucket and error logs to an alert slack channel. Partition lets you split the stream into multiple branches in a single parallel operation.
3. Key Terminology
- Partition Function: A function that takes an element and returns an integer index from
0tonum_partitions - 1, specifying which branch the element belongs to. - Partition List: The output object representing the list of resulting PCollections.
4. How It Works
- Define partitions: Specify how many branches you want (e.g.
num_partitions = 3). - Define index mapper: Provide a function that calculates the branch index.
- Split: Apply
beam.Partition(...). It returns a list of PCollections. You access each branch by index:partitions[index].
5. Visual Diagram
Input PCollection
[1, 2, 3, 4]
Branch 0 (Even)
[2, 4]
Branch 1 (Odd)
[1, 3]
6. Code Example
Splitting transactions into High-Value and Low-Value buckets:
python
import apache_beam as beam
def split_value_fn(element, num_partitions):
# Return index 0 for high-value (>= 100), index 1 for low-value (< 100)
if element >= 100.0:
return 0
return 1
with beam.Pipeline() as p:
transactions = p | beam.Create([50.0, 150.0, 30.0, 200.0])
# 1. Apply Partitioning into 2 branches
split_tx = transactions | "SplitTx" >> beam.Partition(split_value_fn, 2)
# 2. Extract separate PCollections
high_value = split_tx[0]
low_value = split_tx[1]
7. Code Explanation
split_value_fnreceiveselementandnum_partitions. It returns0or1.beam.Partition(split_value_fn, 2)splits the input.split_tx[0]holds elements[150.0, 200.0].split_tx[1]holds elements[50.0, 30.0].
8. Real Production Example
In server monitoring pipelines, you partition log streams based on HTTP status ranges:
- Index
0: Success logs (2xxstatus codes) -> load directly to BI dashboard tables. - Index
1: Client errors (4xxstatus codes) -> write to metric alerts cache. - Index
2: Server crashes (5xxstatus codes) -> route immediately to pager duty messaging hooks.
9. Common Mistakes
- Returning indexes out of bounds: If you configure
num_partitions = 3, your partition function must only return0,1, or2. Returning3or any negative number will raise anIndexErrorcrash. - Assuming dynamic partition counts: The number of partitions must be a static integer defined at construction time. You cannot change the number of partitions dynamically based on runtime record details.
10. Interview Perspective
- Question: What is the difference between Filter and Partition?
- Answer:
Filtertakes a boolean predicate and outputs a single collection discarding elements that returnFalse.Partitionsplits the collection into a fixed list of multiple separate output collections without discarding any data. - Question: Can you achieve partitioning using multiple Filter steps?
- Answer: Yes, but it is highly inefficient. Running three filter steps requires evaluating the entire PCollection three separate times.
Partitionevaluates each element once, routing it directly.
11. Best Practices
- Validate index calculations (e.g. using
try/exceptor modulo boundaries) inside your partitioning function to prevent out-of-bounds index exceptions. - Use Partition to keep your downstream processing stages modular and isolated.
12. Summary
Partitionsplits a PCollection into a static list of smaller collections.- Evaluated by an index-mapping function returning
0toN-1. - More efficient than applying multiple filters in parallel.
13. Interactive Challenges
14. Related Content
Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)