In distributed data processing, Data Skew occurs when a disproportionately large amount of data is associated with a single key. In Apache Beam and Google Cloud Dataflow, all elements sharing the same key are routed to the same worker node for grouping or combining.
If one key (the Hot Key) has millions of events while others have only a few, the worker assigned to that hot key will become overloaded, causing CPU throttling, memory exhaustion, and pipeline lag across the entire cluster.
When your Dataflow pipeline encounters a hot key, you will typically observe:
GroupByKey stages.Common hot keys include null values in database joins, default keys (like country code "US"), or heavy telemetry indicators.
Key Salting is a technique that distributes skewed keys by appending a random suffix (the "salt") to the key before grouping, performing partial aggregations, and then stripping the salt to compute the final aggregate value.
Modify your key mapping step to append a random integer to your key:
import random
import apache_beam as beam
def add_salt(element, num_salts=10):
key, val = element
salt = random.randint(0, num_salts - 1)
# New key format: (key_salt, value)
return (f"{key}_{salt}", val)
# Apply in pipeline
salted = raw_kv | "AddSalt" >> beam.Map(add_salt)
Aggregate data using the salted keys. This spreads the load across multiple worker VMs:
# Compute partial sums on salted keys
partial_sums = salted | "PartialSum" >> beam.CombinePerKey(sum)
Remove the salt suffix and combine the partial aggregates to find the final total:
def strip_salt(element):
salted_key, val = element
original_key = salted_key.rsplit("_", 1)[0]
return (original_key, val)
final_sums = (
partial_sums
| "StripSalt" >> beam.Map(strip_salt)
| "FinalSum" >> beam.CombinePerKey(sum)
)
CombinePerKey utilizes combiner lifting, which aggregates values locally on workers before shuffling, bypassing skew bottlenecks automatically for associative operations.