Mean
1. Introduction
The Mean transform in Apache Beam computes the arithmetic mean (average) of elements in a PCollection. Beam provides two main variations: computing the mean of all elements globally, or computing the mean of values grouped per key.
2. Why This Concept Exists
Averaging values is fundamental to statistical profiling and telemetry processing—such as measuring the average response time of server calls, average transaction spending, or average ambient sensor readings. Since calculating the mean requires tracking both the running sum and the running count of elements across distributed workers, Apache Beam provides built-in Mean combiners to handle the accumulation and division logic efficiently.
3. Key Terminology
- Mean.Globally: Computes the average of all elements in the entire PCollection.
- Mean.PerKey: Computes the average of values associated with each unique key in a key-value PCollection.
4. How It Works
- The
Meancombiner uses a two-part accumulator containing the(sum, count)of elements processed. - Local Accumulation: Individual worker nodes accumulate the elements in their partition, calculating their local sum and element count.
- Shuffle & Merge: The local accumulators are transferred and merged at the reducer node:
(sum1 + sum2, count1 + count2). - Extraction: The final step divides the total sum by the total count to return the overall average.
5. Visual Diagram
State: sum=30, count=2
State: sum=70, count=2
Global Mean Result
sum=100, count=4 ➔ Mean = 25.0
6. Code Example
Computing global average and average score per student:
import apache_beam as beam
with beam.Pipeline() as p:
grades = p | "Grades" >> beam.Create([
("Alice", 80.0),
("Bob", 90.0),
("Alice", 100.0),
("Bob", 80.0)
])
# 1. Mean Per Key
student_averages = grades | "AvgPerStudent" >> beam.combiners.Mean.PerKey()
# 2. Mean Globally (only values)
just_scores = grades | "GetScores" >> beam.Map(lambda x: x[1])
global_average = just_scores | "GlobalAvg" >> beam.combiners.Mean.Globally()
7. Code Explanation
student_averageswill yield:[("Alice", 90.0), ("Bob", 85.0)].global_averageignores keys, extracts the score values80.0, 90.0, 100.0, 80.0, and outputs a single value:87.5.
8. Real Production Example
When processing smart meter grid telemetry, you use Mean.PerKey to calculate the average hourly power consumption (in kWh) grouped by customer meter ID to detect anomalies or billing tiers.
9. Common Mistakes
- Applying Mean.Globally on non-numeric types: Trying to average strings or custom objects without extracting numerical values first will cause a
TypeError. - Dividing by zero on empty inputs: If the input PCollection is empty,
Mean.Globallyreturns a PCollection containingNaN(Not a Number) orNone. Your downstream code must handle these cases.
10. Interview Perspective
- Question: How does the Mean transform avoid floating-point rounding errors during distributed merges?
- Answer: It aggregates counts and sums as exact numerical values, delaying division until the final output extraction stage (
extract_output). - Question: Is Mean commutative and associative?
- Answer: While the mathematical average function is not directly associative, the underlying
(sum, count)accumulator addition is both commutative and associative, satisfying Beam's combiner requirements.
11. Best Practices
- Clean the data before calculating the mean by filtering out anomalous sensor spikes or negative numbers that could skew the averages.
- Prefer the built-in
Meancombiners rather than creating custom accumulators to minimize serialization size.
12. Summary
Meancalculates the average of numerical PCollections.- Uses a
(sum, count)accumulator under the hood. - Supports global (
Mean.Globally) and key-grouped (Mean.PerKey) modes.