beginner

Mean

5 min readLast updated: 2026-07-01

1. Introduction

The Mean transform in Apache Beam computes the arithmetic mean (average) of elements in a PCollection. Beam provides two main variations: computing the mean of all elements globally, or computing the mean of values grouped per key.

2. Why This Concept Exists

Averaging values is fundamental to statistical profiling and telemetry processing—such as measuring the average response time of server calls, average transaction spending, or average ambient sensor readings. Since calculating the mean requires tracking both the running sum and the running count of elements across distributed workers, Apache Beam provides built-in Mean combiners to handle the accumulation and division logic efficiently.

3. Key Terminology

  • Mean.Globally: Computes the average of all elements in the entire PCollection.
  • Mean.PerKey: Computes the average of values associated with each unique key in a key-value PCollection.

4. How It Works

  • The Mean combiner uses a two-part accumulator containing the (sum, count) of elements processed.
  • Local Accumulation: Individual worker nodes accumulate the elements in their partition, calculating their local sum and element count.
  • Shuffle & Merge: The local accumulators are transferred and merged at the reducer node: (sum1 + sum2, count1 + count2).
  • Extraction: The final step divides the total sum by the total count to return the overall average.

5. Visual Diagram

Worker 1 & 2 Accumulator
Local inputs: 10, 20
State: sum=30, count=2
Worker 3 & 4 Accumulator
Local inputs: 30, 40
State: sum=70, count=2
▼ (Final Merge & Divide)

Global Mean Result
sum=100, count=4 ➔ Mean = 25.0

6. Code Example

Computing global average and average score per student:

python
import apache_beam as beam

with beam.Pipeline() as p:
    grades = p | "Grades" >> beam.Create([
        ("Alice", 80.0),
        ("Bob", 90.0),
        ("Alice", 100.0),
        ("Bob", 80.0)
    ])
    
    # 1. Mean Per Key
    student_averages = grades | "AvgPerStudent" >> beam.combiners.Mean.PerKey()
    
    # 2. Mean Globally (only values)
    just_scores = grades | "GetScores" >> beam.Map(lambda x: x[1])
    global_average = just_scores | "GlobalAvg" >> beam.combiners.Mean.Globally()

7. Code Explanation

  • student_averages will yield: [("Alice", 90.0), ("Bob", 85.0)].
  • global_average ignores keys, extracts the score values 80.0, 90.0, 100.0, 80.0, and outputs a single value: 87.5.

8. Real Production Example

When processing smart meter grid telemetry, you use Mean.PerKey to calculate the average hourly power consumption (in kWh) grouped by customer meter ID to detect anomalies or billing tiers.

9. Common Mistakes

  • Applying Mean.Globally on non-numeric types: Trying to average strings or custom objects without extracting numerical values first will cause a TypeError.
  • Dividing by zero on empty inputs: If the input PCollection is empty, Mean.Globally returns a PCollection containing NaN (Not a Number) or None. Your downstream code must handle these cases.

10. Interview Perspective

  • Question: How does the Mean transform avoid floating-point rounding errors during distributed merges?
  • Answer: It aggregates counts and sums as exact numerical values, delaying division until the final output extraction stage (extract_output).
  • Question: Is Mean commutative and associative?
  • Answer: While the mathematical average function is not directly associative, the underlying (sum, count) accumulator addition is both commutative and associative, satisfying Beam's combiner requirements.

11. Best Practices

  • Clean the data before calculating the mean by filtering out anomalous sensor spikes or negative numbers that could skew the averages.
  • Prefer the built-in Mean combiners rather than creating custom accumulators to minimize serialization size.

12. Summary

  • Mean calculates the average of numerical PCollections.
  • Uses a (sum, count) accumulator under the hood.
  • Supports global (Mean.Globally) and key-grouped (Mean.PerKey) modes.

13. Interactive Challenges

Challenge 1: Global Temperature Average (Beginner)

Write an Apache Beam statement that takes a PCollection of float temperature readings and computes the global average temperature.

Challenge 2: Average Price Per Brand (Intermediate)

Write a pipeline segment that takes a PCollection of tuples containing (brand_name, product_price) and computes the average product price for each brand.

Challenge 3: Average Reading from Dict Data (Advanced)

You have a PCollection of telemetry dictionaries: {"sensor_id": "Room-A", "value": 24.5}. Write a pipeline segment that maps these dictionaries into (sensor_id, value) tuples, and then calculates the average value per sensor ID.

14. Related Content

Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)