intermediate

Sample

6 min readLast updated: 2026-07-01

1. Introduction

The Sample transform in Apache Beam extracts a random, representative subset of elements from a PCollection. It supports selecting a fixed number of items globally from the entire collection or taking a fixed number of items per key.

2. Why This Concept Exists

Modern pipelines process billions of records daily. Running complex machine learning training cycles, reporting analysis, or exploratory data quality checks on the entire dataset is often slow and unnecessary. Sampling lets you extract a smaller, statistically sound subset of data for validation, profiling, or testing, minimizing execution costs.

3. Key Terminology

  • Reservoir Sampling: An algorithm for choosing a random sample of $N$ items from a list containing a large, unknown number of items.
  • Sample.FixedSizeGlobally(N): A combiner that returns a list of $N$ randomly selected elements from the entire PCollection.
  • Sample.FixedSizePerKey(N): A combiner that returns a list of $N$ randomly selected values for each unique key.

4. How It Works

Distributed sampling uses Reservoir Sampling:

  • Each worker node reads its subset of elements and keeps a local "reservoir" list of size $N$.
  • For each new element, the worker uses a probability formula to decide whether to swap it into the reservoir or discard it.
  • Only the final reservoirs of size $N$ from each worker are shuffled and sent to the combiner node.
  • The combiner node merges the reservoirs to output a final list of $N$ random items.

5. Visual Diagram

Worker 1 Reservoir
Inputs: [A, B, C, D, E]
Local sample: [B, E]
Worker 2 Reservoir
Inputs: [F, G, H, I, J]
Local sample: [G, I]
▼ (Merge & Final Sample)

Final Sample Result
[E, G]

6. Code Example

Sampling 3 elements globally and 1 element per category key:

python
import apache_beam as beam

with beam.Pipeline() as p:
    words = p | "CreateWords" >> beam.Create(["cat", "dog", "elephant", "bear", "fox", "ant"])
    
    # 1. Sample 3 words globally
    global_sample = words | "SampleThree" >> beam.combiners.Sample.FixedSizeGlobally(3)
    
    # 2. Sample 1 word per category key
    animals = p | "CreateAnimals" >> beam.Create([
        ("mammal", "dog"),
        ("mammal", "cat"),
        ("insect", "ant"),
        ("insect", "bee")
    ])
    category_sample = animals | "SamplePerCategory" >> beam.combiners.Sample.FixedSizePerKey(1)

7. Code Explanation

  • global_sample outputs a PCollection with a single list containing three randomly selected items, e.g., [["bear", "dog", "ant"]].
  • category_sample yields random elements per key: [("mammal", ["cat"]), ("insect", ["bee"])].

8. Real Production Example

In a system auditing pipeline, you can sample 100 successful API transaction responses daily to inspect payloads for schema validation or security audits, avoiding database bloat:

python
auditable_sample = successful_responses | "SampleAudits" >> beam.combiners.Sample.FixedSizeGlobally(100)

9. Common Mistakes

  • Expecting flat outputs: Similar to the Top transform, Sample combiners output a single list containing the sampled elements (e.g. [[item1, item2]]). To process the items individually, you must flatten the list using beam.FlatMap(lambda x: x).
  • Expecting deterministic results: Because the sampling algorithm relies on random probability, running the pipeline multiple times will yield different output sets.

10. Interview Perspective

  • Question: How does reservoir sampling work in a distributed environment?
  • Answer: Workers assign a random uniform number to each incoming element and maintain a local priority queue of size $N$ for the largest random values. Merging reservoirs across workers is done by taking the top $N$ elements with the largest random keys globally.
  • Question: What happens if the input PCollection has fewer than $N$ elements?
  • Answer: The transform returns all elements in the collection, wrapped in a list.

11. Best Practices

  • Never use global sorting to sample elements (e.g., sorting and taking the limit), as it requires a full shuffle of the entire dataset.
  • Use Sample early in pipelines when building diagnostic dashboards to speed up debugging cycles.

12. Summary

  • Sample extracts random subsets of elements from PCollections.
  • Uses reservoir sampling to keep memory usage minimal.
  • The output is returned as a list inside a single-element PCollection.

13. Interactive Challenges

Challenge 1: Sample 2 User IDs (Beginner)

Write an Apache Beam statement that takes a PCollection of user IDs and samples exactly 2 unique IDs globally.

Challenge 2: Sample 1 Log per Severity (Intermediate)

Write a pipeline segment that takes a key-value PCollection of logs in the format (severity, log_message) and samples exactly 1 log message for each severity key.

Challenge 3: Ingest and Sample Numbers (Advanced)

Write a complete pipeline segment that loads numbers [10, 20, 30, 40, 50], samples 3 numbers globally, flattens the resulting list into individual values, and prints them.

14. Related Content

Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)