Sample
1. Introduction
The Sample transform in Apache Beam extracts a random, representative subset of elements from a PCollection. It supports selecting a fixed number of items globally from the entire collection or taking a fixed number of items per key.
2. Why This Concept Exists
Modern pipelines process billions of records daily. Running complex machine learning training cycles, reporting analysis, or exploratory data quality checks on the entire dataset is often slow and unnecessary. Sampling lets you extract a smaller, statistically sound subset of data for validation, profiling, or testing, minimizing execution costs.
3. Key Terminology
- Reservoir Sampling: An algorithm for choosing a random sample of $N$ items from a list containing a large, unknown number of items.
- Sample.FixedSizeGlobally(N): A combiner that returns a list of $N$ randomly selected elements from the entire PCollection.
- Sample.FixedSizePerKey(N): A combiner that returns a list of $N$ randomly selected values for each unique key.
4. How It Works
Distributed sampling uses Reservoir Sampling:
- Each worker node reads its subset of elements and keeps a local "reservoir" list of size $N$.
- For each new element, the worker uses a probability formula to decide whether to swap it into the reservoir or discard it.
- Only the final reservoirs of size $N$ from each worker are shuffled and sent to the combiner node.
- The combiner node merges the reservoirs to output a final list of $N$ random items.
5. Visual Diagram
Local sample: [B, E]
Local sample: [G, I]
Final Sample Result
[E, G]
6. Code Example
Sampling 3 elements globally and 1 element per category key:
import apache_beam as beam
with beam.Pipeline() as p:
words = p | "CreateWords" >> beam.Create(["cat", "dog", "elephant", "bear", "fox", "ant"])
# 1. Sample 3 words globally
global_sample = words | "SampleThree" >> beam.combiners.Sample.FixedSizeGlobally(3)
# 2. Sample 1 word per category key
animals = p | "CreateAnimals" >> beam.Create([
("mammal", "dog"),
("mammal", "cat"),
("insect", "ant"),
("insect", "bee")
])
category_sample = animals | "SamplePerCategory" >> beam.combiners.Sample.FixedSizePerKey(1)
7. Code Explanation
global_sampleoutputs a PCollection with a single list containing three randomly selected items, e.g.,[["bear", "dog", "ant"]].category_sampleyields random elements per key:[("mammal", ["cat"]), ("insect", ["bee"])].
8. Real Production Example
In a system auditing pipeline, you can sample 100 successful API transaction responses daily to inspect payloads for schema validation or security audits, avoiding database bloat:
auditable_sample = successful_responses | "SampleAudits" >> beam.combiners.Sample.FixedSizeGlobally(100)
9. Common Mistakes
- Expecting flat outputs: Similar to the
Toptransform,Samplecombiners output a single list containing the sampled elements (e.g.[[item1, item2]]). To process the items individually, you must flatten the list usingbeam.FlatMap(lambda x: x). - Expecting deterministic results: Because the sampling algorithm relies on random probability, running the pipeline multiple times will yield different output sets.
10. Interview Perspective
- Question: How does reservoir sampling work in a distributed environment?
- Answer: Workers assign a random uniform number to each incoming element and maintain a local priority queue of size $N$ for the largest random values. Merging reservoirs across workers is done by taking the top $N$ elements with the largest random keys globally.
- Question: What happens if the input PCollection has fewer than $N$ elements?
- Answer: The transform returns all elements in the collection, wrapped in a list.
11. Best Practices
- Never use global sorting to sample elements (e.g., sorting and taking the limit), as it requires a full shuffle of the entire dataset.
- Use
Sampleearly in pipelines when building diagnostic dashboards to speed up debugging cycles.
12. Summary
Sampleextracts random subsets of elements from PCollections.- Uses reservoir sampling to keep memory usage minimal.
- The output is returned as a list inside a single-element PCollection.