Global Windows
1. Introduction
A Global Window is the default windowing strategy in Apache Beam. It groups all elements of a PCollection into a single, global window that spans the entire lifetime of the pipeline.
2. Why This Concept Exists
In batch processing, you are working with a finite dataset (e.g. a static CSV file). You don't need to segment data into hourly blocks—you just want to compute metrics over the entire dataset at once. Apache Beam assigns all elements to a single GlobalWindow by default. This allows standard batch pipelines to execute without needing any explicit windowing code.
3. Key Terminology
- Global Window: The single, default window that holds all elements in a PCollection.
- Default Strategy: If you do not apply
beam.WindowIntoin your pipeline, your data runs inside the Global Window. - End of Stream: The timestamp representing the end of time, marking the close of the global window.
4. How It Works
- All elements are assigned to
GlobalWindowwith a start of-infinityand end of+infinity. - For Batch: Once all data is read, the window closes, and aggregations (like sum or count) execute.
- For Streaming: Because the stream is infinite, the global window never ends. Therefore, standard aggregations (like
CombineGlobally) will never fire because the watermark never reaches the end of the window. To aggregate data in a global window inside a streaming pipeline, you must configure a custom trigger (like firing every 10 elements).
5. Visual Diagram
6. Code Example
Configuring a global window in a streaming pipeline that fires a pane every time 5 new elements arrive:
import apache_beam as beam
from apache_beam.transforms.window import GlobalWindows
from apache_beam.transforms.trigger import Repeatedly, AfterCount, AccumulationMode
with beam.Pipeline() as p:
(p
| "ReadStream" >> beam.io.ReadFromPubSub(subscription="projects/my-project/subs/my-sub")
| "ApplyGlobalWindow" >> beam.WindowInto(
GlobalWindows(),
trigger=Repeatedly(AfterCount(5)),
accumulation_mode=AccumulationMode.DISCARDING
)
| "CountFive" >> beam.CombineGlobally(sum).without_defaults()
| "Print" >> beam.Map(print))
7. Code Explanation
GlobalWindows()assigns all elements to the same single global bucket.trigger=Repeatedly(AfterCount(5))tells the runner: "Even though the window never ends, emit the current sum every time 5 elements arrive."AccumulationMode.DISCARDINGclears the buffer after each firing so the next pane starts counting from zero.
8. Real Production Example
In inventory systems, you maintain a running counter of total items sold. You run a streaming pipeline in a global window. As sales logs arrive, you update a stateful counter key using ValueState inside a global window, updating the master database in real-time.
9. Common Mistakes
- Aggregating streaming data in a global window without triggers: Running
stream | beam.CombineGlobally(sum)without applying a trigger will result in no output ever being emitted, as the global window never reaches its end boundary. - Memory leaks: Since the global window never ends, worker state buffers will accumulate data forever unless you configure discarding modes or clear state values.
10. Interview Perspective
- Question: Why does a streaming pipeline in a global window require triggers?
- Answer: Standard aggregations are triggered when the watermark passes the end of a window. Since the global window's end is
+infinity, the watermark will never pass it. Triggers are required to override this behavior, forcing emissions based on element counts or processing durations. - Question: Can you use stateful processing in a global window?
- Answer: Yes. In fact, stateful processing (like
ValueStateandBagState) is commonly used in global windows to maintain long-lived user states, shopping carts, or history tables.
11. Best Practices
- Use global windows for all batch pipelines that do not require time-series aggregations.
- Always define count or processing-time triggers when utilizing global windows in streaming pipelines.
12. Summary
- Global Windows is the default strategy for all PCollections.
- Gathers all data points into a single infinite timeline bucket.
- Batch runs execute once; streaming runs require custom triggers to emit outputs.