Streaming Basics
1. Introduction
Streaming processing is a data processing paradigm designed for continuous, infinite streams of data. Unlike batch processing, which works on complete, static datasets, streaming processes events as they happen, enabling real-time insights and immediate actions.
2. Why This Concept Exists
Modern applications generate data continuously—clicks on websites, financial transactions, IoT sensor readings, and system logs. Waiting hours or days for batch jobs to complete is no longer sufficient for use cases like fraud detection, real-time alerts, or live analytics dashboards. Streaming engines like Apache Beam provide the tools to ingest, analyze, and output continuous data with minimal latency.
3. Key Terminology
- Unbounded Dataset: A dataset of infinite size that grows continuously over time.
- Latency: The time elapsed between a data event occurring and that event being processed by the system.
- Throughput: The volume of data processed by the pipeline per unit of time.
- Event Time: The timestamp indicating when the data event actually occurred at the source.
- Processing Time: The timestamp indicating when the data event is processed by a specific step in the pipeline.
4. How It Works
Streaming pipelines run continuously, often for weeks, months, or years without stopping. The system ingests elements from unbounded sources (like Pub/Sub or Kafka) and immediately passes them to workers. Because the dataset is infinite, global aggregations are impossible without partitioning. The stream must be sliced into finite segments using Windowing (e.g., fixed time windows) or stateful operations before running aggregations.
5. Visual Diagram
Unbounded Source
Continuous Broker
Windowing (5-min)
Divide stream in time
Transform / Aggregate
Compute sum / stats
Unbounded Sink
PubSub / BigQuery Stream
6. Code Example
Here is a basic Apache Beam streaming pipeline in Python that reads from a streaming source, window-groups the incoming values, and writes them:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.transforms.window import FixedWindows
# Create pipeline options with streaming enabled
options = PipelineOptions(streaming=True)
with beam.Pipeline(options=options) as p:
(
p
| "ReadStream" >> beam.io.ReadFromPubSub(subscription="projects/my-project/subscriptions/my-sub")
| "Decode" >> beam.Map(lambda x: x.decode("utf-8"))
| "Window" >> beam.WindowInto(FixedWindows(60)) # 60-second windows
| "Count" >> beam.combiners.Count.Globally()
| "Format" >> beam.Map(lambda count: f"Events in last minute: {count}")
| "WriteLog" >> beam.Map(print)
)
7. Code Explanation
PipelineOptions(streaming=True)is critical; it tells the runner to execute the pipeline in streaming mode rather than batch.beam.io.ReadFromPubSubis an unbounded source reader that listens continuously for new messages.beam.WindowInto(FixedWindows(60))divides the continuous flow into distinct 60-second intervals based on event time.Count.Globally()calculates the count of elements that fall into each specific window.printdisplays the results of each window as soon as the window is triggered and closed.
8. Real Production Example
In a ride-hailing application, GPS coordinates from thousands of drivers are streamed continuously. The pipeline ingests the coordinates, groups them into 1-minute fixed windows, calculates the density of drivers per area, and updates the heat map on the customer app in real-time.
9. Common Mistakes
- Forgetting
streaming=True: If this option is omitted, the runner may attempt to execute in batch mode, causing the pipeline to wait indefinitely for the unbounded source to finish (which it never does). - Aggregating without Windows: Applying global aggregations (like
Count.Globally()orCombineGlobally(sum)) without a windowing strategy on an unbounded dataset will cause memory accumulation and eventual pipeline crashes because the runner tries to hold all historical data in memory.
10. Interview Perspective
- Question: What is the main difference between event time and processing time in a streaming pipeline?
- Answer: Event time is when the event occurred on the device (e.g., when a user clicked a button), whereas processing time is the clock time of the worker machine running the code. Network delays can cause event time to lag behind processing time.
- Question: Why can't we sort an unbounded PCollection?
- Answer: Sorting requires seeing all elements in the dataset to determine their relative order. Since an unbounded dataset is infinite and never ends, you cannot sort the collection globally.
11. Best Practices
- Always define appropriate windowing before running any aggregation on streaming data.
- Filter out corrupt or irrelevant data early in the pipeline to save computation and network bandwidth.
- Ensure that external source clients (like Kafka or Pub/Sub) are scaled to handle peak traffic loads.
12. Summary
- Streaming processes data continuously with low latency.
- Unbounded datasets require windowing to perform aggregations.
streaming=Truemust be configured in Apache Beam options to run in streaming mode.