intermediate

Late Data

7 min readLast updated: 2026-07-01

1. Introduction

In a streaming system, Late Data refers to any data element that arrives at a processing worker after the watermark has passed the element's event timestamp.

Because the watermark represents the system's progress of time, any element whose timestamp lags behind the watermark is classified as late. Depending on your pipeline configuration, this data will either be processed as an update or discarded.

2. Why This Concept Exists

In distributed networks, data delays are inevitable. Network outages, mobile device disconnections, and consumer backlogs cause events to arrive out of order.

Understanding and handling Late Data is critical because:

  • Preventing Data Loss: Without a late-data handling strategy, late elements are dropped by default, resulting in missing logs and inaccurate aggregates.
  • Report Accuracy: Business decisions require correct metrics. If late-arriving sales data is ignored, financial dashboards will show incorrect revenue.
  • Operational Monitoring: Spikes in late data indicate network bottlenecks or failures in upstream message brokers.

3. Key Terminology

  • Late Element: An element with an event timestamp $T$ such that $T < W$, where $W$ is the current watermark.
  • On-Time Element: An element that arrives before the watermark passes its timestamp ($T \geq W$).
  • Dead-Letter Queue (DLQ): A storage destination (such as a database table or file folder) where invalid or late elements are sent for debugging or manual reprocessing.
  • Dropped Data: Elements that arrive after the window's allowed lateness has expired and are discarded by the runner.

4. How It Works

  1. Watermark Calculation: The runner calculates the watermark based on the events in the pipeline.
  2. Comparison: A new element arrives with event time 12:01:30. The current watermark is 12:03:00.
  3. Classification: Since the timestamp (12:01:30) is less than the watermark (12:03:00), the runner flags the element as Late.
  4. Routing Policy:
    • If the window 12:00:00 - 12:05:00 is still within its Allowed Lateness period, the element is processed, updating the aggregate.
    • If the allowed lateness period has expired, the element is discarded.
    • If the pipeline includes a routing step, the late element is redirected to a side output (DLQ) before reaching the windowing step.

5. Visual Diagram

On-Time Data
e1: 12:04 (Watermark = 12:03)

Late Data!
e2: 12:01 (Watermark passed 12:03)

6. Code Example

The following pipeline implements a custom DoFn that inspects the latency of incoming elements relative to the system clock. If an element is delayed by more than 5 minutes (300 seconds), it is routed to a "dead-letter" side output for inspection; otherwise, it passes to the main pipeline:

python
import time
import apache_beam as beam

class ProactiveLateDataRouter(beam.DoFn):
    LATE_OUTPUT_TAG = "late_dlq"
    
    def process(self, element):
        event_time = float(element.get("timestamp_epoch", 0))
        current_time = time.time()
        
        # Check if the element is late by more than 5 minutes (300 seconds)
        if (current_time - event_time) > 300:
            # Route to dead-letter side output
            yield beam.pvalue.TaggedOutput(self.LATE_OUTPUT_TAG, element)
        else:
            # Yield to main collection
            yield element

with beam.Pipeline() as p:
    # Set up main and side outputs
    results = (p
               | "CreateMockData" >> beam.Create([
                   {"id": "msg_1", "timestamp_epoch": time.time()},             # On-Time
                   {"id": "msg_2", "timestamp_epoch": time.time() - 600}        # Late (10 mins old)
               ])
               | "RouteLateData" >> beam.ParDo(ProactiveLateDataRouter()).with_outputs(
                   ProactiveLateDataRouter.LATE_OUTPUT_TAG, main="on_time_data"
               )
              )
    
    # Process on-time data
    on_time = results.on_time_data | "LogOnTime" >> beam.Map(lambda x: print(f"Processing On-Time: {x}"))
    
    # Send late data to dead letter storage
    late_data = results[ProactiveLateDataRouter.LATE_OUTPUT_TAG] | "WriteToDLQ" >> beam.Map(lambda x: print(f"DLQ WARNING: {x} sent to backup storage!"))

7. Code Explanation

  • ProactiveLateDataRouter inherits from beam.DoFn and defines a side output tag LATE_OUTPUT_TAG.
  • current_time - event_time measures the real-world delay of the incoming message.
  • beam.pvalue.TaggedOutput(...) routes late events to the separate side output tag.
  • with_outputs(...) on the ParDo splits the single input stream into two output PCollections: the default (on_time_data) and the late stream (results[LATE_OUTPUT_TAG]).

8. Real Production Example

In streaming analytics for delivery fleets, trucks report GPS coordinates. When a truck drives through a cellular dead zone, GPS points are buffered. Upon reconnecting, the vehicle uploads several hours of old logs. The streaming pipeline detects these logs as late data. It routes them to a BigQuery "late_logs" table for auditing, keeping the real-time mapping system free of latency spikes.

9. Common Mistakes

  • Assuming Late Data is Automatically Routed: Expecting the runner to handle late data without configuration. In standard windowing, late data beyond the allowed lateness is discarded silently.
  • Failing to Monitor Dropped Metrics: Not tracking how many elements are dropped due to lateness. Without monitoring, you may lose data without realizing it.
  • Reprocessing Late Data on the Same Queue: Trying to write late data back into the main stream without correction, creating infinite loops of reprocessing.

10. Interview Perspective

  • Question: How does Apache Beam decide if a record is late?
  • Answer: It compares the event timestamp of the element with the current watermark of the transform. If the timestamp is strictly less than the watermark, the record is late.
  • Question: What is the purpose of a Dead-Letter Queue (DLQ) in streaming?
  • Answer: A DLQ stores elements that cannot be processed in real-time, such as malformed payloads or highly late events. It prevents pipeline crashes and allows developers to inspect or replay the records later.

11. Best Practices

  • Implement side outputs (DLQs) for late-arriving data to guarantee zero data loss.
  • Log and alert if the volume of late data spikes, as this indicates a failure in network infrastructure or upstream ingestion layers.
  • Store late data in low-cost storage (e.g., GCS) for eventual batch backfilling.

12. Summary

  • Late data consists of elements whose event time is older than the current watermark.
  • By default, late data is dropped unless allowed lateness is configured.
  • Proactive routing uses side outputs to capture late data before windowing.
  • Late data can be written to a Dead-Letter Queue for analysis.
  • Monitoring late data is crucial for tracking overall pipeline health.

13. Interactive Challenges

Challenge 1: Route Late Elements (Beginner)

Complete the DoFn class below to route any record with the key "status" equal to "late" to the side output tag "late_records".

Challenge 2: Late Data Inspector (Intermediate)

Write a DoFn class FilterLaggingData that routes records to the side output "late_dlq" if their event timestamp is older than the current watermark by more than 60 seconds.

Challenge 3: Split Late and On-Time Pipelines (Advanced)

Write a pipeline segment that applies a ParDo(FilterLaggingData()) to split an input stream, then saves on-time records to "d:/Builds/beam-acadamy/beam-academy-hub/content/lessons/on_time.txt" and late records to "d:/Builds/beam-acadamy/beam-academy-hub/content/lessons/late_dlq.txt".

14. Related Content

Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)