Pub/Sub Streaming — Learn Apache Beam

1. Introduction

Google Cloud Pub/Sub is a fully-managed real-time messaging service. Apache Beam provides native support for Pub/Sub, making it one of the most common unbounded sources and sinks used for building production-grade streaming ingestion pipelines on Google Cloud Dataflow.

2. Why This Concept Exists

Decoupling message producers (like web servers or mobile applications) from message consumers (like data pipelines) is critical for system reliability. If your processing pipeline is down for maintenance, Pub/Sub buffers the incoming messages. When the pipeline restarts, it catches up without losing data. Apache Beam's Pub/Sub connector manages message ingestion, acknowledgement, watermarking, and scaling automatically.

3. Key Terminology

Topic: A named resource to which producers send messages.
Subscription: A named resource representing the stream of messages from a single, specific topic, to be delivered to the consuming pipeline.
Attributes: Key-value metadata pairs attached to a Pub/Sub message.
Ack (Acknowledgement): A signal sent back to Pub/Sub indicating that a message has been successfully processed and can be deleted.

4. How It Works

When running on Google Cloud Dataflow, the runner uses a highly optimized streaming connector.

Reading: Beam reads messages from a Pub/Sub subscription. It can pull raw bytes, or pull both payload and attributes.
Watermark Tracking: By default, Beam uses the message's publish timestamp (assigned by Google Cloud Pub/Sub) as the event time to advance the pipeline's watermark.
Checkpointing & Acknowledging: As data passes through downstream transforms, the runner checkpoints progress and automatically sends acknowledgements back to Pub/Sub to prevent message redelivery.

5. Visual Diagram

Pub/Sub Topic
Message ingress broker

▼ (Subscription pull)

Worker VM 1
Parallel processing

Worker VM 2
Parallel processing

▼ (WriteToPubSub)

Output Topic
Downstream consumers

6. Code Example

The following pipeline reads JSON messages with their attributes from a subscription, extracts a custom timestamp, and writes processed results back to a target topic:

python

import json
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions(streaming=True)

with beam.Pipeline(options=options) as p:
    (
        p
        # 1. Read messages, including metadata attributes
        | "ReadFromPubSub" >> beam.io.ReadFromPubSub(
            subscription="projects/my-project/subscriptions/iot-sub",
            with_attributes=True
        )
        # 2. Extract payload and attribute info
        | "ParseMessage" >> beam.Map(lambda msg: {
            "data": json.loads(msg.data.decode("utf-8")),
            "attributes": msg.attributes
        })
        # 3. Filter active devices
        | "FilterActive" >> beam.Filter(lambda x: x["attributes"].get("status") == "active")
        # 4. Convert back to Pub/Sub format and write
        | "FormatOutput" >> beam.Map(lambda x: json.dumps(x["data"]).encode("utf-8"))
        | "WriteToPubSub" >> beam.io.WriteToPubSub(
            topic="projects/my-project/topics/filtered-iot-events"
        )
    )

7. Code Explanation

ReadFromPubSub(subscription=..., with_attributes=True) tells Beam to fetch full PubsubMessage objects instead of raw byte strings. This allows access to .data and .attributes.
We parse the JSON payload from bytes (msg.data.decode("utf-8")) and read the custom key-value attributes (msg.attributes).
Filter checks the attributes dictionary to only pass messages where the key "status" is "active".
Finally, the pipeline encodes the result and publishes it to a target Pub/Sub topic using WriteToPubSub.

8. Real Production Example

A smart-grid company streams electrical meter metrics to a Pub/Sub topic. The Apache Beam pipeline reads from the subscription, extracts the voltage and power factors, runs 1-minute windowed averages, and raises alerts on another Pub/Sub topic if voltages fluctuate beyond safe limits.

9. Common Mistakes

Reading Directly from a Topic in Production: While Beam allows reading directly from a topic using ReadFromPubSub(topic=...), this causes Beam to create a temporary subscription. In production, this can lead to data loss or duplicate processing if the pipeline restarts. Always create a static, dedicated subscription beforehand and read from that.
Missing IAM Permissions: The pipeline runner service account must have Pub/Sub Subscriber and Viewer permissions on the input subscription, and Pub/Sub Publisher permissions on the output topic.

10. Interview Perspective

Question: How does Beam determine event time when reading from Pub/Sub?
Answer: By default, Beam uses the message's publish timestamp. However, you can configure ReadFromPubSub(timestamp_attribute="my_timestamp_key") to tell Beam to extract the event time from a specific metadata attribute containing a RFC 3339 or milli-epoch value.
Question: Does Beam guarantee exactly-once processing when reading from Pub/Sub?
Answer: When combined with Google Cloud Dataflow, yes. The runner deduplicates incoming messages using Pub/Sub's unique message_id metadata and ensures side-effects are processed exactly-once before acknowledging the messages.

11. Best Practices

Configure a Dead-Letter Queue (DLQ) on your Pub/Sub subscription to capture poison pills (malformed messages) that cause processing loops.
Use timestamp_attribute if your client generates event times that differ significantly from publish times (e.g., when devices queue messages offline).
Avoid using heavy database lookups per Pub/Sub message. Use side inputs or stateful caching instead.

12. Summary

Pub/Sub is Google Cloud's distributed messaging broker.
Use ReadFromPubSub(subscription=...) for production reliability.
By default, event timestamps are determined by the message's publish time.

13. Interactive Challenges

Challenge 1: Read Byte Payloads (Beginner)

Write a pipeline transform that reads raw byte payloads from the Pub/Sub subscription projects/prod/subscriptions/input-sub.

Challenge 2: Custom Timestamp Attribute (Intermediate)

Configure ReadFromPubSub to read from the subscription projects/prod/subscriptions/input-sub using a custom event timestamp attribute named "event_timestamp".

Challenge 3: Write to Pub/Sub Topic (Advanced)

Given a PCollection of dictionaries representing events, convert them to UTF-8 encoded JSON strings, and write them to the Pub/Sub topic projects/prod/topics/output-topic.

14. Related Content

Previous LessonBounded vs Unbounded

Next LessonKafka Streaming