Pub/Sub Streaming
1. Introduction
Google Cloud Pub/Sub is a fully-managed real-time messaging service. Apache Beam provides native support for Pub/Sub, making it one of the most common unbounded sources and sinks used for building production-grade streaming ingestion pipelines on Google Cloud Dataflow.
2. Why This Concept Exists
Decoupling message producers (like web servers or mobile applications) from message consumers (like data pipelines) is critical for system reliability. If your processing pipeline is down for maintenance, Pub/Sub buffers the incoming messages. When the pipeline restarts, it catches up without losing data. Apache Beam's Pub/Sub connector manages message ingestion, acknowledgement, watermarking, and scaling automatically.
3. Key Terminology
- Topic: A named resource to which producers send messages.
- Subscription: A named resource representing the stream of messages from a single, specific topic, to be delivered to the consuming pipeline.
- Attributes: Key-value metadata pairs attached to a Pub/Sub message.
- Ack (Acknowledgement): A signal sent back to Pub/Sub indicating that a message has been successfully processed and can be deleted.
4. How It Works
When running on Google Cloud Dataflow, the runner uses a highly optimized streaming connector.
- Reading: Beam reads messages from a Pub/Sub subscription. It can pull raw bytes, or pull both payload and attributes.
- Watermark Tracking: By default, Beam uses the message's publish timestamp (assigned by Google Cloud Pub/Sub) as the event time to advance the pipeline's watermark.
- Checkpointing & Acknowledging: As data passes through downstream transforms, the runner checkpoints progress and automatically sends acknowledgements back to Pub/Sub to prevent message redelivery.
5. Visual Diagram
Pub/Sub Topic
Message ingress broker
Worker VM 1
Parallel processing
Worker VM 2
Parallel processing
Output Topic
Downstream consumers
6. Code Example
The following pipeline reads JSON messages with their attributes from a subscription, extracts a custom timestamp, and writes processed results back to a target topic:
import json
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
options = PipelineOptions(streaming=True)
with beam.Pipeline(options=options) as p:
(
p
# 1. Read messages, including metadata attributes
| "ReadFromPubSub" >> beam.io.ReadFromPubSub(
subscription="projects/my-project/subscriptions/iot-sub",
with_attributes=True
)
# 2. Extract payload and attribute info
| "ParseMessage" >> beam.Map(lambda msg: {
"data": json.loads(msg.data.decode("utf-8")),
"attributes": msg.attributes
})
# 3. Filter active devices
| "FilterActive" >> beam.Filter(lambda x: x["attributes"].get("status") == "active")
# 4. Convert back to Pub/Sub format and write
| "FormatOutput" >> beam.Map(lambda x: json.dumps(x["data"]).encode("utf-8"))
| "WriteToPubSub" >> beam.io.WriteToPubSub(
topic="projects/my-project/topics/filtered-iot-events"
)
)
7. Code Explanation
ReadFromPubSub(subscription=..., with_attributes=True)tells Beam to fetch fullPubsubMessageobjects instead of raw byte strings. This allows access to.dataand.attributes.- We parse the JSON payload from bytes (
msg.data.decode("utf-8")) and read the custom key-value attributes (msg.attributes). Filterchecks the attributes dictionary to only pass messages where the key"status"is"active".- Finally, the pipeline encodes the result and publishes it to a target Pub/Sub topic using
WriteToPubSub.
8. Real Production Example
A smart-grid company streams electrical meter metrics to a Pub/Sub topic. The Apache Beam pipeline reads from the subscription, extracts the voltage and power factors, runs 1-minute windowed averages, and raises alerts on another Pub/Sub topic if voltages fluctuate beyond safe limits.
9. Common Mistakes
- Reading Directly from a Topic in Production: While Beam allows reading directly from a topic using
ReadFromPubSub(topic=...), this causes Beam to create a temporary subscription. In production, this can lead to data loss or duplicate processing if the pipeline restarts. Always create a static, dedicated subscription beforehand and read from that. - Missing IAM Permissions: The pipeline runner service account must have
Pub/Sub SubscriberandViewerpermissions on the input subscription, andPub/Sub Publisherpermissions on the output topic.
10. Interview Perspective
- Question: How does Beam determine event time when reading from Pub/Sub?
- Answer: By default, Beam uses the message's publish timestamp. However, you can configure
ReadFromPubSub(timestamp_attribute="my_timestamp_key")to tell Beam to extract the event time from a specific metadata attribute containing a RFC 3339 or milli-epoch value. - Question: Does Beam guarantee exactly-once processing when reading from Pub/Sub?
- Answer: When combined with Google Cloud Dataflow, yes. The runner deduplicates incoming messages using Pub/Sub's unique
message_idmetadata and ensures side-effects are processed exactly-once before acknowledging the messages.
11. Best Practices
- Configure a Dead-Letter Queue (DLQ) on your Pub/Sub subscription to capture poison pills (malformed messages) that cause processing loops.
- Use
timestamp_attributeif your client generates event times that differ significantly from publish times (e.g., when devices queue messages offline). - Avoid using heavy database lookups per Pub/Sub message. Use side inputs or stateful caching instead.
12. Summary
- Pub/Sub is Google Cloud's distributed messaging broker.
- Use
ReadFromPubSub(subscription=...)for production reliability. - By default, event timestamps are determined by the message's publish time.