Google Cloud Pub/Sub
1. Introduction
Google Cloud Pub/Sub is a serverless, real-time messaging service. Apache Beam provides a native, highly optimized connector to read from and write to Pub/Sub, acting as the primary source/sink for streaming pipelines on Google Cloud.
2. Why This Concept Exists
Streaming pipelines require a durable message queue to buffer incoming events. As sensors, mobile apps, or web servers publish millions of events per second, Pub/Sub stores them safely. Beam's Pub/Sub connector reads these messages continuously, creating an unbounded PCollection.
3. Key Terminology
- Topic: The channel where publishers send messages.
- Subscription: The queue where subscribers (like our Beam pipeline) pull messages.
- Acknowledge (ACK): Notifying Pub/Sub that a message has been processed successfully so it can be deleted from the queue.
4. How It Works
- Read: You can read from a subscription (recommended for pipelines) or directly from a topic (which creates a temporary subscription behind the scenes).
- Format: Messages are read as raw bytes, strings, or as structured
PubsubMessageobjects containing attributes (metadata). - Write: Elements are published to a Pub/Sub topic.
5. Visual Diagram
6. Code Example
Reading messages from a subscription and writing them to an archiving topic:
import apache_beam as beam
from apache_beam.io.gcp.pubsub import ReadFromPubSub, WriteToPubSub
with beam.Pipeline() as p:
(p
| "ReadStream" >> ReadFromPubSub(subscription="projects/my-gcp-project/subscriptions/my-sub")
| "DecodeMessage" >> beam.Map(lambda b: b.decode("utf-8"))
| "FilterAlerts" >> beam.Filter(lambda msg: "ALERT" in msg)
| "EncodeMessage" >> beam.Map(lambda s: s.encode("utf-8"))
| "WriteStream" >> WriteToPubSub(topic="projects/my-gcp-project/topics/alerts-topic"))
7. Code Explanation
ReadFromPubSub(...)reads messages as raw bytes.b.decode("utf-8")decodes bytes into readable string elements.WriteToPubSub(...)publishes the matching alert byte strings back to Pub/Sub.
8. Real Production Example: Timestamp Attribute
In production, you want to set event timestamps based on when the event occurred, rather than when it was published. You can instruct Pub/Sub to read timestamps from metadata attributes:
messages = p | ReadFromPubSub(
subscription="projects/my-project/subscriptions/my-sub",
timestamp_attribute="event_timestamp"
)
Beam will automatically parse the event_timestamp attribute from the Pub/Sub message metadata and use it as the element's event time for windowing calculations.
9. Common Mistakes
- Reading from topics in multiple pipelines: If two pipelines read from the same topic directly, they will compete for messages. Always create separate, dedicated subscriptions for each pipeline.
- Writing non-byte data: Pub/Sub messages are strictly byte payloads. You must encode your string data (e.g.
.encode("utf-8")) before passing it toWriteToPubSub.
10. Interview Perspective
- Question: Why is reading from a subscription preferred over reading from a topic?
- Answer: Reading from a subscription allows the runner to track message backlog (how many messages are waiting in queue) and report metrics to GCP, enabling autoscalers to spin up more worker VMs when load spikes.
- Question: How does Beam handle ACK deadlines in Pub/Sub?
- Answer: The runner automatically manages acknowledgments. Once a bundle of elements is safely processed and written to downstream sinks, the runner sends ACK signals to Pub/Sub to remove the messages.
11. Best Practices
- Use subscriptions for pipeline inputs to ensure accurate autoscaling metrics.
- Store event times inside message attributes and utilize
timestamp_attributeto bypass ingestion delay offsets.
12. Summary
- Pub/Sub connects streaming pipelines to message queues.
- Reads return unbounded PCollections.
- Inputs are read from subscriptions; outputs are published to topics.