intermediate

Google Cloud Pub/Sub

7 min readLast updated: 2026-06-30

1. Introduction

Google Cloud Pub/Sub is a serverless, real-time messaging service. Apache Beam provides a native, highly optimized connector to read from and write to Pub/Sub, acting as the primary source/sink for streaming pipelines on Google Cloud.

2. Why This Concept Exists

Streaming pipelines require a durable message queue to buffer incoming events. As sensors, mobile apps, or web servers publish millions of events per second, Pub/Sub stores them safely. Beam's Pub/Sub connector reads these messages continuously, creating an unbounded PCollection.

3. Key Terminology

  • Topic: The channel where publishers send messages.
  • Subscription: The queue where subscribers (like our Beam pipeline) pull messages.
  • Acknowledge (ACK): Notifying Pub/Sub that a message has been processed successfully so it can be deleted from the queue.

4. How It Works

  • Read: You can read from a subscription (recommended for pipelines) or directly from a topic (which creates a temporary subscription behind the scenes).
  • Format: Messages are read as raw bytes, strings, or as structured PubsubMessage objects containing attributes (metadata).
  • Write: Elements are published to a Pub/Sub topic.

5. Visual Diagram

App Publishers
Pub/Sub Topic/Sub
Beam Pipeline

6. Code Example

Reading messages from a subscription and writing them to an archiving topic:

python
import apache_beam as beam
from apache_beam.io.gcp.pubsub import ReadFromPubSub, WriteToPubSub

with beam.Pipeline() as p:
    (p
     | "ReadStream" >> ReadFromPubSub(subscription="projects/my-gcp-project/subscriptions/my-sub")
     | "DecodeMessage" >> beam.Map(lambda b: b.decode("utf-8"))
     | "FilterAlerts" >> beam.Filter(lambda msg: "ALERT" in msg)
     | "EncodeMessage" >> beam.Map(lambda s: s.encode("utf-8"))
     | "WriteStream" >> WriteToPubSub(topic="projects/my-gcp-project/topics/alerts-topic"))

7. Code Explanation

  • ReadFromPubSub(...) reads messages as raw bytes.
  • b.decode("utf-8") decodes bytes into readable string elements.
  • WriteToPubSub(...) publishes the matching alert byte strings back to Pub/Sub.

8. Real Production Example: Timestamp Attribute

In production, you want to set event timestamps based on when the event occurred, rather than when it was published. You can instruct Pub/Sub to read timestamps from metadata attributes:

python
messages = p | ReadFromPubSub(
    subscription="projects/my-project/subscriptions/my-sub",
    timestamp_attribute="event_timestamp"
)

Beam will automatically parse the event_timestamp attribute from the Pub/Sub message metadata and use it as the element's event time for windowing calculations.

9. Common Mistakes

  • Reading from topics in multiple pipelines: If two pipelines read from the same topic directly, they will compete for messages. Always create separate, dedicated subscriptions for each pipeline.
  • Writing non-byte data: Pub/Sub messages are strictly byte payloads. You must encode your string data (e.g. .encode("utf-8")) before passing it to WriteToPubSub.

10. Interview Perspective

  • Question: Why is reading from a subscription preferred over reading from a topic?
  • Answer: Reading from a subscription allows the runner to track message backlog (how many messages are waiting in queue) and report metrics to GCP, enabling autoscalers to spin up more worker VMs when load spikes.
  • Question: How does Beam handle ACK deadlines in Pub/Sub?
  • Answer: The runner automatically manages acknowledgments. Once a bundle of elements is safely processed and written to downstream sinks, the runner sends ACK signals to Pub/Sub to remove the messages.

11. Best Practices

  • Use subscriptions for pipeline inputs to ensure accurate autoscaling metrics.
  • Store event times inside message attributes and utilize timestamp_attribute to bypass ingestion delay offsets.

12. Summary

  • Pub/Sub connects streaming pipelines to message queues.
  • Reads return unbounded PCollections.
  • Inputs are read from subscriptions; outputs are published to topics.

13. Interactive Challenges

Challenge 1: Read From Subscription (Beginner)

Write an Apache Beam import and read statement that instantiates a streaming PCollection from a Pub/Sub subscription path "projects/my-billing/subscriptions/user-activity".

Challenge 2: Encode and Write to Topic (Intermediate)

Write a pipeline segment that takes a PCollection of string records, encodes them to UTF-8 bytes, and writes them to a Pub/Sub topic path "projects/my-billing/topics/notifications".

Challenge 3: Timestamp Attribute Mapping (Advanced)

Write a streaming read statement that pulls from a subscription, extracting event timestamps from a custom metadata attribute named "sensor_time".

14. Related Content

Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)