beginner
Messaging
6 min readLast updated: 2026-07-01
1. Introduction
Real-time streaming architectures rely on Messaging queues and event brokers to process streams. Apache Beam provides adapters to pull and push records to brokers like Google Cloud Pub/Sub, Apache Kafka, and Amazon Kinesis.
2. Why This Concept Exists
Batch pipelines work on historic static folders, but production analytics require sub-second event ingestion. Message brokers act as a buffer between producers and analytical consumers. Beam messaging connectors manage subscriptions, offsets, ack packets, and streaming watermarks so pipelines don't drop events.
3. Key Terminology
- Pub/Sub (Publish/Subscribe): A messaging model where sender services publish events to topics, and consumer services subscribe to them.
- Kafka Partition: A ordered sequence of messages in Kafka. Partitions allow Kafka to scale writes.
- Ack (Acknowledgement): A signal sent back to the broker indicating a message was successfully processed.
4. How It Works
- The streaming pipeline keeps connections open to the message broker.
- Ingest: Workers poll messages from multiple partitions in parallel, extracting payload bytes and properties.
- Publish: Workers serialize processed data and write records back to egress messaging topics.
5. Visual Diagram
Producers
IoT / Web Apps
Message Broker
Kafka / Pub/Sub
Beam Ingestion
Read & Windows
6. Code Example
Ingesting raw payload bytes from a streaming subscription:
python
import apache_beam as beam
def clean_payload(message_bytes):
# Decode the binary event data
return message_bytes.decode("utf-8").strip()
# Inside streaming pipeline:
# raw_events = pipeline | "ReadPubSub" >> beam.io.ReadFromPubSub(subscription="projects/my-gcp/subscriptions/sub")
# clean_events = raw_events | "Clean" >> beam.Map(clean_payload)
7. Code Explanation
ReadFromPubSubconnects to GCP Pub/Sub subscription in streaming mode.- It outputs binary string objects representing raw message body bytes.
8. Real Production Example
Reading from a Kafka topic using the Python SDK cross-language transform:
python
from apache_beam.io.kafka import ReadFromKafka
# kafka_stream = pipeline | "ReadKafka" >> ReadFromKafka(
# consumer_config={"bootstrap.servers": "localhost:9092"},
# topics=["user-events"],
# key_deserializer="org.apache.kafka.common.serialization.StringDeserializer",
# value_deserializer="org.apache.kafka.common.serialization.StringDeserializer"
# )
9. Common Mistakes
- Running in Batch Mode: Attempting to read from a streaming messaging queue inside a standard batch job execution options can cause the pipeline to stall or hang indefinitely.
- Ignoring Event Timestamps: Relying on broker ingestion time rather than extracting native payload transaction timestamps.
10. Interview Perspective
- Question: How does Beam manage duplicate records from message queues?
- Answer: Most messaging brokers guarantee at-least-once delivery, which can introduce duplicates. Beam handles duplicates by assigning unique message IDs and using stateful deduplication.
- Question: What are the benefits of Kafka partitioning for Beam?
- Answer: Partitions map naturally to Beam's parallel execution model. Beam assigns workers to read distinct partition offsets, scaling throughput linearly.
11. Best Practices
- Extract timestamps directly from message payloads rather than relying on ingest time.
- Use dead-letter queues to store unparseable payloads.
12. Summary
- Messaging connectors ingest real-time events.
- Supports Pub/Sub, Kafka, and Kinesis.
- Requires streaming runner configurations.
13. Interactive Challenges
14. Related Content
Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)