beginner

Messaging

6 min readLast updated: 2026-07-01

1. Introduction

Real-time streaming architectures rely on Messaging queues and event brokers to process streams. Apache Beam provides adapters to pull and push records to brokers like Google Cloud Pub/Sub, Apache Kafka, and Amazon Kinesis.

2. Why This Concept Exists

Batch pipelines work on historic static folders, but production analytics require sub-second event ingestion. Message brokers act as a buffer between producers and analytical consumers. Beam messaging connectors manage subscriptions, offsets, ack packets, and streaming watermarks so pipelines don't drop events.

3. Key Terminology

  • Pub/Sub (Publish/Subscribe): A messaging model where sender services publish events to topics, and consumer services subscribe to them.
  • Kafka Partition: A ordered sequence of messages in Kafka. Partitions allow Kafka to scale writes.
  • Ack (Acknowledgement): A signal sent back to the broker indicating a message was successfully processed.

4. How It Works

  • The streaming pipeline keeps connections open to the message broker.
  • Ingest: Workers poll messages from multiple partitions in parallel, extracting payload bytes and properties.
  • Publish: Workers serialize processed data and write records back to egress messaging topics.

5. Visual Diagram

Producers
IoT / Web Apps

Message Broker
Kafka / Pub/Sub

Beam Ingestion
Read & Windows

6. Code Example

Ingesting raw payload bytes from a streaming subscription:

python
import apache_beam as beam

def clean_payload(message_bytes):
    # Decode the binary event data
    return message_bytes.decode("utf-8").strip()

# Inside streaming pipeline:
# raw_events = pipeline | "ReadPubSub" >> beam.io.ReadFromPubSub(subscription="projects/my-gcp/subscriptions/sub")
# clean_events = raw_events | "Clean" >> beam.Map(clean_payload)

7. Code Explanation

  • ReadFromPubSub connects to GCP Pub/Sub subscription in streaming mode.
  • It outputs binary string objects representing raw message body bytes.

8. Real Production Example

Reading from a Kafka topic using the Python SDK cross-language transform:

python
from apache_beam.io.kafka import ReadFromKafka

# kafka_stream = pipeline | "ReadKafka" >> ReadFromKafka(
#     consumer_config={"bootstrap.servers": "localhost:9092"},
#     topics=["user-events"],
#     key_deserializer="org.apache.kafka.common.serialization.StringDeserializer",
#     value_deserializer="org.apache.kafka.common.serialization.StringDeserializer"
# )

9. Common Mistakes

  • Running in Batch Mode: Attempting to read from a streaming messaging queue inside a standard batch job execution options can cause the pipeline to stall or hang indefinitely.
  • Ignoring Event Timestamps: Relying on broker ingestion time rather than extracting native payload transaction timestamps.

10. Interview Perspective

  • Question: How does Beam manage duplicate records from message queues?
  • Answer: Most messaging brokers guarantee at-least-once delivery, which can introduce duplicates. Beam handles duplicates by assigning unique message IDs and using stateful deduplication.
  • Question: What are the benefits of Kafka partitioning for Beam?
  • Answer: Partitions map naturally to Beam's parallel execution model. Beam assigns workers to read distinct partition offsets, scaling throughput linearly.

11. Best Practices

  • Extract timestamps directly from message payloads rather than relying on ingest time.
  • Use dead-letter queues to store unparseable payloads.

12. Summary

  • Messaging connectors ingest real-time events.
  • Supports Pub/Sub, Kafka, and Kinesis.
  • Requires streaming runner configurations.

13. Interactive Challenges

Challenge 1: String Decoder Map (Beginner)

Write a map transform that takes raw message bytes b"hello" and decodes them to string "hello".

Challenge 2: Kafka Read Options (Intermediate)

Define the consumer config dictionary required to connect to a local Kafka server running on port 9092.

Challenge 3: Egress Payload Mapper (Advanced)

Write a mapping transform that converts a dictionary {"event": "login"} into JSON formatted bytes b'{"event": "login"}' for broker output.

14. Related Content

Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)