Kafka IO — Learn Apache Beam

1. Introduction

Apache Kafka is an open-source distributed event streaming platform. Apache Beam provides a robust connector to read from and write to Kafka clusters, making it the primary streaming connector for non-GCP deployments (like on-premise Kubernetes clusters).

2. Why This Concept Exists

Large enterprises run streaming applications across hybrid environments. While Pub/Sub is the choice for GCP, Kafka is the industry-standard event broker elsewhere. Apache Beam allows you to ingest streaming data from Kafka topics into your pipeline, performing distributed aggregations in parallel.

3. Key Terminology

Bootstrap Servers: The list of host/port coordinates pointing to the brokers in your Kafka cluster (e.g. "localhost:9092").
Topic: The category name where records are stored and published.
Deserializer: Helper class that converts raw network bytes back into Java/Python objects.

4. How It Works

The Kafka connector uses a cross-language transform (under the hood, it executes a Java reader process to achieve maximum performance and durability, even when using the Python SDK).
Read: You supply bootstrap servers, target topics, and key/value deserializers. It returns a PCollection of key-value tuples representing messages.
Write: Publishes key-value tuples to a designated Kafka topic.

5. Visual Diagram

Microservices

➔

Kafka Topic

➔ (ReadFromKafka)

Beam Python Pipeline

6. Code Example

Reading byte payload messages from local Kafka cluster topics:

python

import apache_beam as beam
from apache_beam.io.external.kafka import ReadFromKafka

with beam.Pipeline() as p:
    (p
     | "ReadKafka" >> ReadFromKafka(
         consumer_config={"bootstrap.servers": "localhost:9092"},
         topics=["user-logs"],
         key_deserializer="org.apache.kafka.common.serialization.StringDeserializer",
         value_deserializer="org.apache.kafka.common.serialization.StringDeserializer"
     )
     | "LogPayload" >> beam.Map(print))

7. Code Explanation

ReadFromKafka(...) requires bootstrap.servers to establish a network connection.
topics=["user-logs"] subscribes to the log stream.
key_deserializer and value_deserializer specify the Java class paths to parse the serialized data correctly.

8. Real Production Example: Consumer Config

In production environments, you configure SASL security headers and consumer group IDs to enable offset tracking:

python

kafka_records = p | ReadFromKafka(
    consumer_config={
        "bootstrap.servers": "prod-kafka:9092",
        "group.id": "beam-processing-group",
        "security.protocol": "SASL_SSL",
        "sasl.mechanism": "PLAIN"
    },
    topics=["sales"]
)

9. Common Mistakes

Missing Java environment locally: Because the Python Kafka connector utilizes cross-language expansion, you must have a Java Runtime Environment (JRE) installed on your developer machine. Omitting it will throw a connection expansion error.
Forgetting group.id: If you omit group.id, Kafka defaults to a temporary random consumer group, preventing you from resuming partition reads if your pipeline restarts.

10. Interview Perspective

Question: How does Apache Beam read from Kafka in parallel?
Answer: The connector splits partitions. If a Kafka topic has 12 partitions, the runner can assign individual partitions to separate worker machines, executing concurrent reads.
Question: How does Beam manage offsets in Kafka?
Answer: In streaming pipelines, Beam tracks offsets using checkpointing. Once a window fires and results are committed, the runner updates the offset coordinates in Kafka.

11. Best Practices

Always configure consumer group IDs (group.id) explicitly for production monitoring.
Ensure your network settings permit worker VMs to communicate with all brokers in the Kafka cluster.

12. Summary

Kafka connects streaming pipelines to enterprise message brokers.
Uses cross-language Java translation (requires JRE).
Requires bootstrap servers, topics, and deserializer configuration.

13. Interactive Challenges

Challenge 1: Read From Local Broker (Beginner)

Write an Apache Beam import and read statement that pulls string elements from a local Kafka broker "localhost:9092" on topic "sensor-data".

Challenge 2: Write to Kafka Topic (Intermediate)

Write a pipeline segment that takes a PCollection of key-value string records and publishes them to a Kafka cluster "prod-kafka:9092" on topic "alerts".

Challenge 3: Group ID Consumer Config (Advanced)

Write a Kafka read statement that configures a consumer group ID "beam-log-consumer" and reads from topic "system-logs".

14. Related Content

Previous LessonPub/Sub IO

Next LessonDatabases