advanced

Apache Kafka

7 min readLast updated: 2026-06-30

1. Introduction

Apache Kafka is an open-source distributed event streaming platform. Apache Beam provides a robust connector to read from and write to Kafka clusters, making it the primary streaming connector for non-GCP deployments (like on-premise Kubernetes clusters).

2. Why This Concept Exists

Large enterprises run streaming applications across hybrid environments. While Pub/Sub is the choice for GCP, Kafka is the industry-standard event broker elsewhere. Apache Beam allows you to ingest streaming data from Kafka topics into your pipeline, performing distributed aggregations in parallel.

3. Key Terminology

  • Bootstrap Servers: The list of host/port coordinates pointing to the brokers in your Kafka cluster (e.g. "localhost:9092").
  • Topic: The category name where records are stored and published.
  • Deserializer: Helper class that converts raw network bytes back into Java/Python objects.

4. How It Works

  • The Kafka connector uses a cross-language transform (under the hood, it executes a Java reader process to achieve maximum performance and durability, even when using the Python SDK).
  • Read: You supply bootstrap servers, target topics, and key/value deserializers. It returns a PCollection of key-value tuples representing messages.
  • Write: Publishes key-value tuples to a designated Kafka topic.

5. Visual Diagram

Microservices
Kafka Topic
Beam Python Pipeline

6. Code Example

Reading byte payload messages from local Kafka cluster topics:

python
import apache_beam as beam
from apache_beam.io.external.kafka import ReadFromKafka

with beam.Pipeline() as p:
    (p
     | "ReadKafka" >> ReadFromKafka(
         consumer_config={"bootstrap.servers": "localhost:9092"},
         topics=["user-logs"],
         key_deserializer="org.apache.kafka.common.serialization.StringDeserializer",
         value_deserializer="org.apache.kafka.common.serialization.StringDeserializer"
     )
     | "LogPayload" >> beam.Map(print))

7. Code Explanation

  • ReadFromKafka(...) requires bootstrap.servers to establish a network connection.
  • topics=["user-logs"] subscribes to the log stream.
  • key_deserializer and value_deserializer specify the Java class paths to parse the serialized data correctly.

8. Real Production Example: Consumer Config

In production environments, you configure SASL security headers and consumer group IDs to enable offset tracking:

python
kafka_records = p | ReadFromKafka(
    consumer_config={
        "bootstrap.servers": "prod-kafka:9092",
        "group.id": "beam-processing-group",
        "security.protocol": "SASL_SSL",
        "sasl.mechanism": "PLAIN"
    },
    topics=["sales"]
)

9. Common Mistakes

  • Missing Java environment locally: Because the Python Kafka connector utilizes cross-language expansion, you must have a Java Runtime Environment (JRE) installed on your developer machine. Omitting it will throw a connection expansion error.
  • Forgetting group.id: If you omit group.id, Kafka defaults to a temporary random consumer group, preventing you from resuming partition reads if your pipeline restarts.

10. Interview Perspective

  • Question: How does Apache Beam read from Kafka in parallel?
  • Answer: The connector splits partitions. If a Kafka topic has 12 partitions, the runner can assign individual partitions to separate worker machines, executing concurrent reads.
  • Question: How does Beam manage offsets in Kafka?
  • Answer: In streaming pipelines, Beam tracks offsets using checkpointing. Once a window fires and results are committed, the runner updates the offset coordinates in Kafka.

11. Best Practices

  • Always configure consumer group IDs (group.id) explicitly for production monitoring.
  • Ensure your network settings permit worker VMs to communicate with all brokers in the Kafka cluster.

12. Summary

  • Kafka connects streaming pipelines to enterprise message brokers.
  • Uses cross-language Java translation (requires JRE).
  • Requires bootstrap servers, topics, and deserializer configuration.

13. Interactive Challenges

Challenge 1: Read From Local Broker (Beginner)

Write an Apache Beam import and read statement that pulls string elements from a local Kafka broker "localhost:9092" on topic "sensor-data".

Challenge 2: Write to Kafka Topic (Intermediate)

Write a pipeline segment that takes a PCollection of key-value string records and publishes them to a Kafka cluster "prod-kafka:9092" on topic "alerts".

Challenge 3: Group ID Consumer Config (Advanced)

Write a Kafka read statement that configures a consumer group ID "beam-log-consumer" and reads from topic "system-logs".

14. Related Content

Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)