advanced
Apache Kafka
7 min readLast updated: 2026-06-30
1. Introduction
Apache Kafka is an open-source distributed event streaming platform. Apache Beam provides a robust connector to read from and write to Kafka clusters, making it the primary streaming connector for non-GCP deployments (like on-premise Kubernetes clusters).
2. Why This Concept Exists
Large enterprises run streaming applications across hybrid environments. While Pub/Sub is the choice for GCP, Kafka is the industry-standard event broker elsewhere. Apache Beam allows you to ingest streaming data from Kafka topics into your pipeline, performing distributed aggregations in parallel.
3. Key Terminology
- Bootstrap Servers: The list of host/port coordinates pointing to the brokers in your Kafka cluster (e.g.
"localhost:9092"). - Topic: The category name where records are stored and published.
- Deserializer: Helper class that converts raw network bytes back into Java/Python objects.
4. How It Works
- The Kafka connector uses a cross-language transform (under the hood, it executes a Java reader process to achieve maximum performance and durability, even when using the Python SDK).
- Read: You supply bootstrap servers, target topics, and key/value deserializers. It returns a PCollection of key-value tuples representing messages.
- Write: Publishes key-value tuples to a designated Kafka topic.
5. Visual Diagram
Microservices
➔Kafka Topic
➔ (ReadFromKafka)Beam Python Pipeline
6. Code Example
Reading byte payload messages from local Kafka cluster topics:
python
import apache_beam as beam
from apache_beam.io.external.kafka import ReadFromKafka
with beam.Pipeline() as p:
(p
| "ReadKafka" >> ReadFromKafka(
consumer_config={"bootstrap.servers": "localhost:9092"},
topics=["user-logs"],
key_deserializer="org.apache.kafka.common.serialization.StringDeserializer",
value_deserializer="org.apache.kafka.common.serialization.StringDeserializer"
)
| "LogPayload" >> beam.Map(print))
7. Code Explanation
ReadFromKafka(...)requiresbootstrap.serversto establish a network connection.topics=["user-logs"]subscribes to the log stream.key_deserializerandvalue_deserializerspecify the Java class paths to parse the serialized data correctly.
8. Real Production Example: Consumer Config
In production environments, you configure SASL security headers and consumer group IDs to enable offset tracking:
python
kafka_records = p | ReadFromKafka(
consumer_config={
"bootstrap.servers": "prod-kafka:9092",
"group.id": "beam-processing-group",
"security.protocol": "SASL_SSL",
"sasl.mechanism": "PLAIN"
},
topics=["sales"]
)
9. Common Mistakes
- Missing Java environment locally: Because the Python Kafka connector utilizes cross-language expansion, you must have a Java Runtime Environment (JRE) installed on your developer machine. Omitting it will throw a connection expansion error.
- Forgetting group.id: If you omit
group.id, Kafka defaults to a temporary random consumer group, preventing you from resuming partition reads if your pipeline restarts.
10. Interview Perspective
- Question: How does Apache Beam read from Kafka in parallel?
- Answer: The connector splits partitions. If a Kafka topic has 12 partitions, the runner can assign individual partitions to separate worker machines, executing concurrent reads.
- Question: How does Beam manage offsets in Kafka?
- Answer: In streaming pipelines, Beam tracks offsets using checkpointing. Once a window fires and results are committed, the runner updates the offset coordinates in Kafka.
11. Best Practices
- Always configure consumer group IDs (
group.id) explicitly for production monitoring. - Ensure your network settings permit worker VMs to communicate with all brokers in the Kafka cluster.
12. Summary
- Kafka connects streaming pipelines to enterprise message brokers.
- Uses cross-language Java translation (requires JRE).
- Requires bootstrap servers, topics, and deserializer configuration.
13. Interactive Challenges
14. Related Content
Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)