Beam Best PracticesEvergreen Article

Apache Beam Best Practices for Production

Published: July 02, 20268 min read

Moving an Apache Beam pipeline from a local test environment into a distributed production cluster (like Cloud Dataflow or Flink TaskManagers) often exposes bugs and performance bottlenecks.

By adhering to a set of core production best practices, you can design pipelines that scale cleanly and avoid system crashes.


1. DoFn Lifecycle and Resource Pooling

A common mistake is creating external connection clients (e.g. database connections, HTTP clients) inside the process() method of a DoFn. This will establish a connection for every single incoming record, leading to database connection pools exhausting immediately.

Anti-Pattern (Connection per element)

python
class WriteToDbFn(beam.DoFn):
    def process(self, element):
        client = DatabaseClient(host="localhost")
        client.write(element)
        client.close()

Best Practice (Pool connection per worker lifecycle)

Initialize connections once per worker lifecycle inside the setup() or start_bundle() methods, and clean them up inside teardown() or finish_bundle():

python
class WriteToDbFn(beam.DoFn):
    def setup(self):
        # Called once when the DoFn instance is initialized on the worker
        self.client = DatabaseClient(host="localhost")

    def process(self, element):
        self.client.write(element)

    def teardown(self):
        # Called once when the worker completes execution
        self.client.close()

2. Enforce Strict Serializability

All variables, parameters, and objects passed to PTransforms and DoFns must be serializable (picklable in Python).

  • The Issue: Because elements are processed across distributed machines, the runner serializes functions and data to distribute them.
  • The Rule: Do not hold non-serializable objects (like open file handles or network sockets) as class instance properties of your DoFn. Initialize them dynamically inside setup().

3. Designing for Idempotency

Streaming runners will periodically retry tasks due to worker network drops or cluster node adjustments.

  • Deduplication: Ensure your database writes are idempotent (writing the same event twice results in a single database record entry).
  • BigQuery writes: Use BigQuery inserts with unique event IDs, or execute upsert/merge logic rather than simple appends.

4. Production Checklist

  • [ ] Mitigate Data Skew: Swap GroupByKey with CombinePerKey whenever possible. Pre-aggregations reduce data shuffles over the network.
  • [ ] Limit Side Input Size: Avoid using side inputs for datasets larger than 100 MB. Use CoGroupByKey instead to perform distributed joins.
  • [ ] Configure Dead-Letter Queues (DLQ): Wrap parsing code in try-except blocks and route failed elements to a tagged side-output to prevent poisons from blocking pipelines.