Moving an Apache Beam pipeline from a local test environment into a distributed production cluster (like Cloud Dataflow or Flink TaskManagers) often exposes bugs and performance bottlenecks.
By adhering to a set of core production best practices, you can design pipelines that scale cleanly and avoid system crashes.
A common mistake is creating external connection clients (e.g. database connections, HTTP clients) inside the process() method of a DoFn. This will establish a connection for every single incoming record, leading to database connection pools exhausting immediately.
class WriteToDbFn(beam.DoFn):
def process(self, element):
client = DatabaseClient(host="localhost")
client.write(element)
client.close()
Initialize connections once per worker lifecycle inside the setup() or start_bundle() methods, and clean them up inside teardown() or finish_bundle():
class WriteToDbFn(beam.DoFn):
def setup(self):
# Called once when the DoFn instance is initialized on the worker
self.client = DatabaseClient(host="localhost")
def process(self, element):
self.client.write(element)
def teardown(self):
# Called once when the worker completes execution
self.client.close()
All variables, parameters, and objects passed to PTransforms and DoFns must be serializable (picklable in Python).
setup().Streaming runners will periodically retry tasks due to worker network drops or cluster node adjustments.
GroupByKey with CombinePerKey whenever possible. Pre-aggregations reduce data shuffles over the network.CoGroupByKey instead to perform distributed joins.