Last Minute RevisionEvergreen

Cheatsheet: DoFn

Revision time: 4 mins

Topic Overview

Define custom element processing logic using the standard DoFn lifecycle.

Syntax Snapshot

python
import apache_beam as beam

class ParseRecordFn(beam.DoFn):
    def setup(self):
        # Called once per worker start (good for client initialization)
        self.client = initialize_database_client()

    def start_bundle(self):
        # Called once per element bundle processing start
        self.batch = []

    def process(self, element):
        # Main entrypoint: yields zero, one, or multiple outputs
        if element.is_valid:
            yield element

    def finish_bundle(self):
        # Called at the end of element bundle execution
        pass

    def teardown(self):
        # Called once when worker halts (closes connections)
        self.client.close()

Key Points

  • DoFn is the user-defined logic container executed within ParDo.
  • process() must return an iterable (typically using yield or returning a list).
  • setup() and teardown() run once per container lifespan (highly efficient for resource pools).
  • Element bundling groups items to minimize database/network roundtrips.

Production Recommendations

Developer Checklist
Never initialize database clients, API connections, or heavy objects inside process(). Always place them in setup().
Advertisement
AdSense Slot #556677Leaderboard Banner (728x90)