beginner

Create Transform

6 min readLast updated: 2026-07-01

1. Introduction

The Create transform (beam.Create) is an in-memory data source generator in Apache Beam. It allows you to transform local Python collections (like lists, dictionaries, or tuples) directly into a distributed PCollection.

2. Why This Concept Exists

Before you can apply transformations in a pipeline, you need to load data. While production pipelines read from storage engines (like GCS, databases, or Kafka), it is extremely useful to have an easy way to generate test data or initial configuration parameters directly from code memory. beam.Create satisfies this need.

3. Key Terminology

  • In-Memory Source: An input step that reads directly from the driver memory instead of external files or services.
  • Iterable: A Python structure that can be looped over (e.g. list, dictionary, generator).
  • Initial PCollection: The first dataset generated at the start of a pipeline.

4. How It Works

  • beam.Create accepts a Python iterable as an argument.
  • When the pipeline starts, Beam reads the elements from the local memory.
  • These elements are distributed across execution workers as a new parallelized PCollection.

5. Visual Diagram

Python List (In-Memory)
[ 10, 20, 30 ]

▼ (beam.Create)

Distributed PCollection
(10) | (20) | (30) (on workers)

6. Code Example

Generating a PCollection from a list of strings:

python
import apache_beam as beam

with beam.Pipeline() as p:
    names = p | "CreateNames" >> beam.Create(["Alice", "Bob", "Charlie"])
    names | "Print" >> beam.Map(print)

7. Code Explanation

  • beam.Create(["Alice", "Bob", "Charlie"]) wraps the Python list.
  • The pipeline turns this list into a PCollection containing three elements.
  • These elements flow down to the Print step.

8. Real Production Example

Using beam.Create to inject configuration settings or seed data at pipeline startup:

python
import apache_beam as beam

config = [
    {"service": "auth", "endpoint": "https://auth.internal"},
    {"service": "payment", "endpoint": "https://pay.internal"}
]

with beam.Pipeline() as p:
    endpoints = p | "SeedConfig" >> beam.Create(config)
    # Further processing utilizing these endpoints...

9. Common Mistakes

  • Passing large datasets: Since beam.Create loads data from memory on the driver node, passing a massive list (e.g. gigabytes of records) will cause Out-Of-Memory (OOM) crashes.
  • Expecting order preservation: Do not assume elements created with beam.Create will process in the same order they appeared in the list.

10. Interview Perspective

  • Question: Can you use beam.Create with a dictionary?
  • Answer: Yes. If you pass a dict directly to beam.Create, Beam will iterate over the keys. If you want key-value pairs, pass a list of tuples: beam.Create(my_dict.items()).
  • Question: When should you avoid using beam.Create?
  • Answer: Avoid it for production data workflows that read files or DB rows. Use it primarily for small configuration injection and unit testing.

11. Best Practices

  • Limit beam.Create input sizes to small sets (thousands of elements at most).
  • Use it in unit tests to mock database or file read stages.

12. Summary

  • beam.Create generates a PCollection from memory.
  • Takes any Python iterable structure.
  • Commonly used for mock testing and configuration pipelines.

13. Interactive Challenges

Challenge 1: Dictionary Key-Value Pairs (Beginner)

Write a pipeline segment that converts a Python dictionary {"A": 1, "B": 2} into a PCollection of key-value tuples and prints them.

Challenge 2: Ingest JSON Dictionaries (Intermediate)

Write a segment using beam.Create that produces a PCollection containing three user objects, each representing a user dictionary: {"id": 1, "name": "User1"}.

Challenge 3: Generator Source (Advanced)

Write a pipeline that uses beam.Create to consume elements from a custom generator function yield_nums(limit) which yields numbers up to a specified limit.

14. Related Content

Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)