Create Transform
1. Introduction
The Create transform (beam.Create) is an in-memory data source generator in Apache Beam. It allows you to transform local Python collections (like lists, dictionaries, or tuples) directly into a distributed PCollection.
2. Why This Concept Exists
Before you can apply transformations in a pipeline, you need to load data. While production pipelines read from storage engines (like GCS, databases, or Kafka), it is extremely useful to have an easy way to generate test data or initial configuration parameters directly from code memory. beam.Create satisfies this need.
3. Key Terminology
- In-Memory Source: An input step that reads directly from the driver memory instead of external files or services.
- Iterable: A Python structure that can be looped over (e.g. list, dictionary, generator).
- Initial PCollection: The first dataset generated at the start of a pipeline.
4. How It Works
beam.Createaccepts a Python iterable as an argument.- When the pipeline starts, Beam reads the elements from the local memory.
- These elements are distributed across execution workers as a new parallelized
PCollection.
5. Visual Diagram
Python List (In-Memory)
[ 10, 20, 30 ]
Distributed PCollection
(10) | (20) | (30) (on workers)
6. Code Example
Generating a PCollection from a list of strings:
import apache_beam as beam
with beam.Pipeline() as p:
names = p | "CreateNames" >> beam.Create(["Alice", "Bob", "Charlie"])
names | "Print" >> beam.Map(print)
7. Code Explanation
beam.Create(["Alice", "Bob", "Charlie"])wraps the Python list.- The pipeline turns this list into a
PCollectioncontaining three elements. - These elements flow down to the
Printstep.
8. Real Production Example
Using beam.Create to inject configuration settings or seed data at pipeline startup:
import apache_beam as beam
config = [
{"service": "auth", "endpoint": "https://auth.internal"},
{"service": "payment", "endpoint": "https://pay.internal"}
]
with beam.Pipeline() as p:
endpoints = p | "SeedConfig" >> beam.Create(config)
# Further processing utilizing these endpoints...
9. Common Mistakes
- Passing large datasets: Since
beam.Createloads data from memory on the driver node, passing a massive list (e.g. gigabytes of records) will cause Out-Of-Memory (OOM) crashes. - Expecting order preservation: Do not assume elements created with
beam.Createwill process in the same order they appeared in the list.
10. Interview Perspective
- Question: Can you use
beam.Createwith a dictionary? - Answer: Yes. If you pass a dict directly to
beam.Create, Beam will iterate over the keys. If you want key-value pairs, pass a list of tuples:beam.Create(my_dict.items()). - Question: When should you avoid using
beam.Create? - Answer: Avoid it for production data workflows that read files or DB rows. Use it primarily for small configuration injection and unit testing.
11. Best Practices
- Limit
beam.Createinput sizes to small sets (thousands of elements at most). - Use it in unit tests to mock database or file read stages.
12. Summary
beam.Creategenerates a PCollection from memory.- Takes any Python iterable structure.
- Commonly used for mock testing and configuration pipelines.