MongoDB IO — Learn Apache Beam

1. Introduction

MongoDB is a document-oriented NoSQL database that stores data in JSON-like documents with dynamic schemas. The MongoDB IO connector in Apache Beam allows pipelines to read documents from collections and write processed records back.

2. Why This Concept Exists

Modern applications use MongoDB to model nested structures without strict relational schemas. When ingesting clickstream events or user profiles, data pipelines need to dump or read records directly from collections. Beam manages connections, cursor timeouts, and bulk inserts across distributed worker clusters.

3. Key Terminology

BSON: Binary JSON. The serialization format MongoDB uses to store documents.
Collection: A grouping of MongoDB documents (equivalent to a table).
ObjectId: A 12-byte identifier used by MongoDB to uniquely identify documents.

4. How It Works

Reading: ReadFromMongoDB queries a collection. It uses split keys (like ranges of ObjectIds) to assign portions of the collection to different workers, downloading documents in parallel.
Writing: WriteToMongoDB takes Python dictionaries, translates them to BSON, and executes bulk inserts.

5. Visual Diagram

MongoDB collection splits query:

Partition 1 (ObjectId Range A)

Assigned to Worker 1

Partition 2 (ObjectId Range B)

Assigned to Worker 2

6. Code Example

Reading documents from a MongoDB database and writing matching output:

python

from apache_beam.io.mongodb import ReadFromMongoDB, WriteToMongoDB

# In your pipeline context:
# docs = pipeline | "ReadMongo" >> ReadFromMongoDB(
#     uri="mongodb://localhost:27017",
#     db="app_db",
#     collection="users",
#     filter={"status": "active"}
# )

7. Code Explanation

uri defines connection parameters (protocol, host, port).
filter allows querying only subsets of documents, reducing ingestion volume.

8. Real Production Example

Writing processed analytics dictionaries directly back to a MongoDB logging collection:

python

from apache_beam.io.mongodb import WriteToMongoDB

# data = PCollection of parsed analytics logs (dictionaries)
# data | "WriteMongo" >> WriteToMongoDB(
#     uri="mongodb://admin:pass@mongodb-host:27017/admin",
#     db="analytics",
#     collection="summary_metrics",
#     batch_size=100
# )

9. Common Mistakes

BSON Incompatibilities: Trying to write native Python objects (like custom class objects) that cannot be converted to BSON. Stick to dictionaries, strings, lists, integers, and floats.
Unindexed Filters: Using complex filters in ReadFromMongoDB that do not have matching indexes, which forces MongoDB to run slow full-collection scans.

10. Interview Perspective

Question: How does Beam parallelize reading from a single MongoDB collection?
Answer: The connector splits the collection by matching ObjectId boundaries. It queries the collection size, estimates key ranges, and distributes these ranges as query bounds across parallel worker splits.
Question: Does writing to MongoDB overwrite existing documents?
Answer: Standard write transforms execute bulk insert operations. To perform upserts (update if exists), you would configure target parameters or write a custom DoFn that executes update statements.

11. Best Practices

Always define appropriate index structures on MongoDB collections for query filters.
Set a reasonable batch_size (e.g., 100-500) to optimize bulk insertion performance.

12. Summary

MongoDB IO manages document reads and writes.
Uses ObjectId ranges to parallelize reader queries.
Accepts native Python dictionaries, serializing them to BSON records.

13. Interactive Challenges

Challenge 1: MongoDB URI Definition (Beginner)

Write down the connection URI string to connect to MongoDB running locally on default port 27017.

Challenge 2: Query Filter Setup (Intermediate)

Configure ReadFromMongoDB properties to filter documents where "verified" is True.

Challenge 3: Bulk Writer Setup (Advanced)

Write a write step that outputs dictionaries to database "production", collection "events", setting batch write size to 250.

14. Related Content

Previous LessonElasticsearch IO

Next LessonOthers