intermediate
MongoDB IO
7 min readLast updated: 2026-07-01
1. Introduction
MongoDB is a document-oriented NoSQL database that stores data in JSON-like documents with dynamic schemas. The MongoDB IO connector in Apache Beam allows pipelines to read documents from collections and write processed records back.
2. Why This Concept Exists
Modern applications use MongoDB to model nested structures without strict relational schemas. When ingesting clickstream events or user profiles, data pipelines need to dump or read records directly from collections. Beam manages connections, cursor timeouts, and bulk inserts across distributed worker clusters.
3. Key Terminology
- BSON: Binary JSON. The serialization format MongoDB uses to store documents.
- Collection: A grouping of MongoDB documents (equivalent to a table).
- ObjectId: A 12-byte identifier used by MongoDB to uniquely identify documents.
4. How It Works
- Reading:
ReadFromMongoDBqueries a collection. It uses split keys (like ranges of ObjectIds) to assign portions of the collection to different workers, downloading documents in parallel. - Writing:
WriteToMongoDBtakes Python dictionaries, translates them to BSON, and executes bulk inserts.
5. Visual Diagram
MongoDB collection splits query:
Partition 1 (ObjectId Range A)
Assigned to Worker 1
Partition 2 (ObjectId Range B)
Assigned to Worker 2
6. Code Example
Reading documents from a MongoDB database and writing matching output:
python
from apache_beam.io.mongodb import ReadFromMongoDB, WriteToMongoDB
# In your pipeline context:
# docs = pipeline | "ReadMongo" >> ReadFromMongoDB(
# uri="mongodb://localhost:27017",
# db="app_db",
# collection="users",
# filter={"status": "active"}
# )
7. Code Explanation
uridefines connection parameters (protocol, host, port).filterallows querying only subsets of documents, reducing ingestion volume.
8. Real Production Example
Writing processed analytics dictionaries directly back to a MongoDB logging collection:
python
from apache_beam.io.mongodb import WriteToMongoDB
# data = PCollection of parsed analytics logs (dictionaries)
# data | "WriteMongo" >> WriteToMongoDB(
# uri="mongodb://admin:pass@mongodb-host:27017/admin",
# db="analytics",
# collection="summary_metrics",
# batch_size=100
# )
9. Common Mistakes
- BSON Incompatibilities: Trying to write native Python objects (like custom class objects) that cannot be converted to BSON. Stick to dictionaries, strings, lists, integers, and floats.
- Unindexed Filters: Using complex filters in
ReadFromMongoDBthat do not have matching indexes, which forces MongoDB to run slow full-collection scans.
10. Interview Perspective
- Question: How does Beam parallelize reading from a single MongoDB collection?
- Answer: The connector splits the collection by matching ObjectId boundaries. It queries the collection size, estimates key ranges, and distributes these ranges as query bounds across parallel worker splits.
- Question: Does writing to MongoDB overwrite existing documents?
- Answer: Standard write transforms execute bulk insert operations. To perform upserts (update if exists), you would configure target parameters or write a custom
DoFnthat executes update statements.
11. Best Practices
- Always define appropriate index structures on MongoDB collections for query filters.
- Set a reasonable
batch_size(e.g., 100-500) to optimize bulk insertion performance.
12. Summary
- MongoDB IO manages document reads and writes.
- Uses ObjectId ranges to parallelize reader queries.
- Accepts native Python dictionaries, serializing them to BSON records.
13. Interactive Challenges
14. Related Content
Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)