intermediate
Avro
7 min readLast updated: 2026-07-01
1. Introduction
Apache Avro is a row-oriented, remote procedure call and data serialization framework. It is widely used in big data architectures because it stores rich schemas directly alongside binary records.
2. Why This Concept Exists
Text formats like JSON and CSV are slow to parse and consume significant disk space. Avro stores records in a highly compressed binary layout. Because the schema is stored in the file header, readers can dynamically decode records without worrying about schema drift.
3. Key Terminology
- Avro Schema: A JSON-based definition of record structure, fields, and types.
- Binary Encoding: The compact serialization format used for writing records.
- Fastavro: The underlying library commonly used in Python to read/write Avro metadata.
4. How It Works
- Reading:
ReadFromAvroreads binary blocks, parses the schema header, and outputs dictionaries matching the structure. - Writing:
WriteToAvrotakes dictionaries and encodes them into binary using a schema.
5. Visual Diagram
Avro JSON Schema
➔ (WriteToAvro)Compressed Disk File
➔ (ReadFromAvro)PCollection[Dict]
6. Code Example
Writing and reading Avro files using a schema:
python
import apache_beam as beam
from apache_beam.io.avroio import ReadFromAvro, WriteToAvro
# Define an Avro schema
schema = {
"type": "record",
"name": "User",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"},
{"name": "email", "type": ["null", "string"], "default": None}
]
}
with beam.Pipeline() as p:
users = (p
| "CreateData" >> beam.Create([
{"id": 1, "name": "Alice", "email": "alice@example.com"},
{"id": 2, "name": "Bob", "email": None}
]))
# Write to Avro
users | "WriteAvro" >> WriteToAvro("data/users", schema=schema, file_name_suffix=".avro")
7. Code Explanation
- The schema dictionary defines types, constraints, and defaults.
WriteToAvrocompiles this schema and outputs a binary file ending in.avro.
8. Real Production Example
Reading records from GCS and extracting fields:
python
import apache_beam as beam
from apache_beam.io.avroio import ReadFromAvro
with beam.Pipeline() as p:
(p
| "ReadFromGCS" >> ReadFromAvro("gs://my-bucket/records/*.avro")
| "MapEmails" >> beam.Map(lambda record: record["email"])
| "Log" >> beam.Map(print)
)
9. Common Mistakes
- Schema Mismatches: Passing a dictionary that does not strictly match the Avro schema fields will crash the write step.
- Incorrect Null Handling: If a field can be null, the Avro schema type must be union-declared, e.g.,
["null", "string"].
10. Interview Perspective
- Question: Why is Avro preferred over JSON for stream ingestion like Kafka?
- Answer: Avro doesn't duplicate keys in every message; it relies on the pre-defined schema to deserialize byte arrays, saving significant bandwidth and storage space.
- Question: Can Beam split Avro files for parallel processing?
- Answer: Yes, Avro organizes data in sync markers between byte blocks. Beam workers read blocks independently by aligning to these markers.
11. Best Practices
- Leverage union types to handle optional values.
- Include docstrings and documentation inside schema fields for metadata tracking.
12. Summary
- Avro uses a binary row-oriented format.
- Schemas are embedded directly in file headers.
- Integrates natively with
ReadFromAvroandWriteToAvro.
13. Interactive Challenges
14. Related Content
Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)