Avro — Learn Apache Beam

1. Introduction

Apache Avro is a row-oriented, remote procedure call and data serialization framework. It is widely used in big data architectures because it stores rich schemas directly alongside binary records.

2. Why This Concept Exists

Text formats like JSON and CSV are slow to parse and consume significant disk space. Avro stores records in a highly compressed binary layout. Because the schema is stored in the file header, readers can dynamically decode records without worrying about schema drift.

3. Key Terminology

Avro Schema: A JSON-based definition of record structure, fields, and types.
Binary Encoding: The compact serialization format used for writing records.
Fastavro: The underlying library commonly used in Python to read/write Avro metadata.

4. How It Works

Reading: ReadFromAvro reads binary blocks, parses the schema header, and outputs dictionaries matching the structure.
Writing: WriteToAvro takes dictionaries and encodes them into binary using a schema.

5. Visual Diagram

Avro JSON Schema

➔ (WriteToAvro)

Compressed Disk File

➔ (ReadFromAvro)

PCollection[Dict]

6. Code Example

Writing and reading Avro files using a schema:

python

import apache_beam as beam
from apache_beam.io.avroio import ReadFromAvro, WriteToAvro

# Define an Avro schema
schema = {
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "id", "type": "int"},
        {"name": "name", "type": "string"},
        {"name": "email", "type": ["null", "string"], "default": None}
    ]
}

with beam.Pipeline() as p:
    users = (p
             | "CreateData" >> beam.Create([
                 {"id": 1, "name": "Alice", "email": "alice@example.com"},
                 {"id": 2, "name": "Bob", "email": None}
             ]))
             
    # Write to Avro
    users | "WriteAvro" >> WriteToAvro("data/users", schema=schema, file_name_suffix=".avro")

7. Code Explanation

The schema dictionary defines types, constraints, and defaults.
WriteToAvro compiles this schema and outputs a binary file ending in .avro.

8. Real Production Example

Reading records from GCS and extracting fields:

python

import apache_beam as beam
from apache_beam.io.avroio import ReadFromAvro

with beam.Pipeline() as p:
    (p
     | "ReadFromGCS" >> ReadFromAvro("gs://my-bucket/records/*.avro")
     | "MapEmails" >> beam.Map(lambda record: record["email"])
     | "Log" >> beam.Map(print)
    )

9. Common Mistakes

Schema Mismatches: Passing a dictionary that does not strictly match the Avro schema fields will crash the write step.
Incorrect Null Handling: If a field can be null, the Avro schema type must be union-declared, e.g., ["null", "string"].

10. Interview Perspective

Question: Why is Avro preferred over JSON for stream ingestion like Kafka?
Answer: Avro doesn't duplicate keys in every message; it relies on the pre-defined schema to deserialize byte arrays, saving significant bandwidth and storage space.
Question: Can Beam split Avro files for parallel processing?
Answer: Yes, Avro organizes data in sync markers between byte blocks. Beam workers read blocks independently by aligning to these markers.

11. Best Practices

Leverage union types to handle optional values.
Include docstrings and documentation inside schema fields for metadata tracking.

12. Summary

Avro uses a binary row-oriented format.
Schemas are embedded directly in file headers.
Integrates natively with ReadFromAvro and WriteToAvro.

13. Interactive Challenges

Challenge 1: Read Avro (Beginner)

Write an Apache Beam reader statement that reads all Avro files matching the path "data/*.avro".

Challenge 2: Schema Union Setup (Intermediate)

Define a schema dictionary field configuration for an optional integer named "age" that defaults to None.

Challenge 3: Write Avro with Schema (Advanced)

Write a pipeline write step that saves a PCollection to "output/users" using WriteToAvro with a schema containing "id" (int) and "name" (string).

14. Related Content

Previous LessonJSON Files

Next LessonParquet