Parquet — Learn Apache Beam

1. Introduction

Apache Parquet is a columnar storage format optimized for analytic querying. In Apache Beam, the Parquet connector enables pipelines to read and write high-performance datasets matching specific schemas.

2. Why This Concept Exists

For analytical queries (OLAP), row-oriented databases must read full rows even if only querying 2 columns. Parquet stores values column-by-column, allowing readers to skip irrelevant columns (projection) and read ranges of data efficiently.

3. Key Terminology

Columnar Layout: Organizing values of the same column contiguously on disk.
Projection Pushdown: Only loading specific columns from disk into memory.
PyArrow: The underlying Apache Arrow Python library used to interact with Parquet.

4. How It Works

Reading: ReadFromParquet leverages pyarrow to parse metadata, extract specified column arrays, and construct dictionaries.
Writing: WriteToParquet receives elements, formats them to PyArrow tables based on a PyArrow schema, and writes columnar files.

5. Visual Diagram

Parquet Columnar File
Grouped binary blocks

➔ (Projection Pushdown)

Target Reads
Loads only requested columns

6. Code Example

Reading specific columns from Parquet files:

python

import apache_beam as beam
from apache_beam.io.parquetio import ReadFromParquet

with beam.Pipeline() as p:
    (p
     | "ReadParquet" >> ReadFromParquet("data/*.parquet", columns=["user_id", "amount"])
     | "LogAmounts" >> beam.Map(print)
    )

7. Code Explanation

ReadFromParquet opens Parquet files.
The columns argument specifies columns to load, avoiding the overhead of reading the entire row schema.

8. Real Production Example

Writing a PCollection to Parquet with a PyArrow schema:

python

import pyarrow as pa
import apache_beam as beam
from apache_beam.io.parquetio import WriteToParquet

# Define pyarrow schema
pyarrow_schema = pa.schema([
    ("user_id", pa.int64()),
    ("amount", pa.float64())
])

with beam.Pipeline() as p:
    data = (p | beam.Create([
        {"user_id": 101, "amount": 45.50},
        {"user_id": 102, "amount": 90.00}
    ]))
    
    data | "WriteParquet" >> WriteToParquet(
        "gs://my-bucket/payments", 
        schema=pyarrow_schema, 
        file_name_suffix=".parquet"
    )

9. Common Mistakes

Skipping PyArrow schema: Writing Parquet files requires a defined pyarrow schema. Skipping this will trigger runtime errors.
Reading unused columns: Not passing a projection list when reading columns loads everything into memory, degrading network performance.

10. Interview Perspective

Question: Why is Parquet highly compressed compared to CSV?
Answer: Contiguous columnar data consists of identical types. This allows compilers to use compression techniques like dictionary encoding and Run-Length Encoding (RLE).
Question: How does Parquet's metadata assist split evaluation?
Answer: Parquet contains header/footer metadata describing row count and byte offsets per row group, allowing Beam to split reads cleanly.

11. Best Practices

Always define columns to read via projection.
Use standard PyArrow types to model schemas.

12. Summary

Parquet is a columnar binary file format.
Uses Projection Pushdown to save CPU and network bandwidth.
Requires PyArrow schema setup for writing.

13. Interactive Challenges

Challenge 1: Parquet Reader (Beginner)

Write a transform statement to read all Parquet files from "data/events/".

Challenge 2: Columns Projection (Intermediate)

Configure ReadFromParquet to read only "timestamp" and "status" columns from "data/metrics.parquet".

Challenge 3: PyArrow Schema Definition (Advanced)

Write a PyArrow schema containing a string column "name" and a boolean column "active".

14. Related Content

Avro
File IO

Previous LessonAvro

Next LessonCloud Storage