intermediate

Parquet

7 min readLast updated: 2026-07-01

1. Introduction

Apache Parquet is a columnar storage format optimized for analytic querying. In Apache Beam, the Parquet connector enables pipelines to read and write high-performance datasets matching specific schemas.

2. Why This Concept Exists

For analytical queries (OLAP), row-oriented databases must read full rows even if only querying 2 columns. Parquet stores values column-by-column, allowing readers to skip irrelevant columns (projection) and read ranges of data efficiently.

3. Key Terminology

  • Columnar Layout: Organizing values of the same column contiguously on disk.
  • Projection Pushdown: Only loading specific columns from disk into memory.
  • PyArrow: The underlying Apache Arrow Python library used to interact with Parquet.

4. How It Works

  • Reading: ReadFromParquet leverages pyarrow to parse metadata, extract specified column arrays, and construct dictionaries.
  • Writing: WriteToParquet receives elements, formats them to PyArrow tables based on a PyArrow schema, and writes columnar files.

5. Visual Diagram

Parquet Columnar File
Grouped binary blocks

Target Reads
Loads only requested columns

6. Code Example

Reading specific columns from Parquet files:

python
import apache_beam as beam
from apache_beam.io.parquetio import ReadFromParquet

with beam.Pipeline() as p:
    (p
     | "ReadParquet" >> ReadFromParquet("data/*.parquet", columns=["user_id", "amount"])
     | "LogAmounts" >> beam.Map(print)
    )

7. Code Explanation

  • ReadFromParquet opens Parquet files.
  • The columns argument specifies columns to load, avoiding the overhead of reading the entire row schema.

8. Real Production Example

Writing a PCollection to Parquet with a PyArrow schema:

python
import pyarrow as pa
import apache_beam as beam
from apache_beam.io.parquetio import WriteToParquet

# Define pyarrow schema
pyarrow_schema = pa.schema([
    ("user_id", pa.int64()),
    ("amount", pa.float64())
])

with beam.Pipeline() as p:
    data = (p | beam.Create([
        {"user_id": 101, "amount": 45.50},
        {"user_id": 102, "amount": 90.00}
    ]))
    
    data | "WriteParquet" >> WriteToParquet(
        "gs://my-bucket/payments", 
        schema=pyarrow_schema, 
        file_name_suffix=".parquet"
    )

9. Common Mistakes

  • Skipping PyArrow schema: Writing Parquet files requires a defined pyarrow schema. Skipping this will trigger runtime errors.
  • Reading unused columns: Not passing a projection list when reading columns loads everything into memory, degrading network performance.

10. Interview Perspective

  • Question: Why is Parquet highly compressed compared to CSV?
  • Answer: Contiguous columnar data consists of identical types. This allows compilers to use compression techniques like dictionary encoding and Run-Length Encoding (RLE).
  • Question: How does Parquet's metadata assist split evaluation?
  • Answer: Parquet contains header/footer metadata describing row count and byte offsets per row group, allowing Beam to split reads cleanly.

11. Best Practices

  • Always define columns to read via projection.
  • Use standard PyArrow types to model schemas.

12. Summary

  • Parquet is a columnar binary file format.
  • Uses Projection Pushdown to save CPU and network bandwidth.
  • Requires PyArrow schema setup for writing.

13. Interactive Challenges

Challenge 1: Parquet Reader (Beginner)

Write a transform statement to read all Parquet files from "data/events/".

Challenge 2: Columns Projection (Intermediate)

Configure ReadFromParquet to read only "timestamp" and "status" columns from "data/metrics.parquet".

Challenge 3: PyArrow Schema Definition (Advanced)

Write a PyArrow schema containing a string column "name" and a boolean column "active".

14. Related Content

Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)