intermediate
Parquet
7 min readLast updated: 2026-07-01
1. Introduction
Apache Parquet is a columnar storage format optimized for analytic querying. In Apache Beam, the Parquet connector enables pipelines to read and write high-performance datasets matching specific schemas.
2. Why This Concept Exists
For analytical queries (OLAP), row-oriented databases must read full rows even if only querying 2 columns. Parquet stores values column-by-column, allowing readers to skip irrelevant columns (projection) and read ranges of data efficiently.
3. Key Terminology
- Columnar Layout: Organizing values of the same column contiguously on disk.
- Projection Pushdown: Only loading specific columns from disk into memory.
- PyArrow: The underlying Apache Arrow Python library used to interact with Parquet.
4. How It Works
- Reading:
ReadFromParquetleveragespyarrowto parse metadata, extract specified column arrays, and construct dictionaries. - Writing:
WriteToParquetreceives elements, formats them to PyArrow tables based on a PyArrow schema, and writes columnar files.
5. Visual Diagram
Parquet Columnar File
Grouped binary blocks
Target Reads
Loads only requested columns
6. Code Example
Reading specific columns from Parquet files:
python
import apache_beam as beam
from apache_beam.io.parquetio import ReadFromParquet
with beam.Pipeline() as p:
(p
| "ReadParquet" >> ReadFromParquet("data/*.parquet", columns=["user_id", "amount"])
| "LogAmounts" >> beam.Map(print)
)
7. Code Explanation
ReadFromParquetopens Parquet files.- The
columnsargument specifies columns to load, avoiding the overhead of reading the entire row schema.
8. Real Production Example
Writing a PCollection to Parquet with a PyArrow schema:
python
import pyarrow as pa
import apache_beam as beam
from apache_beam.io.parquetio import WriteToParquet
# Define pyarrow schema
pyarrow_schema = pa.schema([
("user_id", pa.int64()),
("amount", pa.float64())
])
with beam.Pipeline() as p:
data = (p | beam.Create([
{"user_id": 101, "amount": 45.50},
{"user_id": 102, "amount": 90.00}
]))
data | "WriteParquet" >> WriteToParquet(
"gs://my-bucket/payments",
schema=pyarrow_schema,
file_name_suffix=".parquet"
)
9. Common Mistakes
- Skipping PyArrow schema: Writing Parquet files requires a defined
pyarrowschema. Skipping this will trigger runtime errors. - Reading unused columns: Not passing a projection list when reading columns loads everything into memory, degrading network performance.
10. Interview Perspective
- Question: Why is Parquet highly compressed compared to CSV?
- Answer: Contiguous columnar data consists of identical types. This allows compilers to use compression techniques like dictionary encoding and Run-Length Encoding (RLE).
- Question: How does Parquet's metadata assist split evaluation?
- Answer: Parquet contains header/footer metadata describing row count and byte offsets per row group, allowing Beam to split reads cleanly.
11. Best Practices
- Always define columns to read via projection.
- Use standard PyArrow types to model schemas.
12. Summary
- Parquet is a columnar binary file format.
- Uses Projection Pushdown to save CPU and network bandwidth.
- Requires PyArrow schema setup for writing.
13. Interactive Challenges
14. Related Content
Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)