CSV Files — Learn Apache Beam

1. Introduction

CSV (Comma-Separated Values) files are the most common format for exchanging tabular data. Apache Beam provides robust mechanisms to read, parse, and write CSV files in parallel across cloud and local filesystems.

2. Why This Concept Exists

While CSV files are simple, parsing them at scale presents challenges. Large CSV files must be split across workers, headers must be skipped, and embedded commas or quotes must be handled correctly. Apache Beam simplifies this by managing data parallelization and row formatting.

3. Key Terminology

Delimiter: The character separating fields (typically a comma ,).
Header Line: The first line in a CSV containing column names.
Dialect: The formatting rules (e.g., Excel, UNIX) defining how quotes and delimiters behave.

4. How It Works

Reading: ReadFromText reads CSV files as raw string lines. A mapping step parses each string line into fields using Python's standard csv library or a custom parser.
Writing: Dicts or lists are formatted into comma-delimited strings and written using WriteToText.

5. Visual Diagram

CSV File (data.csv)
Raw Text Source

➔ (ReadFromText)

csv.reader Map
Parse columns & keys

➔

PCollection[Dict]
Formatted objects

6. Code Example

Reading and parsing a CSV file into dictionaries:

python

import csv
import apache_beam as beam

def parse_csv(line):
    # Parse a CSV row using the csv library
    for row in csv.reader([line]):
        return {"id": row[0], "name": row[1], "score": float(row[2])}

with beam.Pipeline() as p:
    (p
     | "ReadCSV" >> beam.io.ReadFromText("data/users.csv", skip_header_lines=1)
     | "ParseCSV" >> beam.Map(parse_csv)
     | "PrintRows" >> beam.Map(print)
    )

7. Code Explanation

skip_header_lines=1 ignores the header row in ReadFromText.
csv.reader([line]) handles escaped characters and embedded commas correctly, converting a line string into a list of cell values.

8. Real Production Example

Using Beam Dataframes for schema-aware CSV handling:

python

import apache_beam as beam
from apache_beam.dataframe.convert import to_pcollection
from apache_beam.dataframe.io import read_csv

with beam.Pipeline() as p:
    # Use pandas-like syntax to read CSV
    df = p | "ReadDF" >> read_csv("gs://my-bucket/transactions.csv")
    pcoll = to_pcollection(df)
    
    # Process the schema-aware PCollection
    pcoll | "LogResults" >> beam.Map(print)

9. Common Mistakes

Splitting by commas manually: Using line.split(",") fails if a field contains an embedded comma (e.g., "Sanap, Anil"). Always use a proper CSV library like csv.reader.
Ignoring worker memory bounds: Trying to parse massive files as a single string instead of using line-by-line reading.

10. Interview Perspective

Question: How do you handle nested quotes or delimiters inside a CSV field?
Answer: By using Python's native csv reader inside a DoFn, which respects CSV dialects, quoting rules, and escape characters.
Question: Can Beam parallelize reading a single massive CSV file?
Answer: Yes. Beam's ReadFromText split mechanism breaks the file into byte offsets, and each worker reads from its assigned offset.

11. Best Practices

Always skip headers using skip_header_lines=1.
Use Beam Dataframes (read_csv) for a high-level, Pandas-like declarative schema.

12. Summary

CSV parsing requires standard libraries to handle complex formatting.
ReadFromText reads raw strings, which must be mapped to dictionaries.
Dataframes simplify CSV ingest by auto-inferring types and schemas.

13. Interactive Challenges

Challenge 1: Basic CSV Parser (Beginner)

Write a parser function simple_parse that takes a CSV line 123,Laptop,899.9 and returns a dictionary with keys item_id, item_name, and price (as float).

Challenge 2: CSV Row Formatter (Intermediate)

Write a transform map that converts a dictionary {"name": "Anil", "age": 30} into a comma-separated string "Anil,30" for CSV output.

Challenge 3: Robust CSV Reader (Advanced)

Write a complete pipeline block using a context manager to read CSV records from "data/customers.csv", skip the first header line, parse fields using csv.reader, and print them.

14. Related Content

Previous LessonTextIO

Next LessonJSON Files