BigQuery IO — Learn Apache Beam

1. Introduction

Google Cloud BigQuery is a serverless, highly scalable enterprise data warehouse. Apache Beam features a comprehensive, optimized connector to read from and write to BigQuery tables in both batch and streaming pipelines.

2. Why This Concept Exists

Data processing pipelines often deliver aggregated logs, transaction history, or user metrics to data warehouses for analytics. BigQuery is the primary analytics warehouse on Google Cloud. Beam's BigQuery connector handles massive bulk loads (batch) and real-time streaming insert APIs (streaming) transparently.

3. Key Terminology

Table Reference: The path string pointing to a BigQuery table, structured as: "project:dataset.table".
Write Disposition: Setting that decides what to do if the target table already exists (e.g. append, overwrite, or fail).
Table Schema: A description of columns, types, and modes representing the target database structure.

4. How It Works

Read: You can read directly from an entire table or run a custom SQL query to filter records early. Results are returned as a PCollection of Python dictionaries, where keys represent columns.
Write: Pass PCollections of dictionaries to WriteToBigQuery. Beam handles batch uploading or streaming inserts using Google APIs.

5. Visual Diagram

BigQuery Tables
Read source query

➔

Dataflow Workers
Map & Aggregate

➔

WriteToBigQuery
target:dataset.table

6. Code Example

Writing transaction records matching a table schema to BigQuery:

python

import apache_beam as beam
from apache_beam.io.gcp.bigquery import WriteToBigQuery

# 1. Define table schema structure
table_schema = {
    "fields": [
        {"name": "user_id", "type": "STRING", "mode": "REQUIRED"},
        {"name": "amount", "type": "FLOAT", "mode": "NULLABLE"},
        {"name": "timestamp", "type": "TIMESTAMP", "mode": "NULLABLE"}
    ]
}

# 2. Apply write transform inside pipeline
# output = parsed_records | WriteToBigQuery(
#     "my-gcp-project:analytics_dataset.user_transactions",
#     schema=table_schema,
#     write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
#     create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
# )

7. Code Explanation

table_schema describes column names, database types, and required modes.
WriteToBigQuery(...) takes the table reference "project:dataset.table".
write_disposition=WRITE_APPEND appends new records.
create_disposition=CREATE_IF_NEEDED creates the table if it does not already exist.

8. Real Production Example: Read with Query

In batch runs, running custom queries keeps network data transfers lightweight by filtering rows inside BigQuery before ingestion:

python

from apache_beam.io.gcp.bigquery import ReadFromBigQuery

query = "SELECT user_id, SUM(amount) as total FROM `my-project.dataset.sales` GROUP BY user_id"
user_totals = p | ReadFromBigQuery(query=query, use_standard_sql=True)

9. Common Mistakes

Type mismatches in dictionaries: BigQuery enforces strict schemas. If your dictionary has a string (e.g. "100.5") but the schema expects a float, the write operation will throw a database error. Parse your fields correctly.
Writing to wrong project scopes: Make sure your running service account has both BigQuery Admin permissions and project-level storage permissions.

10. Interview Perspective

Question: What is the difference between Write Dispositions?
Answer:
- WRITE_APPEND: Adds new rows to the existing table.
- WRITE_TRUNCATE: Clears all existing rows and overwrites the table with new data.
- WRITE_EMPTY: Fails the pipeline if the target table contains any records.
Question: How does BigQuery sink handle streaming versus batch?
Answer: Beam detects the runner environment automatically. If the pipeline is running in streaming mode, it uses BigQuery's high-speed Streaming Insert APIs. If batch, it triggers file load jobs using GCS staging files (which is free of load charges).

11. Best Practices

Use custom queries to filter row records early rather than loading entire raw tables.
Pin write settings using CREATE_IF_NEEDED and define schema objects explicitly.

12. Summary

Interacts with BigQuery tables.
Inputs/Outputs are mapped using dictionary structures.
Enforces schema mapping rules, supporting both batch uploads and streaming API inserts.

13. Interactive Challenges

Challenge 1: Read Table Directly (Beginner)

Write a code segment that reads all records from a BigQuery table named "sales_table" inside dataset "store_data" under GCP project "prod-analytics".

Challenge 2: Query Read with Standard SQL (Intermediate)

Write a BigQuery read segment that executes a custom SQL query to select user_id and region from a table path, using standard SQL syntax.

Challenge 3: Write Table with Append (Advanced)

Write a BigQuery write statement that appends dictionary records to table "prod-analytics:store_data.logs", creating it if needed using a schema dictionary log_schema.

14. Related Content

Previous LessonDatabases

Next LessonJDBC IO