Google Cloud BigQuery
1. Introduction
Google Cloud BigQuery is a serverless, highly scalable enterprise data warehouse. Apache Beam features a comprehensive, optimized connector to read from and write to BigQuery tables in both batch and streaming pipelines.
2. Why This Concept Exists
Data processing pipelines often deliver aggregated logs, transaction history, or user metrics to data warehouses for analytics. BigQuery is the primary analytics warehouse on Google Cloud. Beam's BigQuery connector handles massive bulk loads (batch) and real-time streaming insert APIs (streaming) transparently.
3. Key Terminology
- Table Reference: The path string pointing to a BigQuery table, structured as:
"project:dataset.table". - Write Disposition: Setting that decides what to do if the target table already exists (e.g. append, overwrite, or fail).
- Table Schema: A description of columns, types, and modes representing the target database structure.
4. How It Works
- Read: You can read directly from an entire table or run a custom SQL query to filter records early. Results are returned as a PCollection of Python dictionaries, where keys represent columns.
- Write: Pass PCollections of dictionaries to
WriteToBigQuery. Beam handles batch uploading or streaming inserts using Google APIs.
5. Visual Diagram
BigQuery Tables
Read source query
Dataflow Workers
Map & Aggregate
WriteToBigQuery
target:dataset.table
6. Code Example
Writing transaction records matching a table schema to BigQuery:
import apache_beam as beam
from apache_beam.io.gcp.bigquery import WriteToBigQuery
# 1. Define table schema structure
table_schema = {
"fields": [
{"name": "user_id", "type": "STRING", "mode": "REQUIRED"},
{"name": "amount", "type": "FLOAT", "mode": "NULLABLE"},
{"name": "timestamp", "type": "TIMESTAMP", "mode": "NULLABLE"}
]
}
# 2. Apply write transform inside pipeline
# output = parsed_records | WriteToBigQuery(
# "my-gcp-project:analytics_dataset.user_transactions",
# schema=table_schema,
# write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
# create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
# )
7. Code Explanation
table_schemadescribes column names, database types, and required modes.WriteToBigQuery(...)takes the table reference"project:dataset.table".write_disposition=WRITE_APPENDappends new records.create_disposition=CREATE_IF_NEEDEDcreates the table if it does not already exist.
8. Real Production Example: Read with Query
In batch runs, running custom queries keeps network data transfers lightweight by filtering rows inside BigQuery before ingestion:
from apache_beam.io.gcp.bigquery import ReadFromBigQuery
query = "SELECT user_id, SUM(amount) as total FROM `my-project.dataset.sales` GROUP BY user_id"
user_totals = p | ReadFromBigQuery(query=query, use_standard_sql=True)
9. Common Mistakes
- Type mismatches in dictionaries: BigQuery enforces strict schemas. If your dictionary has a string (e.g.
"100.5") but the schema expects a float, the write operation will throw a database error. Parse your fields correctly. - Writing to wrong project scopes: Make sure your running service account has both BigQuery Admin permissions and project-level storage permissions.
10. Interview Perspective
- Question: What is the difference between Write Dispositions?
- Answer:
WRITE_APPEND: Adds new rows to the existing table.WRITE_TRUNCATE: Clears all existing rows and overwrites the table with new data.WRITE_EMPTY: Fails the pipeline if the target table contains any records.
- Question: How does BigQuery sink handle streaming versus batch?
- Answer: Beam detects the runner environment automatically. If the pipeline is running in streaming mode, it uses BigQuery's high-speed Streaming Insert APIs. If batch, it triggers file load jobs using GCS staging files (which is free of load charges).
11. Best Practices
- Use custom queries to filter row records early rather than loading entire raw tables.
- Pin write settings using
CREATE_IF_NEEDEDand define schema objects explicitly.
12. Summary
- Interacts with BigQuery tables.
- Inputs/Outputs are mapped using dictionary structures.
- Enforces schema mapping rules, supporting both batch uploads and streaming API inserts.