beginner
Amazon S3
6 min readLast updated: 2026-07-01
1. Introduction
Amazon Simple Storage Service (S3) is an industry-standard object store. Apache Beam supports AWS S3 paths natively (s3://), enabling developers to ingest and store datasets in S3.
2. Why This Concept Exists
AWS architectures use S3 as their default storage backend. Whether running Beam on Amazon EMR, Apache Flink, or local runners, pipelines need to directly interface with S3 keys. Beam's filesystem layer integrates with the AWS SDK, resolving paths dynamically.
3. Key Terminology
- s3:// URI: The scheme prefix denoting Amazon S3 bucket paths.
- Boto3: The AWS SDK for Python used underneath to manage connections.
- IAM Instance Profile: Credentials associated with AWS VMs executing the code.
4. How It Works
- The environment requires
apache-beam[aws]installed. - S3 filesystem adapters intercept paths matching
s3://. - Workers execute AWS SDK commands, parsing objects from S3 in parallel by dividing data byte-offsets.
5. Visual Diagram
AWS Runner Worker
➔ (Parallel s3:// API)Amazon S3 Bucket
logs/ & output-shards/
6. Code Example
Reading transaction text files from Amazon S3 and writing output:
python
import apache_beam as beam
with beam.Pipeline() as p:
(p
| "ReadS3" >> beam.io.ReadFromText("s3://company-transactions/2026/*.csv")
| "Clean" >> beam.Map(lambda line: line.strip())
| "WriteS3" >> beam.io.WriteToText("s3://company-transactions/outputs/cleansed")
)
7. Code Explanation
- The path uses
s3://pointing to AWS resources. - Beam uses AWS credential chains to resolve permissions, matching credentials configured in the environment (e.g.
~/.aws/credentials).
8. Real Production Example
Configuring custom AWS options (like Region or custom credentials) using AwsOptions:
python
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, AwsOptions
options = PipelineOptions()
aws_options = options.view_as(AwsOptions)
aws_options.aws_region = "us-east-1"
# Optionally use a profile: aws_options.aws_profile_name = "prod-profile"
p = beam.Pipeline(options=options)
# Define S3 operations...
9. Common Mistakes
- Missing package extra: Forgetting to install
apache-beam[aws]package, which results inValueError: No filesystem found for scheme s3. - Configuring credentials on worker VMs improperly: Workers running in isolated subnets without access to AWS credentials, or instance roles missing S3 permissions.
10. Interview Perspective
- Question: How does Beam manage AWS S3 authentication?
- Answer: Beam respects AWS credential lookup chains. It checks system environment variables (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY), then~/.aws/credentials, and finally the IAM role of the EC2 instance. - Question: Does writing to S3 support dynamic files names?
- Answer: Yes,
WriteToTextworks the same way on S3, writing output files matching sharded strings (s3://bucket/path-00000-of-00002).
11. Best Practices
- Run workers in the same AWS region as the S3 buckets to eliminate cross-region network egress fees.
- Encrypt buckets using S3 managed keys (SSE-S3) and configure Beam to pass these keys if required.
12. Summary
- S3 integration uses
s3://path syntax. - Requires the
apache-beam[aws]package. - Uses default AWS credential chains for authentication.
13. Interactive Challenges
14. Related Content
Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)