beginner

Amazon S3

6 min readLast updated: 2026-07-01

1. Introduction

Amazon Simple Storage Service (S3) is an industry-standard object store. Apache Beam supports AWS S3 paths natively (s3://), enabling developers to ingest and store datasets in S3.

2. Why This Concept Exists

AWS architectures use S3 as their default storage backend. Whether running Beam on Amazon EMR, Apache Flink, or local runners, pipelines need to directly interface with S3 keys. Beam's filesystem layer integrates with the AWS SDK, resolving paths dynamically.

3. Key Terminology

  • s3:// URI: The scheme prefix denoting Amazon S3 bucket paths.
  • Boto3: The AWS SDK for Python used underneath to manage connections.
  • IAM Instance Profile: Credentials associated with AWS VMs executing the code.

4. How It Works

  • The environment requires apache-beam[aws] installed.
  • S3 filesystem adapters intercept paths matching s3://.
  • Workers execute AWS SDK commands, parsing objects from S3 in parallel by dividing data byte-offsets.

5. Visual Diagram

AWS Runner Worker

Amazon S3 Bucket
logs/ & output-shards/

6. Code Example

Reading transaction text files from Amazon S3 and writing output:

python
import apache_beam as beam

with beam.Pipeline() as p:
    (p
     | "ReadS3" >> beam.io.ReadFromText("s3://company-transactions/2026/*.csv")
     | "Clean" >> beam.Map(lambda line: line.strip())
     | "WriteS3" >> beam.io.WriteToText("s3://company-transactions/outputs/cleansed")
    )

7. Code Explanation

  • The path uses s3:// pointing to AWS resources.
  • Beam uses AWS credential chains to resolve permissions, matching credentials configured in the environment (e.g. ~/.aws/credentials).

8. Real Production Example

Configuring custom AWS options (like Region or custom credentials) using AwsOptions:

python
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, AwsOptions

options = PipelineOptions()
aws_options = options.view_as(AwsOptions)
aws_options.aws_region = "us-east-1"
# Optionally use a profile: aws_options.aws_profile_name = "prod-profile"

p = beam.Pipeline(options=options)
# Define S3 operations...

9. Common Mistakes

  • Missing package extra: Forgetting to install apache-beam[aws] package, which results in ValueError: No filesystem found for scheme s3.
  • Configuring credentials on worker VMs improperly: Workers running in isolated subnets without access to AWS credentials, or instance roles missing S3 permissions.

10. Interview Perspective

  • Question: How does Beam manage AWS S3 authentication?
  • Answer: Beam respects AWS credential lookup chains. It checks system environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), then ~/.aws/credentials, and finally the IAM role of the EC2 instance.
  • Question: Does writing to S3 support dynamic files names?
  • Answer: Yes, WriteToText works the same way on S3, writing output files matching sharded strings (s3://bucket/path-00000-of-00002).

11. Best Practices

  • Run workers in the same AWS region as the S3 buckets to eliminate cross-region network egress fees.
  • Encrypt buckets using S3 managed keys (SSE-S3) and configure Beam to pass these keys if required.

12. Summary

  • S3 integration uses s3:// path syntax.
  • Requires the apache-beam[aws] package.
  • Uses default AWS credential chains for authentication.

13. Interactive Challenges

Challenge 1: S3 Reader Setup (Beginner)

Write an Apache Beam reader statement that reads files from an S3 bucket named "sensor-archives" and pattern "raw/*.log".

Challenge 2: AwsOptions Configuration (Intermediate)

Configure AwsOptions to set the AWS region to "us-west-2".

Challenge 3: Sharded S3 Writer (Advanced)

Write a write step that outputs string elements to "s3://bucket/output/data" forcing the number of shards to be exactly 3.

14. Related Content

Advertisement
AdSense Slot #000001Leaderboard Banner (728x90)