Python with AWS S3

Amazon Web Services offers S3 (Simple Storage Service), a scalable object storage service used to store files, datasets, backups, and logs.

Using Python, you can easily interact with Amazon S3 to upload, download, list, and delete files.

What is Amazon S3?

Amazon S3 is an object storage service that stores data as:

  • Buckets → Containers for files
  • Objects → Actual files stored inside buckets

Example structure:

my-bucket/
data.csv
reports/sales.xlsx

Why Use S3 in Data Engineering?

  • Store raw data
  • Store processed datasets
  • Backup databases
  • Store logs
  • Build data lakes

S3 is highly scalable and cost-effective.

Step 1: Install Required Library

Python uses boto3 to connect to AWS services.

pip install boto3

Step 2: Configure AWS Credentials

You need:

  • AWS Access Key
  • Secret Access Key
  • Region

Configure using:

aws configure

Or set environment variables.

Step 3: Connect to S3 Using Python

import boto3s3 = boto3.client('s3')

Upload a File to S3

s3.upload_file('local_file.csv', 'my-bucket', 'data/local_file.csv')

Parameters:

  • Local file path
  • Bucket name
  • S3 object key

Download a File from S3

s3.download_file('my-bucket', 'data/local_file.csv', 'downloaded.csv')

List Files in a Bucket

response = s3.list_objects_v2(Bucket='my-bucket')for obj in response.get('Contents', []):
print(obj['Key'])

Delete a File

s3.delete_object(Bucket='my-bucket', Key='data/local_file.csv')

Reading CSV Directly from S3 Using Pandas

import pandas as pddf = pd.read_csv('s3://my-bucket/data/local_file.csv')

For this, you may need additional libraries like s3fs.

Uploading DataFrame to S3

df.to_csv('output.csv', index=False)
s3.upload_file('output.csv', 'my-bucket', 'processed/output.csv')

Best Practices

  • Never hardcode credentials
  • Use IAM roles in production
  • Organize bucket folders properly
  • Enable versioning
  • Set proper access permissions
  • Use lifecycle policies for cost control

Real-World Use Case Example

ETL Workflow:

  1. Extract API data
  2. Save raw data to S3
  3. Transform data
  4. Store processed data back to S3
  5. Load into data warehouse

Interview Answer (Short Version)

Using Python with AWS S3 involves using the boto3 library to upload, download, and manage files in S3 buckets. It is commonly used in data engineering workflows for storing raw and processed datasets.

Final Summary

Python with AWS S3 allows you to:

  • Automate file uploads and downloads
  • Build cloud-based data pipelines
  • Store large datasets
  • Create scalable data lake architectures

It is an essential skill for modern cloud-based Data Engineering projects.

Home » PYTHON FOR DATA ENGINEERING (PYDE) > Cloud Data Engineering > Python with AWS S3