Amazon Web Services offers S3 (Simple Storage Service), a scalable object storage service used to store files, datasets, backups, and logs.
Using Python, you can easily interact with Amazon S3 to upload, download, list, and delete files.
What is Amazon S3?
Amazon S3 is an object storage service that stores data as:
- Buckets → Containers for files
- Objects → Actual files stored inside buckets
Example structure:
my-bucket/
data.csv
reports/sales.xlsx
Why Use S3 in Data Engineering?
- Store raw data
- Store processed datasets
- Backup databases
- Store logs
- Build data lakes
S3 is highly scalable and cost-effective.
Step 1: Install Required Library
Python uses boto3 to connect to AWS services.
pip install boto3
Step 2: Configure AWS Credentials
You need:
- AWS Access Key
- Secret Access Key
- Region
Configure using:
aws configure
Or set environment variables.
Step 3: Connect to S3 Using Python
import boto3s3 = boto3.client('s3')
Upload a File to S3
s3.upload_file('local_file.csv', 'my-bucket', 'data/local_file.csv')
Parameters:
- Local file path
- Bucket name
- S3 object key
Download a File from S3
s3.download_file('my-bucket', 'data/local_file.csv', 'downloaded.csv')
List Files in a Bucket
response = s3.list_objects_v2(Bucket='my-bucket')for obj in response.get('Contents', []):
print(obj['Key'])
Delete a File
s3.delete_object(Bucket='my-bucket', Key='data/local_file.csv')
Reading CSV Directly from S3 Using Pandas
import pandas as pddf = pd.read_csv('s3://my-bucket/data/local_file.csv')
For this, you may need additional libraries like s3fs.
Uploading DataFrame to S3
df.to_csv('output.csv', index=False)
s3.upload_file('output.csv', 'my-bucket', 'processed/output.csv')
Best Practices
- Never hardcode credentials
- Use IAM roles in production
- Organize bucket folders properly
- Enable versioning
- Set proper access permissions
- Use lifecycle policies for cost control
Real-World Use Case Example
ETL Workflow:
- Extract API data
- Save raw data to S3
- Transform data
- Store processed data back to S3
- Load into data warehouse
Interview Answer (Short Version)
Using Python with AWS S3 involves using the boto3 library to upload, download, and manage files in S3 buckets. It is commonly used in data engineering workflows for storing raw and processed datasets.
Final Summary
Python with AWS S3 allows you to:
- Automate file uploads and downloads
- Build cloud-based data pipelines
- Store large datasets
- Create scalable data lake architectures
It is an essential skill for modern cloud-based Data Engineering projects.