Data transformation is the process of cleaning, modifying, and preparing raw data for analysis.
Pandas is one of the most powerful Python libraries for data transformation and manipulation.
It is widely used in:
Data Engineering
Data Analysis
Machine Learning
ETL Pipelines
1. Loading Data
Before transforming, we load the data.
import pandas as pddf = pd.read_csv("sales.csv")
print(df.head())
Now the data is ready for transformation.
2. Handling Missing Values
Check missing values:
print(df.isnull().sum())
Remove missing values:
df = df.dropna()
Fill missing values:
df["price"] = df["price"].fillna(0)
3. Removing Duplicates
df = df.drop_duplicates()
This ensures clean and accurate data.
4. Filtering Data
Filter rows:
df = df[df["price"] > 100]
Filter multiple conditions:
df = df[(df["price"] > 100) & (df["quantity"] >= 2)]
5. Selecting Columns
df = df[["product", "price", "quantity"]]
Selecting only required columns improves performance.
6. Creating New Columns
df["total"] = df["price"] * df["quantity"]
This creates a calculated column.
7. Changing Data Types
df["quantity"] = df["quantity"].astype("int32")
df["date"] = pd.to_datetime(df["date"])
Correct data types improve accuracy and memory usage.
8. Renaming Columns
df = df.rename(columns={"product_name": "product"})
9. Grouping and Aggregation
Group data and calculate totals:
summary = df.groupby("product")["total"].sum()
print(summary)
Multiple aggregations:
summary = df.groupby("product").agg({
"total": "sum",
"quantity": "mean"
})
10. Sorting Data
df = df.sort_values(by="total", ascending=False)
11. Merging DataFrames
Combine two datasets:
df1 = pd.read_csv("customers.csv")
df2 = pd.read_csv("orders.csv")merged = pd.merge(df1, df2, on="customer_id", how="inner")
Join types:
inner
left
right
outer
12. Applying Custom Functions
df["discounted_price"] = df["price"].apply(lambda x: x * 0.9)
Apply custom logic to columns.
13. Pivot Tables
Create summary table:
pivot = df.pivot_table(
values="total",
index="product",
columns="region",
aggfunc="sum"
)
Useful for reporting and dashboards.
14. Export Transformed Data
df.to_csv("cleaned_sales.csv", index=False)
Or save to database for analytics.
Real-World ETL Example
Extract:
Load raw sales data
Transform:
Remove duplicates
Handle missing values
Calculate total revenue
Aggregate by product
Load:
Save cleaned data into data warehouse
Best Practices
Avoid loops, use vectorized operations
Handle missing values carefully
Use proper data types
Document transformations
Validate output data
Optimize memory usage
Common Mistakes
Modifying original data without backup
Ignoring data types
Using loops instead of vectorization
Not checking missing values
Not validating final dataset
Key Takeaway
Transforming data using Pandas involves cleaning, filtering, aggregating, and restructuring datasets to make them analysis-ready.
Pandas provides powerful and efficient tools that are essential for ETL processes and modern data workflows.