Transforming Data Using Pandas

Data transformation is the process of cleaning, modifying, and preparing raw data for analysis.

Pandas is one of the most powerful Python libraries for data transformation and manipulation.

It is widely used in:

Data Engineering
Data Analysis
Machine Learning
ETL Pipelines

1. Loading Data

Before transforming, we load the data.

import pandas as pddf = pd.read_csv("sales.csv")
print(df.head())

Now the data is ready for transformation.

2. Handling Missing Values

Check missing values:

print(df.isnull().sum())

Remove missing values:

df = df.dropna()

Fill missing values:

df["price"] = df["price"].fillna(0)

3. Removing Duplicates

df = df.drop_duplicates()

This ensures clean and accurate data.

4. Filtering Data

Filter rows:

df = df[df["price"] > 100]

Filter multiple conditions:

df = df[(df["price"] > 100) & (df["quantity"] >= 2)]

5. Selecting Columns

df = df[["product", "price", "quantity"]]

Selecting only required columns improves performance.

6. Creating New Columns

df["total"] = df["price"] * df["quantity"]

This creates a calculated column.

7. Changing Data Types

df["quantity"] = df["quantity"].astype("int32")
df["date"] = pd.to_datetime(df["date"])

Correct data types improve accuracy and memory usage.

8. Renaming Columns

df = df.rename(columns={"product_name": "product"})

9. Grouping and Aggregation

Group data and calculate totals:

summary = df.groupby("product")["total"].sum()
print(summary)

Multiple aggregations:

summary = df.groupby("product").agg({
"total": "sum",
"quantity": "mean"
})

10. Sorting Data

df = df.sort_values(by="total", ascending=False)

11. Merging DataFrames

Combine two datasets:

df1 = pd.read_csv("customers.csv")
df2 = pd.read_csv("orders.csv")merged = pd.merge(df1, df2, on="customer_id", how="inner")

Join types:

inner
left
right
outer

12. Applying Custom Functions

df["discounted_price"] = df["price"].apply(lambda x: x * 0.9)

Apply custom logic to columns.

13. Pivot Tables

Create summary table:

pivot = df.pivot_table(
values="total",
index="product",
columns="region",
aggfunc="sum"
)

Useful for reporting and dashboards.

14. Export Transformed Data

df.to_csv("cleaned_sales.csv", index=False)

Or save to database for analytics.

Real-World ETL Example

Extract:

Load raw sales data

Transform:

Remove duplicates
Handle missing values
Calculate total revenue
Aggregate by product

Load:

Save cleaned data into data warehouse

Best Practices

Avoid loops, use vectorized operations
Handle missing values carefully
Use proper data types
Document transformations
Validate output data
Optimize memory usage

Common Mistakes

Modifying original data without backup
Ignoring data types
Using loops instead of vectorization
Not checking missing values
Not validating final dataset

Key Takeaway

Transforming data using Pandas involves cleaning, filtering, aggregating, and restructuring datasets to make them analysis-ready.

Pandas provides powerful and efficient tools that are essential for ETL processes and modern data workflows.

Home » PYTHON FOR DATA ENGINEERING (PYDE) > ETL and Data Pipelines > Transforming Data Using Pandas