Data Cleaning

Data Cleaning is the process of detecting and correcting errors, missing values, and inconsistencies in a dataset.

In Data Analytics, clean data is essential for accurate analysis and decision-making.

First, import pandas:

import pandas as pd

Load a dataset:

df = pd.read_csv("data.csv")

Why Data Cleaning is Important

Real-world data often contains:

Missing values
Duplicate records
Incorrect data types
Extra spaces
Inconsistent formatting
Outliers

Cleaning ensures reliable analysis.

1. Handling Missing Values

Check missing values:

df.isnull().sum()

Remove rows with missing values:

df.dropna()

Remove columns with missing values:

df.dropna(axis=1)

Fill missing values with a specific value:

df.fillna(0)

Fill with mean value:

df["Age"].fillna(df["Age"].mean())

2. Removing Duplicates

Check duplicate rows:

df.duplicated().sum()

Remove duplicates:

df.drop_duplicates()

3. Fixing Data Types

Check data types:

df.dtypes

Convert data type:

df["Age"] = df["Age"].astype(int)

Convert to datetime:

df["Date"] = pd.to_datetime(df["Date"])

4. Renaming Columns

df.rename(columns={"OldName": "NewName"}, inplace=True)

5. Removing Extra Spaces

df["Name"] = df["Name"].str.strip()

6. Replacing Values

df["Gender"] = df["Gender"].replace("M", "Male")

7. Handling Outliers (Basic Method)

Using condition:

df[df["Salary"] < 100000]

Or remove extreme values:

df = df[df["Salary"] < df["Salary"].quantile(0.95)]

8. Changing Text Case

df["Name"] = df["Name"].str.lower()
df["Name"] = df["Name"].str.upper()

Data Cleaning Workflow

  1. Load dataset
  2. Check structure and summary
  3. Identify missing values
  4. Remove duplicates
  5. Fix data types
  6. Standardize formatting
  7. Handle outliers

Why Data Cleaning is Critical in Analytics

Accurate insights depend on clean data.
Poor data quality leads to wrong decisions.
Data cleaning is often 60–70% of a data analyst’s work.

Key Takeaway

Data Cleaning is the foundation of Data Analytics.
Before analyzing or visualizing data, always clean and prepare it properly for accurate and meaningful results.

Home » PYTHON FOR DATA ANALYTICS (PYDA) > Pandas > Data Cleaning