Data Cleaning is the process of detecting and correcting errors, missing values, and inconsistencies in a dataset.
In Data Analytics, clean data is essential for accurate analysis and decision-making.
First, import pandas:
import pandas as pd
Load a dataset:
df = pd.read_csv("data.csv")
Why Data Cleaning is Important
Real-world data often contains:
Missing values
Duplicate records
Incorrect data types
Extra spaces
Inconsistent formatting
Outliers
Cleaning ensures reliable analysis.
1. Handling Missing Values
Check missing values:
df.isnull().sum()
Remove rows with missing values:
df.dropna()
Remove columns with missing values:
df.dropna(axis=1)
Fill missing values with a specific value:
df.fillna(0)
Fill with mean value:
df["Age"].fillna(df["Age"].mean())
2. Removing Duplicates
Check duplicate rows:
df.duplicated().sum()
Remove duplicates:
df.drop_duplicates()
3. Fixing Data Types
Check data types:
df.dtypes
Convert data type:
df["Age"] = df["Age"].astype(int)
Convert to datetime:
df["Date"] = pd.to_datetime(df["Date"])
4. Renaming Columns
df.rename(columns={"OldName": "NewName"}, inplace=True)
5. Removing Extra Spaces
df["Name"] = df["Name"].str.strip()
6. Replacing Values
df["Gender"] = df["Gender"].replace("M", "Male")
7. Handling Outliers (Basic Method)
Using condition:
df[df["Salary"] < 100000]
Or remove extreme values:
df = df[df["Salary"] < df["Salary"].quantile(0.95)]
8. Changing Text Case
df["Name"] = df["Name"].str.lower()
df["Name"] = df["Name"].str.upper()
Data Cleaning Workflow
- Load dataset
- Check structure and summary
- Identify missing values
- Remove duplicates
- Fix data types
- Standardize formatting
- Handle outliers
Why Data Cleaning is Critical in Analytics
Accurate insights depend on clean data.
Poor data quality leads to wrong decisions.
Data cleaning is often 60–70% of a data analyst’s work.
Key Takeaway
Data Cleaning is the foundation of Data Analytics.
Before analyzing or visualizing data, always clean and prepare it properly for accurate and meaningful results.