Data Preprocessing

Data Preprocessing is the process of cleaning and preparing raw data before using it in Machine Learning.

Raw data is often incomplete, inconsistent, or noisy.
Preprocessing ensures that the data is accurate, clean, and ready for modeling.

It is one of the most important steps in Machine Learning.

Why Data Preprocessing is Important

Data preprocessing helps:

Improve model accuracy
Handle missing values
Remove errors and duplicates
Normalize data
Convert data into proper format
Reduce noise

Good data leads to better predictions.

Steps in Data Preprocessing

1. Data Cleaning

Data cleaning involves fixing or removing incorrect data.

Common tasks:

  • Handling missing values
  • Removing duplicates
  • Correcting errors
  • Removing irrelevant data

Example:

If age is missing β†’ Replace with mean or median
If duplicate records exist β†’ Remove them

2. Handling Missing Values

Common techniques:

  • Remove rows with missing values
  • Replace with mean (for numerical data)
  • Replace with median
  • Replace with mode (for categorical data)

Choosing the right method depends on the dataset.

3. Encoding Categorical Data

Machine Learning models work with numbers, not text.

So categorical data must be converted.

Techniques:

Label Encoding β†’ Assign number to each category
One-Hot Encoding β†’ Create separate column for each category

Example:

Gender:
Male β†’ 0
Female β†’ 1

Or:

Male β†’ [1,0]
Female β†’ [0,1]

4. Feature Scaling

Some algorithms perform better when features are on the same scale.

Two common methods:

Standardization

  • Mean = 0
  • Standard deviation = 1

Normalization

  • Values scaled between 0 and 1

Example:

Salary range: 10,000 to 1,000,000
Age range: 18 to 60

Scaling ensures one feature does not dominate others.

5. Removing Outliers

Outliers are extreme values that can affect model performance.

Methods:

  • Z-Score
  • IQR (Interquartile Range)
  • Visualization (Boxplots)

Removing outliers improves model stability.

6. Feature Selection

Not all features are useful.

Feature selection helps:

Reduce overfitting
Improve performance
Reduce training time

Techniques:

Correlation analysis
Feature importance
Recursive feature elimination

7. Splitting the Dataset

Before training:

Split data into:

Training set (70–80%)
Testing set (20–30%)

This ensures the model is evaluated properly.

Example Workflow

  1. Load dataset
  2. Handle missing values
  3. Encode categorical variables
  4. Scale features
  5. Remove outliers
  6. Split dataset
  7. Train model

Tools for Data Preprocessing (Python)

Pandas β†’ Data cleaning
NumPy β†’ Numerical operations
Scikit-learn β†’ Encoding and scaling
Matplotlib/Seaborn β†’ Visualization

Why Data Preprocessing is Critical

Garbage in β†’ Garbage out

If data is poor quality, the model will perform poorly β€” no matter how advanced the algorithm is.

Key Takeaway

Data Preprocessing prepares raw data for Machine Learning by cleaning, transforming, and organizing it.

It is a crucial step that directly impacts model performance and accuracy.

Home Β» PYTHON FOR AI AND LLM (PYAI) > Machine Learning Basics > Data Preprocessing