{"id":171,"date":"2026-03-03T09:52:59","date_gmt":"2026-03-03T04:52:59","guid":{"rendered":"https:\/\/gigz.pk\/python\/?post_type=lesson&#038;p=171"},"modified":"2026-03-17T09:17:18","modified_gmt":"2026-03-17T04:17:18","slug":"data-preprocessing","status":"publish","type":"lesson","link":"https:\/\/gigz.pk\/python\/lesson\/data-preprocessing\/","title":{"rendered":"Data Preprocessing"},"content":{"rendered":"\n<p>Data Preprocessing is the process of cleaning and preparing raw data before using it in Machine Learning.<\/p>\n\n\n\n<p>Raw data is often incomplete, inconsistent, or noisy.<br>Preprocessing ensures that the data is accurate, clean, and ready for modeling.<\/p>\n\n\n\n<p>It is one of the most important steps in Machine Learning.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Why Data Preprocessing is Important<\/h2>\n\n\n\n<p>Data preprocessing helps:<\/p>\n\n\n\n<p>Improve model accuracy<br>Handle missing values<br>Remove errors and duplicates<br>Normalize data<br>Convert data into proper format<br>Reduce noise<\/p>\n\n\n\n<p>Good data leads to better predictions.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Steps in Data Preprocessing<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">1. Data Cleaning<\/h3>\n\n\n\n<p>Data cleaning involves fixing or removing incorrect data.<\/p>\n\n\n\n<p>Common tasks:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Handling missing values<\/li>\n\n\n\n<li>Removing duplicates<\/li>\n\n\n\n<li>Correcting errors<\/li>\n\n\n\n<li>Removing irrelevant data<\/li>\n<\/ul>\n\n\n\n<p>Example:<\/p>\n\n\n\n<p>If age is missing \u2192 Replace with mean or median<br>If duplicate records exist \u2192 Remove them<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. Handling Missing Values<\/h2>\n\n\n\n<p>Common techniques:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Remove rows with missing values<\/li>\n\n\n\n<li>Replace with mean (for numerical data)<\/li>\n\n\n\n<li>Replace with median<\/li>\n\n\n\n<li>Replace with mode (for categorical data)<\/li>\n<\/ul>\n\n\n\n<p>Choosing the right method depends on the dataset.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3. Encoding Categorical Data<\/h2>\n\n\n\n<p>Machine Learning models work with numbers, not text.<\/p>\n\n\n\n<p>So categorical data must be converted.<\/p>\n\n\n\n<p>Techniques:<\/p>\n\n\n\n<p>Label Encoding \u2192 Assign number to each category<br>One-Hot Encoding \u2192 Create separate column for each category<\/p>\n\n\n\n<p>Example:<\/p>\n\n\n\n<p>Gender:<br>Male \u2192 0<br>Female \u2192 1<\/p>\n\n\n\n<p>Or:<\/p>\n\n\n\n<p>Male \u2192 [1,0]<br>Female \u2192 [0,1]<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">4. Feature Scaling<\/h2>\n\n\n\n<p>Some algorithms perform better when features are on the same scale.<\/p>\n\n\n\n<p>Two common methods:<\/p>\n\n\n\n<p>Standardization<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Mean = 0<\/li>\n\n\n\n<li>Standard deviation = 1<\/li>\n<\/ul>\n\n\n\n<p>Normalization<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Values scaled between 0 and 1<\/li>\n<\/ul>\n\n\n\n<p>Example:<\/p>\n\n\n\n<p>Salary range: 10,000 to 1,000,000<br>Age range: 18 to 60<\/p>\n\n\n\n<p>Scaling ensures one feature does not dominate others.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">5. Removing Outliers<\/h2>\n\n\n\n<p>Outliers are extreme values that can affect model performance.<\/p>\n\n\n\n<p>Methods:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Z-Score<\/li>\n\n\n\n<li>IQR (Interquartile Range)<\/li>\n\n\n\n<li>Visualization (Boxplots)<\/li>\n<\/ul>\n\n\n\n<p>Removing outliers improves model stability.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">6. Feature Selection<\/h2>\n\n\n\n<p>Not all features are useful.<\/p>\n\n\n\n<p>Feature selection helps:<\/p>\n\n\n\n<p>Reduce overfitting<br>Improve performance<br>Reduce training time<\/p>\n\n\n\n<p>Techniques:<\/p>\n\n\n\n<p>Correlation analysis<br>Feature importance<br>Recursive feature elimination<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">7. Splitting the Dataset<\/h2>\n\n\n\n<p>Before training:<\/p>\n\n\n\n<p>Split data into:<\/p>\n\n\n\n<p>Training set (70\u201380%)<br>Testing set (20\u201330%)<\/p>\n\n\n\n<p>This ensures the model is evaluated properly.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Example Workflow<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Load dataset<\/li>\n\n\n\n<li>Handle missing values<\/li>\n\n\n\n<li>Encode categorical variables<\/li>\n\n\n\n<li>Scale features<\/li>\n\n\n\n<li>Remove outliers<\/li>\n\n\n\n<li>Split dataset<\/li>\n\n\n\n<li>Train model<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">Tools for Data Preprocessing (Python)<\/h2>\n\n\n\n<p>Pandas \u2192 Data cleaning<br>NumPy \u2192 Numerical operations<br>Scikit-learn \u2192 Encoding and scaling<br>Matplotlib\/Seaborn \u2192 Visualization<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Why Data Preprocessing is Critical<\/h2>\n\n\n\n<p>Garbage in \u2192 Garbage out<\/p>\n\n\n\n<p>If data is poor quality, the model will perform poorly \u2014 no matter how advanced the algorithm is.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Key Takeaway<\/h2>\n\n\n\n<p>Data Preprocessing prepares raw data for Machine Learning by cleaning, transforming, and organizing it.<\/p>\n\n\n\n<p>It is a crucial step that directly impacts model performance and accuracy.<\/p>\n\n\n<div class=\"yoast-breadcrumbs\"><span><span><a href=\"https:\/\/gigz.pk\/python\/\">Home<\/a><\/span> \u00bb <span class=\"breadcrumb_last\" aria-current=\"page\">PYTHON FOR AI AND LLM (PYAI) > Machine Learning Basics > Data Preprocessing<\/span><\/span><\/div>\n\n\n<div class=\"schema-faq wp-block-yoast-faq-block\"><div class=\"schema-faq-section\" id=\"faq-question-1773721075628\"><strong class=\"schema-faq-question\"><\/strong> <p class=\"schema-faq-answer\"><\/p> <\/div> <\/div>\n","protected":false},"menu_order":95,"template":"","class_list":["post-171","lesson","type-lesson","status-publish","hentry"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.5 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Data Preprocessing - One Language. Endless Possibilities<\/title>\n<meta name=\"description\" content=\"Learn data preprocessing in Machine Learning: clean, encode, scale, and split data to improve model accuracy and performance.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/gigz.pk\/python\/lesson\/data-preprocessing\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Data Preprocessing - One Language. Endless Possibilities\" \/>\n<meta property=\"og:description\" content=\"Learn data preprocessing in Machine Learning: clean, encode, scale, and split data to improve model accuracy and performance.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/gigz.pk\/python\/lesson\/data-preprocessing\/\" \/>\n<meta property=\"og:site_name\" content=\"One Language. Endless Possibilities\" \/>\n<meta property=\"article:modified_time\" content=\"2026-03-17T04:17:18+00:00\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":[\"WebPage\",\"FAQPage\"],\"@id\":\"https:\\\/\\\/gigz.pk\\\/python\\\/lesson\\\/data-preprocessing\\\/\",\"url\":\"https:\\\/\\\/gigz.pk\\\/python\\\/lesson\\\/data-preprocessing\\\/\",\"name\":\"Data Preprocessing - One Language. Endless Possibilities\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/gigz.pk\\\/python\\\/#website\"},\"datePublished\":\"2026-03-03T04:52:59+00:00\",\"dateModified\":\"2026-03-17T04:17:18+00:00\",\"description\":\"Learn data preprocessing in Machine Learning: clean, encode, scale, and split data to improve model accuracy and performance.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/gigz.pk\\\/python\\\/lesson\\\/data-preprocessing\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/gigz.pk\\\/python\\\/lesson\\\/data-preprocessing\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/gigz.pk\\\/python\\\/lesson\\\/data-preprocessing\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/gigz.pk\\\/python\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"PYTHON FOR AI AND LLM (PYAI) > Machine Learning Basics > Data Preprocessing\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/gigz.pk\\\/python\\\/#website\",\"url\":\"https:\\\/\\\/gigz.pk\\\/python\\\/\",\"name\":\"One Language. Endless Possibilities\",\"description\":\"\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/gigz.pk\\\/python\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Data Preprocessing - One Language. Endless Possibilities","description":"Learn data preprocessing in Machine Learning: clean, encode, scale, and split data to improve model accuracy and performance.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/gigz.pk\/python\/lesson\/data-preprocessing\/","og_locale":"en_US","og_type":"article","og_title":"Data Preprocessing - One Language. Endless Possibilities","og_description":"Learn data preprocessing in Machine Learning: clean, encode, scale, and split data to improve model accuracy and performance.","og_url":"https:\/\/gigz.pk\/python\/lesson\/data-preprocessing\/","og_site_name":"One Language. Endless Possibilities","article_modified_time":"2026-03-17T04:17:18+00:00","twitter_card":"summary_large_image","twitter_misc":{"Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":["WebPage","FAQPage"],"@id":"https:\/\/gigz.pk\/python\/lesson\/data-preprocessing\/","url":"https:\/\/gigz.pk\/python\/lesson\/data-preprocessing\/","name":"Data Preprocessing - One Language. Endless Possibilities","isPartOf":{"@id":"https:\/\/gigz.pk\/python\/#website"},"datePublished":"2026-03-03T04:52:59+00:00","dateModified":"2026-03-17T04:17:18+00:00","description":"Learn data preprocessing in Machine Learning: clean, encode, scale, and split data to improve model accuracy and performance.","breadcrumb":{"@id":"https:\/\/gigz.pk\/python\/lesson\/data-preprocessing\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/gigz.pk\/python\/lesson\/data-preprocessing\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/gigz.pk\/python\/lesson\/data-preprocessing\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/gigz.pk\/python\/"},{"@type":"ListItem","position":2,"name":"PYTHON FOR AI AND LLM (PYAI) > Machine Learning Basics > Data Preprocessing"}]},{"@type":"WebSite","@id":"https:\/\/gigz.pk\/python\/#website","url":"https:\/\/gigz.pk\/python\/","name":"One Language. Endless Possibilities","description":"","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/gigz.pk\/python\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"}]}},"_links":{"self":[{"href":"https:\/\/gigz.pk\/python\/wp-json\/wp\/v2\/lesson\/171","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/gigz.pk\/python\/wp-json\/wp\/v2\/lesson"}],"about":[{"href":"https:\/\/gigz.pk\/python\/wp-json\/wp\/v2\/types\/lesson"}],"wp:attachment":[{"href":"https:\/\/gigz.pk\/python\/wp-json\/wp\/v2\/media?parent=171"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}