Data cleaning and transformation are essential steps in data analysis. Raw data often contains missing values, inconsistencies, or unstructured formats. R provides powerful tools and packages like dplyr and tidyr to clean, transform, and prepare data for analysis.
1. Inspecting Data
Before cleaning, inspect the dataset to identify issues:
# Load data
data <- read.csv("data.csv")# View first few rows
head(data)# Check structure and types
str(data)# Summary statistics
summary(data)
2. Handling Missing Values
a) Identifying Missing Values
is.na(data) # Logical matrix of missing values
colSums(is.na(data)) # Count missing per column
b) Removing Missing Values
data_clean <- na.omit(data) # Removes rows with any missing values
c) Replacing Missing Values
data$Age[is.na(data$Age)] <- mean(data$Age, na.rm = TRUE) # Replace with mean
3. Removing Duplicates
data <- data[!duplicated(data), ]
4. Renaming Columns
library(dplyr)data <- data %>%
rename(
CustomerID = ID,
PurchaseAmount = Amount
)
5. Filtering and Selecting Data
# Filter rows where Age > 30
data_filtered <- data %>% filter(Age > 30)# Select specific columns
data_selected <- data %>% select(Name, Age, PurchaseAmount)
6. Creating New Variables
# Add a new column based on existing data
data <- data %>%
mutate(
DiscountedAmount = PurchaseAmount * 0.9, # Apply 10% discount
AgeGroup = ifelse(Age < 30, "Young", "Adult")
)
7. Reshaping Data
a) Wide to Long
library(tidyr)
data_long <- pivot_longer(data, cols = starts_with("Month"), names_to = "Month", values_to = "Sales")
b) Long to Wide
data_wide <- pivot_wider(data_long, names_from = "Month", values_from = "Sales")
8. Sorting Data
# Sort by PurchaseAmount descending
data <- data %>% arrange(desc(PurchaseAmount))
9. Advantages of Data Cleaning and Transformation
- Ensures accuracy and reliability of analysis
- Handles missing or inconsistent data
- Makes data suitable for modeling and visualization
- Streamlines workflows and improves reproducibility
Conclusion
Cleaning and transforming data is a critical step in preparing datasets for analysis. Using R functions and packages like dplyr and tidyr, you can efficiently handle missing values, filter, reshape, and create new variables. Properly cleaned and structured data ensures more accurate insights and better decision-making in data analysis projects.