Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their main characteristics, uncover patterns, detect anomalies, and test hypotheses before formal modeling. R provides many tools for both numerical and graphical exploration of data.
1. Understanding the Dataset
Before starting EDA, understand the structure and contents of your data:
# Load data
data <- read.csv("data.csv")# View first rows
head(data)# Structure and types
str(data)# Summary statistics
summary(data)
2. Univariate Analysis
Univariate analysis examines one variable at a time.
a) Numerical Variables
# Histogram
hist(data$Age, main="Age Distribution", xlab="Age", col="skyblue", border="black")# Boxplot
boxplot(data$PurchaseAmount, main="Purchase Amount Boxplot", ylab="Amount", col="lightgreen")# Summary statistics
mean(data$PurchaseAmount)
median(data$PurchaseAmount)
sd(data$PurchaseAmount)
b) Categorical Variables
# Frequency table
table(data$Gender)# Proportions
prop.table(table(data$Gender))# Bar plot
barplot(table(data$Gender), main="Gender Distribution", col="orange")
3. Bivariate Analysis
Bivariate analysis examines relationships between two variables.
a) Numerical vs Numerical
# Scatter plot
plot(data$Age, data$PurchaseAmount, main="Age vs Purchase Amount", xlab="Age", ylab="Purchase Amount", pch=19, col="blue")# Correlation
cor(data$Age, data$PurchaseAmount)
b) Numerical vs Categorical
# Boxplot by category
boxplot(PurchaseAmount ~ Gender, data=data, main="Purchase by Gender", ylab="Purchase Amount", col=c("pink","lightblue"))
c) Categorical vs Categorical
# Contingency table
table(data$Gender, data$ProductCategory)# Mosaic plot
mosaicplot(table(data$Gender, data$ProductCategory), color=TRUE, main="Gender vs Product Category")
4. Using dplyr and ggplot2 for EDA
dplyr and ggplot2 allow more powerful and flexible data exploration.
library(dplyr)
library(ggplot2)# Summary by group
data %>%
group_by(Gender) %>%
summarise(
AvgPurchase = mean(PurchaseAmount),
MaxPurchase = max(PurchaseAmount),
MinPurchase = min(PurchaseAmount)
)# Scatter plot with ggplot2
ggplot(data, aes(x=Age, y=PurchaseAmount, color=Gender)) +
geom_point(size=3) +
ggtitle("Age vs Purchase Amount by Gender")
5. Detecting Outliers and Missing Values
# Missing values
colSums(is.na(data))# Boxplot for outliers
boxplot(data$PurchaseAmount)
6. Advantages of EDA
- Understand data distributions and relationships
- Detect anomalies, outliers, and missing values
- Generate hypotheses for modeling
- Guide feature selection and transformation
Conclusion
Exploratory Data Analysis (EDA) is a crucial first step in any data analysis project. By summarizing data, visualizing patterns, and detecting issues, EDA provides insights that inform modeling and decision-making. Tools in R, including base functions, dplyr, and ggplot2, make EDA efficient, flexible, and visually informative.