Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their main characteristics, uncover patterns, detect anomalies, and test hypotheses before formal modeling. R provides many tools for both numerical and graphical exploration of data.

1. Understanding the Dataset

Before starting EDA, understand the structure and contents of your data:

# Load data
data <- read.csv("data.csv")# View first rows
head(data)# Structure and types
str(data)# Summary statistics
summary(data)

2. Univariate Analysis

Univariate analysis examines one variable at a time.

a) Numerical Variables

# Histogram
hist(data$Age, main="Age Distribution", xlab="Age", col="skyblue", border="black")# Boxplot
boxplot(data$PurchaseAmount, main="Purchase Amount Boxplot", ylab="Amount", col="lightgreen")# Summary statistics
mean(data$PurchaseAmount)
median(data$PurchaseAmount)
sd(data$PurchaseAmount)

b) Categorical Variables

# Frequency table
table(data$Gender)# Proportions
prop.table(table(data$Gender))# Bar plot
barplot(table(data$Gender), main="Gender Distribution", col="orange")

3. Bivariate Analysis

Bivariate analysis examines relationships between two variables.

a) Numerical vs Numerical

# Scatter plot
plot(data$Age, data$PurchaseAmount, main="Age vs Purchase Amount", xlab="Age", ylab="Purchase Amount", pch=19, col="blue")# Correlation
cor(data$Age, data$PurchaseAmount)

b) Numerical vs Categorical

# Boxplot by category
boxplot(PurchaseAmount ~ Gender, data=data, main="Purchase by Gender", ylab="Purchase Amount", col=c("pink","lightblue"))

c) Categorical vs Categorical

# Contingency table
table(data$Gender, data$ProductCategory)# Mosaic plot
mosaicplot(table(data$Gender, data$ProductCategory), color=TRUE, main="Gender vs Product Category")

4. Using dplyr and ggplot2 for EDA

dplyr and ggplot2 allow more powerful and flexible data exploration.

library(dplyr)
library(ggplot2)# Summary by group
data %>%
group_by(Gender) %>%
summarise(
AvgPurchase = mean(PurchaseAmount),
MaxPurchase = max(PurchaseAmount),
MinPurchase = min(PurchaseAmount)
)# Scatter plot with ggplot2
ggplot(data, aes(x=Age, y=PurchaseAmount, color=Gender)) +
geom_point(size=3) +
ggtitle("Age vs Purchase Amount by Gender")

5. Detecting Outliers and Missing Values

# Missing values
colSums(is.na(data))# Boxplot for outliers
boxplot(data$PurchaseAmount)

6. Advantages of EDA

  • Understand data distributions and relationships
  • Detect anomalies, outliers, and missing values
  • Generate hypotheses for modeling
  • Guide feature selection and transformation

Conclusion

Exploratory Data Analysis (EDA) is a crucial first step in any data analysis project. By summarizing data, visualizing patterns, and detecting issues, EDA provides insights that inform modeling and decision-making. Tools in R, including base functions, dplyr, and ggplot2, make EDA efficient, flexible, and visually informative.

Home ยป R Programming (R Lang) > R for Data Analysis > Exploratory Data Analysis (EDA)