Clustering and Unsupervised Learning

Clustering and unsupervised learning are used to explore patterns in data without predefined labels. These techniques help identify natural groupings, reduce dimensionality, and gain insights from unlabeled datasets.

1. What is Unsupervised Learning?

Unsupervised learning analyzes datasets without target labels. The goal is to discover hidden patterns or structures. Common techniques include:

  • Clustering: Group similar observations together. Examples: K-Means, Hierarchical Clustering
  • Dimensionality Reduction: Reduce the number of variables while retaining important information. Examples: PCA (Principal Component Analysis)

2. K-Means Clustering

K-Means partitions data into K clusters based on similarity.

a) Load Data

# Sample dataset
data <- read.csv("customer_data.csv")# Select numeric variables for clustering
customer_data <- data[, c("Age", "PurchaseAmt")]

b) Apply K-Means

set.seed(123)
kmeans_model <- kmeans(customer_data, centers = 3)# View cluster assignments
kmeans_model$cluster

c) Add Cluster Labels to Data

customer_data$Cluster <- factor(kmeans_model$cluster)
head(customer_data)

d) Visualize Clusters

library(ggplot2)ggplot(customer_data, aes(x = Age, y = PurchaseAmt, color = Cluster)) +
geom_point(size = 3) +
ggtitle("K-Means Clustering of Customers")

3. Hierarchical Clustering

Hierarchical clustering builds a tree of clusters called a dendrogram.

# Compute distance matrix
dist_matrix <- dist(customer_data[, c("Age", "PurchaseAmt")], method = "euclidean")# Perform hierarchical clustering
hc_model <- hclust(dist_matrix, method = "ward.D2")# Plot dendrogram
plot(hc_model, main = "Hierarchical Clustering Dendrogram")
rect.hclust(hc_model, k = 3, border = "red") # Highlight 3 clusters

4. Principal Component Analysis (PCA)

PCA reduces dimensionality while preserving variability in data.

# Standardize data
customer_scaled <- scale(customer_data[, c("Age", "PurchaseAmt")])# Apply PCA
pca_model <- prcomp(customer_scaled)
summary(pca_model)# Plot PCA
biplot(pca_model, main = "PCA Biplot")

5. Advantages of Clustering and Unsupervised Learning

  • Identify natural groupings in data without labels
  • Discover patterns for marketing, segmentation, and research
  • Reduce dimensionality for visualization and modeling
  • Inform decision-making through data-driven insights

6. Conclusion

Clustering and unsupervised learning in R allow analysts to uncover hidden structures in data. Techniques like K-Means, hierarchical clustering, and PCA help segment customers, reduce complexity, and reveal insights that guide business strategy. Mastering these tools enables effective exploratory analysis and data-driven decision-making.

Home » R Programming (R Lang) > R for Data Science & Machine Learning > Clustering and Unsupervised Learning