Classification models are used to predict categorical outcomes, such as whether a customer will buy a product (Yes/No) or which category an observation belongs to. In R, logistic regression and decision trees are two widely used classification techniques.
1. Logistic Regression
Logistic regression predicts a binary outcome using one or more predictors. The output is a probability between 0 and 1.
a) Load Data
# Load dataset
data <- read.csv("customer_purchases.csv")# Convert target variable to factor
data$HighValue <- factor(ifelse(data$PurchaseAmt > 1000, "Yes", "No"))# Inspect data
str(data)
table(data$HighValue)
b) Split Data into Training and Test Sets
library(caret)set.seed(123)
train_index <- createDataPartition(data$HighValue, p = 0.7, list = FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]
c) Fit Logistic Regression Model
model_logistic <- glm(HighValue ~ Age + Gender + ProductCategory,
data = train_data, family = binomial)summary(model_logistic)
d) Make Predictions
# Predict probabilities
probabilities <- predict(model_logistic, test_data, type = "response")# Convert probabilities to class labels (threshold = 0.5)
predicted_classes <- ifelse(probabilities > 0.5, "Yes", "No")
predicted_classes <- factor(predicted_classes, levels = c("No", "Yes"))
e) Evaluate Model Performance
confusionMatrix(predicted_classes, test_data$HighValue)
2. Decision Trees
Decision trees split the data into branches to classify observations based on feature values.
a) Install and Load Package
install.packages("rpart")
install.packages("rpart.plot")library(rpart)
library(rpart.plot)
b) Fit Decision Tree Model
tree_model <- rpart(HighValue ~ Age + Gender + ProductCategory,
data = train_data, method = "class")# Plot the tree
rpart.plot(tree_model, type=2, extra=104, fallen.leaves=TRUE)
c) Make Predictions
pred_tree <- predict(tree_model, test_data, type = "class")
d) Evaluate Model Performance
confusionMatrix(pred_tree, test_data$HighValue)
3. Comparison of Logistic Regression and Decision Trees
Feature: Logistic Regression vs Decision Tree
Type: Parametric, linear vs Non-parametric, non-linear
Output: Probability of class vs Class labels
Interpretability: Coefficients show effect vs Tree structure is visual
Handling Non-Linearity: Limited without transformations vs Handles non-linear relationships
Robustness: Sensitive to outliers vs Less sensitive to outliers
4. Advantages of Classification Models in R
- Predict categorical outcomes from complex data
- Identify important features that affect the target variable
- Visualize decision-making with trees or probability outputs
- Integrate easily with other R packages for evaluation and reporting
5. Conclusion
Logistic regression and decision trees are essential classification techniques in R. Logistic regression is ideal for binary outcomes with linear relationships, while decision trees handle complex, non-linear interactions and provide intuitive visual explanations. Mastering both methods allows you to build predictive models, evaluate performance, and make data-driven decisions effectively.