Linear regression is one of the most widely used statistical methods for predicting a continuous dependent variable based on one or more independent variables. R provides a simple and efficient way to build, evaluate, and visualize linear regression models.
1. What is Linear Regression?
Linear regression models the relationship between a dependent variable Y and one or more independent variables X by fitting a linear equation:Y=β0+β1X1+β2X2+⋯+ϵ
- β0 = intercept
- β1,β2,… = coefficients for predictors
- ϵ = error term
Types of Linear Regression:
- Simple Linear Regression: One predictor variable
- Multiple Linear Regression: Two or more predictor variables
2. Load and Inspect Data
# Load dataset
data <- read.csv("customer_purchases.csv")# Inspect data
head(data)
str(data)
summary(data)
3. Simple Linear Regression
Predicting PurchaseAmt based on Age:
# Fit model
model_simple <- lm(PurchaseAmt ~ Age, data = data)# Model summary
summary(model_simple)
Key Outputs:
- Coefficients: Effect of
AgeonPurchaseAmt - R-squared: Proportion of variance explained by the model
- p-value: Significance of predictors
4. Multiple Linear Regression
Predicting PurchaseAmt based on Age, Gender, and ProductCategory:
# Fit model
model_multiple <- lm(PurchaseAmt ~ Age + Gender + ProductCategory, data = data)# Summary
summary(model_multiple)
5. Making Predictions
# Predict purchase amounts
predictions <- predict(model_multiple, newdata = data)# View predicted vs actual
head(data.frame(Actual = data$PurchaseAmt, Predicted = predictions))
6. Evaluating Model Performance
Common metrics for regression models:
# Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((predictions - data$PurchaseAmt)^2))
rmse# Mean Absolute Error (MAE)
mae <- mean(abs(predictions - data$PurchaseAmt))
mae
7. Visualizing Regression Results
a) Simple Linear Regression Plot
plot(data$Age, data$PurchaseAmt, main="Age vs Purchase Amount", xlab="Age", ylab="Purchase Amount", pch=19, col="blue")
abline(model_simple, col="red", lwd=2)
b) Multiple Regression with ggplot2
library(ggplot2)
ggplot(data, aes(x=Age, y=PurchaseAmt, color=Gender)) +
geom_point(size=3) +
geom_smooth(method="lm", se=FALSE) +
ggtitle("Multiple Linear Regression: Age & Gender vs Purchase Amount")
8. Assumptions of Linear Regression
- Linearity: Relationship between predictors and outcome is linear
- Independence: Observations are independent
- Homoscedasticity: Constant variance of residuals
- Normality: Residuals are normally distributed
- No multicollinearity: Predictors are not highly correlated
Check assumptions using diagnostic plots:
par(mfrow = c(2, 2))
plot(model_multiple)
par(mfrow = c(1, 1))
9. Advantages of Linear Regression
- Simple and interpretable
- Provides insight into relationships between variables
- Basis for more advanced predictive models
- Easy to implement and visualize in R
Conclusion
Linear regression in R is a fundamental technique for modeling and predicting continuous outcomes. By fitting simple or multiple regression models, making predictions, evaluating performance, and checking assumptions, you can extract actionable insights and understand relationships in your data. Mastering linear regression is a crucial step in data analysis and machine learning workflows.