Descriptive statistics summarize and describe the main features of a dataset. They provide insights into the distribution, central tendency, variability, and overall patterns of data. R offers built-in functions and packages to calculate descriptive statistics efficiently.
1. Measures of Central Tendency
Central tendency indicates the “center” of a dataset. Common measures include mean, median, and mode.
a) Mean
scores <- c(90, 85, 88, 92, 75, 80, 95)
mean(scores) # Calculates average score
b) Median
median(scores) # Middle value when data is sorted
c) Mode
R doesn’t have a built-in mode function, but you can define one:
get_mode <- function(x) {
uniq <- unique(x)
uniq[which.max(tabulate(match(x, uniq)))]
}get_mode(scores)
2. Measures of Dispersion
Dispersion indicates how spread out the data is.
a) Range
range(scores) # Minimum and maximum values
diff(range(scores)) # Range difference
b) Variance
var(scores) # Measures variability
c) Standard Deviation
sd(scores) # Square root of variance
d) Quantiles
quantile(scores) # Default quartiles
quantile(scores, probs = c(0.25, 0.5, 0.75)) # Specific quartiles
3. Summary Function
summary() provides a quick overview of numeric data including min, 1st quartile, median, mean, 3rd quartile, and max.
summary(scores)
4. Frequency Tables
For categorical data, frequency counts and proportions are useful.
grades <- c("A", "B", "A", "C", "B", "A")
table(grades) # Counts of each category
prop.table(table(grades)) # Proportion of each category
5. Using dplyr for Descriptive Statistics
dplyr can summarize data easily:
library(dplyr)data <- data.frame(
Name = c("Alice","Bob","Charlie","David"),
Score = c(90, 85, 88, 92)
)data %>%
summarise(
Average = mean(Score),
Maximum = max(Score),
Minimum = min(Score),
SD = sd(Score)
)
6. Advantages of Descriptive Statistics
- Provides quick insights into the dataset
- Helps detect patterns, trends, and anomalies
- Forms the basis for inferential statistics
- Simplifies decision-making based on data summaries
Conclusion
Descriptive statistics are the foundation of data analysis in R. By calculating measures of central tendency, dispersion, quantiles, and frequencies, you can summarize large datasets effectively. Tools like summary(), table(), and dplyr::summarise() make it easier to explore and understand data before further analysis.