Summarizing data, a Review

Lucy D’Agostino McGowan

Learning objectives

  • Recall how to summarize one continuous variable
  • Identify variables where a mean is a good summary measure (or not)
  • Explain why we summarize data (what is the big picture?)

One continuous variable

One continuous variable

How can we visualize a single continuous variable?

Histogram

Code
starwars |>
  drop_na(height) |>
  ggplot(aes(x = height)) +
  geom_histogram(bins = 30, fill = "#86a293")

Density

Code
starwars |>
  drop_na(height) |>
  ggplot(aes(x = height)) +
  geom_density(color = "#86a293")

Boxplot

Code
starwars |>
  drop_na(height) |>
  ggplot(aes(x = height, y = 1)) +
  geom_boxplot(outlier.shape = NA, color = "#86a293") + 
  geom_jitter(color = "#86a293") + 
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

One continuous variable

How can we numerically summarize a single continuous variable?


starwars |>
  summarise(mean = mean(height, na.rm = TRUE))
# A tibble: 1 × 1
   mean
  <dbl>
1  175.

One continuous variable

Code
library(geomtextpath)
starwars |>
  drop_na(height) |>
  ggplot(aes(x = height)) +
  geom_histogram(bins = 30, fill = "#86a293") +
  geom_textvline(xintercept = 174, 
                 lwd = 6, 
                 linewidth = 2, 
                 label = "mean = 174",
                 hjust = 0.25)

One continuous variable

Why do we calculate a mean?

  • Reduces the dimensionality of the data (from n to 1)
  • To get a sense of a “typical” observation
    • When is this an accurate representation?

Meaningful means

Symmetric

Code
set.seed(1)

d1 <- tibble(x = rnorm(1000, mean = 10))
ggplot(d1, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293")

Bimodal

Code
d2 <- tibble(x = c(rnorm(500, mean = 10),
                   rnorm(500, mean = 20)))
ggplot(d2, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293")

Skewed

Code
d3 <- tibble(x = rbeta(1000, 2, 5))
ggplot(d3, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293")

Guess the mean for each of these variables.

Meaningful means

Symmetric

Code
ggplot(d1, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293") + 
  geom_vline(xintercept = mean(d1$x), lwd = 2)

Bimodal

Code
ggplot(d2, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293") + 
  geom_vline(xintercept = mean(d2$x), lwd = 2)

Skewed

Code
ggplot(d3, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293") + 
  geom_vline(xintercept = mean(d3$x), lwd = 2)

Does this value represent a “typical” observation?

Math speak

\[\Large\bar{y} =\sum_{i=1}^n \frac{y_i}{n}\]

Math speak

\[\Large{\require{color}\colorbox{#86a293}{$\bar{y}$}} =\sum_{i=1}^n \frac{y_i}{n}\]

the mean of the variable \(y\)

Math speak

\[\Large\bar{y} ={\require{color}\colorbox{#86a293}{$\sum$}}_{i=1}^n \frac{y_i}{n}\]

add up the observations

Math speak

\[\Large\bar{y} =\sum_{{\require{color}\colorbox{#86a293}{$i=1$}}}^n \frac{y_i}{n}\]

from the first

Math speak

\[\Large\bar{y} =\sum_{i=1}^{\require{color}\colorbox{#86a293}{$n$}} \frac{y_i}{{\require{color}\colorbox{#86a293}{$n$}}}\]

total number of observations

Math speak

\[\Large\bar{y} =\sum_{i=1}^n \frac{{\require{color}\colorbox{#86a293}{$y_i$}}}{n}\]

continuous variable for observation i

Math speak

\[\Large\bar{y} =\sum_{i=1}^n \frac{y_i}{\require{color}\colorbox{#86a293}{${n}$}}\]

divide by the total number of observations

Application Exercise

data
\(y_1\) 3
\(y_2\) 5
\(y_3\) 1
\(y_4\) 7
\(y_5\) 8

  1. Using the data to the left, what is \(n\)?
  2. What is \(\bar{y}\)?
03:00

Data = model + error

Data

Code
d <- tibble(
  i = 1:5,
  y = c(3, 5, 1, 7, 8),
  model = mean(y),
  error = y - model
) 

knitr::kable(d)
i y model error
1 3 4.8 -1.8
2 5 4.8 0.2
3 1 4.8 -3.8
4 7 4.8 2.2
5 8 4.8 3.2

Data

Code
ggplot(d, aes(x = 1, y = y)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$y), label = "mean = 4.8") + 
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank())

Data

Code
ggplot(d, aes(x = i, y = y)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$y), label = "mean = 4.8") + 
  geom_segment(aes(y = y, yend = mean(y), x = i, xend = i), color = "blue") + 
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank())

Math Speak

\[\Large y = \beta_0 + \varepsilon\]

Math Speak

\[\Large {\require{color}\colorbox{#86a293}{$y$}} = \beta_0 + \varepsilon\]

This is the vector \(y=\{y_1,\dots,y_n\}\)

Math Speak

\[\Large y = {\require{color}\colorbox{#86a293}{$\beta_0$}} + \varepsilon\]

we call this the “intercept”, when there are no other variables, it is just the mean, \(\bar{y}\)

Math Speak

\[\Large y = \beta_0 + {\require{color}\colorbox{#86a293}{$\varepsilon$}}\]

the error

Data

Code
ggplot(d, aes(x = i, y = y)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$y), lwd = 5, hjust = 0.1,
                 label = as.character(expression(beta[0])), parse = TRUE) + 
  geom_segment(aes(y = y, yend = mean(y), x = i, xend = i), color = "blue") +
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank()) 

Data

Code
ggplot(d, aes(x = i, y = y)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$y), lwd = 5, hjust = 0.1,
                 label = as.character(expression(beta[0])), parse = TRUE) + 
  geom_textsegment(aes(y = y, yend = mean(y), x = i, xend = i), color = "blue",
                   label = as.character(expression(epsilon)), parse = TRUE,
                   lwd = 5) +
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank())

Data

Code
ggplot(d, aes(x = 1, y = y)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$y), lwd = 5, hjust = 0.1,
                 label = as.character(expression(beta[0])), parse = TRUE) + 
  geom_segment(aes(y = y, yend = mean(y), x = 1, xend = 1), color = "blue") + 
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank())  

Data

Code
ggplot(d, aes(x = 1, y = y)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$y), lwd = 5, hjust = 0.1,
                 label = as.character(expression(beta[0])), parse = TRUE) + 
  geom_segment(aes(y = y, yend = mean(y), x = 1, xend = 1), color = "blue") + 
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank()) 

Code
d2 <- d[2:4]
names(d2) <- c("$\\mathbf{y}$", "$\\beta_0$", "$\\varepsilon$")

knitr::kable(d2)
\(\mathbf{y}\) \(\beta_0\) \(\varepsilon\)
3 4.8 -1.8
5 4.8 0.2
1 4.8 -3.8
7 4.8 2.2
8 4.8 3.2

Calculating the mean in R

summarise(d, mean_y = mean(y))
# A tibble: 1 × 1
  mean_y
   <dbl>
1    4.8
lm(y ~ 1, data = d)

Call:
lm(formula = y ~ 1, data = d)

Coefficients:
(Intercept)  
        4.8  


  • “intercept only model”
  • lm: linear model

Application Exercise

Open your 04-appex.qmd file. Load the packages by running the top R chunk of code.

  1. Copy the code below into an R chunk at the bottom of the file:
d <- tibble(
  y = c(3, 5, 1, 7, 8)
)

What do you think this code does? Try typing ?tibble in the Console - what does this function do?

  1. Calculate the mean of y. Do this two ways, using the summarize function and using the lm function.
  2. Add a new variable called error to the data set d that is equal to y minus the mean of y.
05:00

Recap

When is the mean an appropriate summary measure to calculate?

What assumptions need to be true in order to use a mean to represent your single continuous variable?