Summarizing data, a Review

Lucy D’Agostino McGowan

Learning objectives

Recall how to summarize one continuous variable
Identify variables where a mean is a good summary measure (or not)
Explain why we summarize data (what is the big picture?)

One continuous variable

How can we visualize a single continuous variable?

Histogram

Code

starwars |>
  drop_na(height) |>
  ggplot(aes(x = height)) +
  geom_histogram(bins = 30, fill = "#86a293")

Density

Code

starwars |>
  drop_na(height) |>
  ggplot(aes(x = height)) +
  geom_density(color = "#86a293")

Boxplot

Code

starwars |>
  drop_na(height) |>
  ggplot(aes(x = height, y = 1)) +
  geom_boxplot(outlier.shape = NA, color = "#86a293") + 
  geom_jitter(color = "#86a293") + 
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

One continuous variable

How can we numerically summarize a single continuous variable?

starwars |>
  summarise(mean = mean(height, na.rm = TRUE))

# A tibble: 1 × 1
   mean
  <dbl>
1  175.

One continuous variable

Code

library(geomtextpath)
starwars |>
  drop_na(height) |>
  ggplot(aes(x = height)) +
  geom_histogram(bins = 30, fill = "#86a293") +
  geom_textvline(xintercept = 174, 
                 lwd = 6, 
                 linewidth = 2, 
                 label = "mean = 174",
                 hjust = 0.25)

One continuous variable

Why do we calculate a mean?

Reduces the dimensionality of the data (from n to 1)
To get a sense of a “typical” observation
- When is this an accurate representation?

Meaningful means

Symmetric

Code

set.seed(1)

d1 <- tibble(x = rnorm(1000, mean = 10))
ggplot(d1, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293")

Bimodal

Code

d2 <- tibble(x = c(rnorm(500, mean = 10),
                   rnorm(500, mean = 20)))
ggplot(d2, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293")

Skewed

Code

d3 <- tibble(x = rbeta(1000, 2, 5))
ggplot(d3, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293")

Guess the mean for each of these variables.

Meaningful means

Symmetric

Code

ggplot(d1, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293") + 
  geom_vline(xintercept = mean(d1$x), lwd = 2)

Bimodal

Code

ggplot(d2, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293") + 
  geom_vline(xintercept = mean(d2$x), lwd = 2)

Skewed

Code

ggplot(d3, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293") + 
  geom_vline(xintercept = mean(d3$x), lwd = 2)

Does this value represent a “typical” observation?

Math speak

\[\Large\bar{y} =\sum_{i=1}^n \frac{y_i}{n}\]

Math speak

\[\Large{\require{color}\colorbox{#86a293}{$\bar{y}$}} =\sum_{i=1}^n \frac{y_i}{n}\]

the mean of the variable $y$

Math speak

\[\Large\bar{y} ={\require{color}\colorbox{#86a293}{$\sum$}}_{i=1}^n \frac{y_i}{n}\]

add up the observations

Math speak

\[\Large\bar{y} =\sum_{{\require{color}\colorbox{#86a293}{$i=1$}}}^n \frac{y_i}{n}\]

from the first

Math speak

\[\Large\bar{y} =\sum_{i=1}^{\require{color}\colorbox{#86a293}{$n$}} \frac{y_i}{{\require{color}\colorbox{#86a293}{$n$}}}\]

total number of observations

Math speak

\[\Large\bar{y} =\sum_{i=1}^n \frac{{\require{color}\colorbox{#86a293}{$y_i$}}}{n}\]

continuous variable for observation i

Math speak

\[\Large\bar{y} =\sum_{i=1}^n \frac{y_i}{\require{color}\colorbox{#86a293}{${n}$}}\]

divide by the total number of observations

`Application Exercise`

data
$y_1$	3
$y_2$	5
$y_3$	1
$y_4$	7
$y_5$	8

Using the data to the left, what is $n$?
What is $\bar{y}$?

03:00

Data = model + error

Data

Code

d <- tibble(
  i = 1:5,
  y = c(3, 5, 1, 7, 8),
  model = mean(y),
  error = y - model
) 

knitr::kable(d)

i	y	model	error
1	3	4.8	-1.8
2	5	4.8	0.2
3	1	4.8	-3.8
4	7	4.8	2.2
5	8	4.8	3.2

Data

Code

ggplot(d, aes(x = 1, y = y)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$y), label = "mean = 4.8") + 
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank())

Data

Code

ggplot(d, aes(x = i, y = y)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$y), label = "mean = 4.8") + 
  geom_segment(aes(y = y, yend = mean(y), x = i, xend = i), color = "blue") + 
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank())

Math Speak

\[\Large y = \beta_0 + \varepsilon\]

Math Speak

\[\Large {\require{color}\colorbox{#86a293}{$y$}} = \beta_0 + \varepsilon\]

This is the vector $y=\{y_1,\dots,y_n\}$

Math Speak

\[\Large y = {\require{color}\colorbox{#86a293}{$\beta_0$}} + \varepsilon\]

we call this the “intercept”, when there are no other variables, it is just the mean, $\bar{y}$

Math Speak

\[\Large y = \beta_0 + {\require{color}\colorbox{#86a293}{$\varepsilon$}}\]

the error

Data

Code

ggplot(d, aes(x = i, y = y)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$y), lwd = 5, hjust = 0.1,
                 label = as.character(expression(beta[0])), parse = TRUE) + 
  geom_segment(aes(y = y, yend = mean(y), x = i, xend = i), color = "blue") +
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank())

Data

Code

ggplot(d, aes(x = i, y = y)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$y), lwd = 5, hjust = 0.1,
                 label = as.character(expression(beta[0])), parse = TRUE) + 
  geom_textsegment(aes(y = y, yend = mean(y), x = i, xend = i), color = "blue",
                   label = as.character(expression(epsilon)), parse = TRUE,
                   lwd = 5) +
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank())

Data

Code

ggplot(d, aes(x = 1, y = y)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$y), lwd = 5, hjust = 0.1,
                 label = as.character(expression(beta[0])), parse = TRUE) + 
  geom_segment(aes(y = y, yend = mean(y), x = 1, xend = 1), color = "blue") + 
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank())

Data

Code

ggplot(d, aes(x = 1, y = y)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$y), lwd = 5, hjust = 0.1,
                 label = as.character(expression(beta[0])), parse = TRUE) + 
  geom_segment(aes(y = y, yend = mean(y), x = 1, xend = 1), color = "blue") + 
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank())

Code

d2 <- d[2:4]
names(d2) <- c("$\\mathbf{y}$", "$\\beta_0$", "$\\varepsilon$")

knitr::kable(d2)

$\mathbf{y}$	$\beta_0$	$\varepsilon$
3	4.8	-1.8
5	4.8	0.2
1	4.8	-3.8
7	4.8	2.2
8	4.8	3.2

Calculating the mean in R

summarise(d, mean_y = mean(y))

# A tibble: 1 × 1
  mean_y
   <dbl>
1    4.8

lm(y ~ 1, data = d)


Call:
lm(formula = y ~ 1, data = d)

Coefficients:
(Intercept)  
        4.8

“intercept only model”
lm: linear model

`Application Exercise`

Open your 04-appex.qmd file. Load the packages by running the top R chunk of code.

Copy the code below into an R chunk at the bottom of the file:

d <- tibble(
  y = c(3, 5, 1, 7, 8)
)

What do you think this code does? Try typing ?tibble in the Console - what does this function do?

Calculate the mean of y. Do this two ways, using the summarize function and using the lm function.
Add a new variable called error to the data set d that is equal to y minus the mean of y.

05:00

Recap

When is the mean an appropriate summary measure to calculate?

What assumptions need to be true in order to use a mean to represent your single continuous variable?

data
\(y_1\)	3
\(y_2\)	5
\(y_3\)	1
\(y_4\)	7
\(y_5\)	8