Partitioning Variability

Lucy D’Agostino McGowan

Partitioning variability

Total variation in response y

\[SSTotal = \sum (y - \bar{y})^2\]

data |>
  summarise(
    sstotal = 
      sum((______ - ______)^2)
    )

Total variation in response y

\[SSTotal = \sum (y - \bar{y})^2\]

data |>
  summarise(
    sstotal = 
      sum((frequency_score - mean(frequency_score))^2)
    )
# A tibble: 1 × 1
  sstotal
    <dbl>
1  46372.

Total variation in response y

\[SSTotal = \sum (y - \bar{y})^2\]

data |>
  summarise(
    sstotal = 
      var(frequency_score) * (n()-1)
    )
# A tibble: 1 × 1
  sstotal
    <dbl>
1  46372.

Unexplained variation from the residuals

\[SSE = \sum (y - \hat{y})^2\]

mod <- lm(frequency_score ~ group, data = data)


data |>
  summarise(
    sse = 
      sum((______ - _______)^2)
    )

Unexplained variation from the residuals

\[SSE = \sum (y - \hat{y})^2\]

mod <- lm(frequency_score ~ group, data = data)


data |>
  summarise(
    sse = 
      sum((frequency_score - fitted(mod))^2)
    )
# A tibble: 1 × 1
     sse
   <dbl>
1 45732.

Unexplained variation from the residuals

\[SSE = \sum (y - \hat{y})^2\]

mod <- lm(frequency_score ~ group, data = data)


data |>
  summarise(
    sse = 
      sum(residuals(mod)^2)
    )
# A tibble: 1 × 1
     sse
   <dbl>
1 45732.

Unexplained variation from the residuals

\[SSE = \sum (y - \hat{y})^2\]

mod <- lm(frequency_score ~ group, data = data)


data |>
  summarise(
    sse = 
      sigma(mod)^2 * (n() - 2)
    )
# A tibble: 1 × 1
     sse
   <dbl>
1 45732.

Variation explained by the model

\[SSModel = \sum (\hat{y}-\bar{y})^2\]

data |>
  summarise(
    ssmodel = 
      sum(______ - ______)^2)
    )

Variation explained by the model

\[SSModel = \sum (\hat{y}-\bar{y})^2\]

data |>
  summarise(
    ssmodel = 
      sum((fitted(mod) - mean(frequency_score))^2)
    )
# A tibble: 1 × 1
  ssmodel
    <dbl>
1     640

Partitioning variability

data |>
  summarise(
    sstotal = sum((frequency_score - mean(frequency_score))^2),
    ssmodel = sum((fitted(mod) - mean(frequency_score))^2),
    sse = sum(residuals(mod)^2)
    )
# A tibble: 1 × 3
  sstotal ssmodel    sse
    <dbl>   <dbl>  <dbl>
1  46372.     640 45732.

Partitioning variability

data |>
  summarise(
    sstotal = sum((frequency_score - mean(frequency_score))^2),
    ssmodel = sum((fitted(mod) - mean(frequency_score))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse
    )


What will this be?

Partitioning variability

data |>
  summarise(
    sstotal = sum((frequency_score - mean(frequency_score))^2),
    ssmodel = sum((fitted(mod) - mean(frequency_score))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse
    )
# A tibble: 1 × 4
  sstotal ssmodel    sse `ssmodel + sse`
    <dbl>   <dbl>  <dbl>           <dbl>
1  46372.     640 45732.          46372.

Partitioning variability

data |>
  summarise(
    sstotal = sum((frequency_score - mean(frequency_score))^2),
    ssmodel = sum((fitted(mod) - mean(frequency_score))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse,
    sstotal - ssmodel
    )


What will this be?

Partitioning variability

data |>
  summarise(
    sstotal = sum((frequency_score - mean(frequency_score))^2),
    ssmodel = sum((fitted(mod) - mean(frequency_score))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse,
    sstotal - ssmodel
    )
# A tibble: 1 × 5
  sstotal ssmodel    sse `ssmodel + sse` `sstotal - ssmodel`
    <dbl>   <dbl>  <dbl>           <dbl>               <dbl>
1  46372.     640 45732.          46372.              45732.

Partitioning variability

data |>
  summarise(
    sstotal = sum((frequency_score - mean(frequency_score))^2),
    ssmodel = sum((fitted(mod) - mean(frequency_score))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse,
    sstotal - ssmodel,
    sstotal - sse
    )


What will this be?

Partitioning variability

data |>
  summarise(
    sstotal = sum((frequency_score - mean(frequency_score))^2),
    ssmodel = sum((fitted(mod) - mean(frequency_score))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse,
    sstotal - ssmodel,
    sstotal - sse
    )
# A tibble: 1 × 6
  sstotal ssmodel    sse `ssmodel + sse` `sstotal - ssmodel` `sstotal - sse`
    <dbl>   <dbl>  <dbl>           <dbl>               <dbl>           <dbl>
1  46372.     640 45732.          46372.              45732.            640.

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^n (y - \bar{y})^2\]

How many observations?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{\require{color}\colorbox{#86a293}{$n$}} (y - \bar{y})^2\]

How many observations?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{n} (y - \bar{y})^2\]

How many things are “estimated”?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{n} (y - \require{color}\colorbox{#86a293}{$\bar{y}$})^2\]

How many things are “estimated”?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{n} (y - \bar{y})^2\]

How many degrees of freedom?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{n} (y - \bar{y})^2\]

\[\Large df_{SSTOTAL}=n-1\]

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - \hat{y})^2\]

How many observations?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{\require{color}\colorbox{#86a293}{$n$}} (y - \hat{y})^2\]

How many observations?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - \hat{y})^2\]

How is \(\hat{y}\) estimated with simple linear regression?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]

How is \(\hat{y}\) estimated with simple linear regression?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]

How many things are “estimated”?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\require{color}\colorbox{#86a293}{$\hat{\beta}_0$}+\colorbox{#86a293}{$\hat{\beta}_1$}x))^2\]

How many things are “estimated”?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]

How many degrees of freedom?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]

\[\Large df_{SSE} = n - 2\]

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = SSModel + SSE\]

\[df_{SSTotal} = df_{SSModel} + df_{SSE} \]

\[n - 1 = df_{SSModel} + (n - 2)\]

Application Exercise

How many degrees of freedom does SSModel have?

\[n - 1 = df_{SSModel} + (n - 2)\]

01:00

Mean squares

\[MSE = \frac{SSE}{n - 2}\]

\[MSModel = \frac{SSModel}{1}\]

What is the pattern?

\[\Large F = \frac{MSModel}{MSE}\]

F-distribution

Under the null hypothesis

Code
f <- data.frame(
  stat = rf(n = 10000, df1 = 1, df2 = 38)
)

ggplot(f) + 
  geom_histogram(aes(stat), bins = 40) + 
  labs(x = "F Statistic")

Example

We can see all of these statistics by using the anova function on the output of lm

mod <- lm(frequency_score ~ group, data = data)
anova(mod)
Analysis of Variance Table

Response: frequency_score
          Df Sum Sq Mean Sq F value Pr(>F)
group      1    640   640.0  0.5318 0.4703
Residuals 38  45732  1203.5               

What is the SSModel?

Example

We can see all of these statistics by using the anova function on the output of lm

mod <- lm(frequency_score ~ group, data = data)
anova(mod)
Analysis of Variance Table

Response: frequency_score
          Df Sum Sq Mean Sq F value Pr(>F)
group      1    640   640.0  0.5318 0.4703
Residuals 38  45732  1203.5               

What is the MSModel?

Example

We can see all of these statistics by using the anova function on the output of lm

mod <- lm(frequency_score ~ group, data = data)
anova(mod)
Analysis of Variance Table

Response: frequency_score
          Df Sum Sq Mean Sq F value Pr(>F)
group      1    640   640.0  0.5318 0.4703
Residuals 38  45732  1203.5               

What is the SSE?

Example

We can see all of these statistics by using the anova function on the output of lm

mod <- lm(frequency_score ~ group, data = data)
anova(mod)
Analysis of Variance Table

Response: frequency_score
          Df Sum Sq Mean Sq F value Pr(>F)
group      1    640   640.0  0.5318 0.4703
Residuals 38  45732  1203.5               

What is the MSE?

Example

We can see all of these statistics by using the anova function on the output of lm

mod <- lm(frequency_score ~ group, data = data)
anova(mod)
Analysis of Variance Table

Response: frequency_score
          Df Sum Sq Mean Sq F value Pr(>F)
group      1    640   640.0  0.5318 0.4703
Residuals 38  45732  1203.5               

What is the SSTotal?

Example

We can see all of these statistics by using the anova function on the output of lm

mod <- lm(frequency_score ~ group, data = data)
anova(mod)
Analysis of Variance Table

Response: frequency_score
          Df Sum Sq Mean Sq F value Pr(>F)
group      1    640   640.0  0.5318 0.4703
Residuals 38  45732  1203.5               

What is the F statistic?

Example

We can see all of these statistics by using the anova function on the output of lm

mod <- lm(frequency_score ~ group, data = data)
anova(mod)
Analysis of Variance Table

Response: frequency_score
          Df Sum Sq Mean Sq F value Pr(>F)
group      1    640   640.0  0.5318 0.4703
Residuals 38  45732  1203.5               

Is the F-statistic statistically significant?

p-value

The probability of getting a statistic as extreme or more extreme than the observed test statistic given the null hypothesis is true

F-Distribution

Under the null hypothesis

Code
ggplot(f) + 
  geom_histogram(aes(stat), bins = 40) + 
  labs(x = "F Statistic")

Degrees of freedom

  • \(n = 40\)
  • \(df_{SSTotal} = ?\)

Degrees of freedom

  • \(n = 40\)
  • \(df_{SSTotal} = 39\)

Degrees of freedom

  • \(n = 40\)
  • \(df_{SSTotal} = 39\)
  • \(df_{SSE} = ?\)

Degrees of freedom

  • \(n = 40\)
  • \(df_{SSTotal} = 39\)
  • \(df_{SSE} = n - 2 = 38\)

Degrees of freedom

  • \(n = 40\)
  • \(df_{SSTotal} = 39\)
  • \(df_{SSE} = n - 2 = 38\)
  • \(df_{SSModel} = ?\)

Degrees of freedom

  • \(n = 40\)
  • \(df_{SSTotal} = 39\)
  • \(df_{SSE} = n - 2 = 38\)
  • \(df_{SSModel} = 39 - 38 = 1\)

Example

To calculate the p-value under the t-distribution we use pt(). What do you think we use to calculate the p-value under the F-distribution?

anova(mod)
Analysis of Variance Table

Response: frequency_score
          Df Sum Sq Mean Sq F value Pr(>F)
group      1    640   640.0  0.5318 0.4703
Residuals 38  45732  1203.5               
  • pf()
  • it takes 3 arguments: q, df1, and df2. What do you think we would plug in for q?

Degrees of freedom

  • \(n = 40\)
  • \(df_{SSTotal} = 39\)
  • \(df_{SSE} = n - 2 = 38\) df2
  • \(df_{SSModel} = 39 - 38 = 1\) df1

Example

To calculate the p-value under the t-distribution we use pt(). What do you think we use to calculate the p-value under the F-distribution?

anova(mod)
Analysis of Variance Table

Response: frequency_score
          Df Sum Sq Mean Sq F value Pr(>F)
group      1    640   640.0  0.5318 0.4703
Residuals 38  45732  1203.5               
pf(0.5318, 1, 38, lower.tail = FALSE)
[1] 0.4703223

Example

Why don’t we multiply this p-value by 2 when we use pf()?

anova(mod)
Analysis of Variance Table

Response: frequency_score
          Df Sum Sq Mean Sq F value Pr(>F)
group      1    640   640.0  0.5318 0.4703
Residuals 38  45732  1203.5               
pf(0.5318, 1, 38, lower.tail = FALSE)
[1] 0.4703223

F-Distribution

Under the null hypothesis

Code
ggplot(f) + 
  geom_histogram(aes(stat), bins = 40) + 
  labs(x = "F Statistic")

F-Distribution

Under the null hypothesis

Code
f$shaded <- ifelse(f$stat > 0.5318, TRUE, FALSE)

ggplot(f) + 
  geom_histogram(aes(stat, fill = shaded), bins = 40) + 
  geom_vline(xintercept = 0.5318, lwd = 1.5) +
  labs(x = "F Statistic") +
  theme(legend.position = "none")
  • We observed an F-statistic of 5.8199
  • Are there any negative values in an F-distribution?

F-Distribution

Under the null hypothesis

Code
f$shaded <- ifelse(f$stat > 0.5318, TRUE, FALSE)

ggplot(f) + 
  geom_histogram(aes(stat, fill = shaded), bins = 40) + 
  geom_vline(xintercept = 0.5318, lwd = 1.5) +
  labs(x = "F Statistic") +
  theme(legend.position = "none")
  • The p-value calculates values “as extreme or more extreme”, in the t-distribution “more extreme values”, defined as farther from 0, can be positive or negative. Not so for the F!

Example

anova(mod)
Analysis of Variance Table

Response: frequency_score
          Df Sum Sq Mean Sq F value Pr(>F)
group      1    640   640.0  0.5318 0.4703
Residuals 38  45732  1203.5               
summary(mod)

Call:
lm(formula = frequency_score ~ group, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-38.80 -22.55 -11.80  30.20  95.20 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   39.800      7.757   5.131 8.82e-06 ***
groupsquare   -8.000     10.970  -0.729     0.47    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 34.69 on 38 degrees of freedom
Multiple R-squared:  0.0138,    Adjusted R-squared:  -0.01215 
F-statistic: 0.5318 on 1 and 38 DF,  p-value: 0.4703
  • Notice the p-value for the F-test is the same as the p-value for the \(\hat\beta_1\) t-test
  • This is always true for simple linear regression (with just one \(x\) variable)

What is the F-test testing?

anova(mod)
Analysis of Variance Table

Response: frequency_score
          Df Sum Sq Mean Sq F value Pr(>F)
group      1    640   640.0  0.5318 0.4703
Residuals 38  45732  1203.5               
  • null hypothesis: the fit of the intercept only model (with \(\hat\beta_0\) only) and your model (\(\hat\beta_0 + \hat\beta_1x\)) are equivalent
  • alternative hypothesis: The fit of the intercept only model is significantly worse compared to your model
  • When we only have one variable in our model, \(x\), the p-values from the F and t tests are going to be equivalent

Relating the F and the t

anova(mod)
Analysis of Variance Table

Response: frequency_score
          Df Sum Sq Mean Sq F value Pr(>F)
group      1    640   640.0  0.5318 0.4703
Residuals 38  45732  1203.5               
summary(mod)

Call:
lm(formula = frequency_score ~ group, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-38.80 -22.55 -11.80  30.20  95.20 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   39.800      7.757   5.131 8.82e-06 ***
groupsquare   -8.000     10.970  -0.729     0.47    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 34.69 on 38 degrees of freedom
Multiple R-squared:  0.0138,    Adjusted R-squared:  -0.01215 
F-statistic: 0.5318 on 1 and 38 DF,  p-value: 0.4703
(-0.729)^2
[1] 0.531441

Application Exercise

  1. Open appex-06.qmd
  2. Using your data, predict frequency score from group
  3. What are the degrees of freedom for the: Sum of Squares Total, Sum of Squares Model, Sum of Squares Errors
  4. Calculate the following quantities: Sum of Squares Total, Sum of Squares Model, Sum of Squares Errors
  5. Calculate the F-statistic for the model and the p-value
  6. What is the null hypothesis? What is the alternative?
06:00