Partitioning Variability

Lucy D’Agostino McGowan

Partitioning variability

Total variation in response y

\[SSTotal = \sum (y - \bar{y})^2\]

data |>
  summarise(
    sstotal = 
      sum((______ - ______)^2)
    )

Total variation in response y

\[SSTotal = \sum (y - \bar{y})^2\]

data |>
  summarise(
    sstotal = 
      sum((frequency_score - mean(frequency_score))^2)
    )

# A tibble: 1 × 1
  sstotal
    <dbl>
1  46372.

Total variation in response y

\[SSTotal = \sum (y - \bar{y})^2\]

data |>
  summarise(
    sstotal = 
      var(frequency_score) * (n()-1)
    )

# A tibble: 1 × 1
  sstotal
    <dbl>
1  46372.

Unexplained variation from the residuals

\[SSE = \sum (y - \hat{y})^2\]

mod <- lm(frequency_score ~ group, data = data)

data |>
  summarise(
    sse = 
      sum((______ - _______)^2)
    )

Unexplained variation from the residuals

\[SSE = \sum (y - \hat{y})^2\]

mod <- lm(frequency_score ~ group, data = data)

data |>
  summarise(
    sse = 
      sum((frequency_score - fitted(mod))^2)
    )

# A tibble: 1 × 1
     sse
   <dbl>
1 45732.

Unexplained variation from the residuals

\[SSE = \sum (y - \hat{y})^2\]

mod <- lm(frequency_score ~ group, data = data)

data |>
  summarise(
    sse = 
      sum(residuals(mod)^2)
    )

# A tibble: 1 × 1
     sse
   <dbl>
1 45732.

Unexplained variation from the residuals

\[SSE = \sum (y - \hat{y})^2\]

mod <- lm(frequency_score ~ group, data = data)

data |>
  summarise(
    sse = 
      sigma(mod)^2 * (n() - 2)
    )

# A tibble: 1 × 1
     sse
   <dbl>
1 45732.

Variation explained by the model

\[SSModel = \sum (\hat{y}-\bar{y})^2\]

data |>
  summarise(
    ssmodel = 
      sum(______ - ______)^2)
    )

Variation explained by the model

\[SSModel = \sum (\hat{y}-\bar{y})^2\]

data |>
  summarise(
    ssmodel = 
      sum((fitted(mod) - mean(frequency_score))^2)
    )

# A tibble: 1 × 1
  ssmodel
    <dbl>
1     640

Partitioning variability

data |>
  summarise(
    sstotal = sum((frequency_score - mean(frequency_score))^2),
    ssmodel = sum((fitted(mod) - mean(frequency_score))^2),
    sse = sum(residuals(mod)^2)
    )

# A tibble: 1 × 3
  sstotal ssmodel    sse
    <dbl>   <dbl>  <dbl>
1  46372.     640 45732.

Partitioning variability

data |>
  summarise(
    sstotal = sum((frequency_score - mean(frequency_score))^2),
    ssmodel = sum((fitted(mod) - mean(frequency_score))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse
    )

What will this be?

Partitioning variability

data |>
  summarise(
    sstotal = sum((frequency_score - mean(frequency_score))^2),
    ssmodel = sum((fitted(mod) - mean(frequency_score))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse
    )

# A tibble: 1 × 4
  sstotal ssmodel    sse `ssmodel + sse`
    <dbl>   <dbl>  <dbl>           <dbl>
1  46372.     640 45732.          46372.

Partitioning variability

data |>
  summarise(
    sstotal = sum((frequency_score - mean(frequency_score))^2),
    ssmodel = sum((fitted(mod) - mean(frequency_score))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse,
    sstotal - ssmodel
    )

What will this be?

Partitioning variability

data |>
  summarise(
    sstotal = sum((frequency_score - mean(frequency_score))^2),
    ssmodel = sum((fitted(mod) - mean(frequency_score))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse,
    sstotal - ssmodel
    )

# A tibble: 1 × 5
  sstotal ssmodel    sse `ssmodel + sse` `sstotal - ssmodel`
    <dbl>   <dbl>  <dbl>           <dbl>               <dbl>
1  46372.     640 45732.          46372.              45732.

Partitioning variability

data |>
  summarise(
    sstotal = sum((frequency_score - mean(frequency_score))^2),
    ssmodel = sum((fitted(mod) - mean(frequency_score))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse,
    sstotal - ssmodel,
    sstotal - sse
    )

What will this be?

Partitioning variability

data |>
  summarise(
    sstotal = sum((frequency_score - mean(frequency_score))^2),
    ssmodel = sum((fitted(mod) - mean(frequency_score))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse,
    sstotal - ssmodel,
    sstotal - sse
    )

# A tibble: 1 × 6
  sstotal ssmodel    sse `ssmodel + sse` `sstotal - ssmodel` `sstotal - sse`
    <dbl>   <dbl>  <dbl>           <dbl>               <dbl>           <dbl>
1  46372.     640 45732.          46372.              45732.            640.

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^n (y - \bar{y})^2\]

How many observations?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{\require{color}\colorbox{#86a293}{$n$}} (y - \bar{y})^2\]

How many observations?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{n} (y - \bar{y})^2\]

How many things are “estimated”?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{n} (y - \require{color}\colorbox{#86a293}{$\bar{y}$})^2\]

How many things are “estimated”?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{n} (y - \bar{y})^2\]

How many degrees of freedom?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{n} (y - \bar{y})^2\]

\[\Large df_{SSTOTAL}=n-1\]

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - \hat{y})^2\]

How many observations?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{\require{color}\colorbox{#86a293}{$n$}} (y - \hat{y})^2\]

How many observations?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - \hat{y})^2\]

How is $\hat{y}$ estimated with simple linear regression?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]

How is $\hat{y}$ estimated with simple linear regression?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]

How many things are “estimated”?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\require{color}\colorbox{#86a293}{$\hat{\beta}_0$}+\colorbox{#86a293}{$\hat{\beta}_1$}x))^2\]

How many things are “estimated”?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]

How many degrees of freedom?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]

\[\Large df_{SSE} = n - 2\]

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = SSModel + SSE\]

\[df_{SSTotal} = df_{SSModel} + df_{SSE} \]

\[n - 1 = df_{SSModel} + (n - 2)\]

`Application Exercise`

How many degrees of freedom does SSModel have?

\[n - 1 = df_{SSModel} + (n - 2)\]

01:00

Mean squares

\[MSE = \frac{SSE}{n - 2}\]

\[MSModel = \frac{SSModel}{1}\]

What is the pattern?

\[\Large F = \frac{MSModel}{MSE}\]

F-distribution

Under the null hypothesis

Code

f <- data.frame(
  stat = rf(n = 10000, df1 = 1, df2 = 38)
)

ggplot(f) + 
  geom_histogram(aes(stat), bins = 40) + 
  labs(x = "F Statistic")