Simple Linear Regression Part 2

Lucy D’Agostino McGowan

Data

library(tidyverse)

starwars_nojabba <- starwars |>
  drop_na(mass, height) |>
  filter(mass < 1000)

Simple linear regression

mod <- lm(mass ~ height, data = starwars_nojabba)


starwars_nojabba |>
  mutate(y_hat = fitted(mod)) |>
  select(y_hat, mass, height)
# A tibble: 58 × 3
   y_hat  mass height
   <dbl> <dbl>  <int>
 1  74.1    77    172
 2  71.1    75    167
 3  27.6    32     96
 4  92.5   136    202
 5  60.7    49    150
 6  77.8   120    178
 7  69.9    75    165
 8  28.2    32     97
 9  80.9    84    183
10  80.3    77    182
# ℹ 48 more rows

Where did these \(\hat{y}\) values come from?

Fitted values

lm(mass ~ height, 
   data = starwars_nojabba)

Call:
lm(formula = mass ~ height, data = starwars_nojabba)

Coefficients:
(Intercept)       height  
   -31.2505       0.6127  
starwars_nojabba |>
  mutate(y_hat = fitted(mod)) |>
  select(y_hat, mass, height) |>
  slice(1:3)
# A tibble: 3 × 3
  y_hat  mass height
  <dbl> <dbl>  <int>
1  74.1    77    172
2  71.1    75    167
3  27.6    32     96

\(\hat{y}_1 = -31.3 + 0.62 \times 172\)

. . .

\(\hat{y}_2 = -31.3 + 0.62 \times 167\)

. . .

\(\hat{y}_3 = -31.3 + 0.62 \times 96\)

Simple linear regression

How did we decide on this line?

Code
starwars_nojabba <- starwars_nojabba |>
  mutate(fitted = fitted(lm(mass ~ height, data = starwars_nojabba)))

ggplot(starwars_nojabba, aes(x = height, mass)) +
  geom_point(color = "#86a293") +
  geom_segment(aes(
    x = height,
    y = mass,
    xend = height,
    yend = fitted
  ),
  color = "blue") +
  geom_smooth(
    method = "lm",
    se = FALSE,
    formula = "y ~ x",
    color = "#86a293"
  ) +
  labs(title = "The relationship between mass and height for Star Wars characters",
       caption = "Data from SWAPI (swapi.dev)")

Minimize Least Squares

Code
ggplot(starwars_nojabba, aes(x = height, mass)) +
  geom_rect(
    aes(
      xmin = height,
      xmax = height + mass - fitted,
      ymin = mass,
      ymax = fitted
    ),
    fill = "blue",
    color = "blue",
    alpha = 0.2
  ) +
  geom_smooth(
    method = "lm",
    se = FALSE,
    formula = "y ~ x",
    color = "#86a293"
  ) +
  geom_point(color = "#86a293") +
  coord_fixed() +
  labs(title = "The relationship between mass and height for Star Wars characters",
       caption = "Data from SWAPI (swapi.dev)")

“Squared Residuals”

Code
ggplot(starwars_nojabba, aes(x = height, mass)) +
  geom_rect(
    aes(
      xmin = height,
      xmax = height + mass - fitted,
      ymin = mass,
      ymax = fitted
    ),
    fill = "blue",
    color = "blue",
    alpha = 0.2
  ) +
  geom_smooth(
    method = "lm",
    se = FALSE,
    formula = "y ~ x",
    color = "#86a293"
  ) +
  geom_point(color = "#86a293") +
  coord_fixed() +
  labs(title = "The relationship between mass and height for Star Wars characters",
       caption = "Data from SWAPI (swapi.dev)")

“Residuals”

Code
ggplot(starwars_nojabba, aes(x = height, mass)) +
  geom_point(color = "#86a293") +
  geom_segment(aes(
    x = height,
    y = mass,
    xend = height,
    yend = fitted
  ),
  color = "blue") +
  geom_smooth(
    method = "lm",
    se = FALSE,
    formula = "y ~ x",
    color = "#86a293"
  ) +
  labs(title = "The relationship between mass and height for Star Wars characters",
       caption = "Data from SWAPI (swapi.dev)")

Let’s pause for definitions

Definitions

  • residual \((e)\)
  • squared residual \((e^2)\)
  • sum of squared residuals (SSE)
  • n
  • degrees of freedom
  • standard deviation of the errors \((\sigma_\varepsilon)\)

  • \(y - \hat{y}\)
  • \((y-\hat{y})^2\)
  • \(\sum(y-\hat{y})^2\)
  • sample size
  • \(n - p\) for simple linear regression: \(n - 2\)
  • estimated by \(\hat{\sigma}_\varepsilon=\sqrt{\frac{\textrm{SSE}}{df}}\)

A note about notation

\[\Large \sum(y-\hat{y})^2\]

\[\Large \sum_{i=1}^n(y_i - \hat{y}_i)^2\]

A note about notation

the \(i\) indicates a single individual

\[\Large e_i = y_i - \hat{y}_i\]

A note about notation

for the 1st observation, \(i = 1\)

\[\Large e_1 = y_1 - \hat{y}_1\]

Application Exercise

  1. Go to lucy.shinyapps.io/least-squares/
  2. This shows a scatter plot of 10 data points with a line estimating the relationship between x and y. Drag the blue points to change the line.
  3. See if you can find a line that minimizes the sum of square errors
03:00

Let’s do this in R!

Calculate the residual

mod <- lm(mass ~ height,
          data = starwars_nojabba)

starwars_nojabba |>
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat) |>
  select(mass, height, y_hat, residual)
# A tibble: 58 × 4
    mass height y_hat residual
   <dbl>  <int> <dbl>    <dbl>
 1    77    172  74.1     2.86
 2    75    167  71.1     3.92
 3    32     96  27.6     4.43
 4   136    202  92.5    43.5 
 5    49    150  60.7   -11.7 
 6   120    178  77.8    42.2 
 7    75    165  69.9     5.15
 8    32     97  28.2     3.82
 9    84    183  80.9     3.12
10    77    182  80.3    -3.27
# ℹ 48 more rows

Calculate the residual squared

How could I add the residual squared to this data frame?

mod <- lm(mass ~ height,
          data = starwars_nojabba)

starwars_nojabba |>
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat) |>
  select(mass, height, y_hat, residual)
# A tibble: 58 × 4
    mass height y_hat residual
   <dbl>  <int> <dbl>    <dbl>
 1    77    172  74.1     2.86
 2    75    167  71.1     3.92
 3    32     96  27.6     4.43
 4   136    202  92.5    43.5 
 5    49    150  60.7   -11.7 
 6   120    178  77.8    42.2 
 7    75    165  69.9     5.15
 8    32     97  28.2     3.82
 9    84    183  80.9     3.12
10    77    182  80.3    -3.27
# ℹ 48 more rows

Calculate the residual squared

How could I add the residual squared to this data frame?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba |>
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) |>
  select(mass, height, y_hat, residual_2)
# A tibble: 58 × 4
    mass height y_hat residual_2
   <dbl>  <int> <dbl>      <dbl>
 1    77    172  74.1       8.18
 2    75    167  71.1      15.4 
 3    32     96  27.6      19.6 
 4   136    202  92.5    1890.  
 5    49    150  60.7     136.  
 6   120    178  77.8    1780.  
 7    75    165  69.9      26.5 
 8    32     97  28.2      14.6 
 9    84    183  80.9       9.74
10    77    182  80.3      10.7 
# ℹ 48 more rows

Calculate the SSE

How can I summarize this dataset to calculate the sum of the squared residuals?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba |>
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2)  |>
  select(mass, height, y_hat, residual_2)
# A tibble: 58 × 4
    mass height y_hat residual_2
   <dbl>  <int> <dbl>      <dbl>
 1    77    172  74.1       8.18
 2    75    167  71.1      15.4 
 3    32     96  27.6      19.6 
 4   136    202  92.5    1890.  
 5    49    150  60.7     136.  
 6   120    178  77.8    1780.  
 7    75    165  69.9      26.5 
 8    32     97  28.2      14.6 
 9    84    183  80.9       9.74
10    77    182  80.3      10.7 
# ℹ 48 more rows

Calculate the SSE

How can I summarize this dataset to calculate the sum of the squared residuals?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba |>
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) |>
  summarize(
    sse = sum(residual_2)
    )
# A tibble: 1 × 1
     sse
   <dbl>
1 21276.

Calculate the sample size

How can I add the total sample size?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba |>
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) |>
  summarize(
    sse = sum(residual_2)
    )
# A tibble: 1 × 1
     sse
   <dbl>
1 21276.

Calculate the sample size

How can I add the total sample size?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba |>
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) |>
  summarize(
    sse = sum(residual_2),
    n = n()
    )
# A tibble: 1 × 2
     sse     n
   <dbl> <int>
1 21276.    58

Calculate the degrees of freedom

How can I add the degrees of freedom \((n-p)\)?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba |>
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) |>
  summarize(
    sse = sum(residual_2),
    n = n()
    )
# A tibble: 1 × 2
     sse     n
   <dbl> <int>
1 21276.    58

Calculate the degrees of freedom

How can I add the degrees of freedom \((n-p)\)?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba |>
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) |>
  summarize(
    sse = sum(residual_2),
    n = n(),
    df = n - 2
    )
# A tibble: 1 × 3
     sse     n    df
   <dbl> <int> <dbl>
1 21276.    58    56

Calculate the residual standard error

How can I add the total \(\hat{\sigma}_\varepsilon= \sqrt{\frac{\textrm{SSE}}{df}}\)?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba |>
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) |>
  summarize(
    sse = sum(residual_2),
    n = n(),
    df = n - 2
    )
# A tibble: 1 × 3
     sse     n    df
   <dbl> <int> <dbl>
1 21276.    58    56

Calculate the residual standard error

How can I add the total \(\hat{\sigma}_\varepsilon= \sqrt{\frac{\textrm{SSE}}{df}}\)?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba |>
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) |>
  summarize(
    sse = sum(residual_2),
    n = n(),
    df = n - 2,
    sigma = sqrt(sse / df)
    )
# A tibble: 1 × 4
     sse     n    df sigma
   <dbl> <int> <dbl> <dbl>
1 21276.    58    56  19.5

Find residual standard error in lm output

mod <- lm(mass ~ height, data = starwars_nojabba)

summary(mod)

Call:
lm(formula = mass ~ height, data = starwars_nojabba)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.006  -7.804   0.508   4.007  57.901 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -31.25047   12.81488  -2.439   0.0179 *  
height        0.61273    0.07202   8.508 1.14e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.49 on 56 degrees of freedom
Multiple R-squared:  0.5638,    Adjusted R-squared:  0.556 
F-statistic: 72.38 on 1 and 56 DF,  p-value: 1.138e-11

Application Exercise

  1. Create a new project from this template in RStudio Pro:
https://github.com/sta-112-s24/appex-07.git
  1. Load the packages and data by running the top chunk of R code
  2. Learn about the PorschePrice data by running ?PorschePrice in your Console
  3. Fit a linear model predicting Price from Mileage
  4. Add a variable called y_hat to the PorschePrice dataset with the predicted y values
  5. Add a variable called residual to the PorschePrice dataset with the residuals
  6. Upload to Canvas
07:00