Multiple linear regression

Lucy D’Agostino McGowan

Simple linear regression

\[y = \beta_0 + \beta_1X + \epsilon\]

\[\epsilon\sim N(0,\sigma_\epsilon)\]

Multiple linear regression

\[y = \beta_0 + \beta_1X_1 + \beta_2X_2+\dots+\beta_kX_k+ \epsilon\]

\[\epsilon\sim N(0,\sigma_\epsilon)\]

Multiple linear regression

How are these coefficients estimated?

\[\hat{y} = \hat\beta_0 + \hat\beta_1X_1 + \hat\beta_2X_2+\dots+\hat\beta_kX_k\]

estimate coefficents that minimize the sum of squared residuals

Let’s do it in R!

Data: Porsche Price
Price, Mileage, Age

Let’s do it in R!

What is my response variable? What are my explanatory variables?

Data: Porsche Price
Price, Mileage, Age

Let’s do it in R!

data("PorschePrice")
lm(Price ~ Mileage + Age, data = PorschePrice)


Call:
lm(formula = Price ~ Mileage + Age, data = PorschePrice)

Coefficients:
(Intercept)      Mileage          Age  
    70.9192      -0.5613      -0.1302

Let’s do it in R!

What is different between this and the lm() functions we have been previously running?

data("PorschePrice")
lm(Price ~ Mileage + Age, data = PorschePrice)


Call:
lm(formula = Price ~ Mileage + Age, data = PorschePrice)

Coefficients:
(Intercept)      Mileage          Age  
    70.9192      -0.5613      -0.1302

Let’s do it in R!

What is different between this and the lm() functions we have been previously running?

model <- lm(Price ~ Mileage + Age, data = PorschePrice)
summary(model)


Call:
lm(formula = Price ~ Mileage + Age, data = PorschePrice)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.930  -3.795  -0.309   4.116  12.811 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  70.9192     2.4835  28.556  < 2e-16 ***
Mileage      -0.5613     0.1141  -4.921 3.76e-05 ***
Age          -0.1302     0.4568  -0.285    0.778    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.291 on 27 degrees of freedom
Multiple R-squared:  0.7951,    Adjusted R-squared:  0.7799 
F-statistic: 52.39 on 2 and 27 DF,  p-value: 5.073e-10

Let’s do it in R!

How would we get the predicted values for \(\hat{y}\)?

model <- lm(Price ~ Mileage + Age, data = PorschePrice)

PorschePrice <- PorschePrice |> 
  mutate(y_hat = fitted(model)) #<<

head(PorschePrice)

  Price Age Mileage    y_hat
1  69.4   3    21.5 58.45976
2  56.9   3    43.0 46.39104
3  49.9   2    19.9 59.48812
4  47.4   4    36.0 50.19016
5  42.9   4    44.0 45.69948
6  36.9   6    49.8 42.18328

Let’s do it in R!

The sample size is \(n = 30\), what would the degrees of freedom for the SSE be now?

\[\Large \sqrt{\frac{SSE}{??}}\]

Let’s do it in R!

The sample size is \(n = 30\), what would the degrees of freedom for the SSE be now?

\[\Large \sqrt\frac{SSE}{n - k - 1}\]

Let’s do it in R!

The sample size is \(n = 30\), what would the degrees of freedom for the SSE be now?

\[\Large\sqrt{ \frac{SSE}{30 - 2 - 1}}\]

Let’s do it in R!

lm(Price ~ Mileage + Age, data = PorschePrice) |> 
  anova()

Analysis of Variance Table

Response: Price
          Df Sum Sq Mean Sq  F value    Pr(>F)    
Mileage    1 5565.7  5565.7 104.7023 8.653e-11 ***
Age        1    4.3     4.3   0.0813    0.7778    
Residuals 27 1435.2    53.2                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Why might we want to do this?

Multiple linear regression for inference

Goal: Discover the relationship between a response (outcome, \(y\)), and an explanatory variable ( \(x\) )

Multiple linear regression for inference

Goal: Discover the relationship between a response (outcome, \(y\)), and an explanatory variable ( \(x\) ) adjusting for all known confounders

Multiple linear regression for inference

What is a confounder?

Goal: Discover the relationship between a response (outcome, \(y\)), and an explanatory variable ( \(x\) ) adjusting for all known confounders

confounder

A confounder is a variable that is associated with both the response variable ( \(y\) ) and the explanatory variable ( \(x\) ). If not accounted for, it can result in seeing a spurious relationship between \(x\) and \(y\).

Confounder example

Code

coords <- list(
  x = c(x = 1, y = 10, z = 5.5),
  y = c(x = 1, y = 1, z = 2)
)
dag <- dagify(y ~ x, coords = coords)
ggdag(dag) + theme_dag()

Confounder example

Code

dag <- dagify(y ~ x, y ~ z, x ~ z, coords = coords)
ggdag(dag) + theme_dag()

Confounder example

Code

dag <- dagify(y ~ z, x ~ z, coords = coords)
ggdag(dag) + theme_dag()

Confounder example

Code

dag <- dagify(y ~ x, coords = coords) |>
 tidy_dagitty() |>
 dag_label(labels = c("x" = "ice cream consumption", 
                      "y" = "crime rates",
                      "z" = "summer"))
  
ggdag(dag, text = FALSE, use_labels = "label") + theme_dag()

Confounder example

Code

dag <- dagify(y ~ z, x ~ z, coords = coords) |>
 tidy_dagitty() |>
 dag_label(labels = c("x" = "ice cream consumption", 
                      "y" = "crime rates",
                      "z" = "summer"))
  
ggdag(dag, text = FALSE, use_labels = "label") + theme_dag()

Confounding example

Armstrong, K.A. (2012). Methods in comparative effectiveness research. Journal of clinical oncology: official journal of the American Society of Clinical Oncology, 30 34, 4208-14.

A quick R aside

So far, the data we’ve been using has been included in an R package
To access this data we just run data("data set")
What if we want to read in other data, for example from a .csv file?

enter: read_csv()

read_csv() is a function from the readr package, which is included when you load the tidyverse

it works like this:

df <- read_csv("the-path-to-your-file.csv")

Where df can be whatever you’d like to call your new dataset

Berkley administration data

Study carried out by the graduate Division of the University of California, Berkeley in the early 70’s to evaluate whether there was a gender bias in graduate admissions
The data come from six departments. For confidentiality we’ll call them A-F.
We have information on whether the applicant was male or female and whether they were admitted or rejected.

First, we will evaluate whether the percentage of males admitted is indeed higher than females, overall. Next, we will calculate the same percentage for each department.

`Application Exercise`

Copy the following template into RStudio Pro:

https://github.com/sta-112-s24/appex-11.git

Run the code to calculate the proportion of each gender admitted and rejected.
Create a bar plot with gender on the x-axis, filled by the % admitted (or rejected)
Amend the code from #2 to calculate the proportion of each gender admitted and rejected by department
Create a bar plot of the percent admitted (and rejected) by gender faceted by department

05:00

Berkley admin data

What was our response variable?
What was our explanatory variable of interest?
What was our confounder?
What was our equation?

Simpson’s paradox

Code

set.seed(1)
data <- tibble(
  x = c(rnorm(25), rnorm(25, 2), rnorm(25, 4), rnorm(25, 6)),
  group = rep(1:4, each = 25),
  y = 5 + 2 * x - 10 * group + rnorm(100, 0, 5)
)
ggplot(data, aes(x, y)) +
  geom_point() + 
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE)

Simpson’s paradox

Code

set.seed(1)
data <- tibble(
  x = c(rnorm(25), rnorm(25, 2), rnorm(25, 4), rnorm(25, 6)),
  group = rep(1:4, each = 25),
  y = 5 + 2.5 * x - 10 * group + rnorm(100, 0, 5)
)
ggplot(data, aes(x, y, color = group)) +
  geom_point() + 
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, aes(group = group)) + 
  theme(legend.position = "none")

Porsche data

lm(Price ~ Mileage + Age, data = PorschePrice)


Call:
lm(formula = Price ~ Mileage + Age, data = PorschePrice)

Coefficients:
(Intercept)      Mileage          Age  
    70.9192      -0.5613      -0.1302

Porsche data

How do you calculate a t statistic for \(\hat{\beta}_2\)?

lm(Price ~ Mileage + Age, data = PorschePrice)


Call:
lm(formula = Price ~ Mileage + Age, data = PorschePrice)

Coefficients:
(Intercept)      Mileage          Age  
    70.9192      -0.5613      -0.1302

Porsche data

How do you calculate a t statistic for \(\hat{\beta}_2\)?

lm(Price ~ Mileage + Age, data = PorschePrice)


Call:
lm(formula = Price ~ Mileage + Age, data = PorschePrice)

Coefficients:
(Intercept)      Mileage          Age  
    70.9192      -0.5613      -0.1302

\(t = \frac{\hat\beta_2}{SE_{\hat\beta_2}}\)
\(t = \frac{\hat\beta_i}{SE_{\hat\beta_i}}\)

Porsche data

What is the null and alternative hypothesis?

lm(Price ~ Mileage + Age, data = PorschePrice)


Call:
lm(formula = Price ~ Mileage + Age, data = PorschePrice)

Coefficients:
(Intercept)      Mileage          Age  
    70.9192      -0.5613      -0.1302

\(\Large H_0: \beta_i = 0\)
\(\Large H_A: \beta_i \neq 0\)

Porsche data

What would the degrees of freedom be for the t-distribution used to calcualte a p-value?

lm(Price ~ Mileage + Age, data = PorschePrice)


Call:
lm(formula = Price ~ Mileage + Age, data = PorschePrice)

Coefficients:
(Intercept)      Mileage          Age  
    70.9192      -0.5613      -0.1302

\(n - k - 1\) = 27

What is that definition of a p-value again?

What about the definition of a confidence interval?

Porche data

How would you calculate a confidence interval for \(\beta_i\)?

model <- lm(Price ~ Mileage + Age, data = PorschePrice)
confint(model)

                 2.5 %     97.5 %
(Intercept) 65.8234089 76.0149140
Mileage     -0.7953816 -0.3272903
Age         -1.0675929  0.8071415

\(\hat\beta_i\pm t^*SE_{\hat\beta_i}\)
\(t^*\) is the critical value from a \(t\) density with degrees of freedom equal to the error df in the model \(n-k-1\), where \(k\) is the number of predictors

Multiple linear regression for prediction

Goal: Discover the best model for predicting a response variable (an outcome variable, \(y\)) using predictors, \(\mathbf{X}\)
Ultimately, we are often comparing models

🛠 toolkit for comparing models

👉 F-test

. . .

👉 \(\Large R^2\)

🛠 F-test for Multiple Linear Regression

Comparing the full model to the intercept only model
\(H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0\)
\(H_A: \textrm{at least one } \beta_i \neq 0\)
We will soon learn a more general version of comparing nested models

🛠 F-test for Multiple Linear Regression

\(F = \frac{MSModel}{MSE}\)
df for the Model?
- k
df for the errors?
- n - k - 1

🛠 \(R^2\) for Multiple Linear Regression

\(R^2= \frac{SSModel}{SSTotal}\)
\(R^2= 1 - \frac{SSE}{SSTotal}\)
As is, if you add a predictor this will always increase. Therefore, we have \(R^2_{adj}\) that has a small “penalty” for adding more predictors

🛠 Adjusted \(R^2\)

\(R^2_{adj} = 1 - \frac{SSE/(n-k-1)}{SSTotal / (n-1)}\)
\(\frac{SSTotal}{n-1} = \frac{\sum(y - \bar{y})^2}{n-1}\) What is this?
- Sample variance! \(S_Y^2\)
\(R^2_{adj} = 1 - \frac{\hat\sigma^2_\epsilon}{S_Y^2}\)

🛠 \(R^2_{adj}\)

\(R^2_{adj} = 1 - \frac{SSE/(n-k-1)}{SSTotal / (n-1)}\)
The denominator stays the same for all models fit to the same response variable and data
the numerator actually increase when a new predictor is added to a model if the decrease in the SSE is not sufficient to offset the decrease in the error degrees of freedom.
So \(R^2_{adj}\) can 👇 when a weak predictor is added to a model

`Application Exercise`

Open appex-11.qmd
Using the NFL2007Standings data create a model that predicts WinPct from PointsFor. Examine the \(R^2\) and \(R^2_{adj}\) values.
Using the NFL2007Standings data create a model that predicts WinPct from PointsFor AND PointsAgainst. Examine the \(R^2\) and \(R^2_{adj}\) values.
Which model do you think is better for predicting win percent?

05:00