Selecting predictors

Lucy D’Agostino McGowan

Selecting variables

When the goal is inference how do we choose what to include in our model?

If we are trying to disentangle “correlation” from “causation” we need to understand the theoretical relationship between the explanatory variable we are interested in, the outcome variable we are interested in, and any confounders.
There is not a statistical test we can do to test whether we have included all relevant variables
Using statistical hypothesis tests to try to tell us what to include in our model can lead to our confidence intervals being too small

Selecting variables

When the goal is prediction we can use the metrics we learned about last class to compare models
But how do we choose what to put in those models to begin with?
What if we pick our models based on what is “statistically signficiant” (i.e., only include variables where p < 0.05)

`Application Exercise`

Get into groups of ~3 and copy the tempate for your group into RStudio Pro, replacing group-n with your group number (i.e., group 1 would be group-1):

https://github.com/sta-112-s24/appex-14-group-n.git

Run the code in the top R chunk to sample your data
Run the code to fit a linear model predicting y using all of the other predictors. Hint: here is a shortcut, if you want to use all variables your dataset in your model you can use this shorthand: y ~ . for your formula
If you were only selecting variables based on statistical significance, what would you include?

04:00

It’s not just you

Code

library(leaps)
n <- 10000
sim <- function(i) {
  set.seed(i)
  data <- tibble(
    x1 = rnorm(n),
    x2 = rnorm(n),
    x3 = rnorm(n),
    x4 = rnorm(n),
    x5 = rnorm(n),
    x6 = rnorm(n),
    x7 = rnorm(n),
    x8 = rnorm(n),
    x9 = rnorm(n),
    x10 = rnorm(n),
    x11 = rnorm(n),
    x12 = rnorm(n),
    x13 = rnorm(n),
    x14 = rnorm(n),
    x15 = rnorm(n),
    x16 = rnorm(n),
    x17 = rnorm(n),
    x18 = rnorm(n),
    x19 = rnorm(n),
    x20 = rnorm(n),
    y = 0.2 * x1 + 0.2 * x2 + 0.2 * x3 + rnorm(n)
  )
  
  mod1 <- lm(y ~ ., data)
  mod2 <- regsubsets(y ~ ., data)
  significant_vars_mod1 <- summary(mod1)$coefficients[,4] < 0.05
  
  mod2_summary <- summary(mod2, all.best = TRUE)
  best_model <- which.min(mod2_summary$bic)
  best_subset_vars <- mod2_summary$which[best_model,][-1]  # Exclude intercept
  
  var_names <- names(significant_vars_mod1)[-1]  # Exclude intercept
  data_frame_result <- tibble(
    variable = var_names,
    significant_in_mod1 = significant_vars_mod1[-1],  # Exclude intercept
    in_best_subset_mod2 = best_subset_vars,
    seed = i
  )
  
  data_frame_result
}
x <- map_df(1:1000, sim)

x |> 
  group_by(variable) |>
  summarise(prop_sig = mean(significant_in_mod1))

# A tibble: 20 × 2
   variable prop_sig
   <chr>       <dbl>
 1 x1          1    
 2 x10         0.059
 3 x11         0.04 
 4 x12         0.058
 5 x13         0.057
 6 x14         0.046
 7 x15         0.054
 8 x16         0.058
 9 x17         0.041
10 x18         0.045
11 x19         0.06 
12 x2          1    
13 x20         0.052
14 x3          1    
15 x4          0.054
16 x5          0.043
17 x6          0.053
18 x7          0.068
19 x8          0.047
20 x9          0.037

If I repeat what you did many times (here, 1,000), each wrong variable has a 5% chance of being selected

It’s not just you

Code

x |> 
  group_by(variable) |>
  summarise(prop_sig = mean(significant_in_mod1)) |>
  mutate(variable = fct_reorder(variable, prop_sig)) |>
  ggplot(aes(x = prop_sig, y = variable)) + 
  geom_col() + 
  labs(x = "Proportion significant")

Choosing predictors

We could try every combination of all variables

Code

data(MLB2007Standings)
MLB2007Standings <- MLB2007Standings |> 
  select(-Wins, -Losses, -Team)

# install.packages("leaps")
library(leaps)

## run on all subsets
all_subsets <- regsubsets(WinPct ~ ., data = MLB2007Standings)
## find the combination with the smallest BIC 
plot(all_subsets)

Choosing predictors

Code

## fit final model
lm(WinPct ~ BattingAvg + Runs + Saves + WHIP, data = MLB2007Standings)


Call:
lm(formula = WinPct ~ BattingAvg + Runs + Saves + WHIP, data = MLB2007Standings)

Coefficients:
(Intercept)   BattingAvg         Runs        Saves         WHIP  
  0.1716932    1.4406921    0.0003187    0.0041726   -0.3356606

⚠️ Choosing predictions

This could get computationally expensive - you are fitting \(2^k\) models (so if you have 10 predictors, that is 1,024 models you are choosing between, yikes!)
Another issue with trying every possible combination of models is you could overfit your model to your data if you don’t pick the right metric for comparison – that is, the model might fit very well to these particular observations, but would do a poor job predicting a new sample.

Overfitting

Code

set.seed(1)
d_samp1 <- tibble(
  x1 = rnorm(20),
  x2 = rnorm(20),
  x3 = rnorm(20),
  x4 = rnorm(20),
  x5 = rnorm(20),
  x6 = rnorm(20),
  x7 = rnorm(20),
  x8 = rnorm(20),
  x9 = rnorm(20),
  x10 = rnorm(20),
  x11 = rnorm(20),
  x12 = rnorm(20),
  x13 = rnorm(20),
  x14 = rnorm(20),
  x15 = rnorm(20),
  y = rnorm(20)
)

set.seed(98)
d_samp2 <- tibble(
  x1 = rnorm(20),
  x2 = rnorm(20),
  x3 = rnorm(20),
  x4 = rnorm(20),
  x5 = rnorm(20),
  x6 = rnorm(20),
  x7 = rnorm(20),
  x8 = rnorm(20),  
  x9 = rnorm(20),
  x10 = rnorm(20),
  x11 = rnorm(20),
  x12 = rnorm(20),
  x13 = rnorm(20),
  x14 = rnorm(20),
  x15 = rnorm(20),
  y =  rnorm(20)
)

mod <- lm(y ~ ., data = d_samp1)
summary(mod)


Call:
lm(formula = y ~ ., data = d_samp1)

Residuals:
       1        2        3        4        5        6        7        8 
-0.31883  0.02333 -0.02733 -0.33500  0.49943  0.32396 -0.31248  0.30849 
       9       10       11       12       13       14       15       16 
-0.02289 -0.42645 -0.10343 -0.13380  0.18120 -0.33629 -0.30803  0.67190 
      17       18       19       20 
 0.17158  0.28788 -0.06399 -0.07926 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -0.6353     0.2992  -2.123   0.1010  
x1            0.9292     0.4621   2.011   0.1147  
x2            0.5034     0.4371   1.152   0.3136  
x3            1.5263     0.5007   3.048   0.0381 *
x4            1.1963     0.5077   2.356   0.0780 .
x5            0.1780     0.4998   0.356   0.7397  
x6            0.9486     0.5354   1.772   0.1511  
x7            0.5901     0.3538   1.668   0.1707  
x8           -0.2217     0.2561  -0.866   0.4355  
x9            0.3176     0.3042   1.044   0.3555  
x10           2.2498     0.6722   3.347   0.0286 *
x11          -0.1485     0.2098  -0.708   0.5180  
x12          -0.3574     0.3537  -1.010   0.3695  
x13          -1.3627     0.4351  -3.132   0.0351 *
x14           2.4456     0.6371   3.839   0.0185 *
x15           1.5671     0.5852   2.678   0.0553 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6686 on 4 degrees of freedom
Multiple R-squared:  0.8922,    Adjusted R-squared:  0.4882 
F-statistic: 2.208 on 15 and 4 DF,  p-value: 0.2311

Code

ggplot(d_samp1, aes(x = fitted(mod), y = residuals(mod))) +
  geom_point(size = 3) +
  geom_hline(yintercept = 0) +
  labs(x = "Fitted value",
       y = "Residuals")

Overfitting

mean(residuals(mod)^2)

[1] 0.08941498

resids <- d_samp2$y - predict(mod, d_samp2)
resids

           1            2            3            4            5            6 
  9.00593710   0.98656438   6.26075868   6.13383066  -0.75454446  -9.73993408 
           7            8            9           10           11           12 
  9.56894306   2.85617756  -4.63838598   5.37057921   2.19513132   3.60929297 
          13           14           15           16           17           18 
 -1.49788006  -0.02425235  -2.28031909  -3.17567864  11.73805521   0.38308719 
          19           20 
-11.11273392  -3.99014053

mean(resids^2)

[1] 35.85672

Code

ggplot(d_samp2, aes(x = predict(mod, d_samp2), y = resids)) +
  geom_point(size = 3) + 
  geom_point(aes(x = fitted(mod), y = residuals(mod)), color = "cornflower blue", size = 3) + 
  geom_hline(yintercept = 0) + 
  labs(x = "Fitted (new sample)",
       y = "Residuals (new sample)")

Overfitting

Solution: Fit the model on part of the data and calculate the error on the remaining part
Cross-validation: fit the model on a subset of observations, see how it performs on the rest
Leave-on-out cross validation: fit the model on n-1 observations, see how the model performs on the nth

LOOCV

set.seed(10)
train <- d_samp1 |>
  sample_n(19)
test <- d_samp1 |>
  anti_join(train)
mod <- lm(y ~ ., data = train)
summary(mod)


Call:
lm(formula = y ~ ., data = train)

Residuals:
       1        2        3        4        5        6        7        8 
 0.21045 -0.23308 -0.12596  0.42737  0.02388 -0.30874  0.26494  0.27812 
       9       10       11       12       13       14       15       16 
 0.04443 -0.33224 -0.02625  0.22471  0.03455  0.14295  0.23164 -0.21622 
      17       18       19 
-0.10781 -0.33574 -0.19699 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -1.29965    0.49820  -2.609   0.0798 .
x1           2.61723    1.15476   2.266   0.1083  
x2           0.39510    0.38179   1.035   0.3768  
x3           1.47738    0.43120   3.426   0.0417 *
x4           2.45737    0.92006   2.671   0.0756 .
x5           0.07084    0.43470   0.163   0.8809  
x6           1.39491    0.54188   2.574   0.0822 .
x7           0.10954    0.43321   0.253   0.8167  
x8          -0.20495    0.22021  -0.931   0.4207  
x9           0.20127    0.27177   0.741   0.5126  
x10          4.46902    1.53815   2.905   0.0622 .
x11          0.28387    0.33111   0.857   0.4543  
x12          0.14482    0.44312   0.327   0.7653  
x13         -1.85182    0.48823  -3.793   0.0322 *
x14          4.44539    1.39637   3.184   0.0500 *
x15          3.04203    1.07255   2.836   0.0658 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5742 on 3 degrees of freedom
Multiple R-squared:  0.9387,    Adjusted R-squared:  0.632 
F-statistic: 3.061 on 15 and 3 DF,  p-value: 0.1939

mean(residuals(mod)^2)

[1] 0.05206743

mean((test$y - predict(mod, test))^2)

[1] 19.44538

LOOCV

Code

get_error <- function(k) {
  train <- d_samp1 |>
    sample_n(19)
  test <- d_samp1 |>
    anti_join(train, by = c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "y"))
  form <- as.formula(glue(
    "y ~ {glue_collapse(glue('x{1:k}'), sep = ' + ')}"
  ))
  mod <- lm(form, data = train)
  tibble(
    error = c(mean(residuals(mod)^2), mean((test$y - predict(mod, test))^2)),
    type = c("train", "test"),
    k = k
  )
}

set.seed(1)
err <- map_df(rep(1:7, each = 100), get_error) 
err |>
  group_by(k, type) |>
  summarise(error = mean(error)) |>
  ggplot(aes(x = k, y = error, color = type)) +
  geom_point() +
  geom_line()

AIC

It turns out in the case of linear regression, AIC mimics the choice from LOOCV so you don’t have to learn how to do this complicated method!

Code

get_aic <- function(k) {
   form <- as.formula(glue(
    "y ~ {glue_collapse(glue('x{1:k}'), sep = ' + ')}"
  ))
  mod <- lm(form, data = d_samp1)
  tibble(
    AIC = AIC(mod),
    k = k
  )
}

err <- map_df(1:7, get_aic)
err |>
  ggplot(aes(x = k, y = AIC)) +
  geom_point() +
  geom_line()

BIC

BIC is equivalent to leave-\(k\)-out cross validation (where \(k = n[1-1/log(n)-1])\)) for linear models, so we can also use this metric without having to code up the complex cross validation!

Code

get_bic <- function(k) {
   form <- as.formula(glue(
    "y ~ {glue_collapse(glue('x{1:k}'), sep = ' + ')}"
  ))
  mod <- lm(form, data = d_samp1)
  tibble(
    BIC = BIC(mod),
    k = k
  )
}

err <- map_df(1:7, get_bic)
err |>
  ggplot(aes(x = k, y = BIC)) +
  geom_point() +
  geom_line()

`Application Exercise`

Open appex-14.qmd. Install the leaps package by running install.packages("leaps")
Use the regsubsets function to find the best subset of variables to include in your model with BIC as the criterion
Which variables did you choose?

05:00

It’s not just you

Code

x |> 
  group_by(variable) |>
  summarise(prop_sig = mean(significant_in_mod1)) |>
  mutate(variable = fct_reorder(variable, prop_sig)) |>
  ggplot(aes(x = prop_sig, y = variable)) + 
  geom_col() + 
  labs(x = "Proportion significant") + 
  theme(text=element_text(size=30))

Code

x |> 
  group_by(variable) |>
  summarise(prop_sig = mean(in_best_subset_mod2)) |>
  mutate(variable = fct_reorder(variable, prop_sig)) |>
  ggplot(aes(x = prop_sig, y = variable)) + 
  geom_col() + 
  labs(x = "Proportion chosen in best subset (BIC)") + 
  theme(text=element_text(size=30))

Big picture

inference

we need to use our theoretical understanding of the relationship between variables in order to properly select variables to include in our model

Big picture

inference

we need to use our theoretical understanding of the relationship between variables in order to properly select variables to include in our model

prediction

Choose a set of predictors to assess
Use a metric that balances parsimony with goodness of fit (\(R^2_{adj}\), AIC, BIC) to select the model
Best practice is to fit the model on one set of data and test it on another (using something like Leave-one-out cross validation), but it turns out AIC and BIC mimic this for linear regression (yay!) so we can reliably use these metric even when not splitting our data