Selecting predictors

Lucy D’Agostino McGowan

Selecting variables

When the goal is inference how do we choose what to include in our model?

  • If we are trying to disentangle “correlation” from “causation” we need to understand the theoretical relationship between the explanatory variable we are interested in, the outcome variable we are interested in, and any confounders.
  • There is not a statistical test we can do to test whether we have included all relevant variables
  • Using statistical hypothesis tests to try to tell us what to include in our model can lead to our confidence intervals being too small

Selecting variables

  • When the goal is prediction we can use the metrics we learned about last class to compare models
  • But how do we choose what to put in those models to begin with?
  • What if we pick our models based on what is “statistically signficiant” (i.e., only include variables where p < 0.05)

Application Exercise

  1. Get into groups of ~3 and copy the tempate for your group into RStudio Pro, replacing group-n with your group number (i.e., group 1 would be group-1):
https://github.com/sta-112-s24/appex-14-group-n.git
  1. Run the code in the top R chunk to sample your data

  2. Run the code to fit a linear model predicting y using all of the other predictors. Hint: here is a shortcut, if you want to use all variables your dataset in your model you can use this shorthand: y ~ . for your formula

  3. If you were only selecting variables based on statistical significance, what would you include?

04:00

It’s not just you

Code
library(leaps)
n <- 10000
sim <- function(i) {
  set.seed(i)
  data <- tibble(
    x1 = rnorm(n),
    x2 = rnorm(n),
    x3 = rnorm(n),
    x4 = rnorm(n),
    x5 = rnorm(n),
    x6 = rnorm(n),
    x7 = rnorm(n),
    x8 = rnorm(n),
    x9 = rnorm(n),
    x10 = rnorm(n),
    x11 = rnorm(n),
    x12 = rnorm(n),
    x13 = rnorm(n),
    x14 = rnorm(n),
    x15 = rnorm(n),
    x16 = rnorm(n),
    x17 = rnorm(n),
    x18 = rnorm(n),
    x19 = rnorm(n),
    x20 = rnorm(n),
    y = 0.2 * x1 + 0.2 * x2 + 0.2 * x3 + rnorm(n)
  )
  
  mod1 <- lm(y ~ ., data)
  mod2 <- regsubsets(y ~ ., data)
  significant_vars_mod1 <- summary(mod1)$coefficients[,4] < 0.05
  
  mod2_summary <- summary(mod2, all.best = TRUE)
  best_model <- which.min(mod2_summary$bic)
  best_subset_vars <- mod2_summary$which[best_model,][-1]  # Exclude intercept
  
  var_names <- names(significant_vars_mod1)[-1]  # Exclude intercept
  data_frame_result <- tibble(
    variable = var_names,
    significant_in_mod1 = significant_vars_mod1[-1],  # Exclude intercept
    in_best_subset_mod2 = best_subset_vars,
    seed = i
  )
  
  data_frame_result
}
x <- map_df(1:1000, sim)

x |> 
  group_by(variable) |>
  summarise(prop_sig = mean(significant_in_mod1))
# A tibble: 20 × 2
   variable prop_sig
   <chr>       <dbl>
 1 x1          1    
 2 x10         0.059
 3 x11         0.04 
 4 x12         0.058
 5 x13         0.057
 6 x14         0.046
 7 x15         0.054
 8 x16         0.058
 9 x17         0.041
10 x18         0.045
11 x19         0.06 
12 x2          1    
13 x20         0.052
14 x3          1    
15 x4          0.054
16 x5          0.043
17 x6          0.053
18 x7          0.068
19 x8          0.047
20 x9          0.037

If I repeat what you did many times (here, 1,000), each wrong variable has a 5% chance of being selected

It’s not just you

Code
x |> 
  group_by(variable) |>
  summarise(prop_sig = mean(significant_in_mod1)) |>
  mutate(variable = fct_reorder(variable, prop_sig)) |>
  ggplot(aes(x = prop_sig, y = variable)) + 
  geom_col() + 
  labs(x = "Proportion significant")

Choosing predictors

We could try every combination of all variables

Code
data(MLB2007Standings)
MLB2007Standings <- MLB2007Standings |> 
  select(-Wins, -Losses, -Team)
# install.packages("leaps")
library(leaps)

## run on all subsets
all_subsets <- regsubsets(WinPct ~ ., data = MLB2007Standings)
## find the combination with the smallest BIC 
plot(all_subsets)

Choosing predictors

Choosing predictors

Code
## fit final model
lm(WinPct ~ BattingAvg + Runs + Saves + WHIP, data = MLB2007Standings)

Call:
lm(formula = WinPct ~ BattingAvg + Runs + Saves + WHIP, data = MLB2007Standings)

Coefficients:
(Intercept)   BattingAvg         Runs        Saves         WHIP  
  0.1716932    1.4406921    0.0003187    0.0041726   -0.3356606  

⚠️ Choosing predictions

  • This could get computationally expensive - you are fitting \(2^k\) models (so if you have 10 predictors, that is 1,024 models you are choosing between, yikes!)
  • Another issue with trying every possible combination of models is you could overfit your model to your data if you don’t pick the right metric for comparison – that is, the model might fit very well to these particular observations, but would do a poor job predicting a new sample.

Overfitting

Code
set.seed(1)
d_samp1 <- tibble(
  x1 = rnorm(20),
  x2 = rnorm(20),
  x3 = rnorm(20),
  x4 = rnorm(20),
  x5 = rnorm(20),
  x6 = rnorm(20),
  x7 = rnorm(20),
  x8 = rnorm(20),
  x9 = rnorm(20),
  x10 = rnorm(20),
  x11 = rnorm(20),
  x12 = rnorm(20),
  x13 = rnorm(20),
  x14 = rnorm(20),
  x15 = rnorm(20),
  y = rnorm(20)
)

set.seed(98)
d_samp2 <- tibble(
  x1 = rnorm(20),
  x2 = rnorm(20),
  x3 = rnorm(20),
  x4 = rnorm(20),
  x5 = rnorm(20),
  x6 = rnorm(20),
  x7 = rnorm(20),
  x8 = rnorm(20),  
  x9 = rnorm(20),
  x10 = rnorm(20),
  x11 = rnorm(20),
  x12 = rnorm(20),
  x13 = rnorm(20),
  x14 = rnorm(20),
  x15 = rnorm(20),
  y =  rnorm(20)
)

mod <- lm(y ~ ., data = d_samp1)
summary(mod)

Call:
lm(formula = y ~ ., data = d_samp1)

Residuals:
       1        2        3        4        5        6        7        8 
-0.31883  0.02333 -0.02733 -0.33500  0.49943  0.32396 -0.31248  0.30849 
       9       10       11       12       13       14       15       16 
-0.02289 -0.42645 -0.10343 -0.13380  0.18120 -0.33629 -0.30803  0.67190 
      17       18       19       20 
 0.17158  0.28788 -0.06399 -0.07926 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -0.6353     0.2992  -2.123   0.1010  
x1            0.9292     0.4621   2.011   0.1147  
x2            0.5034     0.4371   1.152   0.3136  
x3            1.5263     0.5007   3.048   0.0381 *
x4            1.1963     0.5077   2.356   0.0780 .
x5            0.1780     0.4998   0.356   0.7397  
x6            0.9486     0.5354   1.772   0.1511  
x7            0.5901     0.3538   1.668   0.1707  
x8           -0.2217     0.2561  -0.866   0.4355  
x9            0.3176     0.3042   1.044   0.3555  
x10           2.2498     0.6722   3.347   0.0286 *
x11          -0.1485     0.2098  -0.708   0.5180  
x12          -0.3574     0.3537  -1.010   0.3695  
x13          -1.3627     0.4351  -3.132   0.0351 *
x14           2.4456     0.6371   3.839   0.0185 *
x15           1.5671     0.5852   2.678   0.0553 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6686 on 4 degrees of freedom
Multiple R-squared:  0.8922,    Adjusted R-squared:  0.4882 
F-statistic: 2.208 on 15 and 4 DF,  p-value: 0.2311
Code
ggplot(d_samp1, aes(x = fitted(mod), y = residuals(mod))) +
  geom_point(size = 3) +
  geom_hline(yintercept = 0) +
  labs(x = "Fitted value",
       y = "Residuals")

Overfitting

mean(residuals(mod)^2)
[1] 0.08941498
resids <- d_samp2$y - predict(mod, d_samp2)
resids
           1            2            3            4            5            6 
  9.00593710   0.98656438   6.26075868   6.13383066  -0.75454446  -9.73993408 
           7            8            9           10           11           12 
  9.56894306   2.85617756  -4.63838598   5.37057921   2.19513132   3.60929297 
          13           14           15           16           17           18 
 -1.49788006  -0.02425235  -2.28031909  -3.17567864  11.73805521   0.38308719 
          19           20 
-11.11273392  -3.99014053 
mean(resids^2)
[1] 35.85672
Code
ggplot(d_samp2, aes(x = predict(mod, d_samp2), y = resids)) +
  geom_point(size = 3) + 
  geom_point(aes(x = fitted(mod), y = residuals(mod)), color = "cornflower blue", size = 3) + 
  geom_hline(yintercept = 0) + 
  labs(x = "Fitted (new sample)",
       y = "Residuals (new sample)")

Overfitting

  • Solution: Fit the model on part of the data and calculate the error on the remaining part
  • Cross-validation: fit the model on a subset of observations, see how it performs on the rest
  • Leave-on-out cross validation: fit the model on n-1 observations, see how the model performs on the nth

LOOCV

set.seed(10)
train <- d_samp1 |>
  sample_n(19)
test <- d_samp1 |>
  anti_join(train)
mod <- lm(y ~ ., data = train)
summary(mod)

Call:
lm(formula = y ~ ., data = train)

Residuals:
       1        2        3        4        5        6        7        8 
 0.21045 -0.23308 -0.12596  0.42737  0.02388 -0.30874  0.26494  0.27812 
       9       10       11       12       13       14       15       16 
 0.04443 -0.33224 -0.02625  0.22471  0.03455  0.14295  0.23164 -0.21622 
      17       18       19 
-0.10781 -0.33574 -0.19699 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -1.29965    0.49820  -2.609   0.0798 .
x1           2.61723    1.15476   2.266   0.1083  
x2           0.39510    0.38179   1.035   0.3768  
x3           1.47738    0.43120   3.426   0.0417 *
x4           2.45737    0.92006   2.671   0.0756 .
x5           0.07084    0.43470   0.163   0.8809  
x6           1.39491    0.54188   2.574   0.0822 .
x7           0.10954    0.43321   0.253   0.8167  
x8          -0.20495    0.22021  -0.931   0.4207  
x9           0.20127    0.27177   0.741   0.5126  
x10          4.46902    1.53815   2.905   0.0622 .
x11          0.28387    0.33111   0.857   0.4543  
x12          0.14482    0.44312   0.327   0.7653  
x13         -1.85182    0.48823  -3.793   0.0322 *
x14          4.44539    1.39637   3.184   0.0500 *
x15          3.04203    1.07255   2.836   0.0658 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5742 on 3 degrees of freedom
Multiple R-squared:  0.9387,    Adjusted R-squared:  0.632 
F-statistic: 3.061 on 15 and 3 DF,  p-value: 0.1939
mean(residuals(mod)^2)
[1] 0.05206743
mean((test$y - predict(mod, test))^2)
[1] 19.44538

LOOCV

Code
get_error <- function(k) {
  train <- d_samp1 |>
    sample_n(19)
  test <- d_samp1 |>
    anti_join(train, by = c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "y"))
  form <- as.formula(glue(
    "y ~ {glue_collapse(glue('x{1:k}'), sep = ' + ')}"
  ))
  mod <- lm(form, data = train)
  tibble(
    error = c(mean(residuals(mod)^2), mean((test$y - predict(mod, test))^2)),
    type = c("train", "test"),
    k = k
  )
}

set.seed(1)
err <- map_df(rep(1:7, each = 100), get_error) 
err |>
  group_by(k, type) |>
  summarise(error = mean(error)) |>
  ggplot(aes(x = k, y = error, color = type)) +
  geom_point() +
  geom_line()

AIC

  • It turns out in the case of linear regression, AIC mimics the choice from LOOCV so you don’t have to learn how to do this complicated method!
Code
get_aic <- function(k) {
   form <- as.formula(glue(
    "y ~ {glue_collapse(glue('x{1:k}'), sep = ' + ')}"
  ))
  mod <- lm(form, data = d_samp1)
  tibble(
    AIC = AIC(mod),
    k = k
  )
}

err <- map_df(1:7, get_aic)
err |>
  ggplot(aes(x = k, y = AIC)) +
  geom_point() +
  geom_line()

BIC

  • BIC is equivalent to leave-\(k\)-out cross validation (where \(k = n[1-1/log(n)-1])\)) for linear models, so we can also use this metric without having to code up the complex cross validation!
Code
get_bic <- function(k) {
   form <- as.formula(glue(
    "y ~ {glue_collapse(glue('x{1:k}'), sep = ' + ')}"
  ))
  mod <- lm(form, data = d_samp1)
  tibble(
    BIC = BIC(mod),
    k = k
  )
}

err <- map_df(1:7, get_bic)
err |>
  ggplot(aes(x = k, y = BIC)) +
  geom_point() +
  geom_line()

Application Exercise

  1. Open appex-14.qmd. Install the leaps package by running install.packages("leaps")
  2. Use the regsubsets function to find the best subset of variables to include in your model with BIC as the criterion
  3. Which variables did you choose?
05:00

It’s not just you

Code
x |> 
  group_by(variable) |>
  summarise(prop_sig = mean(significant_in_mod1)) |>
  mutate(variable = fct_reorder(variable, prop_sig)) |>
  ggplot(aes(x = prop_sig, y = variable)) + 
  geom_col() + 
  labs(x = "Proportion significant") + 
  theme(text=element_text(size=30))

Code
x |> 
  group_by(variable) |>
  summarise(prop_sig = mean(in_best_subset_mod2)) |>
  mutate(variable = fct_reorder(variable, prop_sig)) |>
  ggplot(aes(x = prop_sig, y = variable)) + 
  geom_col() + 
  labs(x = "Proportion chosen in best subset (BIC)") + 
  theme(text=element_text(size=30))

Big picture

inference

we need to use our theoretical understanding of the relationship between variables in order to properly select variables to include in our model

Big picture

inference

we need to use our theoretical understanding of the relationship between variables in order to properly select variables to include in our model

prediction

  1. Choose a set of predictors to assess
  2. Use a metric that balances parsimony with goodness of fit (\(R^2_{adj}\), AIC, BIC) to select the model
  3. Best practice is to fit the model on one set of data and test it on another (using something like Leave-one-out cross validation), but it turns out AIC and BIC mimic this for linear regression (yay!) so we can reliably use these metric even when not splitting our data