Prediction intervals

Lucy D’Agostino McGowan

confidence intervals

If we use the same sampling method to select different samples and computed an interval estimate for each sample, we would expect the true population parameter ( \(\beta_1\) ) to fall within the interval estimates 95% of the time.

Confidence interval for \(\hat\beta_1\)

How do we calculate the confidence interval for the slope?

\[\hat\beta_1\pm t^*SE_{\hat\beta_1}\]

How do we calculate it in R?

  • In with the confint function:
mod <- lm(frequency_score ~ group, data)
summary(mod)

Call:
lm(formula = frequency_score ~ group, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-38.80 -22.55 -11.80  30.20  95.20 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   39.800      7.757   5.131 8.82e-06 ***
groupsquare   -8.000     10.970  -0.729     0.47    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 34.69 on 38 degrees of freedom
Multiple R-squared:  0.0138,    Adjusted R-squared:  -0.01215 
F-statistic: 0.5318 on 1 and 38 DF,  p-value: 0.4703
confint(mod)
                2.5 %   97.5 %
(Intercept)  24.09636 55.50364
groupsquare -30.20830 14.20830

How do we calculate it in R?

  • “by hand”
t_star <- qt(0.025, df = nrow(data) - 2, lower.tail = FALSE)
# or
t_star <- qt(0.975, df =  nrow(data) - 2)
-8 - t_star * 10.97
[1] -30.2076
-8 + t_star * 10.97
[1] 14.2076

Confidence intervals

There are ✌️ other types of confidence intervals we may want to calculate

  • The confidence interval for the mean response in \(y\) for a given \(x^*\) value
  • The confidence interval for an individual response \(y\) for a given \(x^*\) value
  • Why are these different? Which do you think is easier to estimate? It is harder to predict one response than to predict a mean response. What does this mean in terms of the standard error?
  • The SE of the prediction interval is going to be larger

Confidence intervals

confidence interval for \(\mu_y\) and prediction interval

\[ \hat{y}\pm t^* SE\]

  • \(\hat{y}\) is the predicted \(y\) for a given \(x^*\)
  • \(t^*\) is the critical value for the \(t_{n-2}\) density curve
  • \(SE\) takes ✌️ different values depending on which interval you’re interested in
  • \(SE_{\hat\mu}\)
  • \(SE_{\hat{y}}\)

Which will be larger?

Confidence intervals

confidence interval for \(\mu_y\) and prediction interval

\[\hat{y}\pm t^* SE\]

  • \(\hat{y}\) is the predicted \(y\) for a given \(x^*\)
  • \(t^*\) is the critical value for the \(t_{n-2}\) density curve
  • \(SE\) takes ✌️ different values depending on which interval you’re interested in
  • \(SE_{\hat\mu} = \hat{\sigma}_\epsilon\sqrt{\frac{1}{n}+\frac{(x^*-\bar{x})^2}{\Sigma(x-\bar{x})^2}}\)
  • \(SE_{\hat{y}}=\hat{\sigma}_\epsilon\sqrt{1 + \frac{1}{n}+\frac{(x^*-\bar{x})^2}{\Sigma(x-\bar{x})^2}}\)
  • What is the difference between these two equations?

Confidence intervals

confidence interval for \(\mu_y\) and prediction interval

\[\hat{y}\pm t^* SE\]

  • \(\hat{y}\) is the predicted \(y\) for a given \(x^*\)
  • \(t^*\) is the critical value for the \(t_{n-2}\) density curve
  • \(SE\) takes ✌️ different values depending on which interval you’re interested in
  • \(SE_{\hat\mu} = \hat{\sigma}_\epsilon\sqrt{\frac{1}{n}+\frac{(x^*-\bar{x})^2}{\Sigma(x-\bar{x})^2}}\)
  • \(SE_{\hat{y}}=\hat{\sigma}_\epsilon\sqrt{\color{red}1 + \frac{1}{n}+\frac{(x^*-\bar{x})^2}{\Sigma(x-\bar{x})^2}}\)
  • an individual response will vary from the mean response \(\mu_y\) with a standard deviation of \(\sigma_\epsilon\)

Let’s do it in R!

mod <- lm(frequency_score ~ group, data = data)
predict(mod) 
   1    2    3 
39.8 39.8 39.8 
mod <- lm(frequency_score ~ group, data = data)
predict(mod, interval = "confidence") 
   fit      lwr      upr
1 39.8 24.09636 55.50364
2 39.8 24.09636 55.50364
3 39.8 24.09636 55.50364
mod <- lm(frequency_score ~ group, data = data)
predict(mod, interval = "prediction") 

## WARNING predictions on current data refer to _future_ responses

   fit       lwr      upr
1 39.8 -32.16311 111.7631
2 39.8 -32.16311 111.7631
3 39.8 -32.16311 111.7631

Let’s do it in R!

What if we have new data?

new_data <- data.frame(
  group = c("square", "circle")
)
new_data
   group
1 square
2 circle
predict(
  mod, 
  newdata = new_data, 
  interval = "prediction")
   fit       lwr      upr
1 31.8 -40.16311 103.7631
2 39.8 -32.16311 111.7631

Aplication Exercise

  1. Open appex-10.qmd
  2. You are interested in the predicted Porsche Price for Porsche cars that have 50,000 miles previously driven on average. Calculate this value with an appropriate confidence interval.
  3. You are interested in the predicted Porsche Price for a particular Porsche with 40,000 miles previously driven. Calculate this value with an appropriate confidence interval.
04:00