Regression and Correlation

Lucy D’Agostino McGowan

Application Exercise

  1. Copy the following template into RStudio Pro:
https://github.com/sta-112-s24/appex-10.git
  1. Load the packages and then examine the PorschePrice data frame
  2. Fit a linear model predicting a Porsche’s price from the mileage
  3. Examine the ANOVA table – what is the F statistic? What is the associated p-value? What hypothesis is it testing?
04:00

Partitioning variability

Why?

  • \(y − \bar{y} = (\hat{y} − \bar{y}) + (y − \hat{y})\)
  • \(\sum(y − \bar{y})^2 = \sum(\hat{y} − \bar{y})^2 + \sum(y − \hat{y})^2\)
  • SSTotal = SSModel + SSE

coefficient of determination

Often referred to as \(\color{#86a293}{r^2}\), it is the fraction of the response variability that is explained by the model.

Coefficient of determination

  • \(r^2 = \frac{\textrm{Variability explained by the model}}{\textrm{Total variability in } y}\)
  • \(r^2 = \frac{\textrm{SSModel}}{\textrm{SSTotal}}\)
  • \(r^2 = \frac{\sum(\hat{y} - \bar{y})^2}{\sum(y-\bar{y})^2}\)

Application Exercise

\[r^2 = \frac{\textrm{SSModel}}{\textrm{SSTotal}}\]

How could you calculate \(r^2\) if all you had was \(\textrm{SSTotal}\) and \(\textrm{SSE}\)?

01:00

Coefficient of determination

  • \(r^2 = \frac{\textrm{Variability explained by the model}}{\textrm{Total variability in } y}\)
  • \(r^2 = \frac{\textrm{SSModel}}{\textrm{SSTotal}}\)
  • \(r^2 = \frac{\sum(\hat{y} - \bar{y})^2}{\sum(y-\bar{y})^2}\)
  • \(r^2 = \frac{\textrm{SSTotal − SSE}}{\textrm{SSTotal}}\)
  • \(r^2 = 1 - \frac{\textrm{SSE}}{\textrm{SSTotal}}\)

Let’s do it in R!

mod <- lm(frequency_score ~ group, data = data)
summary(mod)

Call:
lm(formula = frequency_score ~ group, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-38.80 -22.55 -11.80  30.20  95.20 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   39.800      7.757   5.131 8.82e-06 ***
groupsquare   -8.000     10.970  -0.729     0.47    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 34.69 on 38 degrees of freedom
Multiple R-squared:  0.0138,    Adjusted R-squared:  -0.01215 
F-statistic: 0.5318 on 1 and 38 DF,  p-value: 0.4703

1.4% of the variation in the frequency score is explained by group.

Application Exercise

  1. Open appex-10.qmd
  2. Run summary on your model predicting Porsche price from mileage
  3. What is the \(r^2\)? How can you interpret this?
03:00