Lab 05 - Multiple Linear Regression

Due: 2024-04-09 at 11:59pm Turn your .html file in on Canvas

Getting started

Go to RStudio Pro and click:

Step 1. File > New Project
Step 2. “Version Control”
Step 3. Git
Step 4. Copy the following into the “Repository URL”:

https://github.com/sta-112-s24/lab-05-multiple-linear-regression.git

Exercises

The Diamonds data set can be used to examine the predict a diamond’s price using characteristics about the diamond. For this lab, you need to try to find the best model to predict the total price of a diamond. Here “best” is defined as a model that (1) meets the assumptions of multiple regression and (2) has a good model fit, as determined by the metrics we’ve learned so far. I want you to show your work. Your report should include:

  1. A description of the data including: How many observations are in this data set? What are the observations? How many variables are in this data set? Is there any missing data? If so, handle the missing data and report how you did so.
  2. A list of all models you attempted (must be at least 3, if using a method like “best subset”, include a figure showing each model with the corresponding goodness of fit metric)
  3. Figures displaying the checks for the assumptions for multiple linear regression for the final model you pick
  4. The equation for the final model you pick
  5. The model goodness of fit metric for the final model you picked (and how it compared to the other models)
  6. Finally, I want you to use this new data set below to predict what a particular diamond with these specifications would cost, along with an appropriate prediction interval. (Note: it is not a problem if you do not include all of these variables in your model)
new_data <- data.frame(
  Carat = 2,
  Color = "D",
  Clarity = "SI2",
  Depth = 69
)

Some things to consider: you may want to try transformations of the outcome, include polynomial terms to account for non-linearity, and/or include interaction terms. Also, be sure to carefully understand what each of the available variables mean. It would not make sense to use a variable that is a direct function of the outcome in the model, for example.