Types of variables

Lucy D’Agostino McGowan

Variable types

  • There are two major classes of variables
    • numeric (quantitative)
    • categorical

Variable types

  • Recall from the first week of class, you can use the glimpse() function to see all of your variables and their types
data("Diamonds")
glimpse(Diamonds)
Rows: 351
Columns: 6
$ Carat      <dbl> 1.08, 0.31, 0.31, 0.32, 0.33, 0.33, 0.35, 0.35, 0.37, 0.38,…
$ Color      <fct> E, F, H, F, D, G, F, F, F, D, E, F, D, D, F, F, D, D, E, F,…
$ Clarity    <fct> VS1, VVS1, VS1, VVS1, IF, VVS1, VS1, VS1, VVS1, IF, VVS2, I…
$ Depth      <dbl> 68.6, 61.9, 62.1, 60.8, 60.8, 61.5, 62.5, 62.3, 61.4, 60.0,…
$ PricePerCt <dbl> 6693.3, 3159.0, 1755.0, 3159.0, 4758.8, 2895.8, 2457.0, 245…
$ TotalPrice <dbl> 7228.8, 979.3, 544.1, 1010.9, 1570.4, 955.6, 860.0, 860.0, …
  • What are the variables here?
  • fct: “factor” this is a type of categorical variable

Variable types

  • Recall from the first week of class, you can use the glimpse() function to see all of your variables and their types
glimpse(starwars)
Rows: 87
Columns: 5
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
  • chr: “character” this is a type of categorical variable

Variable types

  • So far, our models have only included numeric (quantitative) variables
  • What would the equation be for predicting \(y\) from \(x\) when \(x\) is numeric?
  • What would happen if \(x\) is categorical?
    • What would the equation be for predicting \(y\) from \(x\) if \(x\) is categorical with 2 levels?
    • What would the equation be for predicting \(y\) from \(x\) if \(x\) is categorical with 3 levels?

indicator variable

An indicator variable uses two values, usually 0 and 1, to indicate whether a data case does (1) or does not (0) belong to a specific category

Indicator variable

Indicator variables

What does this line of code do?

Diamonds <- Diamonds |>
  mutate(
    ColorD = ifelse(Color == "D", 1, 0), 
    ColorE = ifelse(Color == "E", 1, 0),
    ColorF = ifelse(Color == "F", 1, 0),
    ColorG = ifelse(Color == "G", 1, 0),
    ColorH = ifelse(Color == "H", 1, 0),
    ColorI = ifelse(Color == "I", 1, 0),
    ColorJ = ifelse(Color == "J", 1, 0)
  )

Indicator variables

What does this line of code do?

Diamonds <- Diamonds |>
  mutate(
    ColorD = ifelse(Color == "D", 1, 0), 
    ColorE = ifelse(Color == "E", 1, 0), 
    ColorF = ifelse(Color == "F", 1, 0),
    ColorG = ifelse(Color == "G", 1, 0),
    ColorH = ifelse(Color == "H", 1, 0),
    ColorI = ifelse(Color == "I", 1, 0),
    ColorJ = ifelse(Color == "J", 1, 0)
  )

Indicator variables

Indicator variables

What if I wanted to model the relationship between TotalPrice and Color?

Indicator variables

Why is ColorJ NA?

lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI + ColorJ,
   data = Diamonds)

Call:
lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG + 
    ColorH + ColorI + ColorJ, data = Diamonds)

Coefficients:
(Intercept)       ColorD       ColorE       ColorF       ColorG       ColorH  
       1936         3632         2423         7224         7623         6732  
     ColorI       ColorJ  
       5704           NA  
  • When including indicator variables in a model for k categories, always include k-1
  • The one that is left out is the “reference” category

Indicator variables

What is the reference category?

lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI,
   data = Diamonds)

Call:
lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG + 
    ColorH + ColorI, data = Diamonds)

Coefficients:
(Intercept)       ColorD       ColorE       ColorF       ColorG       ColorH  
       1936         3632         2423         7224         7623         6732  
     ColorI  
       5704  
  • Interpretation: A diamond with Color D compared to color J increases the expected total price by 3632 dollars.
  • Interpretation: A diamond with Color E compared to color J increases the expected total price by 2423 dollars.
  • Interpretation: A diamond with Color J has an expected total price of 1936 dollars.

Indicator variables

What is the reference category?

lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI,
   data = Diamonds)

Call:
lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG + 
    ColorH + ColorI, data = Diamonds)

Coefficients:
(Intercept)       ColorD       ColorE       ColorF       ColorG       ColorH  
       1936         3632         2423         7224         7623         6732  
     ColorI  
       5704  
  • Interpretation: A diamond with Color D compared to color J increases the expected total price by 3632 dollars.
  • What is the interpretation for a diamond with Color F?

R is smart

lm(TotalPrice ~ Color, data = Diamonds)

Call:
lm(formula = TotalPrice ~ Color, data = Diamonds)

Coefficients:
(Intercept)       ColorE       ColorF       ColorG       ColorH       ColorI  
       5569        -1209         3592         3990         3100         2071  
     ColorJ  
      -3632  

What is the reference category?

R is smart

lm(TotalPrice ~ Color, data = Diamonds)

Call:
lm(formula = TotalPrice ~ Color, data = Diamonds)

Coefficients:
(Intercept)       ColorE       ColorF       ColorG       ColorH       ColorI  
       5569        -1209         3592         3990         3100         2071  
     ColorJ  
      -3632  
  • What is the interpretation for Color E now?
  • What if we wanted a different referent category?
    • We could code the indicators ourselves
    • We could relevel the factor

Relevel

levels(Diamonds$Color)
[1] "D" "E" "F" "G" "H" "I" "J"
new_levels <- c("J", "D", "E", "F", "G", "H", "I")
Diamonds <- Diamonds |>
  mutate(Color = fct_relevel(Color, new_levels))
levels(Diamonds$Color)
[1] "J" "D" "E" "F" "G" "H" "I"

Can also just change the first value like this:

Diamonds <- Diamonds |>
  mutate(Color = fct_relevel(Color, "J"))
levels(Diamonds$Color)
[1] "J" "D" "E" "F" "G" "H" "I"

R is smart

lm(TotalPrice ~ Color, data = Diamonds)

Call:
lm(formula = TotalPrice ~ Color, data = Diamonds)

Coefficients:
(Intercept)       ColorD       ColorE       ColorF       ColorG       ColorH  
       1936         3632         2423         7224         7623         6732  
     ColorI  
       5704  

What is the reference category?

What if the variable is binary

  • A binary variable is a special type of categorical variable with two levels

ICU example

  • A sample of 200 patients in an ICU unit
  • Want to see if the patient’s heart rate is related to whether they were admitted via the emergency room
    • y: Heart rate (beats per minute)
    • x: indicator for emergency room admission
  • Aside: Is this inference or prediction?

Binary x variable

data("ICU")
lm(Pulse ~ Emergency, data = ICU)

Call:
lm(formula = Pulse ~ Emergency, data = ICU)

Coefficients:
(Intercept)    Emergency  
      91.11        10.63  
  • How can we interpret \(\hat{\beta}_0\) now?
  • How can we interpret \(\hat{\beta}_1\)?

Application Exercise

  1. Copy the following template into RStudio Pro:
https://github.com/sta-112-s24/appex-12.git
  1. What are the variables in the Diamonds dataset?
  2. What are the levels of the Clarity variable in the Diamonds data?
  3. Fit a model with TotalPrice as the outcome and Clarity as the explanatory variable
  4. Change the referent category to SI1 and refit the model
  5. Add the variable Depth to your model. How do you interpret the coefficient for this parameter?
05:00