Manipulating Data in R

Lucy D’Agostino McGowan

Application Exercise

  1. Create a new project using my template here:
https://github.com/sta-112-s24/appex-04-starwars.git
  1. Run the R chunk labeled "load-packages"
  2. Take a look at the starwars data
04:00

dplyr: verbs for manipulating data

Extract rows with filter() filter
Extract columns with select() select
Arrange/sort rows with arrange() arrange
Make new columns with mutate() mutate
Make group summaries with
group_by() |> summarize()
summarize

filter()

filter()

Extract rows that meet some sort of test

filter(.data = DATA, ...)
  • DATA = Data frame to transform
  • ... = One or more tests filter() returns each row for which the test is TRUE

filter(.data = starwars, species == "Droid")
name species height
Luke Skywalker Human 172
C-3PO Droid 167
R2-D2 Droid 96
Darth Vader Human 202
Leia Organa Human 150

filter(.data = starwars, species == "Droid")
name species height
Luke Skywalker Human 172
C-3PO Droid 167
R2-D2 Droid 96
Darth Vader Human 202
Leia Organa Human 150
name species height
C-3PO Droid 167
R2-D2 Droid 96
R5-D4 Droid 97
IG-88 Droid 200
R4-P17 Droid 96
BB8 Droid NA

filter()

filter(starwars, 
       species == "Droid")
  • One = sets an argument
  • Two == checks for equality
    returns TRUE or FALSE

Logical tests

Test Meaning Test Meaning
x < y Less than x %in% y In (group membership)
x > y Greater than is.na(x) Is missing
== Equal to !is.na(x) Is not missing
x <= y Less than or equal to
x >= y Greater than or equal to
x != y Not equal to

Application Exercise

In the Console, use filter() and logical tests to show…

  1. The data for Humans
  2. The data for characters with a homeworld Alderaan
  3. Rows where the mass is greater than 1000
04:00

filter(starwars, species == "Human")



filter(starwars, homeworld == "Alderaan")



filter(starwars, mass > 1000)

Common mistakes

Using = instead of ==

👎

filter(starwars, 
       homeworld = "Alderaan")

👍

filter(starwars, 
       homeworld == "Alderaan")

Common mistakes

Using = instead of ==

👎

filter(starwars, 
       homeworld = "Alderaan")

👍

filter(starwars, 
       homeworld == "Alderaan")

Quotes

👎

filter(starwars, 
       homeworld == Alderaan)

👍

filter(starwars, 
       homeworld == "Alderaan")

filter() with multiple conditions

Extract rows that meet every test

filter(starwars, 
       species == "Droid",
       height > 100)

filter(.data = starwars, species == "Droid", height > 100)
name species height
Luke Skywalker Human 172
C-3PO Droid 167
R2-D2 Droid 96
Darth Vader Human 202
Leia Organa Human 150

filter(.data = starwars, species == "Droid", height > 100)
name species height
Luke Skywalker Human 172
C-3PO Droid 167
R2-D2 Droid 96
Darth Vader Human 202
Leia Organa Human 150
name species height
C-3PO Droid 167
IG-88 Droid 200

Boolean operators

Operator Meaning
a & b and
a | b or
!a not

Default is “and”

These do the same thing:

filter(starwars, species == "Droid", height > 100)


filter(starwars, species == "Droid" & height > 100)

Application Exercise

Create a new R chunk in your .qmd file. Use filter() and Boolean logical tests to show…

  1. Droids more than 70 kg
  2. Characters that are born before 56 and have blue eyes
  3. Characters that are born before 56 and have blue eyes and are not Human
04:00

filter(starwars, species == "Droid", mass > 70)



filter(starwars, birth_year < 56, 
       eye_color == "blue")



filter(starwars, birth_year < 56, 
       eye_color == "blue", !species == "Human")

Common mistakes

Collapsing multiple tests into one

👎

filter(starwars, 
       50 < mass < 100)

👍

filter(starwars, 
       mass > 50,
       mass < 100)

Using multiple tests instead of %in%

👎

filter(starwars, 
       homeworld == "Alderaan",
       homeworld == "Bespin",
       homeworld == "Corellia")

👍

filter(starwars, 
       homeworld %in% c("Alderaan", "Bespin", "Corellia"))

Common syntax

  • Every dplyr verb function follows the same pattern
  • First argument is a data frame; returns a data frame
VERB(DATA, ...)
  • VERB = dplyr function/verb
  • DATA = Data frame to transform
  • ... = Stuff the verb does

mutate()

Create new columns

mutate(.data, ...)
  • DATA = Data frame to transform
  • ... = Columns to make

mutate(.data = starwars, bmi = mass / (height / 100)^2)
name mass height
Luke Skywalker 77 172
C-3PO 75 167
R2-D2 32 96
Darth Vader 136 202
Leia Organa 49 150

mutate(.data = starwars, bmi = mass / (height / 100)^2)
name mass height
Luke Skywalker 77 172
C-3PO 75 167
R2-D2 32 96
Darth Vader 136 202
Leia Organa 49 150
name bmi
Luke Skywalker 26.03
C-3PO 26.89
R2-D2 34.72
Darth Vader 33.33
Leia Organa 21.78

mutate(.data = starwars, 
       height_m = height / 100,
       bmi = mass / (height_m)^2)
name mass height
Luke Skywalker 77 172
C-3PO 75 167
R2-D2 32 96
Darth Vader 136 202
Leia Organa 49 150

mutate(.data = starwars, 
       height_m = height / 100,
       bmi = mass / (height_m)^2)
name mass height
Luke Skywalker 77 172
C-3PO 75 167
R2-D2 32 96
Darth Vader 136 202
Leia Organa 49 150
name height_m bmi
Luke Skywalker 1.72 26.03
C-3PO 1.67 26.89
R2-D2 0.96 34.72
Darth Vader 2.02 33.33
Leia Organa 1.5 21.78

ifelse()

Do conditional tests within mutate()

ifelse(TEST,      
       VALUE_IF_TRUE, 
       VALUE_IF_FALSE)
  • TEST = A logical test
  • VALUE_IF_TRUE = What happens if test is true
  • VALUE_IF_FALSE = What happens if test is false

Application Exercise

Create a new R chunk in your .qmd file. Use mutate() to

  1. Add a droid column that is TRUE if the species is a Droid
  2. Add a column for the logged weight (hint: use log())
  3. Add a human_droid column that says “Human or Droid” if the character is a human or a droid and “Not Human or Droid” if it’s not
05:00

What if you have multiple verbs?

Make a dataset of just Droids and calculate their BMI

. . .

Pipes!

The |> operator (pipe) takes an object on the left and passes it as the first argument of the function on the right

. . .

starwars |> filter(_, species == "Droid")

What if you have multiple verbs?

These do the same thing!


filter(starwars, species == "Droid")


starwars |> filter(species == "Droid")

What if you have multiple verbs?

Make a dataset of just Droids and calculate their BMI

starwars |>
  filter(species == "Droid") |>
  mutate(bmi = mass / (height / 100)^2)

|>

leave_house(get_dressed(get_out_of_bed(wake_up(me, time = "8:00"), side = "correct"), pants = TRUE, shirt = TRUE), car = TRUE, bike = FALSE)



me |> 
  wake_up(time = "8:00") |> 
  get_out_of_bed(side = "correct") |> 
  get_dressed(pants = TRUE, shirt = TRUE) |> 
  leave_house(car = TRUE, bike = FALSE)

summarize()

Compute a table of summaries

starwars |>
  summarize(mean_height = mean(height, na.rm = TRUE))
name species height
Luke Skywalker Human 172
C-3PO Droid 167
R2-D2 Droid 96
Darth Vader Human 202
Leia Organa Human 150

summarize()

Compute a table of summaries

starwars |>
  summarize(mean_height = mean(height, na.rm = TRUE))
name species height
Luke Skywalker Human 172
C-3PO Droid 167
R2-D2 Droid 96
Darth Vader Human 202
Leia Organa Human 150
mean_height
174.6049
00:30

summarize()

starwars |>
  summarize(mean_height = mean(height, na.rm = TRUE),
            max_height = max(height, na.rm = TRUE))
name species height
Luke Skywalker Human 172
C-3PO Droid 167
R2-D2 Droid 96
Darth Vader Human 202
Leia Organa Human 150
mean_height max_height
174.6049 264

Application Exercise

Create a new R chunk in your .qmd file. Use summarize() to calculate…

  1. The first (minimum) birth year in the dataset
  2. The last (maximum) birth year in the dataset
  3. The number of distinct homeworlds in the dataset (use the cheatsheet)
06:00

starwars |>
  summarize(first = min(birth_year, na.rm = TRUE),
            last = max(birth_year, na.rm = TRUE),
            num_rows = n(),
            num_unique = n_distinct(homeworld))

Application Exercise

Create a new R chunk in your .qmd file. Use filter() and summarize() to calculate

  1. The number of unique homeworlds among humans
  2. The median height for characters from Tatooine born before the year 100
05:00

group_by()

Put rows into groups based on values in a column

starwars |> group_by(species)

. . .

Nothing happens by itself!

. . .

Powerful when combined with summarize()

starwars |>
  group_by(gender) |>
  summarize(n_species = n_distinct(species))

Application Exercise

Create a new R chunk in your .qmd file.

  1. Find the minimum, maximum, and median mass for each species
  2. Find the minimum, maximum, and median mass for each species from Tatooine
06:00

Evaluation

https://docs.google.com/forms/d/e/1FAIpQLSeXqLgZEe2oTa9exX4LI7wtkZvnr9Qp4QZ5ooCDCcIKKH1hZg/viewform?usp=sf_link