Data frame manipulation with dplyr

Working with data frames using dplyr and the tidyverse

Let’s install and load the tidyverse package. Add some chunk options (#| warning: false) to suppress noisy loading messages.

# in the console ONCE: install.packages("tidyverse")

# load the tidyverse library (or just dplyr)

Use read_csv() (a tidyverse version of read.csv()) to load the gapminder dataset as a “tibble”

# use read_csv() to load gapminder and save it as gapminder

# look at gapminder (without head())

The dplyr functions that we’re going to use

  • select(): extract columns from your data frame

  • filter(): filter to rows of your data frame based on a condition

  • mutate(): add columns or modify columns in your data frame

  • summarize(): aggregate information in your columns

  • group_by(): perform an operation separately for each entry in a categorical column

Select()

We can use the select() function to extract specific named columns from our data frame

# use select() to look at just the country, year, lifeExp columns
# the first agrument is the dataframe

We can also remove columns:

# use select() to remove continent, year, and pop columns

Select can also help you rename columns

# use select() to rename lifeExp to life_exp and gdpPercap to gdp_per_cap 
# (keeping all other columns)
# use rename() to do the same thing more succinctly

The pipe |> (formerly %>%)

The pipe greatly improves the readability of our code.

The pipe syntax is: object |> function(). This places the object to the left of |> into the first argument of the function to the right of |>.

The following two pieces of code are equivalent:

# apply head() to gapminder directly

# apply head() to gapminder using the pipe

Another example:

# apply select to gapminder, year, and pop without the pipe

# apply select to gapminder, year, and pop with the pipe

Note: the pipe |> is now a part of the R programming language. Previously, you needed to load the magrittr, dplyr, or tidyverse libraries to access the pipe and it had a different symbol: %>% (there are very minor differences in functionality). This still works, but it is now recommended that you use the newer “native” pipe syntax: |>.

Exercise

  1. Use the pipe to remove the continent column from gapminder. Save the output as an object called gapminder_tmp.
  1. Modify your code by adding another pipe to rename the gdpPercap column as “gdp_per_cap”. You do not need to define a new object.
  1. Try to do this using the base R syntax (this is kind of hard!)

Filtering rows using filter()

The filter function lets you filter to specific rows based on a condition.

Let’s filter to just the data from Australia.

Recall that you can apply logical operations to a vector as follows:

# ask which entries in the country column are equal to "Australia"

We can ask similar questions inside the filter function to use them to subset our rows. Note: no $ extraction required!

# filter() just to Australia (without the pipe)

We can provide multiple conditions using a comma

# filter just to Australia and the years after 1970 (without the pipe)

We can write this code with the pipe

# filter just to Australia and the years after 1970 with the pipe

Exercise

Let’s combine filter and select

  • Filter to the continent of Africa, where the year is equal to 1992

  • Return just the country and lifeExp columns, where I rename lifeExp to be life_exp

Try to switch the order of the select() and filter() operations. Explain why this doesn’t work

Next, try to extract the rows for Australia and Italy only.

Exercise

Filter gapminder to all countries on the Oceania continent for years 1987 and 1992 and select just the country, year, and gdpPercap columns (as gdp_per_cap).

Save the output in an object called gapminder_oceania.

Adding and modifying columns using mutate()

Let’s use mutate to add a gdp column to our data that is the product of pop and gdpPercap

# use mutate to add a gdp column

The code above hasn’t actually modified gapminder.

# print gapminder

Use mutate to round lifeExp to the nearest integer. Save the resulting df as gapminder_tmp

# use mutate to round lifeExp to the nearest integer. 
# Save the resulting df as gapminder_tmp

Exercise

Modify gapminder_tmp using mutate() so that it contains a “expected year of death” column, e.g., deathExp = year + lifeExp. Overwrite your gapminder_tmp object with this new version.

Summarizing data frames using summarize()

Let’s use summarize to compute the average lifeExp in the entire dataset

# compute the average lifeExp value

Combining summarize() with group_by()

Let’s compute the average life expectancy for each continent

# compute the average lifeExp value for each continent

Exercise

Use group_by() and summarize() to compute the maximum population for each country.

Use group_by() with mutate()

Let’s add a mean_life_exp column to our data frame for each continent

# create a new mean_life_exp column containing the average lifeExp # value for each continent

It’s really important to remember to ungroup() your data frame after you have completed your group_by() operation.

# to your previous code add:
# summarize the mean gdpPercap value without ungrouping.
# then try with ungrouping

Exercise

Filter the gapminder object to the countries in Asia after (and including) the year 2000, then compute the average lifeExp for each country.

Group_by multiple columns simultaneously

Let’s compute the average lifeExp for each continent-year combination

# compute the average life expectancy for each continent-year combination

Count

A really handy function for summarizing categorical (character/factor) variables is the count() function.

# count the number of times each continent appears

Arrange

You can arrange the rows of your data frame in ascending or descending order of the values of a column in your data using the arrange() function.

# arrange the rows of gapminder in ascending order of lifeExp
# arrange the rows of gapminder in descending order of lifeExp

Exercise

Use summarize() to compute the median lifeExp for each country, and then arrange the countries in descending order of maximum pop value.