# in the console ONCE: install.packages("tidyverse")
# load the tidyverse library (or just dplyr)
Data frame manipulation with dplyr
Working with data frames using dplyr and the tidyverse
Let’s install and load the tidyverse package. Add some chunk options (#| warning: false
) to suppress noisy loading messages.
Use read_csv()
(a tidyverse version of read.csv()
) to load the gapminder dataset as a “tibble”
# use read_csv() to load gapminder and save it as gapminder
# look at gapminder (without head())
The dplyr functions that we’re going to use
select()
: extract columns from your data framefilter()
: filter to rows of your data frame based on a conditionmutate()
: add columns or modify columns in your data framesummarize()
: aggregate information in your columnsgroup_by()
: perform an operation separately for each entry in a categorical column
Select()
We can use the select() function to extract specific named columns from our data frame
# use select() to look at just the country, year, lifeExp columns
# the first agrument is the dataframe
We can also remove columns:
# use select() to remove continent, year, and pop columns
Select can also help you rename columns
# use select() to rename lifeExp to life_exp and gdpPercap to gdp_per_cap
# (keeping all other columns)
# use rename() to do the same thing more succinctly
The pipe |>
(formerly %>%
)
The pipe greatly improves the readability of our code.
The pipe syntax is: object |> function()
. This places the object to the left of |>
into the first argument of the function to the right of |>
.
The following two pieces of code are equivalent:
# apply head() to gapminder directly
# apply head() to gapminder using the pipe
Another example:
# apply select to gapminder, year, and pop without the pipe
# apply select to gapminder, year, and pop with the pipe
Note: the pipe |>
is now a part of the R programming language. Previously, you needed to load the magrittr, dplyr, or tidyverse libraries to access the pipe and it had a different symbol: %>%
(there are very minor differences in functionality). This still works, but it is now recommended that you use the newer “native” pipe syntax: |>
.
Exercise
- Use the pipe to remove the continent column from gapminder. Save the output as an object called
gapminder_tmp
.
- Modify your code by adding another pipe to rename the gdpPercap column as “gdp_per_cap”. You do not need to define a new object.
- Try to do this using the base R syntax (this is kind of hard!)
Filtering rows using filter()
The filter function lets you filter to specific rows based on a condition.
Let’s filter to just the data from Australia.
Recall that you can apply logical operations to a vector as follows:
# ask which entries in the country column are equal to "Australia"
We can ask similar questions inside the filter function to use them to subset our rows. Note: no $
extraction required!
# filter() just to Australia (without the pipe)
We can provide multiple conditions using a comma
# filter just to Australia and the years after 1970 (without the pipe)
We can write this code with the pipe
# filter just to Australia and the years after 1970 with the pipe
Exercise
Let’s combine filter and select
Filter to the continent of Africa, where the year is equal to 1992
Return just the country and lifeExp columns, where I rename lifeExp to be life_exp
Try to switch the order of the select() and filter() operations. Explain why this doesn’t work
Next, try to extract the rows for Australia and Italy only.
Exercise
Filter gapminder to all countries on the Oceania continent for years 1987 and 1992 and select just the country, year, and gdpPercap columns (as gdp_per_cap).
Save the output in an object called gapminder_oceania.
Adding and modifying columns using mutate()
Let’s use mutate to add a gdp
column to our data that is the product of pop
and gdpPercap
# use mutate to add a gdp column
The code above hasn’t actually modified gapminder.
# print gapminder
Use mutate to round lifeExp to the nearest integer. Save the resulting df as gapminder_tmp
# use mutate to round lifeExp to the nearest integer.
# Save the resulting df as gapminder_tmp
Exercise
Modify gapminder_tmp
using mutate() so that it contains a “expected year of death” column, e.g., deathExp = year + lifeExp
. Overwrite your gapminder_tmp
object with this new version.
Summarizing data frames using summarize()
Let’s use summarize to compute the average lifeExp in the entire dataset
# compute the average lifeExp value
Combining summarize() with group_by()
Let’s compute the average life expectancy for each continent
# compute the average lifeExp value for each continent
Exercise
Use group_by() and summarize() to compute the maximum population for each country.
Use group_by() with mutate()
Let’s add a mean_life_exp
column to our data frame for each continent
# create a new mean_life_exp column containing the average lifeExp # value for each continent
It’s really important to remember to ungroup()
your data frame after you have completed your group_by()
operation.
# to your previous code add:
# summarize the mean gdpPercap value without ungrouping.
# then try with ungrouping
Exercise
Filter the gapminder
object to the countries in Asia after (and including) the year
2000, then compute the average lifeExp
for each country.
Group_by multiple columns simultaneously
Let’s compute the average lifeExp for each continent-year combination
# compute the average life expectancy for each continent-year combination
Count
A really handy function for summarizing categorical (character/factor) variables is the count()
function.
# count the number of times each continent appears
Arrange
You can arrange the rows of your data frame in ascending or descending order of the values of a column in your data using the arrange()
function.
# arrange the rows of gapminder in ascending order of lifeExp
# arrange the rows of gapminder in descending order of lifeExp
Exercise
Use summarize()
to compute the median lifeExp for each country, and then arrange the countries in descending order of maximum pop
value.