Iteration with purrr

The purpose of this lesson is to learn how to iteratively apply functions to all elements contained within an object, such as all columns in a data frame, or all entries in a vector.

The “purrr” R package that we will be using in this lesson is included in the tidyverse package.

# load the tidyverse and demographics dataset
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
demographics <- read_csv("data/demographics.csv")
Rows: 10175 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): interview_examination, gender, race, marital_status, pregnant
dbl (9): respondent_id, age_years, age_months_sc_0_2yr, six_month_period, ag...
lgl (5): served_active_duty_us, served_active_duty_foreign, born_usa, citize...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The function that we want to iterate first is the n_distinct() function from the tidyverse.

Recall that to pull up the help page for a function, you can use the following syntax:

# pull up the help page for n_distinct
?n_distinct

n_distinct() counts the number of unique values in a vector.

# apply n_distinct to the "gender" column of demographics
n_distinct(demographics$gender)
[1] 2

The code below uses the map() function to apply n_distinct() to every column of the demographics data frame simultaneously:

# use map() to apply n_distinct to every column of demographics
map(demographics, n_distinct)
$respondent_id
[1] 10175

$interview_examination
[1] 2

$gender
[1] 2

$age_years
[1] 81

$age_months_sc_0_2yr
[1] 26

$race
[1] 6

$six_month_period
[1] 3

$age_months_ex_0_19yr
[1] 241

$served_active_duty_us
[1] 3

$served_active_duty_foreign
[1] 3

$born_usa
[1] 3

$citizen_usa
[1] 3

$time_in_us
[1] 10

$education_youth
[1] 16

$education
[1] 6

$marital_status
[1] 7

$pregnant
[1] 4

$language_english
[1] 2

$household_income
[1] 13

The output of the map() function is a list (more on that in a moment).

The following code shows a simpler example, this time applying the exp() function to all entries/elements in a simple numeric vector:

# use map() to apply exp() to the vector c(4, 5, 6)
map(c(4, 5, 6), exp)
[[1]]
[1] 54.59815

[[2]]
[1] 148.4132

[[3]]
[1] 403.4288

Lists

Lists, like vectors, correspond to a collection of values contained in a single object.

You can use the list() function to define a list, for example:

# define a list called my_list with three elements: 1, 4, and 7
my_list <- list(1, 4, 7)

You can extract elements from a list just as you would from a vector, using the square bracket notation. The code below extracts the third element from my_list:

# extract the third entry from my_list using []
my_list[3]
[[1]]
[1] 7
# ask the class of the object above
class(my_list[3])
[1] "list"

However, the output above is a list itself.

If you want to extract the actual object/value contained within the third element of the list, then you need to use double square parentheses:

# extract the third entry from my_list using [[]]
my_list[[3]]
[1] 7

Unlike vectors, lists are not vectorized.

# try to add 1 to my_list
my_list + 1
Error in my_list + 1: non-numeric argument to binary operator

Why would we ever prefer lists to vectors? The reason is that lists are infinitely more flexible than vectors. While all entries in a vector must be a single value of the same type, entries of a list can be anything.

# create a list my_complex_list containing 
# (1) the head of demographics, 
# (2) the value 2, and 
# (3) a vector containing "a" and "b"
my_complex_list <- list(head(demographics), 2, c("a", "b"))
my_complex_list
[[1]]
# A tibble: 6 × 19
  respondent_id interview_examination gender age_years age_months_sc_0_2yr race 
          <dbl> <chr>                 <chr>      <dbl>               <dbl> <chr>
1         73557 both interview and e… male          69                  NA black
2         73558 both interview and e… male          54                  NA white
3         73559 both interview and e… male          72                  NA white
4         73560 both interview and e… male           9                  NA white
5         73561 both interview and e… female        73                  NA white
6         73562 both interview and e… male          56                  NA mexi…
# ℹ 13 more variables: six_month_period <dbl>, age_months_ex_0_19yr <dbl>,
#   served_active_duty_us <lgl>, served_active_duty_foreign <lgl>,
#   born_usa <lgl>, citizen_usa <lgl>, time_in_us <dbl>, education_youth <dbl>,
#   education <dbl>, marital_status <chr>, pregnant <chr>,
#   language_english <lgl>, household_income <dbl>

[[2]]
[1] 2

[[3]]
[1] "a" "b"

You can also create a named list when defining it as if you are defining argument names:

# create a named version of my_complex_list
my_complex_list <- list(data = head(demographics), 
                        value = 2, 
                        vector = c("a", "b"))
my_complex_list
$data
# A tibble: 6 × 19
  respondent_id interview_examination gender age_years age_months_sc_0_2yr race 
          <dbl> <chr>                 <chr>      <dbl>               <dbl> <chr>
1         73557 both interview and e… male          69                  NA black
2         73558 both interview and e… male          54                  NA white
3         73559 both interview and e… male          72                  NA white
4         73560 both interview and e… male           9                  NA white
5         73561 both interview and e… female        73                  NA white
6         73562 both interview and e… male          56                  NA mexi…
# ℹ 13 more variables: six_month_period <dbl>, age_months_ex_0_19yr <dbl>,
#   served_active_duty_us <lgl>, served_active_duty_foreign <lgl>,
#   born_usa <lgl>, citizen_usa <lgl>, time_in_us <dbl>, education_youth <dbl>,
#   education <dbl>, marital_status <chr>, pregnant <chr>,
#   language_english <lgl>, household_income <dbl>

$value
[1] 2

$vector
[1] "a" "b"

You can then extract entries from the named list using $ or [[]]

# extract one of the elements from my_complex_list using $
my_complex_list$data
# A tibble: 6 × 19
  respondent_id interview_examination gender age_years age_months_sc_0_2yr race 
          <dbl> <chr>                 <chr>      <dbl>               <dbl> <chr>
1         73557 both interview and e… male          69                  NA black
2         73558 both interview and e… male          54                  NA white
3         73559 both interview and e… male          72                  NA white
4         73560 both interview and e… male           9                  NA white
5         73561 both interview and e… female        73                  NA white
6         73562 both interview and e… male          56                  NA mexi…
# ℹ 13 more variables: six_month_period <dbl>, age_months_ex_0_19yr <dbl>,
#   served_active_duty_us <lgl>, served_active_duty_foreign <lgl>,
#   born_usa <lgl>, citizen_usa <lgl>, time_in_us <dbl>, education_youth <dbl>,
#   education <dbl>, marital_status <chr>, pregnant <chr>,
#   language_english <lgl>, household_income <dbl>
# extract one of the elements from my_complex_list using [[]]
my_complex_list[["data"]]
# A tibble: 6 × 19
  respondent_id interview_examination gender age_years age_months_sc_0_2yr race 
          <dbl> <chr>                 <chr>      <dbl>               <dbl> <chr>
1         73557 both interview and e… male          69                  NA black
2         73558 both interview and e… male          54                  NA white
3         73559 both interview and e… male          72                  NA white
4         73560 both interview and e… male           9                  NA white
5         73561 both interview and e… female        73                  NA white
6         73562 both interview and e… male          56                  NA mexi…
# ℹ 13 more variables: six_month_period <dbl>, age_months_ex_0_19yr <dbl>,
#   served_active_duty_us <lgl>, served_active_duty_foreign <lgl>,
#   born_usa <lgl>, citizen_usa <lgl>, time_in_us <dbl>, education_youth <dbl>,
#   education <dbl>, marital_status <chr>, pregnant <chr>,
#   language_english <lgl>, household_income <dbl>

Exercise

Use map() to apply class() to every column in the demographics dataset, and extract the class of the household_income column.

Solution

demographics_class <- map(demographics, class)
demographics_class$household_income
[1] "numeric"
demographics_class[[ncol(demographics)]]
[1] "numeric"
demographics_class[["household_income"]]
[1] "numeric"

Using custom functions in purrr

To iterate using your own custom functions in a purrr map() function, you can define your function and provide it in the second argument of map().

# define a function called exp_plus_one() that returns exp(x) + 1
exp_plus_one <- function(x) {
  return(exp(x) + 1)
}

# apply it to every entry in the vector c(1, 4, 5)
map(c(1, 4, 5), exp_plus_one)
[[1]]
[1] 3.718282

[[2]]
[1] 55.59815

[[3]]
[1] 149.4132

For simple functions like this, we can define the function inside the map() function itself:

# apply the function exp(x) + 1 to every entry in the vector c(1, 4, 5)
# using an "anonymous" function
map(c(1, 4, 5), function(x) exp(x) + 1)
[[1]]
[1] 3.718282

[[2]]
[1] 55.59815

[[3]]
[1] 149.4132

However, we can go one step further and forego the function(x) part entirely using what I call the “tilde-dot” shorthand syntax.

Here, we use the tilde ~ symbol to “start” an anonymous function and we use a . to represent the argument of our anonymous function.

# apply the function exp(x) + 1 to every entry in the vector c(1, 4, 5)
# using the "tilde-dot" syntax
map(c(1, 4, 5), ~{exp(.) + 1})
[[1]]
[1] 3.718282

[[2]]
[1] 55.59815

[[3]]
[1] 149.4132

To decide what should go inside the ~{}, I typically use a representative value to test out my code first.

Let’s count the number of missing values in each column of demographics.

For example, below, I take a single column demographics$pregnant, and I write out the code that I want to apply to it:

# compute the number of missing values in the "pregnant" column of demographics
sum(is.na(demographics$pregnant))
[1] 8866

Generalize this code for the map function:

# use a map function to compute the number of missing values in every column
map(demographics, ~{sum(is.na(.))})
$respondent_id
[1] 0

$interview_examination
[1] 0

$gender
[1] 0

$age_years
[1] 0

$age_months_sc_0_2yr
[1] 9502

$race
[1] 0

$six_month_period
[1] 362

$age_months_ex_0_19yr
[1] 5962

$served_active_duty_us
[1] 3915

$served_active_duty_foreign
[1] 9633

$born_usa
[1] 5

$citizen_usa
[1] 11

$time_in_us
[1] 8353

$education_youth
[1] 7373

$education
[1] 4413

$marital_status
[1] 4409

$pregnant
[1] 8866

$language_english
[1] 0

$household_income
[1] 783

Exercise

Use the tilde-dot short-hand syntax to compute the number of values in each column that is equal to 1. Recall that if a vector has missing values, and you want to use sum(), you will want to provide an argument na.rm = TRUE to ignore missing values.

Solution

First, the code below does the long-form version which defines a function and then provides that function in the second argument of the map() function:

add_ones <- function(x) {
  sum(x == 1, na.rm = TRUE)
}

map(demographics, add_ones)
$respondent_id
[1] 0

$interview_examination
[1] 0

$gender
[1] 0

$age_years
[1] 262

$age_months_sc_0_2yr
[1] 42

$race
[1] 0

$six_month_period
[1] 4823

$age_months_ex_0_19yr
[1] 30

$served_active_duty_us
[1] 543

$served_active_duty_foreign
[1] 282

$born_usa
[1] 8262

$citizen_usa
[1] 9220

$time_in_us
[1] 0

$education_youth
[1] 237

$education
[1] 455

$marital_status
[1] 0

$pregnant
[1] 0

$language_english
[1] 9100

$household_income
[1] 0

Then I make this more concise by taking the body of my add_ones() function above, and placing it inside ~{}, and replace the x argument with a .:

map(demographics, ~{sum(. == 1, na.rm = TRUE)})
$respondent_id
[1] 0

$interview_examination
[1] 0

$gender
[1] 0

$age_years
[1] 262

$age_months_sc_0_2yr
[1] 42

$race
[1] 0

$six_month_period
[1] 4823

$age_months_ex_0_19yr
[1] 30

$served_active_duty_us
[1] 543

$served_active_duty_foreign
[1] 282

$born_usa
[1] 8262

$citizen_usa
[1] 9220

$time_in_us
[1] 0

$education_youth
[1] 237

$education
[1] 455

$marital_status
[1] 0

$pregnant
[1] 0

$language_english
[1] 9100

$household_income
[1] 0

Alternative map functions for outputting doubles, characters and data frames

Recall our map() function previously that we used to count the number of missing values in each column of the demographics data.

map(demographics, ~sum(is.na(.)))
$respondent_id
[1] 0

$interview_examination
[1] 0

$gender
[1] 0

$age_years
[1] 0

$age_months_sc_0_2yr
[1] 9502

$race
[1] 0

$six_month_period
[1] 362

$age_months_ex_0_19yr
[1] 5962

$served_active_duty_us
[1] 3915

$served_active_duty_foreign
[1] 9633

$born_usa
[1] 5

$citizen_usa
[1] 11

$time_in_us
[1] 8353

$education_youth
[1] 7373

$education
[1] 4413

$marital_status
[1] 4409

$pregnant
[1] 8866

$language_english
[1] 0

$household_income
[1] 783

A list may not be the most useful format for this information…

Outputting numeric vectors

The map_dbl() function will output a “double” (numeric) vector.

# use map_dbl to count the number of missing values in each column and
# output a numeric vector
map_dbl(demographics, ~sum(is.na(.)))
             respondent_id      interview_examination 
                         0                          0 
                    gender                  age_years 
                         0                          0 
       age_months_sc_0_2yr                       race 
                      9502                          0 
          six_month_period       age_months_ex_0_19yr 
                       362                       5962 
     served_active_duty_us served_active_duty_foreign 
                      3915                       9633 
                  born_usa                citizen_usa 
                         5                         11 
                time_in_us            education_youth 
                      8353                       7373 
                 education             marital_status 
                      4413                       4409 
                  pregnant           language_english 
                      8866                          0 
          household_income 
                       783 

Outputting character vectors

The map_chr() function can output a character vector.

# use map_chr to apply class to every column and output a character vector
map_chr(demographics, class)
             respondent_id      interview_examination 
                 "numeric"                "character" 
                    gender                  age_years 
               "character"                  "numeric" 
       age_months_sc_0_2yr                       race 
                 "numeric"                "character" 
          six_month_period       age_months_ex_0_19yr 
                 "numeric"                  "numeric" 
     served_active_duty_us served_active_duty_foreign 
                 "logical"                  "logical" 
                  born_usa                citizen_usa 
                 "logical"                  "logical" 
                time_in_us            education_youth 
                 "numeric"                  "numeric" 
                 education             marital_status 
                 "numeric"                "character" 
                  pregnant           language_english 
               "character"                  "logical" 
          household_income 
                 "numeric" 

Outputting data frames

One of the most versatile functions is the map_df() function, which outputs a data frame.

# apply map_df to demographics to determine the class of each column, outputting a "wide" data frame
map_df(demographics, class)
# A tibble: 1 × 19
  respondent_id interview_examination gender age_years age_months_sc_0_2yr race 
  <chr>         <chr>                 <chr>  <chr>     <chr>               <chr>
1 numeric       character             chara… numeric   numeric             char…
# ℹ 13 more variables: six_month_period <chr>, age_months_ex_0_19yr <chr>,
#   served_active_duty_us <chr>, served_active_duty_foreign <chr>,
#   born_usa <chr>, citizen_usa <chr>, time_in_us <chr>, education_youth <chr>,
#   education <chr>, marital_status <chr>, pregnant <chr>,
#   language_english <chr>, household_income <chr>

If you want your output to be in a “long” format, the function you are applying must output a single-column tibble/data frame.

# use tibble() to create a single-column tibble containing the class of demographics$pregnant
tibble(col_class = class(demographics$pregnant))
# A tibble: 1 × 1
  col_class
  <chr>    
1 character
# modify this code to be the function call in map_df to create a 
# long-form data frame
map_df(demographics, ~tibble(col_class = class(.)))
# A tibble: 19 × 1
   col_class
   <chr>    
 1 numeric  
 2 character
 3 character
 4 numeric  
 5 numeric  
 6 character
 7 numeric  
 8 numeric  
 9 logical  
10 logical  
11 logical  
12 logical  
13 numeric  
14 numeric  
15 numeric  
16 character
17 character
18 logical  
19 numeric  
# provide an .id argument to include the original column names as a variable
map_df(demographics, ~tibble(col_class = class(.)), .id = "variable_name")
# A tibble: 19 × 2
   variable_name              col_class
   <chr>                      <chr>    
 1 respondent_id              numeric  
 2 interview_examination      character
 3 gender                     character
 4 age_years                  numeric  
 5 age_months_sc_0_2yr        numeric  
 6 race                       character
 7 six_month_period           numeric  
 8 age_months_ex_0_19yr       numeric  
 9 served_active_duty_us      logical  
10 served_active_duty_foreign logical  
11 born_usa                   logical  
12 citizen_usa                logical  
13 time_in_us                 numeric  
14 education_youth            numeric  
15 education                  numeric  
16 marital_status             character
17 pregnant                   character
18 language_english           logical  
19 household_income           numeric  

Example

This long-format is very useful if you want to create a plot, such as a bar chart for the number of missing values.

# create a bar chart of the number of missing values in each column
# use factors to order the columns by the number of missing values
map_df(demographics, 
       ~tibble(n_missing = sum(is.na(.))), 
       .id = "variable_name") |>
  arrange(n_missing) |>
  mutate(variable_name = fct_inorder(variable_name)) |>
  ggplot() +
  geom_col(aes(x = variable_name, y = n_missing)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, 
                                   hjust = 1,
                                   vjust = 0.5))

An alternative approach is to use map_dbl() and enframe()

# use map_dbl() and enframe() to create a long data frame of the 
# number of missing values in each column
map_dbl(demographics, ~sum(is.na(.))) |>
  enframe()
# A tibble: 19 × 2
   name                       value
   <chr>                      <dbl>
 1 respondent_id                  0
 2 interview_examination          0
 3 gender                         0
 4 age_years                      0
 5 age_months_sc_0_2yr         9502
 6 race                           0
 7 six_month_period             362
 8 age_months_ex_0_19yr        5962
 9 served_active_duty_us       3915
10 served_active_duty_foreign  9633
11 born_usa                       5
12 citizen_usa                   11
13 time_in_us                  8353
14 education_youth             7373
15 education                   4413
16 marital_status              4409
17 pregnant                    8866
18 language_english               0
19 household_income             783