Recoding variables using if_else and case_when

# load the tidyverse and demographics dataset
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

demographics <- read_csv("data/demographics.csv")

Rows: 10175 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): interview_examination, gender, race, marital_status, pregnant
dbl (9): respondent_id, age_years, age_months_sc_0_2yr, six_month_period, ag...
lgl (5): served_active_duty_us, served_active_duty_foreign, born_usa, citize...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(demographics)

# A tibble: 6 × 19
  respondent_id interview_examination gender age_years age_months_sc_0_2yr race 
          <dbl> <chr>                 <chr>      <dbl>               <dbl> <chr>
1         73557 both interview and e… male          69                  NA black
2         73558 both interview and e… male          54                  NA white
3         73559 both interview and e… male          72                  NA white
4         73560 both interview and e… male           9                  NA white
5         73561 both interview and e… female        73                  NA white
6         73562 both interview and e… male          56                  NA mexi…
# ℹ 13 more variables: six_month_period <dbl>, age_months_ex_0_19yr <dbl>,
#   served_active_duty_us <lgl>, served_active_duty_foreign <lgl>,
#   born_usa <lgl>, citizen_usa <lgl>, time_in_us <dbl>, education_youth <dbl>,
#   education <dbl>, marital_status <chr>, pregnant <chr>,
#   language_english <lgl>, household_income <dbl>

In this document, you will learn how to “recode” variables by replacing values in a variable with other values of your choice, e.g., converting numerically-encoded variables to categorical/character versions, and vice versa.

We will introduce two functions for doing this:

if_else() which can be used to create binary variables (variables with just two distinct values) based on a logical “condition”
case_when() which can be used to create variables with many different values.

if_else

# apply count() to interview_examination
demographics |> count(interview_examination)

# A tibble: 2 × 2
  interview_examination       n
  <chr>                   <int>
1 both interview and exam  9813
2 interview only            362

Let’s convert it to a numeric format using if_else().

# use mutate() to create a new column called interview_examination_numeric
# which is 2 if interview_examination is "both interview and exam"
# and 1 otherwise
# then select just interview_examination and interview_examination_numeric
demographics |> 
  mutate(interview_examination_numeric = if_else(
    interview_examination == "both interview and exam", 
    true = 2, 
    false = 1)
  ) |>
  select(interview_examination, interview_examination_numeric)

# A tibble: 10,175 × 2
   interview_examination   interview_examination_numeric
   <chr>                                           <dbl>
 1 both interview and exam                             2
 2 both interview and exam                             2
 3 both interview and exam                             2
 4 both interview and exam                             2
 5 both interview and exam                             2
 6 both interview and exam                             2
 7 both interview and exam                             2
 8 both interview and exam                             2
 9 interview only                                      1
10 both interview and exam                             2
# ℹ 10,165 more rows

Exercise

Use if_else() to create a variable called completed_high_school, which has the values "yes" if education is equal to at least 3, and "no" otherwise.

A link to the documentation for the demographics data can be found here.

Solution

Our condition is education >= 3 and we want to replace all values for which this condition is true with "yes", and all other values with "no":

demographics |> 
  mutate(completed_high_school = if_else(
    education >= 3, 
    true = "yes", 
    false = "no")) |>
  select(education, completed_high_school) |>
  sample_n(10)

# A tibble: 10 × 2
   education completed_high_school
       <dbl> <chr>                
 1        NA <NA>                 
 2        NA <NA>                 
 3         4 yes                  
 4         5 yes                  
 5         4 yes                  
 6         3 yes                  
 7         4 yes                  
 8         2 no                   
 9        NA <NA>                 
10         5 yes

case_when

The case_when() function is similar to if_else(), but it allows you to have an unlimited number of conditions and corresponding recoded values.

In this section we will work with the marital_status column of the demographics data:

# apply count() to the marital_status column
demographics |> count(marital_status)

# A tibble: 7 × 2
  marital_status          n
  <chr>               <int>
1 divorced              659
2 living_with_partner   417
3 married              2965
4 never_married        1112
5 separated             177
6 widowed               436
7 <NA>                 4409

Let’s create a new variable based on marital_status that replaces

"married" with 3
"living_with_partner" with 2
"divorced", "widowed", "never_married", and "separated" with 1

# use mutate and case_when() to create marital_status_numeric 
# with the values above
demographics |> mutate(marital_status_numeric = case_when(
  marital_status == "married" ~ 3,
  marital_status == "living_with_partner" ~ 2,
  marital_status %in% c("divorced", "widowed", "never_married", "separated") ~ 1)) |>
  select(marital_status, marital_status_numeric)

# A tibble: 10,175 × 2
   marital_status marital_status_numeric
   <chr>                           <dbl>
 1 separated                           1
 2 married                             3
 3 married                             3
 4 <NA>                               NA
 5 married                             3
 6 divorced                            1
 7 <NA>                               NA
 8 widowed                             1
 9 married                             3
10 divorced                            1
# ℹ 10,165 more rows

Execise

Use case_when() to convert born_usa and citizen_usa to “yes” and “no”. Bonus points: use across().

Solution

To do this without across(), you can write:

demographics |> mutate(born_usa_chr = case_when(born_usa ~ "yes",
                                                !born_usa ~ "no"),
                       citizen_usa_chr = case_when(citizen_usa ~ "yes",
                                                   !citizen_usa ~ "no")) |>
  select(born_usa, born_usa_chr, citizen_usa, citizen_usa_chr) |> 
  sample_n(10)

# A tibble: 10 × 4
   born_usa born_usa_chr citizen_usa citizen_usa_chr
   <lgl>    <chr>        <lgl>       <chr>          
 1 TRUE     yes          TRUE        yes            
 2 TRUE     yes          TRUE        yes            
 3 TRUE     yes          TRUE        yes            
 4 TRUE     yes          TRUE        yes            
 5 FALSE    no           FALSE       no             
 6 FALSE    no           TRUE        yes            
 7 TRUE     yes          TRUE        yes            
 8 TRUE     yes          TRUE        yes            
 9 FALSE    no           TRUE        yes            
10 TRUE     yes          TRUE        yes

Note that born_usa ~ "yes" is equivalent to born_usa == TRUE ~ "yes", but the == TRUE is redundant because born_usa is itself a logical variable. Similarly, !citizen_usa ~ "no" is equivalent to citizen_usa == FALSE ~ "no".

To use across(), the relevant select helper is something like ends_with("_usa") or contains("usa"), and you provide a ~ before case_when() to start an anonymous function, and you need to replace the column name with . (again, you don’t need the == TRUE for this particular example in the condition, but that’s just because the columns we’re applying the case_when() function to are logical).

demographics |> mutate(across(ends_with("_usa"), 
                              ~case_when(. == TRUE ~ "yes",
                                         . == FALSE ~ "no"),
                              .names = "{.col}_chr")) |>
  select(contains("_usa")) |> 
  sample_n(10)

# A tibble: 10 × 4
   born_usa citizen_usa born_usa_chr citizen_usa_chr
   <lgl>    <lgl>       <chr>        <chr>          
 1 TRUE     TRUE        yes          yes            
 2 TRUE     TRUE        yes          yes            
 3 FALSE    TRUE        no           yes            
 4 TRUE     TRUE        yes          yes            
 5 TRUE     TRUE        yes          yes            
 6 TRUE     TRUE        yes          yes            
 7 TRUE     TRUE        yes          yes            
 8 TRUE     TRUE        yes          yes            
 9 TRUE     TRUE        yes          yes            
10 TRUE     TRUE        yes          yes