# load the tidyverse and demographics datasetlibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
demographics <-read_csv("data/demographics.csv")
Rows: 10175 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): interview_examination, gender, race, marital_status, pregnant
dbl (9): respondent_id, age_years, age_months_sc_0_2yr, six_month_period, ag...
lgl (5): served_active_duty_us, served_active_duty_foreign, born_usa, citize...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(demographics)
# A tibble: 6 × 19
respondent_id interview_examination gender age_years age_months_sc_0_2yr race
<dbl> <chr> <chr> <dbl> <dbl> <chr>
1 73557 both interview and e… male 69 NA black
2 73558 both interview and e… male 54 NA white
3 73559 both interview and e… male 72 NA white
4 73560 both interview and e… male 9 NA white
5 73561 both interview and e… female 73 NA white
6 73562 both interview and e… male 56 NA mexi…
# ℹ 13 more variables: six_month_period <dbl>, age_months_ex_0_19yr <dbl>,
# served_active_duty_us <lgl>, served_active_duty_foreign <lgl>,
# born_usa <lgl>, citizen_usa <lgl>, time_in_us <dbl>, education_youth <dbl>,
# education <dbl>, marital_status <chr>, pregnant <chr>,
# language_english <lgl>, household_income <dbl>
In this document, you will learn how to “recode” variables by replacing values in a variable with other values of your choice, e.g., converting numerically-encoded variables to categorical/character versions, and vice versa.
We will introduce two functions for doing this:
if_else() which can be used to create binary variables (variables with just two distinct values) based on a logical “condition”
case_when() which can be used to create variables with many different values.
if_else
# apply count() to interview_examinationdemographics |>count(interview_examination)
# A tibble: 2 × 2
interview_examination n
<chr> <int>
1 both interview and exam 9813
2 interview only 362
Let’s convert it to a numeric format using if_else().
# use mutate() to create a new column called interview_examination_numeric# which is 2 if interview_examination is "both interview and exam"# and 1 otherwise# then select just interview_examination and interview_examination_numericdemographics |>mutate(interview_examination_numeric =if_else( interview_examination =="both interview and exam", true =2, false =1) ) |>select(interview_examination, interview_examination_numeric)
# A tibble: 10,175 × 2
interview_examination interview_examination_numeric
<chr> <dbl>
1 both interview and exam 2
2 both interview and exam 2
3 both interview and exam 2
4 both interview and exam 2
5 both interview and exam 2
6 both interview and exam 2
7 both interview and exam 2
8 both interview and exam 2
9 interview only 1
10 both interview and exam 2
# ℹ 10,165 more rows
Exercise
Use if_else() to create a variable called completed_high_school, which has the values "yes" if education is equal to at least 3, and "no" otherwise.
A link to the documentation for the demographics data can be found here.
Solution
Our condition is education >= 3 and we want to replace all values for which this condition is true with "yes", and all other values with "no":
Note that born_usa ~ "yes" is equivalent to born_usa == TRUE ~ "yes", but the == TRUE is redundant because born_usa is itself a logical variable. Similarly, !citizen_usa ~ "no" is equivalent to citizen_usa == FALSE ~ "no".
To use across(), the relevant select helper is something like ends_with("_usa") or contains("usa"), and you provide a ~ before case_when() to start an anonymous function, and you need to replace the column name with . (again, you don’t need the == TRUE for this particular example in the condition, but that’s just because the columns we’re applying the case_when() function to are logical).