Reshaping (Pivoting) Data

In this lesson, we will be working with a new NHANES table that contains urine albumin and creatinine lab data.

# load the tidyverse and the urine_albumin_creatinine NHANES data
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

labs_data <- read.csv("data/urine_albumin_creatinine.csv")

# take a look at the first 10 rows
head(labs_data, 10)

   respondent_id                         lab measurement
1          73557            ur_albumin_ug_ml        4.30
2          73557             ur_albumin_mg_l        4.30
3          73557         ur_creatinine_mg_dl       39.00
4          73557        ur_creatinine_umol_l     3447.60
5          73557 ur_albumin_creatinine_ratio       11.03
6          73558            ur_albumin_ug_ml      153.00
7          73558             ur_albumin_mg_l      153.00
8          73558         ur_creatinine_mg_dl       50.00
9          73558        ur_creatinine_umol_l     4420.00
10         73558 ur_albumin_creatinine_ratio      306.00

This data is in a long format–there are multiple rows for each observational unit, each corresponding to a different variable being measured (in this case, the different labs).

A wide format, on the other hand, will only have one row for each observational unit (respondent), and each of the individual variables being measured (the labs) will have their own column, each containing the values for the corresponding lab.

You can convert from a long format to a wide format using the pivor_wider() function.

The code below creates a wide-form version of the labs data, and saves it as a new data frame called labs_data_wide:

# use pivot_wider() to convert the labs data from a 
# long format to a wide format. 
# arguments are `names_from` and `values_from`
# save the wide data as `labs_data_wide`.
labs_data_wide <- labs_data |> 
  pivot_wider(names_from = "lab", values_from = "measurement")
labs_data_wide

# A tibble: 8,291 × 6
   respondent_id ur_albumin_ug_ml ur_albumin_mg_l ur_creatinine_mg_dl
           <int>            <dbl>           <dbl>               <dbl>
 1         73557              4.3             4.3                  39
 2         73558            153             153                    50
 3         73559             11.9            11.9                 113
 4         73560             16              16                    76
 5         73561            255             255                   147
 6         73562            123             123                    74
 7         73564             19              19                   242
 8         73566              1.3             1.3                  18
 9         73567             35              35                   215
10         73568             25              25                    31
# ℹ 8,281 more rows
# ℹ 2 more variables: ur_creatinine_umol_l <dbl>,
#   ur_albumin_creatinine_ratio <dbl>

If you want to convert from a wide format dataset into a longer-format dataset, you can instead use the pivot_longer() function.

# use pivot_longer() to convert labs_data_wide back to along format
# arguments are `cols` (use a select helper!), `names_to` and `values_to`
labs_data_wide |> 
  pivot_longer(cols = starts_with("ur_"), 
               names_to = "lab",
               values_to = "measurement")

# A tibble: 41,455 × 3
   respondent_id lab                         measurement
           <int> <chr>                             <dbl>
 1         73557 ur_albumin_ug_ml                    4.3
 2         73557 ur_albumin_mg_l                     4.3
 3         73557 ur_creatinine_mg_dl                39  
 4         73557 ur_creatinine_umol_l             3448. 
 5         73557 ur_albumin_creatinine_ratio        11.0
 6         73558 ur_albumin_ug_ml                  153  
 7         73558 ur_albumin_mg_l                   153  
 8         73558 ur_creatinine_mg_dl                50  
 9         73558 ur_creatinine_umol_l             4420  
10         73558 ur_albumin_creatinine_ratio       306  
# ℹ 41,445 more rows

Use-case for long-format: creating a visualization that compares the different variables using ggplot2.

# create boxplots using the long format for the measurements from each lab
# maybe add a log-scale to the y-axis
labs_data |> ggplot() + 
  geom_boxplot(aes(x = lab, y = measurement)) +
  scale_y_log10()

Warning: Removed 1195 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Exercise

Load the demographics data and use select() to select just the respondent_id, gender, marital_status and race columns and then use pivot_longer() to create the following long-format table (below are just the first 10 rows).

respondent_id name value

1 73557 gender male
2 73557 marital_status separated 3 73557 race black
4 73558 gender male
5 73558 marital_status married
6 73558 race white
7 73559 gender male
8 73559 marital_status married
9 73559 race white
10 73560 gender male

Solution

demographics <- read_csv("data/demographics.csv")

Rows: 10175 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): interview_examination, gender, race, marital_status, pregnant
dbl (9): respondent_id, age_years, age_months_sc_0_2yr, six_month_period, ag...
lgl (5): served_active_duty_us, served_active_duty_foreign, born_usa, citize...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

demographics

# A tibble: 10,175 × 19
   respondent_id interview_examination   gender age_years age_months_sc_0_2yr
           <dbl> <chr>                   <chr>      <dbl>               <dbl>
 1         73557 both interview and exam male          69                  NA
 2         73558 both interview and exam male          54                  NA
 3         73559 both interview and exam male          72                  NA
 4         73560 both interview and exam male           9                  NA
 5         73561 both interview and exam female        73                  NA
 6         73562 both interview and exam male          56                  NA
 7         73563 both interview and exam male           0                   5
 8         73564 both interview and exam female        61                  NA
 9         73565 interview only          male          42                  NA
10         73566 both interview and exam female        56                  NA
# ℹ 10,165 more rows
# ℹ 14 more variables: race <chr>, six_month_period <dbl>,
#   age_months_ex_0_19yr <dbl>, served_active_duty_us <lgl>,
#   served_active_duty_foreign <lgl>, born_usa <lgl>, citizen_usa <lgl>,
#   time_in_us <dbl>, education_youth <dbl>, education <dbl>,
#   marital_status <chr>, pregnant <chr>, language_english <lgl>,
#   household_income <dbl>

demographics |> 
  select(respondent_id, gender, marital_status, race) |>
  pivot_longer(cols = all_of(c("gender", "marital_status", "race")))

# A tibble: 30,525 × 3
   respondent_id name           value    
           <dbl> <chr>          <chr>    
 1         73557 gender         male     
 2         73557 marital_status separated
 3         73557 race           black    
 4         73558 gender         male     
 5         73558 marital_status married  
 6         73558 race           white    
 7         73559 gender         male     
 8         73559 marital_status married  
 9         73559 race           white    
10         73560 gender         male     
# ℹ 30,515 more rows