Joining Data Together

This lesson introduces techniques for “joining” multiple data tables together using a “key” variable

A really helpful resource for learning about joins is the R for Data Science book (specifically the chapter about “Relational Data”).

At the end of this lesson we will have created a single data frame that contains the information from both the demographics data and the urine albumin and creatinine labs data.

# load the tidyverse, the demographics, and the urine_albumin_creatinine data
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

demographics <- read_csv("data/demographics.csv")

Rows: 10175 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): interview_examination, gender, race, marital_status, pregnant
dbl (9): respondent_id, age_years, age_months_sc_0_2yr, six_month_period, ag...
lgl (5): served_active_duty_us, served_active_duty_foreign, born_usa, citize...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

labs <- read_csv("data/urine_albumin_creatinine.csv")

Rows: 41455 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): lab
dbl (2): respondent_id, measurement

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Determining what kind of join to do

There are four kinds of joins that you might want to use for combining two tables into a single table:

An inner join, which will keep only the data for the observational units that appear in both tables
A left join, which will keep all of the data for the observational units in the first (left) table, and will add missing values when there is no available data in the second table.
A right join, which will keep all of the data for the observational units in the second (right) table, and will add missing values when there is no available data in the first table.
A full join, which will keep all of the data for the observational units in both the first (left) table, and will add missing values when there is no available data in the other table.

The kind of join that you will implement will depend on what you want to use your joined data for.

Checking the uniqueness of the key ID variables

Below, we see that the respondent_id column is indeed unique for demographics:

# check whether the respondent ID values in demographics are unique?
identical(unique(demographics$respondent_id), demographics$respondent_id)

[1] TRUE

But not for labs:

# check whether the respondent ID values in labs are unique?
identical(unique(labs$respondent_id), labs$respondent_id)

[1] FALSE

Checking overlap between tables

Are there any observations in the demographics table that do not appear in the labs table?

# Compute the proportion of people in demographics who are also in labs
sum(unique(demographics$respondent_id) %in% unique(labs$respondent_id)) / n_distinct(demographics$respondent_id)

[1] 0.8148403

The calculation above shows that around 81% of observations in the demographics table have available labs data.

Are there any observations in the labs table that do not appear in the demographics table?

# proportion of people in labs who are also in demographics
sum(unique(labs$respondent_id) %in% unique(demographics$respondent_id)) / n_distinct(labs$respondent_id)

[1] 1

The calculation above shows that 100% of observations in the labs table have available demographics data.

Joining demographics and labs

Our two options for to join demographics (left) and labs (right) are:

An inner or right join will only keep the 81% of people in demographics who have available labs data
A left or full join will keep all people in demographics data, and join missing values for those who don’t have labs data available

Since we don’t want to discard any data, we will choose the left/full join.

# perform a left/full join of demographics and labs
print(left_join(demographics, labs, by = "respondent_id"), width = Inf)

# A tibble: 43,339 × 21
   respondent_id interview_examination   gender age_years age_months_sc_0_2yr
           <dbl> <chr>                   <chr>      <dbl>               <dbl>
 1         73557 both interview and exam male          69                  NA
 2         73557 both interview and exam male          69                  NA
 3         73557 both interview and exam male          69                  NA
 4         73557 both interview and exam male          69                  NA
 5         73557 both interview and exam male          69                  NA
 6         73558 both interview and exam male          54                  NA
 7         73558 both interview and exam male          54                  NA
 8         73558 both interview and exam male          54                  NA
 9         73558 both interview and exam male          54                  NA
10         73558 both interview and exam male          54                  NA
   race  six_month_period age_months_ex_0_19yr served_active_duty_us
   <chr>            <dbl>                <dbl> <lgl>                
 1 black                1                   NA TRUE                 
 2 black                1                   NA TRUE                 
 3 black                1                   NA TRUE                 
 4 black                1                   NA TRUE                 
 5 black                1                   NA TRUE                 
 6 white                1                   NA FALSE                
 7 white                1                   NA FALSE                
 8 white                1                   NA FALSE                
 9 white                1                   NA FALSE                
10 white                1                   NA FALSE                
   served_active_duty_foreign born_usa citizen_usa time_in_us education_youth
   <lgl>                      <lgl>    <lgl>            <dbl>           <dbl>
 1 TRUE                       TRUE     TRUE                NA              NA
 2 TRUE                       TRUE     TRUE                NA              NA
 3 TRUE                       TRUE     TRUE                NA              NA
 4 TRUE                       TRUE     TRUE                NA              NA
 5 TRUE                       TRUE     TRUE                NA              NA
 6 NA                         TRUE     TRUE                NA              NA
 7 NA                         TRUE     TRUE                NA              NA
 8 NA                         TRUE     TRUE                NA              NA
 9 NA                         TRUE     TRUE                NA              NA
10 NA                         TRUE     TRUE                NA              NA
   education marital_status pregnant language_english household_income
       <dbl> <chr>          <chr>    <lgl>                       <dbl>
 1         3 separated      <NA>     TRUE                        17500
 2         3 separated      <NA>     TRUE                        17500
 3         3 separated      <NA>     TRUE                        17500
 4         3 separated      <NA>     TRUE                        17500
 5         3 separated      <NA>     TRUE                        17500
 6         3 married        <NA>     TRUE                        40000
 7         3 married        <NA>     TRUE                        40000
 8         3 married        <NA>     TRUE                        40000
 9         3 married        <NA>     TRUE                        40000
10         3 married        <NA>     TRUE                        40000
   lab                         measurement
   <chr>                             <dbl>
 1 ur_albumin_ug_ml                    4.3
 2 ur_albumin_mg_l                     4.3
 3 ur_creatinine_mg_dl                39  
 4 ur_creatinine_umol_l             3448. 
 5 ur_albumin_creatinine_ratio        11.0
 6 ur_albumin_ug_ml                  153  
 7 ur_albumin_mg_l                   153  
 8 ur_creatinine_mg_dl                50  
 9 ur_creatinine_umol_l             4420  
10 ur_albumin_creatinine_ratio       306  
# ℹ 43,329 more rows

print(full_join(demographics, labs, by = "respondent_id"), width = Inf)

# A tibble: 43,339 × 21
   respondent_id interview_examination   gender age_years age_months_sc_0_2yr
           <dbl> <chr>                   <chr>      <dbl>               <dbl>
 1         73557 both interview and exam male          69                  NA
 2         73557 both interview and exam male          69                  NA
 3         73557 both interview and exam male          69                  NA
 4         73557 both interview and exam male          69                  NA
 5         73557 both interview and exam male          69                  NA
 6         73558 both interview and exam male          54                  NA
 7         73558 both interview and exam male          54                  NA
 8         73558 both interview and exam male          54                  NA
 9         73558 both interview and exam male          54                  NA
10         73558 both interview and exam male          54                  NA
   race  six_month_period age_months_ex_0_19yr served_active_duty_us
   <chr>            <dbl>                <dbl> <lgl>                
 1 black                1                   NA TRUE                 
 2 black                1                   NA TRUE                 
 3 black                1                   NA TRUE                 
 4 black                1                   NA TRUE                 
 5 black                1                   NA TRUE                 
 6 white                1                   NA FALSE                
 7 white                1                   NA FALSE                
 8 white                1                   NA FALSE                
 9 white                1                   NA FALSE                
10 white                1                   NA FALSE                
   served_active_duty_foreign born_usa citizen_usa time_in_us education_youth
   <lgl>                      <lgl>    <lgl>            <dbl>           <dbl>
 1 TRUE                       TRUE     TRUE                NA              NA
 2 TRUE                       TRUE     TRUE                NA              NA
 3 TRUE                       TRUE     TRUE                NA              NA
 4 TRUE                       TRUE     TRUE                NA              NA
 5 TRUE                       TRUE     TRUE                NA              NA
 6 NA                         TRUE     TRUE                NA              NA
 7 NA                         TRUE     TRUE                NA              NA
 8 NA                         TRUE     TRUE                NA              NA
 9 NA                         TRUE     TRUE                NA              NA
10 NA                         TRUE     TRUE                NA              NA
   education marital_status pregnant language_english household_income
       <dbl> <chr>          <chr>    <lgl>                       <dbl>
 1         3 separated      <NA>     TRUE                        17500
 2         3 separated      <NA>     TRUE                        17500
 3         3 separated      <NA>     TRUE                        17500
 4         3 separated      <NA>     TRUE                        17500
 5         3 separated      <NA>     TRUE                        17500
 6         3 married        <NA>     TRUE                        40000
 7         3 married        <NA>     TRUE                        40000
 8         3 married        <NA>     TRUE                        40000
 9         3 married        <NA>     TRUE                        40000
10         3 married        <NA>     TRUE                        40000
   lab                         measurement
   <chr>                             <dbl>
 1 ur_albumin_ug_ml                    4.3
 2 ur_albumin_mg_l                     4.3
 3 ur_creatinine_mg_dl                39  
 4 ur_creatinine_umol_l             3448. 
 5 ur_albumin_creatinine_ratio        11.0
 6 ur_albumin_ug_ml                  153  
 7 ur_albumin_mg_l                   153  
 8 ur_creatinine_mg_dl                50  
 9 ur_creatinine_umol_l             4420  
10 ur_albumin_creatinine_ratio       306  
# ℹ 43,329 more rows

Look at the results above, does something look wrong to you?

If your key ID variable has repeated values, your joined data will too.

A good check after completing a join is to make sure that the number of rows in the resulting joined data matches what you expect

# check the number of rows in the joined data
nrow(left_join(demographics, labs, by = "respondent_id"))

[1] 43339

# check the number of rows in demographics
nrow(demographics)

[1] 10175

Therefore, to ensure that we have just one row per observational unit (i.e., per respondent_id value), we need to join the wide-format labs_wide data instead of the long-format labs data.

# compute the wide format of the labs data
labs_wide <- labs |> 
  pivot_wider(names_from = "lab", 
              values_from = "measurement")
labs_wide

# A tibble: 8,291 × 6
   respondent_id ur_albumin_ug_ml ur_albumin_mg_l ur_creatinine_mg_dl
           <dbl>            <dbl>           <dbl>               <dbl>
 1         73557              4.3             4.3                  39
 2         73558            153             153                    50
 3         73559             11.9            11.9                 113
 4         73560             16              16                    76
 5         73561            255             255                   147
 6         73562            123             123                    74
 7         73564             19              19                   242
 8         73566              1.3             1.3                  18
 9         73567             35              35                   215
10         73568             25              25                    31
# ℹ 8,281 more rows
# ℹ 2 more variables: ur_creatinine_umol_l <dbl>,
#   ur_albumin_creatinine_ratio <dbl>

# join demographics and labs_wide
print(left_join(demographics, labs_wide, by = "respondent_id"), width = Inf)

# A tibble: 10,175 × 24
   respondent_id interview_examination   gender age_years age_months_sc_0_2yr
           <dbl> <chr>                   <chr>      <dbl>               <dbl>
 1         73557 both interview and exam male          69                  NA
 2         73558 both interview and exam male          54                  NA
 3         73559 both interview and exam male          72                  NA
 4         73560 both interview and exam male           9                  NA
 5         73561 both interview and exam female        73                  NA
 6         73562 both interview and exam male          56                  NA
 7         73563 both interview and exam male           0                   5
 8         73564 both interview and exam female        61                  NA
 9         73565 interview only          male          42                  NA
10         73566 both interview and exam female        56                  NA
   race             six_month_period age_months_ex_0_19yr served_active_duty_us
   <chr>                       <dbl>                <dbl> <lgl>                
 1 black                           1                   NA TRUE                 
 2 white                           1                   NA FALSE                
 3 white                           2                   NA TRUE                 
 4 white                           1                  119 NA                   
 5 white                           1                   NA FALSE                
 6 mexican american                1                   NA TRUE                 
 7 white                           2                    6 NA                   
 8 white                           2                   NA FALSE                
 9 other hispanic                 NA                   NA FALSE                
10 white                           1                   NA FALSE                
   served_active_duty_foreign born_usa citizen_usa time_in_us education_youth
   <lgl>                      <lgl>    <lgl>            <dbl>           <dbl>
 1 TRUE                       TRUE     TRUE                NA              NA
 2 NA                         TRUE     TRUE                NA              NA
 3 TRUE                       TRUE     TRUE                NA              NA
 4 NA                         TRUE     TRUE                NA               3
 5 NA                         TRUE     TRUE                NA              NA
 6 FALSE                      TRUE     TRUE                NA              NA
 7 NA                         TRUE     TRUE                NA              NA
 8 NA                         TRUE     TRUE                NA              NA
 9 NA                         TRUE     TRUE                NA              NA
10 NA                         TRUE     TRUE                NA              NA
   education marital_status pregnant language_english household_income
       <dbl> <chr>          <chr>    <lgl>                       <dbl>
 1         3 separated      <NA>     TRUE                        17500
 2         3 married        <NA>     TRUE                        40000
 3         4 married        <NA>     TRUE                        70000
 4        NA <NA>           <NA>     TRUE                        60000
 5         5 married        <NA>     TRUE                       100000
 6         4 divorced       <NA>     TRUE                        60000
 7        NA <NA>           <NA>     TRUE                       100000
 8         5 widowed        <NA>     TRUE                        70000
 9         3 married        <NA>     TRUE                       100000
10         3 divorced       <NA>     TRUE                        17500
   ur_albumin_ug_ml ur_albumin_mg_l ur_creatinine_mg_dl ur_creatinine_umol_l
              <dbl>           <dbl>               <dbl>                <dbl>
 1              4.3             4.3                  39                3448.
 2            153             153                    50                4420 
 3             11.9            11.9                 113                9989.
 4             16              16                    76                6718.
 5            255             255                   147               12995.
 6            123             123                    74                6542.
 7             NA              NA                    NA                  NA 
 8             19              19                   242               21393.
 9             NA              NA                    NA                  NA 
10              1.3             1.3                  18                1591.
   ur_albumin_creatinine_ratio
                         <dbl>
 1                       11.0 
 2                      306   
 3                       10.5 
 4                       21.0 
 5                      173.  
 6                      166.  
 7                       NA   
 8                        7.85
 9                       NA   
10                        7.22
# ℹ 10,165 more rows

Exercise

How many rows would you expect an inner join of demographics and labs_wide to have? Check your answer by conducting the join using inner_join()

Solution

The number of rows should match the intersection of the two datasets. Since all people in labs_wide are also in demographics (but not the other way around), the join will have the same number of rows in labs_wide.

nrow(labs_wide)

[1] 8291

The join can be conducted using inner_join() and notice that the number of rows matches labs_wide.

inner_join(demographics, labs_wide, by = "respondent_id")

# A tibble: 8,291 × 24
   respondent_id interview_examination   gender age_years age_months_sc_0_2yr
           <dbl> <chr>                   <chr>      <dbl>               <dbl>
 1         73557 both interview and exam male          69                  NA
 2         73558 both interview and exam male          54                  NA
 3         73559 both interview and exam male          72                  NA
 4         73560 both interview and exam male           9                  NA
 5         73561 both interview and exam female        73                  NA
 6         73562 both interview and exam male          56                  NA
 7         73564 both interview and exam female        61                  NA
 8         73566 both interview and exam female        56                  NA
 9         73567 both interview and exam male          65                  NA
10         73568 both interview and exam female        26                  NA
# ℹ 8,281 more rows
# ℹ 19 more variables: race <chr>, six_month_period <dbl>,
#   age_months_ex_0_19yr <dbl>, served_active_duty_us <lgl>,
#   served_active_duty_foreign <lgl>, born_usa <lgl>, citizen_usa <lgl>,
#   time_in_us <dbl>, education_youth <dbl>, education <dbl>,
#   marital_status <chr>, pregnant <chr>, language_english <lgl>,
#   household_income <dbl>, ur_albumin_ug_ml <dbl>, ur_albumin_mg_l <dbl>, …

Note that if there were observations in labs_wide that were not in demographics, then the resulting inner join will have fewer rows than both tables.