Data Wrangling 4: Merging/Joining Data Frames

r-basics

tidyverse

Author

Horacio Lopez-Nicora

Published

September 23, 2024

1 Introduction

In the past two weeks, we’ve been learning about several functions from the dplyr package, a central workhorse of the tidyverse ecosystem, to manipulate data in data frames.

In real-world data analysis, it’s uncommon to work with just a single data frame. More often than not, you will be dealing with multiple data frames that you need to combine to effectively address your analytical questions.

In this session, we will explore the essential techniques for merging data frames using the _join functions provided by the dplyr package.

Learning objectives for today

Understand Join Fundamentals:
- Grasp the basics of joins and their significance in merging datasets.
- Learn about keys and their use in linking different data frames.
Explore Mutating Joins:
- Get acquainted with mutating join functions in dplyr, such as inner_join, left_join, right_join, and full_join.
- Learn how to add new columns to a data frame using matching records from another.
Discover Filtering Joins:
- Learn how to use filtering join functions like semi_join and anti_join.
- Understand how to retain or exclude records in one data frame based on matches with another data frame.

Setting up

Load the tidyverse meta-package:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

2 Data Jazz: Merging with `dplyr` Joins

Just as a jazz band brings together a variety of instruments to create a rich, spontaneous musical experience, data analysis often involves merging different data sets to form a coherent and insightful whole. Each data set can be thought of as an instrument in a jazz ensemble, contributing its unique timbre and melody to the overall composition.

When you use join functions from the `dplyr` package in R Studio, it’s like orchestrating an improvisational jazz performance where the trumpet’s crisp notes (one data set) harmonize with the saxophone’s smooth riffs (another data set), underpinned by the piano’s rhythmic chords (yet another data set).

By cleverly combining these data sets through joins, you can produce an insightful and dynamic data analysis, much like a jazz composition that resonates on multiple levels.

Looking to get excited about jazz? Click here for a book recommendation!

But Beautiful: A Book About Jazz by Geoff Dyer.

To explore the `_join` functions we will create our own data sets:

Musicians data frame:

musicians <- tibble(
  musician_id = 1:5,
  name = c("John Coltrane", "Miles Davis", "Thelonious Monk", "Ella Fitzgerald", "Louis Armstrong"),
  instrument = c("Saxophone", "Trumpet", "Piano", "Vocals", "Trumpet"),
  years_active = c("1945-1967", "1944-1975", "1934-1975", "1934-1993", "1919-1971"),
  genre = c("Jazz", "Jazz", "Bebop", "Swing", "Jazz")
)

Albums data frame:

albums <- tibble(
  album_id = 1:6,
  musician_id = c(1, 2, 3, 2, 4, 5),
  album_title = c("Blue Train", "Kind of Blue", "Monk's Dream", "Bitches Brew", "Ella in Berlin", "Louis Armstrong Plays W.C. Handy"),
  release_year = c(1957, 1959, 1963, 1970, 1960, 1954),
  genre = c("Hard Bop", "Modal Jazz", "Bebop", "Fusion", "Vocal Jazz", "Jazz")
)

Awards data frame:

awards <- tibble(
  award_id = 1:5,
  musician_id = c(1, 2, 4, 4, 5),
  award_name = c("Grammy Hall of Fame", "Grammy Lifetime Achievement Award", "Grammy Hall of Fame", "National Medal of Arts", "Grammy Lifetime Achievement Award"),
  year = c(1999, 1990, 1999, 1987, 1972)
)

Gigs data frame:

gigs <- tibble(
  gig_id = 1:6,
  musician_id = c(1, 2, 2, 4, 1, 5),
  venue = c("Blue Note", "Village Vanguard", "Birdland", "Newport Jazz Festival", "Montreux Jazz Festival", "Cotton Club"),
  date = as.Date(c("1961-11-18", "1961-03-21", "1961-04-12", "1962-07-11", "1965-03-10", "1937-06-26")),
  city = c("New York", "New York", "New York", "Newport", "Montreux", "Harlem")
)

3 Let’s explore the different `_join` funtions

When merging datasets, we typically integrate variables from one dataset into another using a specific “key” variable. Let’s examine our JAZZ datasets and explore the joining verbs:

3.1 Setting the Stage

3.2 Mutating `_join` funtions:

`left_join` (and `right_join`)

Left join: Musicians with all their gigs.

musicians |> left_join(gigs, 
                       by = join_by(musician_id))

Right join: All albums and related musicians.

musicians |> right_join(albums, 
                        by = join_by(musician_id))

`inner_join` (and `full_join`)

Inner join: Musicians and their albums.

musicians |> inner_join(awards, 
                        by = join_by(musician_id))

Full join: Complete overview of musicians and their awards.

musicians |> full_join(awards, 
                        by = join_by(musician_id))

3.3 Exercises I

A) Find all albums by Miles Davis.

Hint (click here)

Use an inner join to list all albums released by Miles Davis. Then use filter() to end up with a data set with information only for Miles Davis.

Solution (click here)

musicians |> inner_join(albums, by = join_by(musician_id)) |>
  filter(name == "Miles Davis")

# A tibble: 2 × 9
  musician_id name        instrument years_active genre.x album_id album_title 
        <dbl> <chr>       <chr>      <chr>        <chr>      <int> <chr>       
1           2 Miles Davis Trumpet    1944-1975    Jazz           2 Kind of Blue
2           2 Miles Davis Trumpet    1944-1975    Jazz           4 Bitches Brew
# ℹ 2 more variables: release_year <dbl>, genre.y <chr>

B) Combine musicians, gigs, and awards information.

Hint (click here)

Perform a left join to combine musicians with their gigs and awards information.

Solution (click here)

musicians |> left_join(gigs, by = join_by(musician_id)) |>
  left_join(awards, by = join_by(musician_id), relationship = "many-to-many")

# A tibble: 8 × 12
  musician_id name   instrument years_active genre gig_id venue date       city 
        <dbl> <chr>  <chr>      <chr>        <chr>  <int> <chr> <date>     <chr>
1           1 John … Saxophone  1945-1967    Jazz       1 Blue… 1961-11-18 New …
2           1 John … Saxophone  1945-1967    Jazz       5 Mont… 1965-03-10 Mont…
3           2 Miles… Trumpet    1944-1975    Jazz       2 Vill… 1961-03-21 New …
4           2 Miles… Trumpet    1944-1975    Jazz       3 Bird… 1961-04-12 New …
5           3 Thelo… Piano      1934-1975    Bebop     NA <NA>  NA         <NA> 
6           4 Ella … Vocals     1934-1993    Swing      4 Newp… 1962-07-11 Newp…
7           4 Ella … Vocals     1934-1993    Swing      4 Newp… 1962-07-11 Newp…
8           5 Louis… Trumpet    1919-1971    Jazz       6 Cott… 1937-06-26 Harl…
# ℹ 3 more variables: award_id <int>, award_name <chr>, year <dbl>

C) List of venues with musician information.

Hint (click here)

Perform a right join to get a list of all gig venues and the associated musician details.

Solution (click here)

musicians |> right_join(gigs, by = join_by(musician_id))

# A tibble: 6 × 9
  musician_id name   instrument years_active genre gig_id venue date       city 
        <dbl> <chr>  <chr>      <chr>        <chr>  <int> <chr> <date>     <chr>
1           1 John … Saxophone  1945-1967    Jazz       1 Blue… 1961-11-18 New …
2           1 John … Saxophone  1945-1967    Jazz       5 Mont… 1965-03-10 Mont…
3           2 Miles… Trumpet    1944-1975    Jazz       2 Vill… 1961-03-21 New …
4           2 Miles… Trumpet    1944-1975    Jazz       3 Bird… 1961-04-12 New …
5           4 Ella … Vocals     1934-1993    Swing      4 Newp… 1962-07-11 Newp…
6           5 Louis… Trumpet    1919-1971    Jazz       6 Cott… 1937-06-26 Harl…

3.4 Filtering `_join` funtions:

`anti_join` (and `semi_join`)

Semi join: Musicians who have received awards.

musicians |> semi_join(awards, 
                        by = join_by(musician_id))

What is the difference between inner_join and semi_join ?

Notice that the observations in both data frames are the similar, but that the inner join adds the variables from the data frame “y”, whereas the semi join only uses the “y” data frame to determine which observations to keep.

3.5 Exercises II

A) Find musicians without any gigs.

Hint (click here)

Using an anti join, find musicians who haven’t have any gigs in this period.

Solution (click here)

musicians |> anti_join(gigs, 
                        by = join_by(musician_id))

4 Merging Data Sets in Base R

The merge function in base R can perform inner joins, left joins, right joins, and full joins by specifying appropriate parameters.

Left join: Musicians with all their gigs.

merge(musicians, gigs, all.x = TRUE, by = "musician_id")

Right join: All albums and related musicians.

merge(musicians, albums, all.y = TRUE, by = "musician_id")

Inner join: Musicians and their albums.

merge(musicians, awards, all = FALSE, by = "musician_id")

Full join: Complete overview of musicians and their awards.

merge(musicians, awards, all = TRUE, by = "musician_id")

With merge(), you can achieve the same results as dplyr’s mutating joins, though dplyr provides a more explicit and often more readable syntax for many users. Both methods are powerful, and understanding both can be very useful in different contexts!

1 Introduction

Learning objectives for today

Setting up

2 Data Jazz: Merging with dplyr Joins

To explore the _join functions we will create our own data sets:

Musicians data frame:

Albums data frame:

Awards data frame:

Gigs data frame:

3 Let’s explore the different _join funtions

3.1 Setting the Stage

3.2 Mutating _join funtions:

left_join (and right_join)

inner_join (and full_join)

3.3 Exercises I

3.4 Filtering _join funtions:

anti_join (and semi_join)

3.5 Exercises II

4 Merging Data Sets in Base R

2 Data Jazz: Merging with `dplyr` Joins

To explore the `_join` functions we will create our own data sets:

3 Let’s explore the different `_join` funtions

3.2 Mutating `_join` funtions:

`left_join` (and `right_join`)

`inner_join` (and `full_join`)

3.4 Filtering `_join` funtions:

`anti_join` (and `semi_join`)