Tidyverse 3: Grouping and Summarizing Data

r-basics
tidyverse
Author

Jelmer Poelstra

Published

March 1, 2024



1 Introduction

Recap of the past two weeks

In the past two weeks, we’ve been learning about 5 functions from the dplyr package, a central workhorse of the tidyverse ecosystem, to manipulate data in data frames:

  • filter() to pick rows (which typically represent observations/samples/individuals)
  • select() to pick columns (which typically represent variables/properties)
  • arrange() to sort data frame rows
  • rename() to change data frame column names
  • mutate() to add and manipulate data frame columns

Learning objectives for today

Today, we will focus on a slightly more complicated and very powerful dplyr function: summarize() to compute summaries across rows, typically across groups of rows.

We will also learn about a couple other handy functions to work with groups of rows: the slice_() family of functions and count().

Setting up

Load the tidyverse:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
install.packages("tidyverse")

We’ll continue working with the diamonds dataframe (automatically loaded with the tidyverse), so let’s take another look at that:

head(diamonds)
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48


2 summarize()

2.1 A first example

The summarize() function from the dplyr package can compute across-row data summaries. As a first example, here’s how you can compute the overall mean of the carat column:

diamonds |>
  summarize(carat_mean = mean(carat))
# A tibble: 1 × 1
  carat_mean
       <dbl>
1      0.798

As you can see, this function has quite a different output from the dplyr functions we’d seen so far. All of those returned a manipulated version of our original dataframe, whereas summarize() returns a “completely new” dataframe with a summary of the original data.

However, summarizing across all rows at once with summarize() is much more verbose than the simple, base R:

mean(diamonds$carat)
[1] 0.7979397

2.2 Summarizing by group

The real power of summarize() comes with its ability to compute group-wise summaries. For example, simply by adding .by = cut, it will calculate the mean carat separately for each value of cut:

diamonds |>
  summarize(carat_mean = mean(carat), .by = cut)
# A tibble: 5 × 2
  cut       carat_mean
  <ord>          <dbl>
1 Ideal          0.703
2 Premium        0.892
3 Good           0.849
4 Very Good      0.806
5 Fair           1.05 

Nice! We can also compute multiple summarized variables — and use summarizing functions other than mean():

diamonds |>
  summarize(carat_mean = mean(carat),
            carat_max = max(carat),
            .by = cut)
# A tibble: 5 × 3
  cut       carat_mean carat_max
  <ord>          <dbl>     <dbl>
1 Ideal          0.703      3.5 
2 Premium        0.892      4.01
3 Good           0.849      3.01
4 Very Good      0.806      4   
5 Fair           1.05       5.01

A handy helper function is n(), which will compute the number of rows for each group (i.e. the group sizes, which can be good to know, for example so you don’t make unfounded conclusions based on really small sample sizes):

diamonds |>
  summarize(carat_mean = mean(carat),
            n_diamonds = n(),
            .by = cut)
# A tibble: 5 × 3
  cut       carat_mean n_diamonds
  <ord>          <dbl>      <int>
1 Ideal          0.703      21551
2 Premium        0.892      13791
3 Good           0.849       4906
4 Very Good      0.806      12082
5 Fair           1.05        1610

Here is an overview of the most commonly used functions to compute summaries:

  • mean() & median()
  • min() & max()
  • sum()
  • sd(): standard deviation
  • IQR(): interquartile range
  • n(): counts the number of rows (observations)
  • n_distinct(): counts the number of distinct (unique) values

Two other comments:

  • You can also ask summarize() to compute summaries by multiple columns, which will return separate summaries for each combination of the involved variables — we’ll see this in the exercises.

  • This may be obvious, but whatever column you are computing summaries by (using .by) should be a categorical variable. In our diamond examples, we’re only using columns that are factors, but “regular” character columns will work just fine as well.


Side note: group_by()

The “classic” way of using summarize() with multiple groups is by preceding it with a group_by() call — e.g., the code below is equivalent to our last example above:

diamonds |>
  group_by(cut) |> 
  summarize(carat_mean = mean(carat),
            n_diamonds = n())
# A tibble: 5 × 3
  cut       carat_mean n_diamonds
  <ord>          <dbl>      <int>
1 Fair           1.05        1610
2 Good           0.849       4906
3 Very Good      0.806      12082
4 Premium        0.892      13791
5 Ideal          0.703      21551

The .by argument to summarize() (and other functions!) is a rather recent addition, but I prefer it over group_by():

  • It is simpler, a bit less typing, and makes the summarize() call self-contained
  • When grouping by multiple columns, group_by() has some odd, unhelpful behavior where it keeps some of the groupings, such that you likely need an ungroup() call as well.


3 Exercises I

3.1 Diamond sizes in relation to clarity

A) Compute the mean length (x), width (y), and depth (z) for each diamond clarity. Also include a column with group sizes. Do you see any clear differences?

Solution (click here)
diamonds |>
  summarize(x_mean = mean(x),
            y_mean = mean(y),
            z_mean = mean(z),
            n_diamonds = n(),
            .by = clarity)
# A tibble: 8 × 5
  clarity x_mean y_mean z_mean n_diamonds
  <ord>    <dbl>  <dbl>  <dbl>      <int>
1 SI2       6.40   6.40   3.95       9194
2 SI1       5.89   5.89   3.64      13065
3 VS1       5.57   5.58   3.44       8171
4 VS2       5.66   5.66   3.49      12258
5 VVS2      5.22   5.23   3.22       5066
6 VVS1      4.96   4.98   3.06       3655
7 I1        6.76   6.71   4.21        741
8 IF        4.97   4.99   3.06       1790

B) Diamond size differences between clarities may become more apparent if we use the overall volume:

  • Dig up your mutate() skills to create a volume column vol (length * width * depth)
  • Compute the mean volume for each diamond clarity
Solution (click here)
diamonds |>
  mutate(vol = x * y * z) |>
  summarize(vol_mean = mean(vol), .by = clarity)
# A tibble: 8 × 2
  clarity vol_mean
  <ord>      <dbl>
1 SI2        175. 
2 SI1        138. 
3 VS1        119. 
4 VS2        124. 
5 VVS2        97.6
6 VVS1        82.5
7 I1         205. 
8 IF          83.3



3.2 Diamond prices

A) For each diamond cut, compute the price’s mean, median, minimum, maximum, and standard deviation.

Solution (click here)
diamonds |>
  summarize(price_mean = mean(price),
            price_median = median(price),
            price_min = min(price),
            price_max = max(price),
            price_sd = sd(price),
            .by = cut)
# A tibble: 5 × 6
  cut       price_mean price_median price_min price_max price_sd
  <ord>          <dbl>        <dbl>     <int>     <int>    <dbl>
1 Ideal          3458.        1810        326     18806    3808.
2 Premium        4584.        3185        326     18823    4349.
3 Good           3929.        3050.       327     18788    3682.
4 Very Good      3982.        2648        336     18818    3936.
5 Fair           4359.        3282        337     18574    3560.

B) Find the combination of cut, color, and clarity that on average yields the most expensive diamonds.

For this, you’ll have to group by all these 3 columns. Try this for yourself first, but check out the first solution if you can’t get that part to work.

Grouping solution (click here)

.by = c(cut, color, clarity) will group by these 3 columns at once.

Solution (click here)

After summarizing, it makes sense to arrange in descending order by price, so you can see the cut-color-clarity combinations with the most expensive diamonds:

diamonds |>
  summarize(price = mean(price),
            .by = c(cut, color, clarity)) |>
  arrange(desc(price))
# A tibble: 276 × 4
   cut       color clarity  price
   <ord>     <ord> <ord>    <dbl>
 1 Very Good D     IF      10298.
 2 Good      D     IF      10030.
 3 Ideal     J     I1       9454 
 4 Premium   D     IF       9056.
 5 Premium   J     SI2      7550.
 6 Fair      D     I1       7383 
 7 Premium   J     VVS1     7244.
 8 Ideal     I     SI2      7192.
 9 Premium   I     VS2      7156.
10 Premium   I     SI2      7148.
# ℹ 266 more rows


4 The slice_ functions

Like the filter() function, functions in the slice_ family select specific rows, but have some different functionality that’s quite handy — especially in combination with grouping.

Let’s say we wanted to only get, for each cut, the diamond with the highest value of carat. We can do this pretty easily with the slice_max() function, which will return the row(s) with the highest value for a specified variable:

diamonds |>
  slice_max(carat, by = cut)
# A tibble: 8 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  3.5  Ideal     H     I1       62.8    57 12587  9.65  9.59  6.03
2  4.01 Premium   I     I1       61      61 15223 10.1  10.1   6.17
3  4.01 Premium   J     I1       62.5    62 15223 10.0   9.94  6.24
4  3.01 Good      I     SI2      63.9    60 18242  9.06  9.01  5.77
5  3.01 Good      I     SI2      63.9    60 18242  9.06  9.01  5.77
6  3.01 Good      H     SI2      57.6    64 18593  9.44  9.38  5.42
7  4    Very Good I     I1       63.3    58 15984 10.0   9.94  6.31
8  5.01 Fair      J     I1       65.5    59 18018 10.7  10.5   6.98
Why are we getting more than one diamond per cut in some cases? (Click to see the answer) Because of ties in the cut value. (We’ll get back to this in the next set of exercises.)

You can get more than just the single highest (slice_max()) / lowest (slice_min()) value per group with the n= argument, and can get a specific proportion of rows with prop=:

# Get the 3 highest-depth diamonds for each value of cut
diamonds |>
  slice_max(depth, by = cut, n = 3)
# A tibble: 280 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.01 Ideal   I     VS2      66.7    56 15729  7.75  7.65  5.14
 2  0.71 Ideal   E     SI2      65.5    55  2511  5.62  5.56  3.66
 3  3.01 Ideal   J     I1       65.4    60 16538  8.99  8.93  5.86
 4  0.8  Premium G     SI1      63      59  2760  5.9   5.81  3.69
 5  0.9  Premium I     VS2      63      58  2761  6.16  6.12  3.87
 6  0.72 Premium I     IF       63      57  2795  5.72  5.7   3.6 
 7  0.72 Premium E     VS2      63      55  2802  5.79  5.61  3.59
 8  0.7  Premium E     VS2      63      60  2818  5.64  5.6   3.54
 9  0.81 Premium G     SI1      63      60  2832  5.87  5.81  3.68
10  0.7  Premium G     VS1      63      60  2838  5.64  5.57  3.53
# ℹ 270 more rows
# Get the 1% lowest-carat diamonds for each value of color
diamonds |>
  slice_min(carat, by = color, prop = 0.01)
# A tibble: 926 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.2  Premium   E     SI2      60.2    62   345  3.79  3.75  2.27
 2  0.2  Premium   E     VS2      59.8    62   367  3.79  3.77  2.26
 3  0.2  Premium   E     VS2      59      60   367  3.81  3.78  2.24
 4  0.2  Premium   E     VS2      61.1    59   367  3.81  3.78  2.32
 5  0.2  Premium   E     VS2      59.7    62   367  3.84  3.8   2.28
 6  0.2  Ideal     E     VS2      59.7    55   367  3.86  3.84  2.3 
 7  0.2  Very Good E     VS2      63.4    59   367  3.74  3.71  2.36
 8  0.2  Ideal     E     VS2      62.2    57   367  3.76  3.73  2.33
 9  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
10  0.21 Very Good E     VS2      63.2    54   386  3.82  3.78  2.4 
# ℹ 916 more rows

The full set of slice_ functions
  • slice_head(n = 1) takes the first row.
  • slice_tail(n = 1) takes the last row
  • slice_min(x, n = 1) takes the row with the smallest value in column x.
  • slice_max(x, n = 1) takes the row with the largest value in column x.
  • slice_sample(n = 1) takes one random row.
  • slice(15) takes the 15th row.


5 The count() function

Above, we used the n() helper with summarize() to get the number of rows with each value of cut (i.e. the group sizes) along with other computed summaries.

Getting a “count table” like that is such a common part of Exploratory Data Analysis (EDA) that there’s another dplyr function available for it, count(). And unless you’re already summarize-ing, this one is simpler to use:

diamonds |> count(cut)
# A tibble: 5 × 2
  cut           n
  <ord>     <int>
1 Fair       1610
2 Good       4906
3 Very Good 12082
4 Premium   13791
5 Ideal     21551

We may also want to sort the output by frequency, which can be useful especially when there are many different values for the focal variable (not really the case here, but you get the point):

diamonds |> count(cut, sort = TRUE)
# A tibble: 5 × 2
  cut           n
  <ord>     <int>
1 Ideal     21551
2 Premium   13791
3 Very Good 12082
4 Good       4906
5 Fair       1610

And you can also create counts for value combinations across multiple columns:

diamonds |> count(cut, color)
# A tibble: 35 × 3
   cut   color     n
   <ord> <ord> <int>
 1 Fair  D       163
 2 Fair  E       224
 3 Fair  F       312
 4 Fair  G       314
 5 Fair  H       303
 6 Fair  I       175
 7 Fair  J       119
 8 Good  D       662
 9 Good  E       933
10 Good  F       909
# ℹ 25 more rows

Base R’s table() function

Also worth mentioning is the base R table() function, which is similar to count(). While its output format is unwieldy for follow-up analyses1, you may prefer its formatting especially when wanting to glance at a 2-way count table to see patterns:

table(diamonds$cut, diamonds$color)
           
               D    E    F    G    H    I    J
  Fair       163  224  312  314  303  175  119
  Good       662  933  909  871  702  522  307
  Very Good 1513 2400 2164 2299 1824 1204  678
  Premium   1603 2337 2331 2924 2360 1428  808
  Ideal     2834 3903 3826 4884 3115 2093  896

Additionally, as a dplyr function, count() only works with data frames. Occasionally, you may also need to create a count table for a vector, and table() can do that:

cut_vec <- diamonds$cut

table(cut_vec)
cut_vec
     Fair      Good Very Good   Premium     Ideal 
     1610      4906     12082     13791     21551 


6 Exercises II

6.1 No ties, please

Above, when we first used slice_max(), we got multiple rows for some groups. Check out the help for this function (?slice_max) and get it to print only one row per group, even in the case of ties.

Solution (click here)

The with_ties argument controls this. The default is TRUE (do include ties), so we want to set it to FALSE (don’t include ties):

diamonds |>
  slice_max(carat, by = cut, with_ties = FALSE)
# A tibble: 5 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  3.5  Ideal     H     I1       62.8    57 12587  9.65  9.59  6.03
2  4.01 Premium   I     I1       61      61 15223 10.1  10.1   6.17
3  3.01 Good      I     SI2      63.9    60 18242  9.06  9.01  5.77
4  4    Very Good I     I1       63.3    58 15984 10.0   9.94  6.31
5  5.01 Fair      J     I1       65.5    59 18018 10.7  10.5   6.98



6.2 Random diamonds

Use slice_sample() to get 5 random diamonds for each combination of cut, color and clarity.

Solution (click here)
diamonds |>
  slice_sample(n = 5, by = c(cut, color, clarity))
# A tibble: 1,342 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1.2  Ideal   E     SI2      61.3  56    6006  6.85  6.88  4.21
 2  1.02 Ideal   E     SI2      62.6  54.6  4983  6.4   6.44  4.02
 3  0.78 Ideal   E     SI2      61    56    3053  5.92  5.98  3.63
 4  1.11 Ideal   E     SI2      61.2  57    4969  6.7   6.74  4.11
 5  1    Ideal   E     SI2      60.7  55    4077  6.39  6.33  3.86
 6  0.71 Premium E     SI1      60.3  60    2674  5.77  5.74  3.47
 7  0.82 Premium E     SI1      61.2  61    3211  5.98  5.95  3.65
 8  0.51 Premium E     SI1      61.3  54    1546  5.18  5.13  3.16
 9  0.57 Premium E     SI1      62.2  55    1590  5.36  5.31  3.32
10  0.41 Premium E     SI1      62.4  58     755  4.75  4.77  2.97
# ℹ 1,332 more rows



6.3 count()

A) Create a sorted count table of diamond claritys

Solution (click here)
diamonds |> count(clarity, sort = TRUE)
# A tibble: 8 × 2
  clarity     n
  <ord>   <int>
1 SI1     13065
2 VS2     12258
3 SI2      9194
4 VS1      8171
5 VVS2     5066
6 VVS1     3655
7 IF       1790
8 I1        741

B) What are the least common combinations of clarity and depth for diamonds costing less than $5,000?

Hints (click here)
  • You’ll have to filter() first.
  • Can you sort in reverse order with count()? If not, you’ll have to use a separate arrange()
Solution (click here)
diamonds |>
  filter(price < 5000) |> 
  count(clarity, color) |>
  arrange(n)
# A tibble: 56 × 3
   clarity color     n
   <ord>   <ord> <int>
 1 I1      J        26
 2 I1      D        32
 3 IF      D        39
 4 IF      J        40
 5 VVS1    J        51
 6 I1      I        65
 7 VVS2    J        67
 8 I1      E        84
 9 I1      H       108
10 I1      G       114
# ℹ 46 more rows


6.4 Bonus: sort then slice

slice_head() will print the first-occurring row(s), and its twin slice_tail() prints the last row(s). In combination with sorting, these can give more flexibility than the min-max functions.

For each color, use arrange() and slice_head() to get the 3 diamonds with the highest carat, with ties in carat broken by price (highest first).

Solution (click here)
diamonds |>
  arrange(desc(carat), desc(price)) |>
  slice_head(n = 3, by = color)
# A tibble: 21 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  5.01 Fair      J     I1       65.5    59 18018 10.7  10.5   6.98
 2  4.5  Fair      J     I1       65.8    58 18531 10.2  10.2   6.72
 3  4.01 Premium   J     I1       62.5    62 15223 10.0   9.94  6.24
 4  4.13 Fair      H     I1       64.8    61 17329 10     9.85  6.43
 5  3.65 Fair      H     I1       67.1    53 11668  9.53  9.48  6.38
 6  3.5  Ideal     H     I1       62.8    57 12587  9.65  9.59  6.03
 7  4.01 Premium   I     I1       61      61 15223 10.1  10.1   6.17
 8  4    Very Good I     I1       63.3    58 15984 10.0   9.94  6.31
 9  3.67 Premium   I     I1       62.4    56 16193  9.86  9.81  6.13
10  3.4  Fair      D     I1       66.8    52 15964  9.42  9.34  6.27
# ℹ 11 more rows

Back to top

Footnotes

  1. It’s not actually a data frame…↩︎