In the past two weeks, we’ve been learning about 5 functions from the dplyr package, a central workhorse of the tidyverse ecosystem, to manipulate data in data frames:
filter() to pick rows (which typically represent observations/samples/individuals)
select() to pick columns (which typically represent variables/properties)
arrange() to sort data frame rows
rename() to change data frame column names
mutate() to add and manipulate data frame columns
Learning objectives for today
Today, we will focus on a slightly more complicated and very powerful dplyr function: summarize() to compute summaries across rows, typically across groups of rows.
We will also learn about a couple other handy functions to work with groups of rows: the slice_() family of functions and count().
Setting up
Load the tidyverse:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Still need to install the tidyverse? Click here for instructions
install.packages("tidyverse")
We’ll continue working with the diamonds dataframe (automatically loaded with the tidyverse), so let’s take another look at that:
head(diamonds)
# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
As you can see, this function has quite a different output from the dplyr functions we’d seen so far. All of those returned a manipulated version of our original dataframe, whereas summarize() returns a “completely new” dataframe with a summary of the original data.
However, summarizing across all rows at once with summarize() is much more verbose than the simple, base R:
mean(diamonds$carat)
[1] 0.7979397
2.2 Summarizing by group
The real power of summarize() comes with its ability to compute group-wise summaries. For example, simply by adding .by = cut, it will calculate the mean carat separately for each value of cut:
# A tibble: 5 × 3
cut carat_mean carat_max
<ord> <dbl> <dbl>
1 Ideal 0.703 3.5
2 Premium 0.892 4.01
3 Good 0.849 3.01
4 Very Good 0.806 4
5 Fair 1.05 5.01
A handy helper function is n(), which will compute the number of rows for each group (i.e. the group sizes, which can be good to know, for example so you don’t make unfounded conclusions based on really small sample sizes):
# A tibble: 5 × 3
cut carat_mean n_diamonds
<ord> <dbl> <int>
1 Ideal 0.703 21551
2 Premium 0.892 13791
3 Good 0.849 4906
4 Very Good 0.806 12082
5 Fair 1.05 1610
Here is an overview of the most commonly used functions to compute summaries:
mean() & median()
min() & max()
sum()
sd(): standard deviation
IQR(): interquartile range
n(): counts the number of rows (observations)
n_distinct(): counts the number of distinct (unique) values
Two other comments:
You can also ask summarize() to compute summaries by multiple columns, which will return separate summaries for each combination of the involved variables — we’ll see this in the exercises.
This may be obvious, but whatever column you are computing summaries by (using .by) should be a categorical variable. In our diamond examples, we’re only using columns that are factors, but “regular” character columns will work just fine as well.
Side note: group_by()
The “classic” way of using summarize() with multiple groups is by preceding it with a group_by() call — e.g., the code below is equivalent to our last example above:
# A tibble: 5 × 3
cut carat_mean n_diamonds
<ord> <dbl> <int>
1 Fair 1.05 1610
2 Good 0.849 4906
3 Very Good 0.806 12082
4 Premium 0.892 13791
5 Ideal 0.703 21551
The .by argument to summarize() (and other functions!) is a rather recent addition, but I prefer it over group_by():
It is simpler, a bit less typing, and makes the summarize() call self-contained
When grouping by multiple columns, group_by() has some odd, unhelpful behavior where it keeps some of the groupings, such that you likely need an ungroup() call as well.
3 Exercises I
3.1 Diamond sizes in relation to clarity
A) Compute the mean length (x), width (y), and depth (z) for each diamond clarity. Also include a column with group sizes. Do you see any clear differences?
B) Find the combination of cut, color, and clarity that on average yields the most expensive diamonds.
For this, you’ll have to group by all these 3 columns. Try this for yourself first, but check out the first solution if you can’t get that part to work.
Grouping solution (click here)
.by = c(cut, color, clarity) will group by these 3 columns at once.
Solution (click here)
After summarizing, it makes sense to arrange in descending order by price, so you can see the cut-color-clarity combinations with the most expensive diamonds:
# A tibble: 276 × 4
cut color clarity price
<ord> <ord> <ord> <dbl>
1 Very Good D IF 10298.
2 Good D IF 10030.
3 Ideal J I1 9454
4 Premium D IF 9056.
5 Premium J SI2 7550.
6 Fair D I1 7383
7 Premium J VVS1 7244.
8 Ideal I SI2 7192.
9 Premium I VS2 7156.
10 Premium I SI2 7148.
# ℹ 266 more rows
4 The slice_ functions
Like the filter() function, functions in the slice_ family select specific rows, but have some different functionality that’s quite handy — especially in combination with grouping.
Let’s say we wanted to only get, for each cut, the diamond with the highest value of carat. We can do this pretty easily with the slice_max() function, which will return the row(s) with the highest value for a specified variable:
diamonds |>slice_max(carat, by = cut)
# A tibble: 8 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 3.5 Ideal H I1 62.8 57 12587 9.65 9.59 6.03
2 4.01 Premium I I1 61 61 15223 10.1 10.1 6.17
3 4.01 Premium J I1 62.5 62 15223 10.0 9.94 6.24
4 3.01 Good I SI2 63.9 60 18242 9.06 9.01 5.77
5 3.01 Good I SI2 63.9 60 18242 9.06 9.01 5.77
6 3.01 Good H SI2 57.6 64 18593 9.44 9.38 5.42
7 4 Very Good I I1 63.3 58 15984 10.0 9.94 6.31
8 5.01 Fair J I1 65.5 59 18018 10.7 10.5 6.98
Why are we getting more than one diamond per cut in some cases? (Click to see the answer)
Because of ties in the cut value. (We’ll get back to this in the next set of exercises.)
You can get more than just the single highest (slice_max()) / lowest (slice_min()) value per group with the n= argument, and can get a specific proportion of rows with prop=:
# Get the 3 highest-depth diamonds for each value of cutdiamonds |>slice_max(depth, by = cut, n =3)
# A tibble: 280 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.01 Ideal I VS2 66.7 56 15729 7.75 7.65 5.14
2 0.71 Ideal E SI2 65.5 55 2511 5.62 5.56 3.66
3 3.01 Ideal J I1 65.4 60 16538 8.99 8.93 5.86
4 0.8 Premium G SI1 63 59 2760 5.9 5.81 3.69
5 0.9 Premium I VS2 63 58 2761 6.16 6.12 3.87
6 0.72 Premium I IF 63 57 2795 5.72 5.7 3.6
7 0.72 Premium E VS2 63 55 2802 5.79 5.61 3.59
8 0.7 Premium E VS2 63 60 2818 5.64 5.6 3.54
9 0.81 Premium G SI1 63 60 2832 5.87 5.81 3.68
10 0.7 Premium G VS1 63 60 2838 5.64 5.57 3.53
# ℹ 270 more rows
# Get the 1% lowest-carat diamonds for each value of colordiamonds |>slice_min(carat, by = color, prop =0.01)
# A tibble: 926 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27
2 0.2 Premium E VS2 59.8 62 367 3.79 3.77 2.26
3 0.2 Premium E VS2 59 60 367 3.81 3.78 2.24
4 0.2 Premium E VS2 61.1 59 367 3.81 3.78 2.32
5 0.2 Premium E VS2 59.7 62 367 3.84 3.8 2.28
6 0.2 Ideal E VS2 59.7 55 367 3.86 3.84 2.3
7 0.2 Very Good E VS2 63.4 59 367 3.74 3.71 2.36
8 0.2 Ideal E VS2 62.2 57 367 3.76 3.73 2.33
9 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
10 0.21 Very Good E VS2 63.2 54 386 3.82 3.78 2.4
# ℹ 916 more rows
The full set of slice_ functions
slice_head(n = 1) takes the first row.
slice_tail(n = 1) takes the last row
slice_min(x, n = 1) takes the row with the smallest value in column x.
slice_max(x, n = 1) takes the row with the largest value in column x.
slice_sample(n = 1) takes one random row.
slice(15) takes the 15th row.
5 The count() function
Above, we used the n() helper with summarize() to get the number of rows with each value of cut (i.e. the group sizes) along with other computed summaries.
Getting a “count table” like that is such a common part of Exploratory Data Analysis (EDA) that there’s another dplyr function available for it, count(). And unless you’re already summarize-ing, this one is simpler to use:
diamonds |>count(cut)
# A tibble: 5 × 2
cut n
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
We may also want to sort the output by frequency, which can be useful especially when there are many different values for the focal variable (not really the case here, but you get the point):
diamonds |>count(cut, sort =TRUE)
# A tibble: 5 × 2
cut n
<ord> <int>
1 Ideal 21551
2 Premium 13791
3 Very Good 12082
4 Good 4906
5 Fair 1610
And you can also create counts for value combinations across multiple columns:
diamonds |>count(cut, color)
# A tibble: 35 × 3
cut color n
<ord> <ord> <int>
1 Fair D 163
2 Fair E 224
3 Fair F 312
4 Fair G 314
5 Fair H 303
6 Fair I 175
7 Fair J 119
8 Good D 662
9 Good E 933
10 Good F 909
# ℹ 25 more rows
Base R’s table() function
Also worth mentioning is the base R table() function, which is similar to count(). While its output format is unwieldy for follow-up analyses1, you may prefer its formatting especially when wanting to glance at a 2-way count table to see patterns:
table(diamonds$cut, diamonds$color)
D E F G H I J
Fair 163 224 312 314 303 175 119
Good 662 933 909 871 702 522 307
Very Good 1513 2400 2164 2299 1824 1204 678
Premium 1603 2337 2331 2924 2360 1428 808
Ideal 2834 3903 3826 4884 3115 2093 896
Additionally, as a dplyr function, count() only works with data frames. Occasionally, you may also need to create a count table for a vector, and table() can do that:
cut_vec <- diamonds$cuttable(cut_vec)
cut_vec
Fair Good Very Good Premium Ideal
1610 4906 12082 13791 21551
6 Exercises II
6.1 No ties, please
Above, when we first used slice_max(), we got multiple rows for some groups. Check out the help for this function (?slice_max) and get it to print only one row per group, even in the case of ties.
Solution (click here)
The with_ties argument controls this. The default is TRUE (do include ties), so we want to set it to FALSE (don’t include ties):
diamonds |>slice_max(carat, by = cut, with_ties =FALSE)
# A tibble: 5 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 3.5 Ideal H I1 62.8 57 12587 9.65 9.59 6.03
2 4.01 Premium I I1 61 61 15223 10.1 10.1 6.17
3 3.01 Good I SI2 63.9 60 18242 9.06 9.01 5.77
4 4 Very Good I I1 63.3 58 15984 10.0 9.94 6.31
5 5.01 Fair J I1 65.5 59 18018 10.7 10.5 6.98
6.2 Random diamonds
Use slice_sample() to get 5 random diamonds for each combination of cut, color and clarity.
Solution (click here)
diamonds |>slice_sample(n =5, by =c(cut, color, clarity))
# A tibble: 1,342 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1.2 Ideal E SI2 61.3 56 6006 6.85 6.88 4.21
2 1.02 Ideal E SI2 62.6 54.6 4983 6.4 6.44 4.02
3 0.78 Ideal E SI2 61 56 3053 5.92 5.98 3.63
4 1.11 Ideal E SI2 61.2 57 4969 6.7 6.74 4.11
5 1 Ideal E SI2 60.7 55 4077 6.39 6.33 3.86
6 0.71 Premium E SI1 60.3 60 2674 5.77 5.74 3.47
7 0.82 Premium E SI1 61.2 61 3211 5.98 5.95 3.65
8 0.51 Premium E SI1 61.3 54 1546 5.18 5.13 3.16
9 0.57 Premium E SI1 62.2 55 1590 5.36 5.31 3.32
10 0.41 Premium E SI1 62.4 58 755 4.75 4.77 2.97
# ℹ 1,332 more rows
6.3count()
A) Create a sorted count table of diamond claritys
# A tibble: 56 × 3
clarity color n
<ord> <ord> <int>
1 I1 J 26
2 I1 D 32
3 IF D 39
4 IF J 40
5 VVS1 J 51
6 I1 I 65
7 VVS2 J 67
8 I1 E 84
9 I1 H 108
10 I1 G 114
# ℹ 46 more rows
6.4 Bonus: sort then slice
slice_head() will print the first-occurring row(s), and its twin slice_tail() prints the last row(s). In combination with sorting, these can give more flexibility than the min-max functions.
For each color, use arrange() and slice_head() to get the 3 diamonds with the highest carat, with ties in carat broken by price (highest first).
Solution (click here)
diamonds |>arrange(desc(carat), desc(price)) |>slice_head(n =3, by = color)
# A tibble: 21 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 5.01 Fair J I1 65.5 59 18018 10.7 10.5 6.98
2 4.5 Fair J I1 65.8 58 18531 10.2 10.2 6.72
3 4.01 Premium J I1 62.5 62 15223 10.0 9.94 6.24
4 4.13 Fair H I1 64.8 61 17329 10 9.85 6.43
5 3.65 Fair H I1 67.1 53 11668 9.53 9.48 6.38
6 3.5 Ideal H I1 62.8 57 12587 9.65 9.59 6.03
7 4.01 Premium I I1 61 61 15223 10.1 10.1 6.17
8 4 Very Good I I1 63.3 58 15984 10.0 9.94 6.31
9 3.67 Premium I I1 62.4 56 16193 9.86 9.81 6.13
10 3.4 Fair D I1 66.8 52 15964 9.42 9.34 6.27
# ℹ 11 more rows