The dplyr package provides a number of very useful functions for manipulating data frames.
Here we’re going to cover some of the most commonly used functions, and will also use pipes (%>%) to combine them.
select() to pick columns (variables)
filter() to pick rows (observations)
rename() to change column names
arrange() to change the order of rows (i.e., to sort a data frame)
mutate() to modify values in columns and create new columns
summarize() (with group_by()) to compute summaries across rows
Importantly, all these functions take a data frame as the input, and return a new, modified data frame as the output.
dplyr belongs to a broader family of R packages designed for “dataframe-centric” data science called the “Tidyverse”. The other tidyverse package that we’ll cover in today’s workshop is ggplot2 for making plots. You can find more information about the Tidyverse here: https://www.tidyverse.org/.
All of the tidyverse’s packages can be loaded at once as follows:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
That will print quote some output which tells you which packages have been loaded as part of the tidyverse, and which tidyverse functions “conflict with” (in the sense of having the same name as) R functions that were already in your environment.
2 The gapminder data set
In this section and in the afternoon, we will work with the gapminder data set. This data set is available in a package (while most packages are built around functions so as to extend R’s functionality, others, like this one, merely contain data sets), and we can load it as follows:
library(gapminder)
Let’s take a look at the dataset, which is stored in a data frame also called gapminder:
gapminder
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
The gapminder data frame is a so-called “tibble”, which is the tidyverse version of a data frame. The main difference is the nicer default printing behavior of tibbles: e.g. the data types of columns are shown, and only a limited number of rows will be printed (hence 1,694 more rows).
As for the dataset itself, note that each row contains some data for a single country in a specific year (across 5-year intervals between 1952 and 2007):
lifeExp is the life expectancy in years
pop is the population size
gdpPercap is the per-capita GDP
3select() to pick columns (variables)
To subset a data frame by keeping or removing certain columns, we can use the select() function.
By default, this function will only keep the columns that you specify, which you typically do simply by listing those columns by name:
In the command above, the first argument was the data frame, whereas the other arguments were the (unquoted!) names of columns we wanted to keep.
The order of the columns in the output data frame is exactly as you list them in select(), and doesn’t need to be the same as in the input data frame. In other words, select() is also one way to reorder columns. In the example above, we moved year to come before country, for example.
We can also specify columns that should be removed, by prefacing their name with a ! (or a -):
There are also ways to select ranges of columns, and to match columns by their partial names, but that is beyond the scope of this short workshop (check the select() help by typing ?select to learn more about this).
4rename() to change column names, and the pipe (%>%)
Our next dplyr function is one of the simplest: rename() to change column names.
The syntax to specify the new and old name within the function is new_name = old_name. For example, building on the column selection we did above, we may want to rename the gdpPercap column:
It is common to use several (dplyr) functions in succession to “wrangle” a dataframe into a format, and with the data, that we want. To do so, we could go on like we did above, successively assigning new data frames and moving on to the next step.
But there is a nicer way of dong this, using so-called “piping” with a pipe operator: we will use the %>% pipe operator.
Let’s start by seeing pipes into action with a reformulation of the code we used above to first select 3 columns and then rename 1 of them:
What happened here is that we took the gapminder data frame, pushed (or “piped”) it into the select() function, whose output was in turn piped into the rename() function.
You can think of the pipe as “then”: take gapminder, then select, then rename.
When using pipes, piped input replaces our previous way of specifying the input with the .data argument.
Under the hood, when you pipe something into a function, this will by default be passed to the first argument of the function. dplyr (and other tidyverse) functions are quite consistent with the first argument always being a data frame, which makes them “pipe-friendly”.
Using pipes is slightly less typing and more readable than successive assignments. (It is also faster and uses less computer memory.)
5filter() to pick rows (observations)
The filter() function can be used to keep only rows that match a condition. Whereas column-selection often simply happens by column name as we’ve seen with select(), row-selection tends to be a more intricate and interesting topic.
Let’s start with the following example, where we want to keep observations (remember, these are countries in a given year) that have a life expectancy (column lifeExp) greater than 80 years:
gapminder %>%filter(lifeExp >80)
# A tibble: 21 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Australia Oceania 2002 80.4 19546792 30688.
2 Australia Oceania 2007 81.2 20434176 34435.
3 Canada Americas 2007 80.7 33390141 36319.
4 France Europe 2007 80.7 61083916 30470.
5 Hong Kong, China Asia 2002 81.5 6762476 30209.
6 Hong Kong, China Asia 2007 82.2 6980412 39725.
7 Iceland Europe 2002 80.5 288030 31163.
8 Iceland Europe 2007 81.8 301931 36181.
9 Israel Asia 2007 80.7 6426679 25523.
10 Italy Europe 2002 80.2 57926999 27968.
# ℹ 11 more rows
So, we specify a condition based on the values in at least one column, and only the rows that satisfy this condition will be kept.
These conditions don’t have to be based on numeric comparisons – for example:
gapminder %>%filter(continent =="Europe")
# A tibble: 360 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Albania Europe 1952 55.2 1282697 1601.
2 Albania Europe 1957 59.3 1476505 1942.
3 Albania Europe 1962 64.8 1728137 2313.
4 Albania Europe 1967 66.2 1984060 2760.
5 Albania Europe 1972 67.7 2263554 3313.
6 Albania Europe 1977 68.9 2509048 3533.
7 Albania Europe 1982 70.4 2780097 3631.
8 Albania Europe 1987 72 3075321 3739.
9 Albania Europe 1992 71.6 3326498 2497.
10 Albania Europe 1997 73.0 3428038 3193.
# ℹ 350 more rows
(Remember to use two equals signs == to test for equality!)
5.1 Filter based on multiple conditions
It’s also possible to filter based on multiple conditions:
gapminder %>%filter(continent =="Asia", year ==2007, lifeExp >80)
# A tibble: 3 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Hong Kong, China Asia 2007 82.2 6980412 39725.
2 Israel Asia 2007 80.7 6426679 25523.
3 Japan Asia 2007 82.6 127467972 31656.
By default, multiple conditions are combined in an AND fashion — in other words, in a given row, each condition needs to be met for that column to be kept.
If you want to combine conditions in an OR fashion, you should use a | as follows:
# A tibble: 300 × 3
year country gdp_per_capita
<int> <fct> <dbl>
1 1952 Argentina 5911.
2 1957 Argentina 6857.
3 1962 Argentina 7133.
4 1967 Argentina 8053.
5 1972 Argentina 9443.
6 1977 Argentina 10079.
7 1982 Argentina 8998.
8 1987 Argentina 9140.
9 1992 Argentina 9308.
10 1997 Argentina 10967.
# ℹ 290 more rows
5.2 Challenge 1
Write a single command (which can span multiple lines and include pipes) that will produce a data frame that has lifeExp, country, and year for Africa but not for other continents. How many rows does your data frame have?
The arrange() function is like the sort function in Excel: it changes the order of the rows based on the values in one or more columns. For example, our data set gapminder is currently sorted alphabetically by country and then by year, but we may instead want to sort observations by population size:
gapminder %>%arrange(pop)
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Sao Tome and Principe Africa 1952 46.5 60011 880.
2 Sao Tome and Principe Africa 1957 48.9 61325 861.
3 Djibouti Africa 1952 34.8 63149 2670.
4 Sao Tome and Principe Africa 1962 51.9 65345 1072.
5 Sao Tome and Principe Africa 1967 54.4 70787 1385.
6 Djibouti Africa 1957 37.3 71851 2865.
7 Sao Tome and Principe Africa 1972 56.5 76595 1533.
8 Sao Tome and Principe Africa 1977 58.6 86796 1738.
9 Djibouti Africa 1962 39.7 89898 3021.
10 Sao Tome and Principe Africa 1982 60.4 98593 1890.
# ℹ 1,694 more rows
Sorting can be useful to see the observations with the smallest or largest values for a certain column: above we see that the country and year with the smallest population size is Sao Tome and Principe in 1952.
Default sorting is from small to large, but of course, we may also want to sort in the reverse order. You can do this using the desc() (descending, large-to-small) helper function:
gapminder %>%arrange(desc(pop))
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 China Asia 2007 73.0 1318683096 4959.
2 China Asia 2002 72.0 1280400000 3119.
3 China Asia 1997 70.4 1230075000 2289.
4 China Asia 1992 68.7 1164970000 1656.
5 India Asia 2007 64.7 1110396331 2452.
6 China Asia 1987 67.3 1084035000 1379.
7 India Asia 2002 62.9 1034172547 1747.
8 China Asia 1982 65.5 1000281000 962.
9 India Asia 1997 61.8 959000000 1459.
10 China Asia 1977 64.0 943455000 741.
# ℹ 1,694 more rows
Finally, it is common to want to sort by multiple columns, where ties in the first column are broken by a second column (and so on) – to do so, simply list the columns in the appropriate order:
gapminder %>%arrange(continent, country)
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Algeria Africa 1952 43.1 9279525 2449.
2 Algeria Africa 1957 45.7 10270856 3014.
3 Algeria Africa 1962 48.3 11000948 2551.
4 Algeria Africa 1967 51.4 12760499 3247.
5 Algeria Africa 1972 54.5 14760787 4183.
6 Algeria Africa 1977 58.0 17152804 4910.
7 Algeria Africa 1982 61.4 20033753 5745.
8 Algeria Africa 1987 65.8 23254956 5681.
9 Algeria Africa 1992 67.7 26298373 5023.
10 Algeria Africa 1997 69.2 29072015 4797.
# ℹ 1,694 more rows
The above example sorts first by continent and then by country.
7mutate() to modify values in columns and create new columns
So far, we’ve focused on functions that “merely” subset and reorganize data frames. We’ve also seen how we can modify column names. But we haven’t seen how we can change the data or compute derived data in data frames.
We can do this with the mutate() function. For example, say that we want to create a new column that has the population size in millions rather than in individuals:
called pop_million that is the result of dividing the values in the pop column by a million.
gapminder %>%mutate(pop_million = pop /10^6)
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap pop_million
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 8.43
2 Afghanistan Asia 1957 30.3 9240934 821. 9.24
3 Afghanistan Asia 1962 32.0 10267083 853. 10.3
4 Afghanistan Asia 1967 34.0 11537966 836. 11.5
5 Afghanistan Asia 1972 36.1 13079460 740. 13.1
6 Afghanistan Asia 1977 38.4 14880372 786. 14.9
7 Afghanistan Asia 1982 39.9 12881816 978. 12.9
8 Afghanistan Asia 1987 40.8 13867957 852. 13.9
9 Afghanistan Asia 1992 41.7 16317921 649. 16.3
10 Afghanistan Asia 1997 41.8 22227415 635. 22.2
# ℹ 1,694 more rows
So, the code above created a new column called pop_million that is the result of dividing the values in the pop column by a million.
To modify a column rather than adding a new one, simply assign back to the same name:
gapminder %>%mutate(pop = pop /10^6)
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8.43 779.
2 Afghanistan Asia 1957 30.3 9.24 821.
3 Afghanistan Asia 1962 32.0 10.3 853.
4 Afghanistan Asia 1967 34.0 11.5 836.
5 Afghanistan Asia 1972 36.1 13.1 740.
6 Afghanistan Asia 1977 38.4 14.9 786.
7 Afghanistan Asia 1982 39.9 12.9 978.
8 Afghanistan Asia 1987 40.8 13.9 852.
9 Afghanistan Asia 1992 41.7 16.3 649.
10 Afghanistan Asia 1997 41.8 22.2 635.
# ℹ 1,694 more rows
7.1 Challenge 2
A: Create a new column called gdp_billion that has the absolute GDP (i.e., not relative to population size) in units of billions.
Click for the solution
gapminder %>%mutate(gdp_billion = gdpPercap * pop /10^9)
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap gdp_billion
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 6.57
2 Afghanistan Asia 1957 30.3 9240934 821. 7.59
3 Afghanistan Asia 1962 32.0 10267083 853. 8.76
4 Afghanistan Asia 1967 34.0 11537966 836. 9.65
5 Afghanistan Asia 1972 36.1 13079460 740. 9.68
6 Afghanistan Asia 1977 38.4 14880372 786. 11.7
7 Afghanistan Asia 1982 39.9 12881816 978. 12.6
8 Afghanistan Asia 1987 40.8 13867957 852. 11.8
9 Afghanistan Asia 1992 41.7 16317921 649. 10.6
10 Afghanistan Asia 1997 41.8 22227415 635. 14.1
# ℹ 1,694 more rows
B: (Bonus) Create a new column planet that has the value earth in every row.
Click for the solution
If you simply provide a value, this will be repeated in every row:
gapminder %>%mutate(planet ="earth")
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap planet
<fct> <fct> <int> <dbl> <int> <dbl> <chr>
1 Afghanistan Asia 1952 28.8 8425333 779. earth
2 Afghanistan Asia 1957 30.3 9240934 821. earth
3 Afghanistan Asia 1962 32.0 10267083 853. earth
4 Afghanistan Asia 1967 34.0 11537966 836. earth
5 Afghanistan Asia 1972 36.1 13079460 740. earth
6 Afghanistan Asia 1977 38.4 14880372 786. earth
7 Afghanistan Asia 1982 39.9 12881816 978. earth
8 Afghanistan Asia 1987 40.8 13867957 852. earth
9 Afghanistan Asia 1992 41.7 16317921 649. earth
10 Afghanistan Asia 1997 41.8 22227415 635. earth
# ℹ 1,694 more rows
8summarize() to compute (per-group) summary statistics
In combination with group_by(), the summarize() function can compute data summaries across groups of rows of a data frame.
First, let’s see what summarize() does when used by itself:
Above, we computed the mean for two columns, across all rows. This is already useful, but in combination with the helper function group_by(), summarize() becomes really powerful.
For example, let’s compute the mean GDP and mean life expectancy separately for each continent:
# A tibble: 5 × 3
continent mean_gdpPercap mean_lifeExp
<fct> <dbl> <dbl>
1 Africa 2194. 48.9
2 Americas 7136. 64.7
3 Asia 7902. 60.1
4 Europe 14469. 71.9
5 Oceania 18622. 74.3
group_by() implicitly splits a data frame into groups of rows: here, one group for observations from each continent. After that, operations like in summarize() will happen separately for each group, which is how we ended up with per-continent means.
Finally, another powerful feature is that we can group by multiple variables – for example, by yearandcontinent:
Then, arrange that dataframe in two directions to see the countries with the longest and shortest life expectance – piping into head() as a bonus to only see the top n, here top 1:
# A tibble: 1 × 2
country mean_lifeExp
<fct> <dbl>
1 Iceland 76.5
9 Bonus: count() and n()
A very common operation is to count the number of observations for each group. The dplyr package comes with two related functions that help with this.
For instance, if we wanted to check the number of countries included in the dataset for the year 2002, we can use the count() function. It takes the name of one or more columns that contain the groups we are interested in, and we can optionally sort the results in descending order by adding sort = TRUE:
# A tibble: 5 × 2
continent n
<fct> <int>
1 Africa 52
2 Asia 33
3 Europe 30
4 Americas 25
5 Oceania 2
If we need to use the number of observations in calculations, the n() function is useful. It will return the total number of observations in the current group rather than counting the number of observations in each group within a specific column. For instance, if we wanted to get the standard error of the life expectancy per continent:
In your journey to become a skilled data frame wrangler in R, here are some additional topics that are very useful but beyond the scope of this workshop:
Joining/merging – combining multiple dataframes based on one or more shared columns. This can be done with dplyr’s join_*() functions.
Pivoting/reshaping – moving between ‘wide’ and ‘long’ data formats with pivot_wider() and pivot_longer() – this is covered in episode 13 of our focal Carpentries lesson.