# 1 - Introduction -------------------------------------------------------------
# 1.3 - Loading the tidyverse
# 1.4 - The gapminder dataset
# 2 - select() to pick columns (variables) -------------------------------------
# 3 - rename() to change column names ------------------------------------------
# 4 - The pipe |> --------------------------------------------------------------
# 5 - filter() to pick rows (observations) -------------------------------------
# Challenge 1
# Write a single command (which can span multiple lines and include pipes) that
# will produce a data frame that has lifeExp, country, and year for Africa but not
# for other continents.
# How many rows does your data frame have?
# 6 - arrange() to sort data frames --------------------------------------------
# 7 - mutate() to modify values in columns and create new columns --------------
# Challenge 2
# A: Use mutate() to create a new column called gdp_billion that has the absolute
# GDP (i.e., not relative to population size) and does so in units of billions
# (i.e., 2.3 would mean 2.3 billion).
# B: (Bonus) Use mutate() to create a new column planet that has the value earth
# in every row. We didn’t cover this kind of scenario but go ahead and test
# your intuition!
# 8 - summarize() to compute group-wise summary stats --------------------------
# Challenge 3
# Calculate the average life expectancy per country. Which has the longest average
# life expectancy and which has the shortest average life expectancy?
Data frame manipulation with dplyr
1 Introduction
1.1 The dplyr package and the tidyverse
The dplyr package provides a number of very useful functions for manipulating data frames.
In this session, we’ll cover some of the most commonly used functions:
select()
to pick columns (variables)filter()
to pick rows (observations)rename()
to change column namesarrange()
to change the order of rows (i.e., to sort a data frame)mutate()
to modify values in columns and create new columnssummarize()
to compute across-row summaries
All these functions take a data frame as the input, and output a new, modified data frame.
dplyr belongs to a family of R packages designed for “dataframe-centric” data science called the “tidyverse”. The other tidyverse package we’ll cover in today’s workshop is ggplot2 for making plots.
1.2 Setting up
Let’s start a new script for this session:
Open a new R script (Click the
+
symbol in toolbar at the top, then clickR Script
)1.Save the script straight away as
data-structures.R
– you can save it anywhere you like, though it is probably best to save it in a folder specifically for this workshop.If you want the section headers as comments in your script, as in the script I am showing you now, then copy-and-paste the following into your script:
Section headers for your script (Click to expand)
1.3 Loading the tidyverse
All core tidyverse packages can be installed and loaded with a single command. Since you should already have installed the tidyverse2, you only need to load it, which you do as follows:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The output tells you which packages have been loaded as part of the tidyverse.
Also printed above is a “Conflicts” section that may seem ominous. What this means is that two tidyverse functions, filter()
and lag()
, have the same names as two functions from the stats package that were already in your R environment.
(Those stats package functions are part of what is often referred to as “base R”: core R functionality that is always available (loaded) when you start R.)
Due to this function name conflict/collision, for example, the filter()
function from dplyr “masks” the filter()
function from stats: that is, if you write a command with filter()
, it will use the dplyr function and not the stats function.
You can still use a “masked” function, but you will need to preface it with its package name as follows: stats::filter()
.
1.4 The gapminder
data set
In this session and the next one on data visualization, we will work with the gapminder
data set. This data set is available in a package of the same name (while most packages are built around functions so as to extend R’s functionality, others merely contain data sets).
This package, too, you should have already installed3, so you only need to load it right now:
library(gapminder)
Let’s take a look at the dataset:
gapminder
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
We can also use the View()
function to look at the data frame — this will open a new tab in our editor pane with a spreadsheet-like look and feel:
View(gapminder)
# (Should display the dataset in an editor pane tab)
The gapminder
data frame is a so-called “tibble”, which is the tidyverse variant of a data frame. The main difference is the nicer default printing behavior of tibbles: e.g. the data types of columns are shown, and only a limited number of rows are printed.
As for the dataset itself, note that each row contains some data for a single country in a specific year (across 5-year intervals between 1952 and 2007), with the following columns:
country
(stored as afactor
)continent
(stored as afactor
)year
(stored as aninteger
)lifeExp
(stored as aninteger
): the life expectancy in yearspop
(stored as aninteger
): the population sizegdpPercap
(stored as adouble
): the per-capita GDP
2 select()
to pick columns (variables)
To subset a data frame by keeping or removing certain columns, we can use the select()
function.
By default, this function will only keep the columns that you specify, which you typically do simply by listing those columns by name:
select(.data = gapminder, year, country, gdpPercap)
# A tibble: 1,704 × 3
year country gdpPercap
<int> <fct> <dbl>
1 1952 Afghanistan 779.
2 1957 Afghanistan 821.
3 1962 Afghanistan 853.
4 1967 Afghanistan 836.
5 1972 Afghanistan 740.
6 1977 Afghanistan 786.
7 1982 Afghanistan 978.
8 1987 Afghanistan 852.
9 1992 Afghanistan 649.
10 1997 Afghanistan 635.
# ℹ 1,694 more rows
In the command above, the first argument was the data frame, whereas the other arguments were the (unquoted!) names of columns we wanted to keep.
The order of the columns in the output data frame is exactly as you list them in select()
, and doesn’t need to be the same as in the input data frame. In other words, select()
is also one way to reorder columns. In the example above, we made year
appear before country
, for example.
We can also specify columns that should be removed, by prefacing their name with a !
(or a -
):
select(.data = gapminder, !continent)
# A tibble: 1,704 × 5
country year lifeExp pop gdpPercap
<fct> <int> <dbl> <int> <dbl>
1 Afghanistan 1952 28.8 8425333 779.
2 Afghanistan 1957 30.3 9240934 821.
3 Afghanistan 1962 32.0 10267083 853.
4 Afghanistan 1967 34.0 11537966 836.
5 Afghanistan 1972 36.1 13079460 740.
6 Afghanistan 1977 38.4 14880372 786.
7 Afghanistan 1982 39.9 12881816 978.
8 Afghanistan 1987 40.8 13867957 852.
9 Afghanistan 1992 41.7 16317921 649.
10 Afghanistan 1997 41.8 22227415 635.
# ℹ 1,694 more rows
3 rename()
to change column names
The next dplyr function is one of the simplest: rename()
to change column names.
The syntax to specify the new and old name within the function is new_name = old_name
— for example, we may want to rename the gdpPercap
column:
rename(.data = gapminder, gdp_per_capita = gdpPercap)
# A tibble: 1,704 × 6
country continent year lifeExp pop gdp_per_capita
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
We can also rename multiple columns at once:
rename(.data = gapminder, gdp_per_capita = gdpPercap, life_exp = lifeExp)
# A tibble: 1,704 × 6
country continent year life_exp pop gdp_per_capita
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
4 The pipe (|>
)
So far, our examples applied a single dplyr function to a data frame, and simply printed the output (a new data frame) to screen.
But in practice, it is common to use several functions in succession to “wrangle” a dataframe into the format we want. For example, if we want to first select()
one or more columns, and then modify the output further by rename()
-ing a column, we could:
<- select(.data = gapminder, year, country, gdpPercap)
gapminder_sel
rename(.data = gapminder_sel, gdp_per_capita = gdpPercap)
# A tibble: 1,704 × 3
year country gdp_per_capita
<int> <fct> <dbl>
1 1952 Afghanistan 779.
2 1957 Afghanistan 821.
3 1962 Afghanistan 853.
4 1967 Afghanistan 836.
5 1972 Afghanistan 740.
6 1977 Afghanistan 786.
7 1982 Afghanistan 978.
8 1987 Afghanistan 852.
9 1992 Afghanistan 649.
10 1997 Afghanistan 635.
# ℹ 1,694 more rows
For more elaborate editing, we could go on like we did above, successively assigning new data frames and moving on to the next step.
But there is a nicer way of dong this, using so-called “piping” with a pipe operator (|>
). Let’s start by seeing a reformulation of the code above with pipes:
|>
gapminder select(year, country, gdpPercap) |>
rename(gdp_per_capita = gdpPercap)
# A tibble: 1,704 × 3
year country gdp_per_capita
<int> <fct> <dbl>
1 1952 Afghanistan 779.
2 1957 Afghanistan 821.
3 1962 Afghanistan 853.
4 1967 Afghanistan 836.
5 1972 Afghanistan 740.
6 1977 Afghanistan 786.
7 1982 Afghanistan 978.
8 1987 Afghanistan 852.
9 1992 Afghanistan 649.
10 1997 Afghanistan 635.
# ℹ 1,694 more rows
What happened here is that we took the gapminder
data frame, pushed (or “piped”) it into the select()
function, whose output was in turn piped into the rename()
function.
You can think of the pipe as “then”: take gapminder
, then select, then rename. Using pipes is slightly less typing and considerably more readable than using successive assignments like we did before.4
When using pipes, we no longer specify the input data frame with the .data
argument, because the input is “already” being piped into the function (by default, the input goes to the function’s first argument).
5 filter()
to pick rows (observations)
The filter()
function keeps only those rows that satisfy one or more conditions. It is similar to Filter functionality in spreadsheets — except that those only change what you display, while filter()
will remove rows. But if that sounds scary, recall what we mentioned earlier:
All these functions take a data frame as the input, and output a new, modified data frame.
Let’s start with the following example, where we want to keep observations (remember, these are countries in a given year) with a life expectancy exceeding 80 years:
|>
gapminder filter(lifeExp > 80)
# A tibble: 21 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Australia Oceania 2002 80.4 19546792 30688.
2 Australia Oceania 2007 81.2 20434176 34435.
3 Canada Americas 2007 80.7 33390141 36319.
4 France Europe 2007 80.7 61083916 30470.
5 Hong Kong, China Asia 2002 81.5 6762476 30209.
6 Hong Kong, China Asia 2007 82.2 6980412 39725.
7 Iceland Europe 2002 80.5 288030 31163.
8 Iceland Europe 2007 81.8 301931 36181.
9 Israel Asia 2007 80.7 6426679 25523.
10 Italy Europe 2002 80.2 57926999 27968.
# ℹ 11 more rows
So, we specify a condition based on the values in at least one column to only keep rows satisfying this condition.
These conditions don’t have to be based on numeric comparisons – for example:
|>
gapminder filter(continent == "Europe")
# A tibble: 360 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Albania Europe 1952 55.2 1282697 1601.
2 Albania Europe 1957 59.3 1476505 1942.
3 Albania Europe 1962 64.8 1728137 2313.
4 Albania Europe 1967 66.2 1984060 2760.
5 Albania Europe 1972 67.7 2263554 3313.
6 Albania Europe 1977 68.9 2509048 3533.
7 Albania Europe 1982 70.4 2780097 3631.
8 Albania Europe 1987 72 3075321 3739.
9 Albania Europe 1992 71.6 3326498 2497.
10 Albania Europe 1997 73.0 3428038 3193.
# ℹ 350 more rows
==
to test for equality!
Filter based on multiple conditions
It’s also possible to filter based on multiple conditions – for example:
|>
gapminder filter(continent == "Asia", year == 2007, lifeExp > 80)
# A tibble: 3 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Hong Kong, China Asia 2007 82.2 6980412 39725.
2 Israel Asia 2007 80.7 6426679 25523.
3 Japan Asia 2007 82.6 127467972 31656.
By default, multiple conditions are combined in an AND fashion — in other words, in a given row, each condition needs to be met for that column to be kept.
If you want to combine conditions in an OR fashion, you should use a |
as follows:
|>
gapminder filter(lifeExp > 80 | gdpPercap > 10000)
# A tibble: 392 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Argentina Americas 1977 68.5 26983828 10079.
2 Argentina Americas 1997 73.3 36203463 10967.
3 Argentina Americas 2007 75.3 40301927 12779.
4 Australia Oceania 1952 69.1 8691212 10040.
5 Australia Oceania 1957 70.3 9712569 10950.
6 Australia Oceania 1962 70.9 10794968 12217.
7 Australia Oceania 1967 71.1 11872264 14526.
8 Australia Oceania 1972 71.9 13177000 16789.
9 Australia Oceania 1977 73.5 14074100 18334.
10 Australia Oceania 1982 74.7 15184200 19477.
# ℹ 382 more rows
Pipeline practice
Finally, let’s practice a bit more with “pipelines” that use multiple dplyr verbs:
|>
gapminder filter(continent == "Americas") |>
select(year, country, gdpPercap) |>
rename(gdp_per_capita = gdpPercap)
# A tibble: 300 × 3
year country gdp_per_capita
<int> <fct> <dbl>
1 1952 Argentina 5911.
2 1957 Argentina 6857.
3 1962 Argentina 7133.
4 1967 Argentina 8053.
5 1972 Argentina 9443.
6 1977 Argentina 10079.
7 1982 Argentina 8998.
8 1987 Argentina 9140.
9 1992 Argentina 9308.
10 1997 Argentina 10967.
# ℹ 290 more rows
Challenge 1
Write a single command (which can span multiple lines and include pipes) that will produce a data frame that has lifeExp
, country
, and year
for Africa but not for other continents. How many rows does your data frame have?
Click for the solution
|>
gapminder filter(continent == "Africa") |>
select(year, country, lifeExp)
# A tibble: 624 × 3
year country lifeExp
<int> <fct> <dbl>
1 1952 Algeria 43.1
2 1957 Algeria 45.7
3 1962 Algeria 48.3
4 1967 Algeria 51.4
5 1972 Algeria 54.5
6 1977 Algeria 58.0
7 1982 Algeria 61.4
8 1987 Algeria 65.8
9 1992 Algeria 67.7
10 1997 Algeria 69.2
# ℹ 614 more rows
6 arrange()
to sort data frames
The arrange()
function is like sorting functionality in Excel: it changes the order of rows based on the values in one or more columns.
For example, gapminder
is currently sorted alphabetically by country
and then by year
, but we may instead want to sort by population size:
|>
gapminder arrange(pop)
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Sao Tome and Principe Africa 1952 46.5 60011 880.
2 Sao Tome and Principe Africa 1957 48.9 61325 861.
3 Djibouti Africa 1952 34.8 63149 2670.
4 Sao Tome and Principe Africa 1962 51.9 65345 1072.
5 Sao Tome and Principe Africa 1967 54.4 70787 1385.
6 Djibouti Africa 1957 37.3 71851 2865.
7 Sao Tome and Principe Africa 1972 56.5 76595 1533.
8 Sao Tome and Principe Africa 1977 58.6 86796 1738.
9 Djibouti Africa 1962 39.7 89898 3021.
10 Sao Tome and Principe Africa 1982 60.4 98593 1890.
# ℹ 1,694 more rows
Sorting can be useful to see the observations with the smallest or largest values for a certain column: above we see that the country and year with the smallest population size is Sao Tome and Principe in 1952.
Default sorting is from small to large, but of course, we may also want to sort in the reverse order. You can do this using the desc()
(descending, large-to-small) helper function:
|>
gapminder arrange(desc(pop))
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 China Asia 2007 73.0 1318683096 4959.
2 China Asia 2002 72.0 1280400000 3119.
3 China Asia 1997 70.4 1230075000 2289.
4 China Asia 1992 68.7 1164970000 1656.
5 India Asia 2007 64.7 1110396331 2452.
6 China Asia 1987 67.3 1084035000 1379.
7 India Asia 2002 62.9 1034172547 1747.
8 China Asia 1982 65.5 1000281000 962.
9 India Asia 1997 61.8 959000000 1459.
10 China Asia 1977 64.0 943455000 741.
# ℹ 1,694 more rows
Finally, it is common to want to sort by multiple columns, where ties in the first column are broken by a second column (and so on) – to do so, simply list the columns in the appropriate order:
|>
gapminder arrange(continent, country)
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Algeria Africa 1952 43.1 9279525 2449.
2 Algeria Africa 1957 45.7 10270856 3014.
3 Algeria Africa 1962 48.3 11000948 2551.
4 Algeria Africa 1967 51.4 12760499 3247.
5 Algeria Africa 1972 54.5 14760787 4183.
6 Algeria Africa 1977 58.0 17152804 4910.
7 Algeria Africa 1982 61.4 20033753 5745.
8 Algeria Africa 1987 65.8 23254956 5681.
9 Algeria Africa 1992 67.7 26298373 5023.
10 Algeria Africa 1997 69.2 29072015 4797.
# ℹ 1,694 more rows
The above example sorts first by continent and then by country.
7 mutate()
to modify values in columns and create new columns
So far, we’ve focused on functions that “merely” subset and reorganize data frames. We’ve also seen how we can modify column names. But we haven’t seen how we can change the data or compute derived data in data frames.
We can do this with the mutate()
function. For example, say that we want to create a new column that has population sizes in millions rather than in individuals:
|>
gapminder mutate(pop_million = pop / 10^6)
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap pop_million
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 8.43
2 Afghanistan Asia 1957 30.3 9240934 821. 9.24
3 Afghanistan Asia 1962 32.0 10267083 853. 10.3
4 Afghanistan Asia 1967 34.0 11537966 836. 11.5
5 Afghanistan Asia 1972 36.1 13079460 740. 13.1
6 Afghanistan Asia 1977 38.4 14880372 786. 14.9
7 Afghanistan Asia 1982 39.9 12881816 978. 12.9
8 Afghanistan Asia 1987 40.8 13867957 852. 13.9
9 Afghanistan Asia 1992 41.7 16317921 649. 16.3
10 Afghanistan Asia 1997 41.8 22227415 635. 22.2
# ℹ 1,694 more rows
The code above created a new column called pop_million
that is the result of dividing the values in the pop
column by a million.
To modify a column rather than adding a new one, simply “assign back to the same name”:
|>
gapminder mutate(pop = pop / 10^6)
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8.43 779.
2 Afghanistan Asia 1957 30.3 9.24 821.
3 Afghanistan Asia 1962 32.0 10.3 853.
4 Afghanistan Asia 1967 34.0 11.5 836.
5 Afghanistan Asia 1972 36.1 13.1 740.
6 Afghanistan Asia 1977 38.4 14.9 786.
7 Afghanistan Asia 1982 39.9 12.9 978.
8 Afghanistan Asia 1987 40.8 13.9 852.
9 Afghanistan Asia 1992 41.7 16.3 649.
10 Afghanistan Asia 1997 41.8 22.2 635.
# ℹ 1,694 more rows
Challenge 2
A: Use mutate()
to create a new column called gdp_billion
that has the absolute GDP (i.e., not relative to population size) and does so in units of billions (i.e., 2.3
would mean 2.3 billion).
Click for the solution
|>
gapminder mutate(gdp_billion = gdpPercap * pop / 10^9)
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap gdp_billion
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 6.57
2 Afghanistan Asia 1957 30.3 9240934 821. 7.59
3 Afghanistan Asia 1962 32.0 10267083 853. 8.76
4 Afghanistan Asia 1967 34.0 11537966 836. 9.65
5 Afghanistan Asia 1972 36.1 13079460 740. 9.68
6 Afghanistan Asia 1977 38.4 14880372 786. 11.7
7 Afghanistan Asia 1982 39.9 12881816 978. 12.6
8 Afghanistan Asia 1987 40.8 13867957 852. 11.8
9 Afghanistan Asia 1992 41.7 16317921 649. 10.6
10 Afghanistan Asia 1997 41.8 22227415 635. 14.1
# ℹ 1,694 more rows
B: (Bonus) Use mutate()
to create a new column planet
that has the value earth
in every row. We didn’t cover this kind of scenario, but go ahead and test your intuition!
Click for the solution
If you simply provide a value, this will be repeated in every row:
|>
gapminder mutate(planet = "earth")
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap planet
<fct> <fct> <int> <dbl> <int> <dbl> <chr>
1 Afghanistan Asia 1952 28.8 8425333 779. earth
2 Afghanistan Asia 1957 30.3 9240934 821. earth
3 Afghanistan Asia 1962 32.0 10267083 853. earth
4 Afghanistan Asia 1967 34.0 11537966 836. earth
5 Afghanistan Asia 1972 36.1 13079460 740. earth
6 Afghanistan Asia 1977 38.4 14880372 786. earth
7 Afghanistan Asia 1982 39.9 12881816 978. earth
8 Afghanistan Asia 1987 40.8 13867957 852. earth
9 Afghanistan Asia 1992 41.7 16317921 649. earth
10 Afghanistan Asia 1997 41.8 22227415 635. earth
# ℹ 1,694 more rows
8 summarize()
to compute (groupwise) summary stats
In combination with group_by()
, the summarize()
function can compute data summaries across groups of rows of a data frame.
First, let’s see what summarize()
does when used by itself:
|>
gapminder summarize(mean_gdpPercap = mean(gdpPercap),
mean_lifeExp = mean(lifeExp))
# A tibble: 1 × 2
mean_gdpPercap mean_lifeExp
<dbl> <dbl>
1 7215. 59.5
Above, we computed the mean for two columns, across all rows. This is already useful, but in combination with the helper function group_by()
, summarize()
becomes really powerful by allowing us to compute groupwise stats.
For example, let’s compute the mean GDP and mean life expectancy separately for each continent:
|>
gapminder group_by(continent) |>
summarize(mean_gdpPercap = mean(gdpPercap),
mean_lifeExp = mean(lifeExp))
# A tibble: 5 × 3
continent mean_gdpPercap mean_lifeExp
<fct> <dbl> <dbl>
1 Africa 2194. 48.9
2 Americas 7136. 64.7
3 Asia 7902. 60.1
4 Europe 14469. 71.9
5 Oceania 18622. 74.3
group_by()
implicitly splits a data frame into groups of rows: here, one group for observations from each continent. After that, operations like in summarize()
will happen separately for each group, which is how we ended up with per-continent means.
Finally, another powerful feature is that we can group by multiple variables – for example, by year
and continent
:
|>
gapminder group_by(continent, year) |>
summarize(mean_gdpPercap = mean(gdpPercap),
mean_lifeExp = mean(lifeExp))
`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.
# A tibble: 60 × 4
# Groups: continent [5]
continent year mean_gdpPercap mean_lifeExp
<fct> <int> <dbl> <dbl>
1 Africa 1952 1253. 39.1
2 Africa 1957 1385. 41.3
3 Africa 1962 1598. 43.3
4 Africa 1967 2050. 45.3
5 Africa 1972 2340. 47.5
6 Africa 1977 2586. 49.6
7 Africa 1982 2482. 51.6
8 Africa 1987 2283. 53.3
9 Africa 1992 2282. 53.6
10 Africa 1997 2379. 53.6
# ℹ 50 more rows
Challenge 3
Calculate the average life expectancy per country. Which has the longest average life expectancy and which has the shortest average life expectancy?
Click for the solution
First, let’s create a dataframe with the mean life expectancy by country:
<- gapminder |>
lifeExp_bycountry group_by(country) |>
summarize(mean_lifeExp = mean(lifeExp))
Then, arrange that dataframe in two directions to see the countries with the longest and shortest life expectance – piping into head()
as a bonus to only see the top n, here top 1:
|>
lifeExp_bycountry arrange(mean_lifeExp) |>
head(n = 1)
# A tibble: 1 × 2
country mean_lifeExp
<fct> <dbl>
1 Sierra Leone 36.8
|>
lifeExp_bycountry arrange(desc(mean_lifeExp)) |>
head(n = 1)
# A tibble: 1 × 2
country mean_lifeExp
<fct> <dbl>
1 Iceland 76.5
9 Bonus material for self-study
9.1 Writing and reading tabular data to and from files
When working with your own data in R, it is common to want to read data from files into your R environment, and write data that is in your R environment to files.
While it’s possible to have R interact with Excel spreadsheet files5, keeping your data in plain-text files generally benefits reproducibility. Tabular plain text files can be stored using:
- A Tab as the column delimiter (often called TSV files, and stored with a
.tsv
extension) - A comma as the column delimiter (often called CSV files, and stored with a
.csv
extension).
We will use functions from the readr package, which is part of the core tidyverse and therefore already loaded into your environment, to practice writing and reading data to and from TSV files.
Writing files
First, let’s write the gapminder
data frame to file using the write_tsv()
function with arguments x
for the R object and file
for the file path:
write_tsv(x = gapminder, file = "gapminder.tsv")
This will have written to a file called gapminder.tsv
in your R working directory (you can see where that is at the top of the Console, or by running getwd()
).
In RStudio’s Files pane, find your new file and click on it. That way, the file will open in the editor.
The file
argument to write_tsv()
takes a file path, meaning that you can specify any location on your computer for it it in addition to its name.
For example, if you had a folder called results
in your current working directory (directory is just another word for folder), you could put the file in there:
write_tsv(x = gapminder, file = "results/gapminder.tsv")
Note that a forward slash /
as a folder delimiter will work regardless of your operating system (even though Windows natively delimits folder by a backslash \
).
Finally, if you want to store the file in a totally different place than where you are now, note that you can also use so-called absolute (or “full”) paths like:
write_tsv(x = gapminder, file = "/Users/poelstra.1/Desktop/gapminder.tsv")
Reading from files
To practice reading data from a file, we’ll use the read_tsv()
function on the file we just created:
<- read_tsv(file = "gapminder.tsv") gapminder_reread
Rows: 1704 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): country, continent
dbl (4): year, lifeExp, pop, gdpPercap
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Note that the function is rather chatty by default, telling us how many rows and columns it read in, and what their data types are. Let’s check the resulting object:
gapminder_reread
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
This looks good! Do note that the column’s data types are not identical to what they were (year
and pop
are saved as double
rather than integer
, and country
and continent
as character
rather than factor
). This is largely expected because that kind of metadata is not stored in a plain-text TSV, so read_tsv()
will by default simply make best guesses as to the types.
Alternatively, we could tell read_tsv what the column types should be using abbreviations (f
for factor
, i
for integer
, d
for double
— run ?read_tsv
for more info):
read_tsv(file = "gapminder.tsv", col_types = "ffidi")
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
9.2 count()
and n()
A common operation is to count the number of observations for each group. The dplyr package comes with two related functions that help with this.
For instance, if we wanted to check the number of countries included in the dataset for the year 2002, we can use the count()
function. It takes the name of one or more columns that contain the groups we are interested in, and we can optionally sort the results in descending order by adding sort = TRUE
:
|>
gapminder filter(year == 2002) |>
count(continent, sort = TRUE)
# A tibble: 5 × 2
continent n
<fct> <int>
1 Africa 52
2 Asia 33
3 Europe 30
4 Americas 25
5 Oceania 2
If we need to use the number of observations in calculations like by summarize()
, the n()
function is useful. It will return the total number of observations in the “current group” as applicable — for instance, if we wanted to get the standard error of the life expectancy per continent:
|>
gapminder group_by(continent) |>
summarize(se_life = sd(lifeExp) / sqrt(n()))
# A tibble: 5 × 2
continent se_life
<fct> <dbl>
1 Africa 0.366
2 Americas 0.540
3 Asia 0.596
4 Europe 0.286
5 Oceania 0.775
9.3 Learn more
This material was adapted from this Carpentries lesson episode.
In your journey to become a skilled data frame wrangler in R, here are some additional topics that are very useful but beyond the scope of this workshop:
Joining/merging – combining multiple dataframes based on one or more shared columns. This can be done with dplyr’s
join_*()
functions.Pivoting/reshaping – moving between ‘wide’ and ‘long’ data formats with
pivot_wider()
andpivot_longer()
– this is covered in episode 13 of our focal Carpentries lesson.
Footnotes
Or Click
File
=>New file
=>R Script
.↩︎If not: run
install.packages("tidyverse")
now.↩︎If not: run
install.packages("gapminder")
now.↩︎Using pipes is also faster and uses less computer memory.↩︎
Using the readxl package. This is installed as part of the tidyverse but is not a core tidyverse package and therefore needs to be loaded separately.↩︎