Data frame manipulation with dplyr

Author

Software Carpentry / Jelmer Poelstra

Published

February 11, 2025



1 Introduction

1.1 The dplyr package and the tidyverse

The dplyr package provides a number of very useful functions for manipulating data frames.

In this session, we’ll cover some of the most commonly used functions:

  • select() to pick columns (variables)
  • filter() to pick rows (observations)
  • rename() to change column names
  • arrange() to change the order of rows (i.e., to sort a data frame)
  • mutate() to modify values in columns and create new columns
  • summarize() to compute across-row summaries

All these functions take a data frame as the input, and output a new, modified data frame.

dplyr belongs to a family of R packages designed for “dataframe-centric” data science called the “tidyverse”. The other tidyverse package we’ll cover in today’s workshop is ggplot2 for making plots.

1.2 Setting up

Let’s start a new script for this session:

  1. Open a new R script (Click the + symbol in toolbar at the top, then click R Script)1.

  2. Save the script straight away as data-structures.R – you can save it anywhere you like, though it is probably best to save it in a folder specifically for this workshop.

  3. If you want the section headers as comments in your script, as in the script I am showing you now, then copy-and-paste the following into your script:

Section headers for your script (Click to expand)
# 1 - Introduction -------------------------------------------------------------
# 1.3 - Loading the tidyverse

# 1.4 - The gapminder dataset

# 2 - select() to pick columns (variables) -------------------------------------

# 3 - rename() to change column names ------------------------------------------

# 4 - The pipe |> --------------------------------------------------------------

# 5 - filter() to pick rows (observations) -------------------------------------

# Challenge 1
# Write a single command (which can span multiple lines and include pipes) that
# will produce a data frame that has lifeExp, country, and year for Africa but not
# for other continents.
# How many rows does your data frame have?

# 6 - arrange() to sort data frames --------------------------------------------

# 7 - mutate() to modify values in columns and create new columns --------------

# Challenge 2
# A: Use mutate() to create a new column called gdp_billion that has the absolute
#    GDP (i.e., not relative to population size) and does so in units of billions
#    (i.e., 2.3 would mean 2.3 billion).

# B: (Bonus) Use mutate() to create a new column planet that has the value earth
#    in every row. We didn’t cover this kind of scenario but go ahead and test 
#    your intuition!
  
# 8 - summarize() to compute group-wise summary stats --------------------------

# Challenge 3
# Calculate the average life expectancy per country. Which has the longest average
# life expectancy and which has the shortest average life expectancy?

1.3 Loading the tidyverse

All core tidyverse packages can be installed and loaded with a single command. Since you should already have installed the tidyverse2, you only need to load it, which you do as follows:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The output tells you which packages have been loaded as part of the tidyverse.

Also printed above is a “Conflicts” section that may seem ominous. What this means is that two tidyverse functions, filter() and lag(), have the same names as two functions from the stats package that were already in your R environment.

(Those stats package functions are part of what is often referred to as “base R”: core R functionality that is always available (loaded) when you start R.)

Due to this function name conflict/collision, for example, the filter() function from dplyr “masks” the filter() function from stats: that is, if you write a command with filter(), it will use the dplyr function and not the stats function.

You can still use a “masked” function, but you will need to preface it with its package name as follows: stats::filter().


1.4 The gapminder data set

In this session and the next one on data visualization, we will work with the gapminder data set. This data set is available in a package of the same name (while most packages are built around functions so as to extend R’s functionality, others merely contain data sets).

This package, too, you should have already installed3, so you only need to load it right now:

library(gapminder)

Let’s take a look at the dataset:

gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

We can also use the View() function to look at the data frame — this will open a new tab in our editor pane with a spreadsheet-like look and feel:

View(gapminder)
# (Should display the dataset in an editor pane tab)

The gapminder data frame is a so-called “tibble”, which is the tidyverse variant of a data frame. The main difference is the nicer default printing behavior of tibbles: e.g. the data types of columns are shown, and only a limited number of rows are printed.

As for the dataset itself, note that each row contains some data for a single country in a specific year (across 5-year intervals between 1952 and 2007), with the following columns:

  • country (stored as a factor)
  • continent (stored as a factor)
  • year (stored as an integer)
  • lifeExp (stored as an integer): the life expectancy in years
  • pop (stored as an integer): the population size
  • gdpPercap (stored as a double): the per-capita GDP


2 select() to pick columns (variables)

To subset a data frame by keeping or removing certain columns, we can use the select() function.

By default, this function will only keep the columns that you specify, which you typically do simply by listing those columns by name:

select(.data = gapminder, year, country, gdpPercap)
# A tibble: 1,704 × 3
    year country     gdpPercap
   <int> <fct>           <dbl>
 1  1952 Afghanistan      779.
 2  1957 Afghanistan      821.
 3  1962 Afghanistan      853.
 4  1967 Afghanistan      836.
 5  1972 Afghanistan      740.
 6  1977 Afghanistan      786.
 7  1982 Afghanistan      978.
 8  1987 Afghanistan      852.
 9  1992 Afghanistan      649.
10  1997 Afghanistan      635.
# ℹ 1,694 more rows

In the command above, the first argument was the data frame, whereas the other arguments were the (unquoted!) names of columns we wanted to keep.

The order of the columns in the output data frame is exactly as you list them in select(), and doesn’t need to be the same as in the input data frame. In other words, select() is also one way to reorder columns. In the example above, we made year appear before country, for example.

We can also specify columns that should be removed, by prefacing their name with a ! (or a -):

select(.data = gapminder, !continent)
# A tibble: 1,704 × 5
   country      year lifeExp      pop gdpPercap
   <fct>       <int>   <dbl>    <int>     <dbl>
 1 Afghanistan  1952    28.8  8425333      779.
 2 Afghanistan  1957    30.3  9240934      821.
 3 Afghanistan  1962    32.0 10267083      853.
 4 Afghanistan  1967    34.0 11537966      836.
 5 Afghanistan  1972    36.1 13079460      740.
 6 Afghanistan  1977    38.4 14880372      786.
 7 Afghanistan  1982    39.9 12881816      978.
 8 Afghanistan  1987    40.8 13867957      852.
 9 Afghanistan  1992    41.7 16317921      649.
10 Afghanistan  1997    41.8 22227415      635.
# ℹ 1,694 more rows

There are also ways to e.g. select ranges of columns, but that is beyond the scope of this short workshop. Check the select() help by typing ?select to learn more.


3 rename() to change column names

The next dplyr function is one of the simplest: rename() to change column names.

The syntax to specify the new and old name within the function is new_name = old_name — for example, we may want to rename the gdpPercap column:

rename(.data = gapminder, gdp_per_capita = gdpPercap)
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdp_per_capita
   <fct>       <fct>     <int>   <dbl>    <int>          <dbl>
 1 Afghanistan Asia       1952    28.8  8425333           779.
 2 Afghanistan Asia       1957    30.3  9240934           821.
 3 Afghanistan Asia       1962    32.0 10267083           853.
 4 Afghanistan Asia       1967    34.0 11537966           836.
 5 Afghanistan Asia       1972    36.1 13079460           740.
 6 Afghanistan Asia       1977    38.4 14880372           786.
 7 Afghanistan Asia       1982    39.9 12881816           978.
 8 Afghanistan Asia       1987    40.8 13867957           852.
 9 Afghanistan Asia       1992    41.7 16317921           649.
10 Afghanistan Asia       1997    41.8 22227415           635.
# ℹ 1,694 more rows

We can also rename multiple columns at once:

rename(.data = gapminder, gdp_per_capita = gdpPercap, life_exp = lifeExp)
# A tibble: 1,704 × 6
   country     continent  year life_exp      pop gdp_per_capita
   <fct>       <fct>     <int>    <dbl>    <int>          <dbl>
 1 Afghanistan Asia       1952     28.8  8425333           779.
 2 Afghanistan Asia       1957     30.3  9240934           821.
 3 Afghanistan Asia       1962     32.0 10267083           853.
 4 Afghanistan Asia       1967     34.0 11537966           836.
 5 Afghanistan Asia       1972     36.1 13079460           740.
 6 Afghanistan Asia       1977     38.4 14880372           786.
 7 Afghanistan Asia       1982     39.9 12881816           978.
 8 Afghanistan Asia       1987     40.8 13867957           852.
 9 Afghanistan Asia       1992     41.7 16317921           649.
10 Afghanistan Asia       1997     41.8 22227415           635.
# ℹ 1,694 more rows


4 The pipe (|>)

So far, our examples applied a single dplyr function to a data frame, and simply printed the output (a new data frame) to screen.

But in practice, it is common to use several functions in succession to “wrangle” a dataframe into the format we want. For example, if we want to first select() one or more columns, and then modify the output further by rename()-ing a column, we could:

gapminder_sel <- select(.data = gapminder, year, country, gdpPercap)

rename(.data = gapminder_sel, gdp_per_capita = gdpPercap)
# A tibble: 1,704 × 3
    year country     gdp_per_capita
   <int> <fct>                <dbl>
 1  1952 Afghanistan           779.
 2  1957 Afghanistan           821.
 3  1962 Afghanistan           853.
 4  1967 Afghanistan           836.
 5  1972 Afghanistan           740.
 6  1977 Afghanistan           786.
 7  1982 Afghanistan           978.
 8  1987 Afghanistan           852.
 9  1992 Afghanistan           649.
10  1997 Afghanistan           635.
# ℹ 1,694 more rows

For more elaborate editing, we could go on like we did above, successively assigning new data frames and moving on to the next step.

But there is a nicer way of dong this, using so-called “piping” with a pipe operator (|>). Let’s start by seeing a reformulation of the code above with pipes:

gapminder |>
  select(year, country, gdpPercap) |>
  rename(gdp_per_capita = gdpPercap)
# A tibble: 1,704 × 3
    year country     gdp_per_capita
   <int> <fct>                <dbl>
 1  1952 Afghanistan           779.
 2  1957 Afghanistan           821.
 3  1962 Afghanistan           853.
 4  1967 Afghanistan           836.
 5  1972 Afghanistan           740.
 6  1977 Afghanistan           786.
 7  1982 Afghanistan           978.
 8  1987 Afghanistan           852.
 9  1992 Afghanistan           649.
10  1997 Afghanistan           635.
# ℹ 1,694 more rows

What happened here is that we took the gapminder data frame, pushed (or “piped”) it into the select() function, whose output was in turn piped into the rename() function.

You can think of the pipe as “then”: take gapminder, then select, then rename. Using pipes is slightly less typing and considerably more readable than using successive assignments like we did before.4

When using pipes, we no longer specify the input data frame with the .data argument, because the input is “already” being piped into the function (by default, the input goes to the function’s first argument).

For code readability, it is good practice to always start a new line after a pipe |>


5 filter() to pick rows (observations)

The filter() function keeps only those rows that satisfy one or more conditions. It is similar to Filter functionality in spreadsheets — except that those only change what you display, while filter() will remove rows. But if that sounds scary, recall what we mentioned earlier:

All these functions take a data frame as the input, and output a new, modified data frame.

Therefore, you would only modify existing data if you would assign the output back to the input object, like in the example below with the select() function:

cats <- cats |> select(name, coat)

Let’s start with the following example, where we want to keep observations (remember, these are countries in a given year) with a life expectancy exceeding 80 years:

gapminder |>
  filter(lifeExp > 80)
# A tibble: 21 × 6
   country          continent  year lifeExp      pop gdpPercap
   <fct>            <fct>     <int>   <dbl>    <int>     <dbl>
 1 Australia        Oceania    2002    80.4 19546792    30688.
 2 Australia        Oceania    2007    81.2 20434176    34435.
 3 Canada           Americas   2007    80.7 33390141    36319.
 4 France           Europe     2007    80.7 61083916    30470.
 5 Hong Kong, China Asia       2002    81.5  6762476    30209.
 6 Hong Kong, China Asia       2007    82.2  6980412    39725.
 7 Iceland          Europe     2002    80.5   288030    31163.
 8 Iceland          Europe     2007    81.8   301931    36181.
 9 Israel           Asia       2007    80.7  6426679    25523.
10 Italy            Europe     2002    80.2 57926999    27968.
# ℹ 11 more rows

So, we specify a condition based on the values in at least one column to only keep rows satisfying this condition.

These conditions don’t have to be based on numeric comparisons – for example:

gapminder |>
  filter(continent == "Europe")
# A tibble: 360 × 6
   country continent  year lifeExp     pop gdpPercap
   <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
 1 Albania Europe     1952    55.2 1282697     1601.
 2 Albania Europe     1957    59.3 1476505     1942.
 3 Albania Europe     1962    64.8 1728137     2313.
 4 Albania Europe     1967    66.2 1984060     2760.
 5 Albania Europe     1972    67.7 2263554     3313.
 6 Albania Europe     1977    68.9 2509048     3533.
 7 Albania Europe     1982    70.4 2780097     3631.
 8 Albania Europe     1987    72   3075321     3739.
 9 Albania Europe     1992    71.6 3326498     2497.
10 Albania Europe     1997    73.0 3428038     3193.
# ℹ 350 more rows
Remember to use two equals signs == to test for equality!

Filter based on multiple conditions

It’s also possible to filter based on multiple conditions – for example:

gapminder |>
  filter(continent == "Asia", year == 2007, lifeExp > 80)
# A tibble: 3 × 6
  country          continent  year lifeExp       pop gdpPercap
  <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
1 Hong Kong, China Asia       2007    82.2   6980412    39725.
2 Israel           Asia       2007    80.7   6426679    25523.
3 Japan            Asia       2007    82.6 127467972    31656.

By default, multiple conditions are combined in an AND fashion — in other words, in a given row, each condition needs to be met for that column to be kept.

If you want to combine conditions in an OR fashion, you should use a | as follows:

gapminder |>
  filter(lifeExp > 80 | gdpPercap > 10000)
# A tibble: 392 × 6
   country   continent  year lifeExp      pop gdpPercap
   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
 1 Argentina Americas   1977    68.5 26983828    10079.
 2 Argentina Americas   1997    73.3 36203463    10967.
 3 Argentina Americas   2007    75.3 40301927    12779.
 4 Australia Oceania    1952    69.1  8691212    10040.
 5 Australia Oceania    1957    70.3  9712569    10950.
 6 Australia Oceania    1962    70.9 10794968    12217.
 7 Australia Oceania    1967    71.1 11872264    14526.
 8 Australia Oceania    1972    71.9 13177000    16789.
 9 Australia Oceania    1977    73.5 14074100    18334.
10 Australia Oceania    1982    74.7 15184200    19477.
# ℹ 382 more rows

Pipeline practice

Finally, let’s practice a bit more with “pipelines” that use multiple dplyr verbs:

gapminder |>
  filter(continent == "Americas") |>
  select(year, country, gdpPercap) |>
  rename(gdp_per_capita = gdpPercap)
# A tibble: 300 × 3
    year country   gdp_per_capita
   <int> <fct>              <dbl>
 1  1952 Argentina          5911.
 2  1957 Argentina          6857.
 3  1962 Argentina          7133.
 4  1967 Argentina          8053.
 5  1972 Argentina          9443.
 6  1977 Argentina         10079.
 7  1982 Argentina          8998.
 8  1987 Argentina          9140.
 9  1992 Argentina          9308.
10  1997 Argentina         10967.
# ℹ 290 more rows


Challenge 1

Write a single command (which can span multiple lines and include pipes) that will produce a data frame that has lifeExp, country, and year for Africa but not for other continents. How many rows does your data frame have?

Click for the solution
gapminder |>
  filter(continent == "Africa") |>
  select(year, country, lifeExp)
# A tibble: 624 × 3
    year country lifeExp
   <int> <fct>     <dbl>
 1  1952 Algeria    43.1
 2  1957 Algeria    45.7
 3  1962 Algeria    48.3
 4  1967 Algeria    51.4
 5  1972 Algeria    54.5
 6  1977 Algeria    58.0
 7  1982 Algeria    61.4
 8  1987 Algeria    65.8
 9  1992 Algeria    67.7
10  1997 Algeria    69.2
# ℹ 614 more rows
It has 624 rows.


6 arrange() to sort data frames

The arrange() function is like sorting functionality in Excel: it changes the order of rows based on the values in one or more columns.

For example, gapminder is currently sorted alphabetically by country and then by year, but we may instead want to sort by population size:

gapminder |>
  arrange(pop)
# A tibble: 1,704 × 6
   country               continent  year lifeExp   pop gdpPercap
   <fct>                 <fct>     <int>   <dbl> <int>     <dbl>
 1 Sao Tome and Principe Africa     1952    46.5 60011      880.
 2 Sao Tome and Principe Africa     1957    48.9 61325      861.
 3 Djibouti              Africa     1952    34.8 63149     2670.
 4 Sao Tome and Principe Africa     1962    51.9 65345     1072.
 5 Sao Tome and Principe Africa     1967    54.4 70787     1385.
 6 Djibouti              Africa     1957    37.3 71851     2865.
 7 Sao Tome and Principe Africa     1972    56.5 76595     1533.
 8 Sao Tome and Principe Africa     1977    58.6 86796     1738.
 9 Djibouti              Africa     1962    39.7 89898     3021.
10 Sao Tome and Principe Africa     1982    60.4 98593     1890.
# ℹ 1,694 more rows

Sorting can be useful to see the observations with the smallest or largest values for a certain column: above we see that the country and year with the smallest population size is Sao Tome and Principe in 1952.

Default sorting is from small to large, but of course, we may also want to sort in the reverse order. You can do this using the desc() (descending, large-to-small) helper function:

gapminder |>
  arrange(desc(pop))
# A tibble: 1,704 × 6
   country continent  year lifeExp        pop gdpPercap
   <fct>   <fct>     <int>   <dbl>      <int>     <dbl>
 1 China   Asia       2007    73.0 1318683096     4959.
 2 China   Asia       2002    72.0 1280400000     3119.
 3 China   Asia       1997    70.4 1230075000     2289.
 4 China   Asia       1992    68.7 1164970000     1656.
 5 India   Asia       2007    64.7 1110396331     2452.
 6 China   Asia       1987    67.3 1084035000     1379.
 7 India   Asia       2002    62.9 1034172547     1747.
 8 China   Asia       1982    65.5 1000281000      962.
 9 India   Asia       1997    61.8  959000000     1459.
10 China   Asia       1977    64.0  943455000      741.
# ℹ 1,694 more rows

Finally, it is common to want to sort by multiple columns, where ties in the first column are broken by a second column (and so on) – to do so, simply list the columns in the appropriate order:

gapminder |>
  arrange(continent, country)
# A tibble: 1,704 × 6
   country continent  year lifeExp      pop gdpPercap
   <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
 1 Algeria Africa     1952    43.1  9279525     2449.
 2 Algeria Africa     1957    45.7 10270856     3014.
 3 Algeria Africa     1962    48.3 11000948     2551.
 4 Algeria Africa     1967    51.4 12760499     3247.
 5 Algeria Africa     1972    54.5 14760787     4183.
 6 Algeria Africa     1977    58.0 17152804     4910.
 7 Algeria Africa     1982    61.4 20033753     5745.
 8 Algeria Africa     1987    65.8 23254956     5681.
 9 Algeria Africa     1992    67.7 26298373     5023.
10 Algeria Africa     1997    69.2 29072015     4797.
# ℹ 1,694 more rows

The above example sorts first by continent and then by country.


7 mutate() to modify values in columns and create new columns

So far, we’ve focused on functions that “merely” subset and reorganize data frames. We’ve also seen how we can modify column names. But we haven’t seen how we can change the data or compute derived data in data frames.

We can do this with the mutate() function. For example, say that we want to create a new column that has population sizes in millions rather than in individuals:

gapminder |>
  mutate(pop_million = pop / 10^6)
# A tibble: 1,704 × 7
   country     continent  year lifeExp      pop gdpPercap pop_million
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>       <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.        8.43
 2 Afghanistan Asia       1957    30.3  9240934      821.        9.24
 3 Afghanistan Asia       1962    32.0 10267083      853.       10.3 
 4 Afghanistan Asia       1967    34.0 11537966      836.       11.5 
 5 Afghanistan Asia       1972    36.1 13079460      740.       13.1 
 6 Afghanistan Asia       1977    38.4 14880372      786.       14.9 
 7 Afghanistan Asia       1982    39.9 12881816      978.       12.9 
 8 Afghanistan Asia       1987    40.8 13867957      852.       13.9 
 9 Afghanistan Asia       1992    41.7 16317921      649.       16.3 
10 Afghanistan Asia       1997    41.8 22227415      635.       22.2 
# ℹ 1,694 more rows

The code above created a new column called pop_million that is the result of dividing the values in the pop column by a million.

To modify a column rather than adding a new one, simply “assign back to the same name”:

gapminder |>
  mutate(pop = pop / 10^6)
# A tibble: 1,704 × 6
   country     continent  year lifeExp   pop gdpPercap
   <fct>       <fct>     <int>   <dbl> <dbl>     <dbl>
 1 Afghanistan Asia       1952    28.8  8.43      779.
 2 Afghanistan Asia       1957    30.3  9.24      821.
 3 Afghanistan Asia       1962    32.0 10.3       853.
 4 Afghanistan Asia       1967    34.0 11.5       836.
 5 Afghanistan Asia       1972    36.1 13.1       740.
 6 Afghanistan Asia       1977    38.4 14.9       786.
 7 Afghanistan Asia       1982    39.9 12.9       978.
 8 Afghanistan Asia       1987    40.8 13.9       852.
 9 Afghanistan Asia       1992    41.7 16.3       649.
10 Afghanistan Asia       1997    41.8 22.2       635.
# ℹ 1,694 more rows


Challenge 2

A: Use mutate() to create a new column called gdp_billion that has the absolute GDP (i.e., not relative to population size) and does so in units of billions (i.e., 2.3 would mean 2.3 billion).

Click for the solution
gapminder |>
    mutate(gdp_billion = gdpPercap * pop / 10^9)
# A tibble: 1,704 × 7
   country     continent  year lifeExp      pop gdpPercap gdp_billion
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>       <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.        6.57
 2 Afghanistan Asia       1957    30.3  9240934      821.        7.59
 3 Afghanistan Asia       1962    32.0 10267083      853.        8.76
 4 Afghanistan Asia       1967    34.0 11537966      836.        9.65
 5 Afghanistan Asia       1972    36.1 13079460      740.        9.68
 6 Afghanistan Asia       1977    38.4 14880372      786.       11.7 
 7 Afghanistan Asia       1982    39.9 12881816      978.       12.6 
 8 Afghanistan Asia       1987    40.8 13867957      852.       11.8 
 9 Afghanistan Asia       1992    41.7 16317921      649.       10.6 
10 Afghanistan Asia       1997    41.8 22227415      635.       14.1 
# ℹ 1,694 more rows

B: (Bonus) Use mutate() to create a new column planet that has the value earth in every row. We didn’t cover this kind of scenario, but go ahead and test your intuition!

Click for the solution

If you simply provide a value, this will be repeated in every row:

gapminder |>
  mutate(planet = "earth")
# A tibble: 1,704 × 7
   country     continent  year lifeExp      pop gdpPercap planet
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl> <chr> 
 1 Afghanistan Asia       1952    28.8  8425333      779. earth 
 2 Afghanistan Asia       1957    30.3  9240934      821. earth 
 3 Afghanistan Asia       1962    32.0 10267083      853. earth 
 4 Afghanistan Asia       1967    34.0 11537966      836. earth 
 5 Afghanistan Asia       1972    36.1 13079460      740. earth 
 6 Afghanistan Asia       1977    38.4 14880372      786. earth 
 7 Afghanistan Asia       1982    39.9 12881816      978. earth 
 8 Afghanistan Asia       1987    40.8 13867957      852. earth 
 9 Afghanistan Asia       1992    41.7 16317921      649. earth 
10 Afghanistan Asia       1997    41.8 22227415      635. earth 
# ℹ 1,694 more rows


8 summarize() to compute (groupwise) summary stats

In combination with group_by(), the summarize() function can compute data summaries across groups of rows of a data frame.

First, let’s see what summarize() does when used by itself:

gapminder |>
  summarize(mean_gdpPercap = mean(gdpPercap),
            mean_lifeExp = mean(lifeExp))
# A tibble: 1 × 2
  mean_gdpPercap mean_lifeExp
           <dbl>        <dbl>
1          7215.         59.5

Above, we computed the mean for two columns, across all rows. This is already useful, but in combination with the helper function group_by(), summarize() becomes really powerful by allowing us to compute groupwise stats.

For example, let’s compute the mean GDP and mean life expectancy separately for each continent:

gapminder |>
  group_by(continent) |>
  summarize(mean_gdpPercap = mean(gdpPercap),
            mean_lifeExp = mean(lifeExp))
# A tibble: 5 × 3
  continent mean_gdpPercap mean_lifeExp
  <fct>              <dbl>        <dbl>
1 Africa             2194.         48.9
2 Americas           7136.         64.7
3 Asia               7902.         60.1
4 Europe            14469.         71.9
5 Oceania           18622.         74.3

group_by() implicitly splits a data frame into groups of rows: here, one group for observations from each continent. After that, operations like in summarize() will happen separately for each group, which is how we ended up with per-continent means.

Finally, another powerful feature is that we can group by multiple variables – for example, by year and continent:

gapminder |>
  group_by(continent, year) |>
  summarize(mean_gdpPercap = mean(gdpPercap),
            mean_lifeExp = mean(lifeExp))
`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.
# A tibble: 60 × 4
# Groups:   continent [5]
   continent  year mean_gdpPercap mean_lifeExp
   <fct>     <int>          <dbl>        <dbl>
 1 Africa     1952          1253.         39.1
 2 Africa     1957          1385.         41.3
 3 Africa     1962          1598.         43.3
 4 Africa     1967          2050.         45.3
 5 Africa     1972          2340.         47.5
 6 Africa     1977          2586.         49.6
 7 Africa     1982          2482.         51.6
 8 Africa     1987          2283.         53.3
 9 Africa     1992          2282.         53.6
10 Africa     1997          2379.         53.6
# ℹ 50 more rows


Challenge 3

Calculate the average life expectancy per country. Which has the longest average life expectancy and which has the shortest average life expectancy?

Click for the solution

First, let’s create a dataframe with the mean life expectancy by country:

lifeExp_bycountry <- gapminder |>
   group_by(country) |>
   summarize(mean_lifeExp = mean(lifeExp))

Then, arrange that dataframe in two directions to see the countries with the longest and shortest life expectance – piping into head() as a bonus to only see the top n, here top 1:

lifeExp_bycountry |>
   arrange(mean_lifeExp) |>
   head(n = 1)
# A tibble: 1 × 2
  country      mean_lifeExp
  <fct>               <dbl>
1 Sierra Leone         36.8
lifeExp_bycountry |>
   arrange(desc(mean_lifeExp)) |>
   head(n = 1)
# A tibble: 1 × 2
  country mean_lifeExp
  <fct>          <dbl>
1 Iceland         76.5





9 Bonus material for self-study

9.1 Writing and reading tabular data to and from files

When working with your own data in R, it is common to want to read data from files into your R environment, and write data that is in your R environment to files.

While it’s possible to have R interact with Excel spreadsheet files5, keeping your data in plain-text files generally benefits reproducibility. Tabular plain text files can be stored using:

  • A Tab as the column delimiter (often called TSV files, and stored with a .tsv extension)
  • A comma as the column delimiter (often called CSV files, and stored with a .csv extension).

We will use functions from the readr package, which is part of the core tidyverse and therefore already loaded into your environment, to practice writing and reading data to and from TSV files.

Writing files

First, let’s write the gapminder data frame to file using the write_tsv() function with arguments x for the R object and file for the file path:

write_tsv(x = gapminder, file = "gapminder.tsv")

This will have written to a file called gapminder.tsv in your R working directory (you can see where that is at the top of the Console, or by running getwd()).

Checking out the file

In RStudio’s Files pane, find your new file and click on it. That way, the file will open in the editor.

The file argument to write_tsv() takes a file path, meaning that you can specify any location on your computer for it it in addition to its name.

For example, if you had a folder called results in your current working directory (directory is just another word for folder), you could put the file in there:

write_tsv(x = gapminder, file = "results/gapminder.tsv")

Note that a forward slash / as a folder delimiter will work regardless of your operating system (even though Windows natively delimits folder by a backslash \).

Finally, if you want to store the file in a totally different place than where you are now, note that you can also use so-called absolute (or “full”) paths like:

write_tsv(x = gapminder, file = "/Users/poelstra.1/Desktop/gapminder.tsv")

Reading from files

To practice reading data from a file, we’ll use the read_tsv() function on the file we just created:

gapminder_reread <- read_tsv(file = "gapminder.tsv")
Rows: 1704 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): country, continent
dbl (4): year, lifeExp, pop, gdpPercap

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note that the function is rather chatty by default, telling us how many rows and columns it read in, and what their data types are. Let’s check the resulting object:

gapminder_reread
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

This looks good! Do note that the column’s data types are not identical to what they were (year and pop are saved as double rather than integer, and country and continent as character rather than factor). This is largely expected because that kind of metadata is not stored in a plain-text TSV, so read_tsv() will by default simply make best guesses as to the types.

Alternatively, we could tell read_tsv what the column types should be using abbreviations (f for factor, i for integer, d for double — run ?read_tsv for more info):

read_tsv(file = "gapminder.tsv", col_types = "ffidi")
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

To read and write CSV files instead, use read_csv() / write_csv() in the same way.


9.2 count() and n()

A common operation is to count the number of observations for each group. The dplyr package comes with two related functions that help with this.

For instance, if we wanted to check the number of countries included in the dataset for the year 2002, we can use the count() function. It takes the name of one or more columns that contain the groups we are interested in, and we can optionally sort the results in descending order by adding sort = TRUE:

gapminder |>
    filter(year == 2002) |>
    count(continent, sort = TRUE)
# A tibble: 5 × 2
  continent     n
  <fct>     <int>
1 Africa       52
2 Asia         33
3 Europe       30
4 Americas     25
5 Oceania       2

If we need to use the number of observations in calculations like by summarize(), the n() function is useful. It will return the total number of observations in the “current group” as applicable — for instance, if we wanted to get the standard error of the life expectancy per continent:

gapminder |>
  group_by(continent) |>
  summarize(se_life = sd(lifeExp) / sqrt(n()))
# A tibble: 5 × 2
  continent se_life
  <fct>       <dbl>
1 Africa      0.366
2 Americas    0.540
3 Asia        0.596
4 Europe      0.286
5 Oceania     0.775

9.3 Learn more

This material was adapted from this Carpentries lesson episode.

In your journey to become a skilled data frame wrangler in R, here are some additional topics that are very useful but beyond the scope of this workshop:

  • Joining/merging – combining multiple dataframes based on one or more shared columns. This can be done with dplyr’s join_*() functions.

  • Pivoting/reshaping – moving between ‘wide’ and ‘long’ data formats with pivot_wider() and pivot_longer() – this is covered in episode 13 of our focal Carpentries lesson.


Back to top

Footnotes

  1. Or Click File => New file => R Script.↩︎

  2. If not: run install.packages("tidyverse") now.↩︎

  3. If not: run install.packages("gapminder") now.↩︎

  4. Using pipes is also faster and uses less computer memory.↩︎

  5. Using the readxl package. This is installed as part of the tidyverse but is not a core tidyverse package and therefore needs to be loaded separately.↩︎