Data wrangling: using select(), arrange(), and mutate()

r-basics
tidyverse
Author

Jessica Cooperstone

Published

September 9, 2024


Artwork by @allison_horst

1 Introduction

We are going to start off this semester of code club with a series of sessions on how to “wrangle” your data. It can be a struggle to get data into a format that is amenable for analysis, so we will be going through a series of functions and approaches that will get you more comfortable with manipulating your data.

Last Code Club, we didn’t get to go over the function select() - so we are going to do that today, along with the function mutate(). In brief:

1.1 Load packages

First we will load the packages that we will use today.

library(tidyverse) 
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor) # for cleaning column names

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

1.2 Data

We are going to use data from The World Factbook, put together by the CIA to “provides basic intelligence on the history, people, government, economy, energy, geography, environment, communications, transportation, military, terrorism, and transnational issues for 265 world entities.” I thought this data would give us some opportunities to flex our R skills, and learn a bit about the world.

The data we are going to download can be found here, though I have saved the file, added it to our Code Club Github, and included some code below for you to download it.

If you downloaded this file two weeks ago, you don’t need to download it again, you just need to make sure you know where it is on your computer to read it in.

download.file(
  url = "https://raw.githubusercontent.com/osu-codeclub/osu-codeclub.github.io/main/posts/S08E01_wrangling_01/data/factbook-2015.csv",
  destfile = "factbook_download_2015.csv"
)

We can read it in using the tidyverse function from the readr package called read_csv().

factbook_2015 <- read_csv("factbook_download_2015.csv")
Rows: 217 Columns: 53
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Country Name, Country Code
dbl (51): Population, total, Population growth (annual %), Surface area (sq....

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We can look at this data another way, using the function head() to look at the first six rows, and every column.

head(factbook_2015)
# A tibble: 6 × 53
  `Country Name` `Country Code` `Population, total` Population growth (annual …¹
  <chr>          <chr>                        <dbl>                        <dbl>
1 Afghanistan    AFG                       33753499                        3.12 
2 Albania        ALB                        2880703                       -0.291
3 Algeria        DZA                       39543154                        2.00 
4 American Samoa ASM                          51368                       -1.64 
5 Andorra        AND                          71746                        0.174
6 Angola         AGO                       28127721                        3.62 
# ℹ abbreviated name: ¹​`Population growth (annual %)`
# ℹ 49 more variables: `Surface area (sq. km)` <dbl>,
#   `Poverty headcount ratio at national poverty lines (% of population)` <dbl>,
#   `GNI, Atlas method (current US$)` <dbl>,
#   `GNI per capita, Atlas method (current US$)` <dbl>,
#   `GNI, PPP (current international $)` <dbl>,
#   `GNI per capita, PPP (current international $)` <dbl>, …

Two weeks ago, we cleaned up the column names, so let’s do that again. If you want to remember why, you can go back to that lesson.

factbook_2015 <- clean_names(factbook_2015)

2 Choose columns with select()

Often you will want to pick only certain columns in your dataframe, and you can do this with the function select(). You can pick columns generally by:

  • their names
  • their position
  • characteristics of that column

If we want to know how the arguments to select() work, we can access the documentation material about the function.

?select()

Let’s first select columns by their names. Let’s pick just the country_name, population_total, and surface_area_sq_km.

factbook_2015 |> 
  select(country_name, population_total, surface_area_sq_km)
# A tibble: 217 × 3
   country_name        population_total surface_area_sq_km
   <chr>                          <dbl>              <dbl>
 1 Afghanistan                 33753499             652860
 2 Albania                      2880703              28750
 3 Algeria                     39543154            2381741
 4 American Samoa                 51368                200
 5 Andorra                        71746                470
 6 Angola                      28127721            1246700
 7 Antigua and Barbuda            89941                440
 8 Argentina                   43131966            2780400
 9 Armenia                      2878595              29740
10 Aruba                         104257                180
# ℹ 207 more rows

Those columns are also the 1st, 3rd, and 5th columns in our data frame, do we can select them by their indices, or by their location.

factbook_2015 |> 
  select(1, 3, 5)
# A tibble: 217 × 3
   country_name        population_total surface_area_sq_km
   <chr>                          <dbl>              <dbl>
 1 Afghanistan                 33753499             652860
 2 Albania                      2880703              28750
 3 Algeria                     39543154            2381741
 4 American Samoa                 51368                200
 5 Andorra                        71746                470
 6 Angola                      28127721            1246700
 7 Antigua and Barbuda            89941                440
 8 Argentina                   43131966            2780400
 9 Armenia                      2878595              29740
10 Aruba                         104257                180
# ℹ 207 more rows

In general I would recommend against this because its really hard to remember which column indices are which variables today, nevermind returning back to old code 1 year from now.

We can also select columns that are consecutive, using the : operator. Below I’m selecting the columns country_name through population_growth_annual_percent.

factbook_2015 |> 
  select(country_name:population_growth_annual_percent)
# A tibble: 217 × 4
   country_name        country_code population_total population_growth_annual_…¹
   <chr>               <chr>                   <dbl>                       <dbl>
 1 Afghanistan         AFG                  33753499                       3.12 
 2 Albania             ALB                   2880703                      -0.291
 3 Algeria             DZA                  39543154                       2.00 
 4 American Samoa      ASM                     51368                      -1.64 
 5 Andorra             AND                     71746                       0.174
 6 Angola              AGO                  28127721                       3.62 
 7 Antigua and Barbuda ATG                     89941                       0.787
 8 Argentina           ARG                  43131966                       1.08 
 9 Armenia             ARM                   2878595                      -0.393
10 Aruba               ABW                    104257                       0.638
# ℹ 207 more rows
# ℹ abbreviated name: ¹​population_growth_annual_percent

We can remove columns using the ! operator:

factbook_2015 |> 
  select(!country_code)
# A tibble: 217 × 52
   country_name       population_total population_growth_an…¹ surface_area_sq_km
   <chr>                         <dbl>                  <dbl>              <dbl>
 1 Afghanistan                33753499                  3.12              652860
 2 Albania                     2880703                 -0.291              28750
 3 Algeria                    39543154                  2.00             2381741
 4 American Samoa                51368                 -1.64                 200
 5 Andorra                       71746                  0.174                470
 6 Angola                     28127721                  3.62             1246700
 7 Antigua and Barbu…            89941                  0.787                440
 8 Argentina                  43131966                  1.08             2780400
 9 Armenia                     2878595                 -0.393              29740
10 Aruba                        104257                  0.638                180
# ℹ 207 more rows
# ℹ abbreviated name: ¹​population_growth_annual_percent
# ℹ 48 more variables:
#   poverty_headcount_ratio_at_national_poverty_lines_percent_of_population <dbl>,
#   gni_atlas_method_current_us <dbl>,
#   gni_per_capita_atlas_method_current_us <dbl>,
#   gni_ppp_current_international <dbl>, …

We can also select data based on its characteristics. We can select using selection helpers like:

For example, we might want each column that has anything to do with gross domestic product, or gdp. We can select all of the columns which contain the string “gdp” in their name. I’m also going to add country_name so we know what we’re working with.

factbook_2015 |> 
  select(country_name, contains("gdp"))
# A tibble: 217 × 15
   country_name     water_productivity_t…¹ gdp_current_us gdp_growth_annual_pe…²
   <chr>                             <dbl>          <dbl>                  <dbl>
 1 Afghanistan                       0.943    19134221645                  1.45 
 2 Albania                          12.4      11386853113                  2.22 
 3 Algeria                          20.9     187494000000                  3.2  
 4 American Samoa                   NA          673000000                  3.15 
 5 Andorra                          NA         2789881259                  1.43 
 6 Angola                          128.       90496420507                  0.944
 7 Antigua and Bar…                327.        1437755556                  1.45 
 8 Argentina                        15.8     594749000000                  2.73 
 9 Armenia                           3.23     10553337518                  3.20 
10 Aruba                            NA         2962907263                 -0.624
# ℹ 207 more rows
# ℹ abbreviated names:
#   ¹​water_productivity_total_constant_2015_us_gdp_per_cubic_meter_of_total_freshwater_withdrawal,
#   ²​gdp_growth_annual_percent
# ℹ 11 more variables: inflation_gdp_deflator_annual_percent <dbl>,
#   agriculture_forestry_and_fishing_value_added_percent_of_gdp <dbl>,
#   industry_including_construction_value_added_percent_of_gdp <dbl>, …

We can also select all column that meet a certain predicate. For example, we can pick all of the column that are of the type character.

factbook_2015 |> 
  select(where(is.character))
# A tibble: 217 × 2
   country_name        country_code
   <chr>               <chr>       
 1 Afghanistan         AFG         
 2 Albania             ALB         
 3 Algeria             DZA         
 4 American Samoa      ASM         
 5 Andorra             AND         
 6 Angola              AGO         
 7 Antigua and Barbuda ATG         
 8 Argentina           ARG         
 9 Armenia             ARM         
10 Aruba               ABW         
# ℹ 207 more rows

We can also combine selections using the & (and), | (or), and ! (not) operators. For example, if I want the columns about GNI but only the international ones I can:

factbook_2015 |> 
  select(starts_with("gni") & ends_with("international"))
# A tibble: 217 × 2
   gni_ppp_current_international gni_per_capita_ppp_current_international
                           <dbl>                                    <dbl>
 1                   77739869000                                     2300
 2                   33981401261                                    11800
 3                  541234000000                                    13690
 4                            NA                                       NA
 5                            NA                                       NA
 6                  190283000000                                     6760
 7                    1766297473                                    19640
 8                  848960000000                                    19680
 9                   30505448341                                    10600
10                    3718236361                                    35660
# ℹ 207 more rows

We can also use select() to order our columns, as the order we select them in dictates the order they will exist in our dataframe.

2.1 Practice

Come up with 3 different ways to select the columns about children, and make sure you also include a country column so you know what you’re looking at.

Need a hint?

Here are the columns that I’m considering to be about children:

[1] “country_name”
[17] “mortality_rate_under_5_per_1_000_live_births”
[18] “prevalence_of_underweight_weight_for_age_percent_of_children_under_5”
[19] “immunization_measles_percent_of_children_ages_12_23_months”
[20] “primary_completion_rate_total_percent_of_relevant_age_group”
[21] “school_enrollment_secondary_percent_gross”
[22] “school_enrollment_primary_and_secondary_gross_gender_parity_index_gpi”

Click for the solution

By name and for columns in a row, using the :.

factbook_2015 |> 
  select(country_name, 
         mortality_rate_under_5_per_1_000_live_births:
         school_enrollment_primary_and_secondary_gross_gender_parity_index_gpi)
# A tibble: 217 × 7
   country_name        mortality_rate_under_5_per_1_000…¹ prevalence_of_underw…²
   <chr>                                            <dbl>                  <dbl>
 1 Afghanistan                                       72.7                     NA
 2 Albania                                            9.6                     NA
 3 Algeria                                           25.3                     NA
 4 American Samoa                                    NA                       NA
 5 Andorra                                            3.6                     NA
 6 Angola                                            87.9                     19
 7 Antigua and Barbuda                               10.9                     NA
 8 Argentina                                         11.8                     NA
 9 Armenia                                           14.5                     NA
10 Aruba                                             NA                       NA
# ℹ 207 more rows
# ℹ abbreviated names: ¹​mortality_rate_under_5_per_1_000_live_births,
#   ²​prevalence_of_underweight_weight_for_age_percent_of_children_under_5
# ℹ 4 more variables:
#   immunization_measles_percent_of_children_ages_12_23_months <dbl>,
#   primary_completion_rate_total_percent_of_relevant_age_group <dbl>,
#   school_enrollment_secondary_percent_gross <dbl>, …

By index:

factbook_2015 |> 
  select(1, 17:22) 
# A tibble: 217 × 7
   country_name        mortality_rate_under_5_per_1_000…¹ prevalence_of_underw…²
   <chr>                                            <dbl>                  <dbl>
 1 Afghanistan                                       72.7                     NA
 2 Albania                                            9.6                     NA
 3 Algeria                                           25.3                     NA
 4 American Samoa                                    NA                       NA
 5 Andorra                                            3.6                     NA
 6 Angola                                            87.9                     19
 7 Antigua and Barbuda                               10.9                     NA
 8 Argentina                                         11.8                     NA
 9 Armenia                                           14.5                     NA
10 Aruba                                             NA                       NA
# ℹ 207 more rows
# ℹ abbreviated names: ¹​mortality_rate_under_5_per_1_000_live_births,
#   ²​prevalence_of_underweight_weight_for_age_percent_of_children_under_5
# ℹ 4 more variables:
#   immunization_measles_percent_of_children_ages_12_23_months <dbl>,
#   primary_completion_rate_total_percent_of_relevant_age_group <dbl>,
#   school_enrollment_secondary_percent_gross <dbl>, …

By name characteristics:

factbook_2015 |> 
  select(country_name, contains("under_5"), contains("children"),
         contains("primary"), contains("school"))
# A tibble: 217 × 7
   country_name        mortality_rate_under_5_per_1_000…¹ prevalence_of_underw…²
   <chr>                                            <dbl>                  <dbl>
 1 Afghanistan                                       72.7                     NA
 2 Albania                                            9.6                     NA
 3 Algeria                                           25.3                     NA
 4 American Samoa                                    NA                       NA
 5 Andorra                                            3.6                     NA
 6 Angola                                            87.9                     19
 7 Antigua and Barbuda                               10.9                     NA
 8 Argentina                                         11.8                     NA
 9 Armenia                                           14.5                     NA
10 Aruba                                             NA                       NA
# ℹ 207 more rows
# ℹ abbreviated names: ¹​mortality_rate_under_5_per_1_000_live_births,
#   ²​prevalence_of_underweight_weight_for_age_percent_of_children_under_5
# ℹ 4 more variables:
#   immunization_measles_percent_of_children_ages_12_23_months <dbl>,
#   primary_completion_rate_total_percent_of_relevant_age_group <dbl>,
#   school_enrollment_primary_and_secondary_gross_gender_parity_index_gpi <dbl>, …

These are just some ways!

3 Sorting data with arrange()

A nice helper function for looking at your data is arrange() which sorts your data.

We can sort our data based on how much forest (forest_area_sq_km) each country has.

factbook_2015 |> 
  arrange(forest_area_sq_km)
# A tibble: 217 × 53
   country_name             country_code population_total population_growth_an…¹
   <chr>                    <chr>                   <dbl>                  <dbl>
 1 Gibraltar                GIB                     32520                  0.209
 2 Monaco                   MCO                     36760                  1.78 
 3 Nauru                    NRU                     11185                  2.21 
 4 Qatar                    QAT                   2414573                  8.65 
 5 Curacao                  CUW                    157980                  1.32 
 6 Faroe Islands            FRO                     48816                  0.722
 7 Greenland                GRL                     56114                 -0.322
 8 Malta                    MLT                    445053                  2.39 
 9 Sint Maarten (Dutch par… SXM                     38825                  2.98 
10 Aruba                    ABW                    104257                  0.638
# ℹ 207 more rows
# ℹ abbreviated name: ¹​population_growth_annual_percent
# ℹ 49 more variables: surface_area_sq_km <dbl>,
#   poverty_headcount_ratio_at_national_poverty_lines_percent_of_population <dbl>,
#   gni_atlas_method_current_us <dbl>,
#   gni_per_capita_atlas_method_current_us <dbl>,
#   gni_ppp_current_international <dbl>, …

By default, arrange() sorts small to big, if we want to go from big to small we can set arrange(desc()) to sort by descending.

factbook_2015 |> 
  arrange(desc(forest_area_sq_km))
# A tibble: 217 × 53
   country_name       country_code population_total population_growth_annual_p…¹
   <chr>              <chr>                   <dbl>                        <dbl>
 1 Russian Federation RUS                 144640716                        0.279
 2 Brazil             BRA                 205188205                        0.846
 3 Canada             CAN                  35704498                        0.760
 4 United States      USA                 320738994                        0.736
 5 China              CHN                1379860000                        0.581
 6 Australia          AUS                  23815995                        1.44 
 7 Congo, Dem. Rep.   COD                  78656904                        3.39 
 8 Indonesia          IDN                 259091970                        1.11 
 9 Peru               PER                  30711863                        1.17 
10 India              IND                1322866505                        1.19 
# ℹ 207 more rows
# ℹ abbreviated name: ¹​population_growth_annual_percent
# ℹ 49 more variables: surface_area_sq_km <dbl>,
#   poverty_headcount_ratio_at_national_poverty_lines_percent_of_population <dbl>,
#   gni_atlas_method_current_us <dbl>,
#   gni_per_capita_atlas_method_current_us <dbl>,
#   gni_ppp_current_international <dbl>, …

It’s not too surprising that the biggest countries in forest-feasible lattitudes have the most forest.

We can also add select() into our pipe, so that we don’t have to scroll so far to see what the actual amount of forest is.

factbook_2015 |> 
  select(country_name, surface_area_sq_km, forest_area_sq_km) |> 
  arrange(desc(forest_area_sq_km))
# A tibble: 217 × 3
   country_name       surface_area_sq_km forest_area_sq_km
   <chr>                           <dbl>             <dbl>
 1 Russian Federation           17098250          8149305.
 2 Brazil                        8515770          5038848 
 3 Canada                        9879750          3471157.
 4 United States                 9831510          3100950 
 5 China                         9562911          2102942.
 6 Australia                     7741220          1330945 
 7 Congo, Dem. Rep.              2344860          1316621.
 8 Indonesia                     1913580           950279 
 9 Peru                          1285220           731945.
10 India                         3287260           708280 
# ℹ 207 more rows

Wow almost 50% of Russia is forested.

3.1 Practice

Which countries have the lowest cell phone subscriptions? mobile_cellular_subscriptions_per_100_people

Need a hint?

You can use the function arrange() to sort your columns. The default arranging is from low to high, so if you want to go from high to low, you can set arrange(desc()).

Click for the solution
factbook_2015 |> 
  select(country_name, mobile_cellular_subscriptions_per_100_people) |> 
  arrange(mobile_cellular_subscriptions_per_100_people)
# A tibble: 217 × 2
   country_name              mobile_cellular_subscriptions_per_100_people
   <chr>                                                            <dbl>
 1 Korea, Dem. People's Rep.                                         12.8
 2 Eritrea                                                           14.2
 3 Micronesia, Fed. Sts.                                             20.6
 4 Central African Republic                                          25.8
 5 South Sudan                                                       25.9
 6 Cuba                                                              29.4
 7 Djibouti                                                          31.2
 8 Marshall Islands                                                  31.4
 9 Kiribati                                                          35.1
10 Chad                                                              38.7
# ℹ 207 more rows

4 Creating new columns with mutate()

Sometimes we might want to use our existing columns to create new ones. For example, we might want to know which country has the highest percentage of its land as forest. In our current data set, we can see amount of forested land (forest_area_sq_km) but this doesn’t take into account the size of the country. Fortunately, we also have the variable surface_area_sq_km so this is something we can calculate ourselves.

Conceptually, a new variable representing the percentage of forested land can be created by taking forest_area_sq_km, dividing it by surface_area_sq_km and multiplying by 100.

The function mutate() uses the syntax new_variable = existing_variables. I am also saving over our existing dataframe so we can see our new variable. I am then using select() to see only the variables I want, and order by desc(forest_percent) to see the countries with the highest percentage of forest at the top.

factbook_2015 <- factbook_2015 |> 
  mutate(forest_percent = (forest_area_sq_km / surface_area_sq_km)*100)

factbook_2015 |> 
  select(country_name, forest_area_sq_km, surface_area_sq_km, forest_percent) |> 
  arrange(desc(forest_percent))
# A tibble: 217 × 4
   country_name          forest_area_sq_km surface_area_sq_km forest_percent
   <chr>                             <dbl>              <dbl>          <dbl>
 1 Suriname                        152517.             163820           93.1
 2 Micronesia, Fed. Sts.              643.                700           91.8
 3 Palau                              410.                460           89.1
 4 Equatorial Guinea                24902.              28050           88.8
 5 Gabon                           235900              267670           88.1
 6 Solomon Islands                  25266.              28900           87.4
 7 American Samoa                     173.                200           86.4
 8 Guyana                          184614.             214970           85.9
 9 Papua New Guinea                360244.             462840           77.8
10 Seychelles                         337                 460           73.3
# ℹ 207 more rows

Wow! Suriname is 93% forested!

We might be interested to have forest_area_sq_km and surface_area_sq_km show up in scientific notation,. We will use the function format() within a mutate() function to indicate that scientific notation should be true (i.e., scientific = TRUE).

factbook_2015 <- factbook_2015 |> 
  mutate(forest_area_sq_km = format(forest_area_sq_km, scientific = TRUE),
         surface_area_sq_km = format(surface_area_sq_km, scientific = TRUE))

Did it work?

factbook_2015 |> 
  select(country_name, forest_area_sq_km, surface_area_sq_km) |> 
  arrange(desc(surface_area_sq_km))
# A tibble: 217 × 3
   country_name             forest_area_sq_km surface_area_sq_km
   <chr>                    <chr>             <chr>             
 1 Canada                   3.471157e+06      9.879750e+06      
 2 United Arab Emirates     3.173000e+03      9.864790e+04      
 3 United States            3.100950e+06      9.831510e+06      
 4 Sao Tome and Principe    5.500000e+02      9.600000e+02      
 5 China                    2.102942e+06      9.562911e+06      
 6 Turks and Caicos Islands 1.052000e+02      9.500000e+02      
 7 Tanzania                 4.809000e+05      9.473000e+05      
 8 Hungary                  2.060820e+04      9.303000e+04      
 9 Cyprus                   1.727100e+03      9.251000e+03      
10 Nigeria                  2.244345e+05      9.237700e+05      
# ℹ 207 more rows

Excellent.

4.1 Practice

Which country has the highest dollar amount of GDP due to agricultlure, foresty and fishing products? What is that amount in US dollars?

Need a hint?

Our data has gdp_current_us and agriculture_forestry_and_fishing_value_added_percent_of_gdp. From this you can calculate how much the total GDP from these sources.

Click for the solution
factbook_2015 |> 
  mutate(
    ag_gdp = agriculture_forestry_and_fishing_value_added_percent_of_gdp * gdp_current_us) |> 
  select(country_name, ag_gdp, agriculture_forestry_and_fishing_value_added_percent_of_gdp, gdp_current_us) |> 
  arrange(desc(ag_gdp))
# A tibble: 217 × 4
   country_name        ag_gdp agriculture_forestry_and_fishing_…¹ gdp_current_us
   <chr>                <dbl>                               <dbl>          <dbl>
 1 China              9.28e13                                8.39        1.11e13
 2 India              3.40e13                               16.2         2.10e12
 3 United States      1.88e13                                1.03        1.83e13
 4 Indonesia          1.16e13                               13.5         8.61e11
 5 Nigeria            1.02e13                               20.6         4.93e11
 6 Brazil             7.78e12                                4.32        1.80e12
 7 Pakistan           6.98e12                               23.3         3.00e11
 8 Turkiye            5.94e12                                6.87        8.64e11
 9 Russian Federation 5.28e12                                3.87        1.36e12
10 Japan              4.60e12                                1.03        4.44e12
# ℹ 207 more rows
# ℹ abbreviated name:
#   ¹​agriculture_forestry_and_fishing_value_added_percent_of_gdp
Back to top