Data wrangling: using select(), arrange(), and mutate()
r-basics
tidyverse
Author
Jessica Cooperstone
Published
September 9, 2024
1 Introduction
We are going to start off this semester of code club with a series of sessions on how to “wrangle” your data. It can be a struggle to get data into a format that is amenable for analysis, so we will be going through a series of functions and approaches that will get you more comfortable with manipulating your data.
Last Code Club, we didn’t get to go over the function select() - so we are going to do that today, along with the function mutate(). In brief:
mutate() creates new columns based on existing ones
1.1 Load packages
First we will load the packages that we will use today.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor) # for cleaning column names
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
1.2 Data
We are going to use data from The World Factbook, put together by the CIA to “provides basic intelligence on the history, people, government, economy, energy, geography, environment, communications, transportation, military, terrorism, and transnational issues for 265 world entities.” I thought this data would give us some opportunities to flex our R skills, and learn a bit about the world.
The data we are going to download can be found here, though I have saved the file, added it to our Code Club Github, and included some code below for you to download it.
If you downloaded this file two weeks ago, you don’t need to download it again, you just need to make sure you know where it is on your computer to read it in.
Rows: 217 Columns: 53
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country Name, Country Code
dbl (51): Population, total, Population growth (annual %), Surface area (sq....
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We can look at this data another way, using the function head() to look at the first six rows, and every column.
head(factbook_2015)
# A tibble: 6 × 53
`Country Name` `Country Code` `Population, total` Population growth (annual …¹
<chr> <chr> <dbl> <dbl>
1 Afghanistan AFG 33753499 3.12
2 Albania ALB 2880703 -0.291
3 Algeria DZA 39543154 2.00
4 American Samoa ASM 51368 -1.64
5 Andorra AND 71746 0.174
6 Angola AGO 28127721 3.62
# ℹ abbreviated name: ¹`Population growth (annual %)`
# ℹ 49 more variables: `Surface area (sq. km)` <dbl>,
# `Poverty headcount ratio at national poverty lines (% of population)` <dbl>,
# `GNI, Atlas method (current US$)` <dbl>,
# `GNI per capita, Atlas method (current US$)` <dbl>,
# `GNI, PPP (current international $)` <dbl>,
# `GNI per capita, PPP (current international $)` <dbl>, …
Two weeks ago, we cleaned up the column names, so let’s do that again. If you want to remember why, you can go back to that lesson.
factbook_2015 <-clean_names(factbook_2015)
2 Choose columns with select()
Often you will want to pick only certain columns in your dataframe, and you can do this with the function select(). You can pick columns generally by:
their names
their position
characteristics of that column
If we want to know how the arguments to select() work, we can access the documentation material about the function.
?select()
Let’s first select columns by their names. Let’s pick just the country_name, population_total, and surface_area_sq_km.
In general I would recommend against this because its really hard to remember which column indices are which variables today, nevermind returning back to old code 1 year from now.
We can also select columns that are consecutive, using the : operator. Below I’m selecting the columns country_name through population_growth_annual_percent.
where(): selects columns where the statement given in the argument is TRUE
all_of(): matches all of the variable names in a character vector
any_of(): matches any of the names in a character vector
For example, we might want each column that has anything to do with gross domestic product, or gdp. We can select all of the columns which contain the string “gdp” in their name. I’m also going to add country_name so we know what we’re working with.
We can also select all column that meet a certain predicate. For example, we can pick all of the column that are of the type character.
factbook_2015 |>select(where(is.character))
# A tibble: 217 × 2
country_name country_code
<chr> <chr>
1 Afghanistan AFG
2 Albania ALB
3 Algeria DZA
4 American Samoa ASM
5 Andorra AND
6 Angola AGO
7 Antigua and Barbuda ATG
8 Argentina ARG
9 Armenia ARM
10 Aruba ABW
# ℹ 207 more rows
We can also combine selections using the & (and), | (or), and ! (not) operators. For example, if I want the columns about GNI but only the international ones I can:
# A tibble: 217 × 2
gni_ppp_current_international gni_per_capita_ppp_current_international
<dbl> <dbl>
1 77739869000 2300
2 33981401261 11800
3 541234000000 13690
4 NA NA
5 NA NA
6 190283000000 6760
7 1766297473 19640
8 848960000000 19680
9 30505448341 10600
10 3718236361 35660
# ℹ 207 more rows
We can also use select() to order our columns, as the order we select them in dictates the order they will exist in our dataframe.
2.1 Practice
Come up with 3 different ways to select the columns about children, and make sure you also include a country column so you know what you’re looking at.
Need a hint?
Here are the columns that I’m considering to be about children:
# A tibble: 217 × 7
country_name mortality_rate_under_5_per_1_000…¹ prevalence_of_underw…²
<chr> <dbl> <dbl>
1 Afghanistan 72.7 NA
2 Albania 9.6 NA
3 Algeria 25.3 NA
4 American Samoa NA NA
5 Andorra 3.6 NA
6 Angola 87.9 19
7 Antigua and Barbuda 10.9 NA
8 Argentina 11.8 NA
9 Armenia 14.5 NA
10 Aruba NA NA
# ℹ 207 more rows
# ℹ abbreviated names: ¹mortality_rate_under_5_per_1_000_live_births,
# ²prevalence_of_underweight_weight_for_age_percent_of_children_under_5
# ℹ 4 more variables:
# immunization_measles_percent_of_children_ages_12_23_months <dbl>,
# primary_completion_rate_total_percent_of_relevant_age_group <dbl>,
# school_enrollment_secondary_percent_gross <dbl>, …
By index:
factbook_2015 |>select(1, 17:22)
# A tibble: 217 × 7
country_name mortality_rate_under_5_per_1_000…¹ prevalence_of_underw…²
<chr> <dbl> <dbl>
1 Afghanistan 72.7 NA
2 Albania 9.6 NA
3 Algeria 25.3 NA
4 American Samoa NA NA
5 Andorra 3.6 NA
6 Angola 87.9 19
7 Antigua and Barbuda 10.9 NA
8 Argentina 11.8 NA
9 Armenia 14.5 NA
10 Aruba NA NA
# ℹ 207 more rows
# ℹ abbreviated names: ¹mortality_rate_under_5_per_1_000_live_births,
# ²prevalence_of_underweight_weight_for_age_percent_of_children_under_5
# ℹ 4 more variables:
# immunization_measles_percent_of_children_ages_12_23_months <dbl>,
# primary_completion_rate_total_percent_of_relevant_age_group <dbl>,
# school_enrollment_secondary_percent_gross <dbl>, …
# A tibble: 217 × 7
country_name mortality_rate_under_5_per_1_000…¹ prevalence_of_underw…²
<chr> <dbl> <dbl>
1 Afghanistan 72.7 NA
2 Albania 9.6 NA
3 Algeria 25.3 NA
4 American Samoa NA NA
5 Andorra 3.6 NA
6 Angola 87.9 19
7 Antigua and Barbuda 10.9 NA
8 Argentina 11.8 NA
9 Armenia 14.5 NA
10 Aruba NA NA
# ℹ 207 more rows
# ℹ abbreviated names: ¹mortality_rate_under_5_per_1_000_live_births,
# ²prevalence_of_underweight_weight_for_age_percent_of_children_under_5
# ℹ 4 more variables:
# immunization_measles_percent_of_children_ages_12_23_months <dbl>,
# primary_completion_rate_total_percent_of_relevant_age_group <dbl>,
# school_enrollment_primary_and_secondary_gross_gender_parity_index_gpi <dbl>, …
These are just some ways!
3 Sorting data with arrange()
A nice helper function for looking at your data is arrange() which sorts your data.
We can sort our data based on how much forest (forest_area_sq_km) each country has.
# A tibble: 217 × 3
country_name surface_area_sq_km forest_area_sq_km
<chr> <dbl> <dbl>
1 Russian Federation 17098250 8149305.
2 Brazil 8515770 5038848
3 Canada 9879750 3471157.
4 United States 9831510 3100950
5 China 9562911 2102942.
6 Australia 7741220 1330945
7 Congo, Dem. Rep. 2344860 1316621.
8 Indonesia 1913580 950279
9 Peru 1285220 731945.
10 India 3287260 708280
# ℹ 207 more rows
Wow almost 50% of Russia is forested.
3.1 Practice
Which countries have the lowest cell phone subscriptions? mobile_cellular_subscriptions_per_100_people
Need a hint?
You can use the function arrange() to sort your columns. The default arranging is from low to high, so if you want to go from high to low, you can set arrange(desc()).
# A tibble: 217 × 2
country_name mobile_cellular_subscriptions_per_100_people
<chr> <dbl>
1 Korea, Dem. People's Rep. 12.8
2 Eritrea 14.2
3 Micronesia, Fed. Sts. 20.6
4 Central African Republic 25.8
5 South Sudan 25.9
6 Cuba 29.4
7 Djibouti 31.2
8 Marshall Islands 31.4
9 Kiribati 35.1
10 Chad 38.7
# ℹ 207 more rows
4 Creating new columns with mutate()
Sometimes we might want to use our existing columns to create new ones. For example, we might want to know which country has the highest percentage of its land as forest. In our current data set, we can see amount of forested land (forest_area_sq_km) but this doesn’t take into account the size of the country. Fortunately, we also have the variable surface_area_sq_km so this is something we can calculate ourselves.
Conceptually, a new variable representing the percentage of forested land can be created by taking forest_area_sq_km, dividing it by surface_area_sq_km and multiplying by 100.
The function mutate() uses the syntax new_variable = existing_variables. I am also saving over our existing dataframe so we can see our new variable. I am then using select() to see only the variables I want, and order by desc(forest_percent) to see the countries with the highest percentage of forest at the top.
We might be interested to have forest_area_sq_km and surface_area_sq_km show up in scientific notation,. We will use the function format() within a mutate() function to indicate that scientific notation should be true (i.e., scientific = TRUE).
# A tibble: 217 × 3
country_name forest_area_sq_km surface_area_sq_km
<chr> <chr> <chr>
1 Canada 3.471157e+06 9.879750e+06
2 United Arab Emirates 3.173000e+03 9.864790e+04
3 United States 3.100950e+06 9.831510e+06
4 Sao Tome and Principe 5.500000e+02 9.600000e+02
5 China 2.102942e+06 9.562911e+06
6 Turks and Caicos Islands 1.052000e+02 9.500000e+02
7 Tanzania 4.809000e+05 9.473000e+05
8 Hungary 2.060820e+04 9.303000e+04
9 Cyprus 1.727100e+03 9.251000e+03
10 Nigeria 2.244345e+05 9.237700e+05
# ℹ 207 more rows
Excellent.
4.1 Practice
Which country has the highest dollar amount of GDP due to agricultlure, foresty and fishing products? What is that amount in US dollars?
Need a hint?
Our data has gdp_current_us and agriculture_forestry_and_fishing_value_added_percent_of_gdp. From this you can calculate how much the total GDP from these sources.