tidy data : a way of organizing data where columns contain variables and rows contain observations
Terms
variable : contains values of an attribute that is measured (i.e., length, sex, pH, color)
observation : contains all the values measured on the same unit (i.e., person, sample, day)
Functions
pivot_longer() : a function that “lengthens” data by transforming a set of columns containing values into two columns - one containing the previous column names, and the other contain all of the values. This is the inverse of pivot_wider()
pivot_wider() : a function that “widens” data by transforming a set of two columns (though can be more) containing names and values into a larger set of columns. This is the inverse of pivot_longer()
So far, we’ve been learning a lot of functions within the tidyverse to manipulate datasets
filter() to choose rows
select() to choose columns
arrange() to sort data
rename() to change data names
mutate() to add and change data
summarise() to produce data summaries
Today we will be learning more about the structure of datasets, specifically tidy data .
Most of what we have been working in code club this semester has actually been using tidy data, as most functions in the tidyverse are used to create or manipulate tidy data.
While we have been focusing on how to manipulate tidy data, today we will be learning two functions that can be used to manipulate and more powerfully create tidy data - pivot_longer() and pivot_wider()
2.1 Set up
Lets get our favorite library loaded up
#install.packages("tidyverse") # delete first hash and run to install if you havent alreadylibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
3 Tidy data
3.1 Data Structure
Most datasets we work with in R are organized in a roughly rectangular grid-like format made up of rows and columns that contain some values.
Each value in a dataset belongs to some variable and some observation.
A variable contains values of an attribute that was measured (i.e., length, sex, pH, color)
An observation contains all the values measured on the same thing (unit) (i.e., person, sample, day)
Examples:
If I’m measuring the height of a population each person will be one observation and the variable I am measuring is height
If I’m counting the number of geese I see on campus throughout the year every day, the variables will be number of geese I see, and the day I count them on. The observation will be one instance of me counting geese on a specific day
There is more than one way to organize this information in a table - for instance here we can flip (transpose) the table to have student names in columns and assessments in rows
SIde Note: Different Organizations Serve Different Purposes
Different ways of organizing data can serve different purposes - some ways of organizing data are great for collecting data, some good for analyzing data, some both, and some neither.
3.2.1 The look of tidy data
Roughly speaking, in tidy data columns contain variables and rows contain observations
Here’s the dataset manually remade in a tidy format
Here, the values, variables, and observations are more clear
The variables are
the names of the students (in the name column)
the assessments (in the assessment column)
the grades (in the grade column)
Each observation here is a particular grade, on each assessmnet, for each student.
Note
Look at the dimensions of this table. From this, we can immediatly tell the number of variables, and the number of observations in this dataset
3.3 Tidy data definition
Tidy data is a standard way of organizing data, and is defined by 3 things
In tidy data
Each variable is one column; columns are single variables
Each observation is one row; rows are single observations
Each value is a cell; cells are values
Messy data is defined as any data organization that is not tidy data
Why use tidy data?
Having data structured as tidy data makes it easier to manipulate and analyze data in tidy R - in fact, most of the functions we have been learning up to now have been preformed on tidy data - this is the tidyverse after all
Some functions require tidy data, and most operations on data in the tidyverse are made significantly easier by having your data as tidy data (i.e., data visualization, statistics)
Having data in a standardized format makes it easier to collaborate on
Order of Columns
A note on ordering of columns : While order of columns doesn’t really matter for data to be considered tidy, it is standard and common practice to put meta data (aka fixed or random variables) first, followed by the columns for measured data (i.e., sample name columns go first)
4 Making data tidy - pivot_longer() and pivot_wider()
Not all data is tidy, which creates the need for us to know how to clean it.
4.1 Pivoting Longer
One common problem mentioned in R for data science and by Hadley Wickham is when you have column names that are actually variable values (when a single variable is spread across multiple columns)
Sidenote
Having a set of columns with variable values for names is actually very common in metabomolics or genomics data, where column titles are metabolites, genes, taxa, etc.
Here is a dataset with this issue - This data denotes the population of Afghanistan, Brazil, and China for the years 1999 and 2000
table4b
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
Question
What are the variables in the table4b data?
Solution (click here)
The variables are
country
year
population
This format of data is sometimes called wide data, where the row of column names (the very first row) actually contains variable values.
In this data it doesn’t look very wide, but imagine how much wider if data was taken every year up to or past this year. Wide data tends to get wide very quickly.
For an example of very wide data, look at the who dataset
data(who) who
# A tibble: 7,240 × 60
country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghani… AF AFG 1980 NA NA NA NA
2 Afghani… AF AFG 1981 NA NA NA NA
3 Afghani… AF AFG 1982 NA NA NA NA
4 Afghani… AF AFG 1983 NA NA NA NA
5 Afghani… AF AFG 1984 NA NA NA NA
6 Afghani… AF AFG 1985 NA NA NA NA
7 Afghani… AF AFG 1986 NA NA NA NA
8 Afghani… AF AFG 1987 NA NA NA NA
9 Afghani… AF AFG 1988 NA NA NA NA
10 Afghani… AF AFG 1989 NA NA NA NA
# ℹ 7,230 more rows
# ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
# new_sp_m65 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
# new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
# new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
# new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>,
# new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, new_sn_m65 <dbl>, …
To make this data tidy, we need to the the column names (the year variable) and turn it into its own column (called year), and take all of the population data and put that into a second variable column (called population)
Previously with the classroom data we did this manually, but there is a much better way to pivot from wide messy data, to longer tidy data with the function pivot_longer
Lets look at the help file for it
?pivot_longer
From this, we see that the pivot_longer function can take a lot of aruguments, but there are only a few important ones
cols, in which we select the columns to pivot (i.e., the columns whos names contain a variable)
names_to, in which we create a new column for where our old column names are to be put into
values_to, in which we create a second new column for the values to be put into
Here’s what that looks like in our example
tidy_table4b <- table4b |>pivot_longer(cols =`1999`:`2000`, names_to ="year", values_to ="population") # note here the numbers are in backticks. This is because R treats column names as characters, and calling a name as number data won't work here. # the new column names also need to be in quotes heretidy_table4b
# A tibble: 6 × 3
country year population
<chr> <chr> <dbl>
1 Afghanistan 1999 19987071
2 Afghanistan 2000 20595360
3 Brazil 1999 172006362
4 Brazil 2000 174504898
5 China 1999 1272915272
6 China 2000 1280428583
using pivot_longer we turned a 3x3 table where the column names stored variable values into a 6x3 table where each column contains a single variable, and each row a single observation - and presto, tidy data!
One other thing to note here is that we are not pivoting every column. In some datasets you may only have a few columns that you would like to be longer
4.1.1 More examples and uses of pivot
Pivoting isn’t just for cleaning up! When combined with other functions, pivoting can also be a very useful way to manipulate data to make it easier to analyze
Last week, we used the summary function to find the mean height, width and depth of diamonds in the diamonds dataset
For this we had to create a new column for every dimensional mean, which a.) takes a while to type, especially if we wanted to find the mean of further values and b.) comes out to not be tidy data
Another way to find the means for these columns is to use the pivot_longer function to make another column which we can group the values by
# A tibble: 3 × 2
dimension mean_value
<chr> <dbl>
1 x 5.73
2 y 5.73
3 z 3.54
# note in cols, I'm not specifying x, y, and z individually. In the dataset they are all next to each other, so I can use the : operator to signify that I want columns x through z. This can be really helpful on wider datasets.
Here we only have three values, but for wider datasets this can be a very powerful tool
4.1.2 Pivot longer problems
Question 1
Make the classroom data from earlier into a tidy format using pivot_longer
# A tibble: 12 × 3
name assesment grade
<chr> <chr> <chr>
1 Billy quiz1 <NA>
2 Billy quiz2 D
3 Billy test1 C
4 Suzy quiz1 F
5 Suzy quiz2 <NA>
6 Suzy test1 <NA>
7 Lionel quiz1 B
8 Lionel quiz2 C
9 Lionel test1 B
10 Jenny quiz1 A
11 Jenny quiz2 A
12 Jenny test1 B
Question 2
Find the means, minimum values, maximum values, and standard deviations of all of the numeric columns in the diamonds dataset using pivot_longer followed by summarise
# A tibble: 7 × 5
parameter mean min max sd
<chr> <dbl> <dbl> <dbl> <dbl>
1 carat 0.798 0.2 5.01 0.474
2 depth 61.7 43 79 1.43
3 table 57.5 43 95 2.23
4 price 3933. 326 18823 3989.
5 x 5.73 0 10.7 1.12
6 y 5.73 0 58.9 1.14
7 z 3.54 0 31.8 0.706
Question 3 - Challange
Take a look at the very wide dataset who
data(who)head(who)
# A tibble: 6 × 60
country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanis… AF AFG 1980 NA NA NA NA
2 Afghanis… AF AFG 1981 NA NA NA NA
3 Afghanis… AF AFG 1982 NA NA NA NA
4 Afghanis… AF AFG 1983 NA NA NA NA
5 Afghanis… AF AFG 1984 NA NA NA NA
6 Afghanis… AF AFG 1985 NA NA NA NA
# ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
# new_sp_m65 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
# new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
# new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
# new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>,
# new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, new_sn_m65 <dbl>,
# new_sn_f014 <dbl>, new_sn_f1524 <dbl>, new_sn_f2534 <dbl>, …
This data contains records of the counts of confirmed tuberculosis cases by country, year, and demographic group. The demographic groups are broken down by sex (m, f) and age (0-14, 15-25, 25-34, 35-44, 45-54, 55-64, unknown).
After using pivot_longer, find the country with the most cases of tb in 2011.
Note: The first four columns (country, iso2, iso3, and year) are already properly formatted and do not need to be pivoted here.
Hint (click here)
NA values are liekly to mess up your analysis here - check the pivot_longer() help file to see what arguments are available to remove NA values
# A tibble: 210 × 3
country year count
<chr> <dbl> <dbl>
1 China 2011 865059
2 India 2011 642311
3 Indonesia 2011 313601
4 South Africa 2011 289111
5 Bangladesh 2011 148198
6 Pakistan 2011 127571
7 Russian Federation 2011 104320
8 Philippines 2011 93580
9 Democratic People's Republic of Korea 2011 85564
10 Kenya 2011 84548
# ℹ 200 more rows
China is our winner with a sum of 865059 cases in 2011
4.2 Pivoting Wider
The other function of the day here is pivot_wider
pivot_wider , as the name might imply, is the inverse of pivot_longer and creates columns from from values stored in another column
4.2.1 Uses of pivot_wider()
You may be thinking, “why would I ever need this? I thought the whole point was to avoid having columns of variable values”
pivot_wider is an especially useful function in many situations. Here are only a few:
pivot_wider is useful where we have the opposite problem as we saw when learning pivot_longer - when a multiple variables are stored in a single column (single observation is scattered across more than one row)
pivot_wider can be useful when we want to make calculations across columns
Not every function in R accepts tidy data, sometimes we do need wide data! (e.g., anything that requires a matrix, PCA)
4.2.2 Using pivot_wider()
Lets see how pivot_wider() is used
This is a dataset containing population data and the number of cases of some disease in Afghanistan, Brazil, and China for the years 1999 and 2000
table2
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Here, we have columns containing more than one variable this time (type has both cases and population).
Lets check out pivot_wider to see how we can undo this and make our data tidy again
?pivot_wider
Again, pivot_wider has a bunch of arguments, but this time there are really only two essential ones
names_from which denotes the column in which our new column names are stored
values_from which denotes the column in which the values we will be filling our new columns with are stored
Note:
There are fewer important arguments in pivot_wider - which one is missing its inverse from pivot_longer?
In one of the questions from using pivot_longer, we put all of the numeric variables into two columns, here named parameter and value. Can you return the data tampered_diamonds to its original state?
Warning: Values from `value` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
{data} |>
dplyr::summarise(n = dplyr::n(), .by = c(cut, color, clarity, parameter)) |>
dplyr::filter(n > 1L)
# A tibble: 276 × 10
cut color clarity carat depth table price x y z
<ord> <ord> <ord> <list> <list> <lis> <lis> <lis> <lis> <lis>
1 Ideal E SI2 <dbl [469]> <dbl [469]> <dbl> <dbl> <dbl> <dbl> <dbl>
2 Premium E SI1 <dbl [614]> <dbl [614]> <dbl> <dbl> <dbl> <dbl> <dbl>
3 Good E VS1 <dbl [89]> <dbl [89]> <dbl> <dbl> <dbl> <dbl> <dbl>
4 Premium I VS2 <dbl [315]> <dbl [315]> <dbl> <dbl> <dbl> <dbl> <dbl>
5 Good J SI2 <dbl [53]> <dbl [53]> <dbl> <dbl> <dbl> <dbl> <dbl>
6 Very Good J VVS2 <dbl [29]> <dbl [29]> <dbl> <dbl> <dbl> <dbl> <dbl>
7 Very Good I VVS1 <dbl [69]> <dbl [69]> <dbl> <dbl> <dbl> <dbl> <dbl>
8 Very Good H SI1 <dbl [547]> <dbl [547]> <dbl> <dbl> <dbl> <dbl> <dbl>
9 Fair E VS2 <dbl [42]> <dbl [42]> <dbl> <dbl> <dbl> <dbl> <dbl>
10 Very Good H VS1 <dbl [257]> <dbl [257]> <dbl> <dbl> <dbl> <dbl> <dbl>
# ℹ 266 more rows
Uh-oh! Uh-oh! With what we have learned so far, we cant undo pivot_longer with pivot_wider here! Can you figure out why?
# A tibble: 1,897 × 5
cut color clarity parameter n
<ord> <ord> <ord> <chr> <int>
1 Fair D I1 carat 4
2 Fair D I1 depth 4
3 Fair D I1 price 4
4 Fair D I1 table 4
5 Fair D I1 x 4
6 Fair D I1 y 4
7 Fair D I1 z 4
8 Fair D SI2 carat 56
9 Fair D SI2 depth 56
10 Fair D SI2 price 56
# ℹ 1,887 more rows
# example of multiple replicatestampered_diamonds |>filter(cut=="Premium", color =="D", clarity=="VS2")
# A tibble: 2,373 × 5
cut color clarity parameter value
<ord> <ord> <ord> <chr> <dbl>
1 Premium D VS2 carat 0.22
2 Premium D VS2 depth 59.3
3 Premium D VS2 table 62
4 Premium D VS2 price 404
5 Premium D VS2 x 3.91
6 Premium D VS2 y 3.88
7 Premium D VS2 z 2.31
8 Premium D VS2 carat 0.71
9 Premium D VS2 depth 62.5
10 Premium D VS2 table 60
# ℹ 2,363 more rows
multiple values for each “type” of diamond - not uniquely identifiable, thus we can’t get to the original dataset