Data Wrangling 5: Tidy Data and Pivoting

r-basics

tidyverse

Author

Aaron Max Wiedemer and Horacio Lopez-Nicora

Published

September 30, 2024

Tidy Data By Julia Lowndes and Allison Horst

1 Overview

Document Summary

tidy data : a way of organizing data where columns contain variables and rows contain observations

Terms

variable : contains values of an attribute that is measured (i.e., length, sex, pH, color)
observation : contains all the values measured on the same unit (i.e., person, sample, day)

Functions

pivot_longer() : a function that “lengthens” data by transforming a set of columns containing values into two columns - one containing the previous column names, and the other contain all of the values. This is the inverse of pivot_wider()
pivot_wider() : a function that “widens” data by transforming a set of two columns (though can be more) containing names and values into a larger set of columns. This is the inverse of pivot_longer()

1.1 Starting note

Much of this page is inspired from and has examples from - Hadley Wickam’s Tidy Data - R for Data Science

2 Introduction

So far, we’ve been learning a lot of functions within the tidyverse to manipulate datasets

filter() to choose rows
select() to choose columns
arrange() to sort data
rename() to change data names
mutate() to add and change data
summarise() to produce data summaries

Today we will be learning more about the structure of datasets, specifically tidy data .

Most of what we have been working in code club this semester has actually been using tidy data, as most functions in the tidyverse are used to create or manipulate tidy data.

While we have been focusing on how to manipulate tidy data, today we will be learning two functions that can be used to manipulate and more powerfully create tidy data - pivot_longer() and pivot_wider()

2.1 Set up

Lets get our favorite library loaded up

#install.packages("tidyverse") # delete first hash and run to install if you havent already
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

3 Tidy data

3.1 Data Structure

Most datasets we work with in R are organized in a roughly rectangular grid-like format made up of rows and columns that contain some values.

Each value in a dataset belongs to some variable and some observation.

A variable contains values of an attribute that was measured (i.e., length, sex, pH, color)
An observation contains all the values measured on the same thing (unit) (i.e., person, sample, day)

Examples:

If I’m measuring the height of a population each person will be one observation and the variable I am measuring is height

height <- tibble(
 height_in = c(rnorm(43, m=67, sd=8))
 )

height

# A tibble: 43 × 1
   height_in
       <dbl>
 1      55.1
 2      70.2
 3      60.9
 4      60.0
 5      68.1
 6      76.8
 7      51.8
 8      61.3
 9      64.5
10      76.5
# ℹ 33 more rows

If I’m counting the number of geese I see on campus throughout the year every day, the variables will be number of geese I see, and the day I count them on. The observation will be one instance of me counting geese on a specific day

geese <- tibble(
 day = seq(1:365),
 geese_number = c(trunc(rnorm(365, m=24, sd=10)))
 )

geese

# A tibble: 365 × 2
     day geese_number
   <int>        <dbl>
 1     1           19
 2     2           14
 3     3           23
 4     4           17
 5     5           25
 6     6           31
 7     7           49
 8     8            7
 9     9           33
10    10           50
# ℹ 355 more rows

This may sound like review, but in practice can be somewhat tricky

Question

Looking at the diamonds dataset, what is the observation and variable of the the value diamonds[5,7] ?

Solution (click here)

diamonds[5,7]

# A tibble: 1 × 1
  price
  <int>
1   335

The observation here is the 5th diamond and the variable we are measuring is price

3.2 Pre-Tidy Date

“All tidy data sets are alike. Each messy dataset is messy in its own way” - Hadley Wickham.

At risk of stating the obvious, organizing values in datasets can be accomplished in many ways

Here with an example of table of student grades on several assignments adapted from Hadley Wickam’s Tidy Data

classroom <- tribble(
  ~name,    ~quiz1, ~quiz2, ~test1,
  "Billy",  NA,     "67",    "75",
  "Suzy",   "44",    NA,     NA,
  "Lionel", "86",    "74",    "84",
  "Jenny",  "94",    "98",    "88"
  )

classroom

# A tibble: 4 × 4
  name   quiz1 quiz2 test1
  <chr>  <chr> <chr> <chr>
1 Billy  <NA>  67    75   
2 Suzy   44    <NA>  <NA> 
3 Lionel 86    74    84   
4 Jenny  94    98    88

There is more than one way to organize this information in a table - for instance here we can flip (transpose) the table to have student names in columns and assessments in rows

tribble(
  ~assessment, ~Billy, ~Suzy, ~Lionel, ~Jenny,
  "quiz1",     NA,     "44",   "86",     "94",
  "quiz2",     "67",    NA,    "74",     "98",
  "test1",     "75",    NA,    "84",     "88"
  )

# A tibble: 3 × 5
  assessment Billy Suzy  Lionel Jenny
  <chr>      <chr> <chr> <chr>  <chr>
1 quiz1      <NA>  44    86     94   
2 quiz2      67    <NA>  74     98   
3 test1      75    <NA>  84     88

This data isn’t tidy data yet though.

SIde Note: Different Organizations Serve Different Purposes

Different ways of organizing data can serve different purposes - some ways of organizing data are great for collecting data, some good for analyzing data, some both, and some neither.

3.2.1 The look of tidy data

Roughly speaking, in tidy data columns contain variables and rows contain observations

Here’s the dataset manually remade in a tidy format

tidy_classroom <- tibble( name = c( rep( "Billy", 3), rep( "Jenny", 3), rep("Lionel", 3), rep("Suzy", 3)),
        assessment = c( rep(c("quiz1", "quiz2", "test1"), 4)),
        grade = c(NA, "67", "75", "94", "98", "88", "86", "74", "84", "44", NA, NA))


tidy_classroom

# A tibble: 12 × 3
   name   assessment grade
   <chr>  <chr>      <chr>
 1 Billy  quiz1      <NA> 
 2 Billy  quiz2      67   
 3 Billy  test1      75   
 4 Jenny  quiz1      94   
 5 Jenny  quiz2      98   
 6 Jenny  test1      88   
 7 Lionel quiz1      86   
 8 Lionel quiz2      74   
 9 Lionel test1      84   
10 Suzy   quiz1      44   
11 Suzy   quiz2      <NA> 
12 Suzy   test1      <NA>

Here, the values, variables, and observations are more clear

The variables are

the names of the students (in the name column)
the assessments (in the assessment column)
the grades (in the grade column)

Each observation here is a particular grade, on each assessmnet, for each student.

Note

Look at the dimensions of this table. From this, we can immediatly tell the number of variables, and the number of observations in this dataset

3.3 Tidy data definition

Tidy data is a standard way of organizing data, and is defined by 3 things

In tidy data

Each variable is one column; columns are single variables
Each observation is one row; rows are single observations
Each value is a cell; cells are values

Messy data is defined as any data organization that is not tidy data

Why use tidy data?

Having data structured as tidy data makes it easier to manipulate and analyze data in tidy R - in fact, most of the functions we have been learning up to now have been preformed on tidy data - this is the tidyverse after all
Some functions require tidy data, and most operations on data in the tidyverse are made significantly easier by having your data as tidy data (i.e., data visualization, statistics)
Having data in a standardized format makes it easier to collaborate on

Order of Columns

A note on ordering of columns : While order of columns doesn’t really matter for data to be considered tidy, it is standard and common practice to put meta data (aka fixed or random variables) first, followed by the columns for measured data (i.e., sample name columns go first)

4 Making data tidy - `pivot_longer()` and `pivot_wider()`

Not all data is tidy, which creates the need for us to know how to clean it.

4.1 Pivoting Longer

One common problem mentioned in R for data science and by Hadley Wickham is when you have column names that are actually variable values (when a single variable is spread across multiple columns)

Sidenote

Having a set of columns with variable values for names is actually very common in metabomolics or genomics data, where column titles are metabolites, genes, taxa, etc.

Here is a dataset with this issue - This data denotes the population of Afghanistan, Brazil, and China for the years 1999 and 2000

table4b

# A tibble: 3 × 3
  country         `1999`     `2000`
  <chr>            <dbl>      <dbl>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

Question

What are the variables in the table4b data?

Solution (click here)

The variables are

country
year
population

This format of data is sometimes called wide data, where the row of column names (the very first row) actually contains variable values.

In this data it doesn’t look very wide, but imagine how much wider if data was taken every year up to or past this year. Wide data tends to get wide very quickly.

For an example of very wide data, look at the who dataset

data(who)
 who

# A tibble: 7,240 × 60
   country  iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
   <chr>    <chr> <chr> <dbl>       <dbl>        <dbl>        <dbl>        <dbl>
 1 Afghani… AF    AFG    1980          NA           NA           NA           NA
 2 Afghani… AF    AFG    1981          NA           NA           NA           NA
 3 Afghani… AF    AFG    1982          NA           NA           NA           NA
 4 Afghani… AF    AFG    1983          NA           NA           NA           NA
 5 Afghani… AF    AFG    1984          NA           NA           NA           NA
 6 Afghani… AF    AFG    1985          NA           NA           NA           NA
 7 Afghani… AF    AFG    1986          NA           NA           NA           NA
 8 Afghani… AF    AFG    1987          NA           NA           NA           NA
 9 Afghani… AF    AFG    1988          NA           NA           NA           NA
10 Afghani… AF    AFG    1989          NA           NA           NA           NA
# ℹ 7,230 more rows
# ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
#   new_sp_m65 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
#   new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
#   new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
#   new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>,
#   new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, new_sn_m65 <dbl>, …

To make this data tidy, we need to the the column names (the year variable) and turn it into its own column (called year), and take all of the population data and put that into a second variable column (called population)

Previously with the classroom data we did this manually, but there is a much better way to pivot from wide messy data, to longer tidy data with the function pivot_longer

Lets look at the help file for it

?pivot_longer

From this, we see that the pivot_longer function can take a lot of aruguments, but there are only a few important ones

cols, in which we select the columns to pivot (i.e., the columns whos names contain a variable)
names_to, in which we create a new column for where our old column names are to be put into
values_to, in which we create a second new column for the values to be put into

Here’s what that looks like in our example

tidy_table4b <- table4b |> 
  pivot_longer(cols = `1999`:`2000`, 
               names_to = "year", 
               values_to = "population") 

# note here the numbers are in backticks. This is because R treats column names as characters, and calling a name as number data won't work here. 

# the new column names also need to be in quotes here

tidy_table4b

# A tibble: 6 × 3
  country     year  population
  <chr>       <chr>      <dbl>
1 Afghanistan 1999    19987071
2 Afghanistan 2000    20595360
3 Brazil      1999   172006362
4 Brazil      2000   174504898
5 China       1999  1272915272
6 China       2000  1280428583

using pivot_longer we turned a 3x3 table where the column names stored variable values into a 6x3 table where each column contains a single variable, and each row a single observation - and presto, tidy data!

One other thing to note here is that we are not pivoting every column. In some datasets you may only have a few columns that you would like to be longer

4.1.1 More examples and uses of pivot

Pivoting isn’t just for cleaning up! When combined with other functions, pivoting can also be a very useful way to manipulate data to make it easier to analyze

Last week, we used the summary function to find the mean height, width and depth of diamonds in the diamonds dataset

data(diamonds)

diamonds |> 
  summarise(mean_x = mean(x),
            mean_y = mean(y),
            mean_z = mean(z))

# A tibble: 1 × 3
  mean_x mean_y mean_z
   <dbl>  <dbl>  <dbl>
1   5.73   5.73   3.54

For this we had to create a new column for every dimensional mean, which a.) takes a while to type, especially if we wanted to find the mean of further values and b.) comes out to not be tidy data

Another way to find the means for these columns is to use the pivot_longer function to make another column which we can group the values by

diamonds |> 
pivot_longer(cols = c(x:z), 
             names_to = "dimension", 
             values_to = "value") |> 
  summarise(mean_value = mean(value),
            .by = dimension)

# A tibble: 3 × 2
  dimension mean_value
  <chr>          <dbl>
1 x               5.73
2 y               5.73
3 z               3.54

# note in cols, I'm not specifying x, y, and z individually. In the dataset they are all next to each other, so I can use the : operator to signify that I want columns x through z. This can be really helpful on wider datasets.

Here we only have three values, but for wider datasets this can be a very powerful tool

4.1.2 Pivot longer problems

Question 1

Make the classroom data from earlier into a tidy format using pivot_longer

classroom <- tribble(
  ~name,    ~quiz1, ~quiz2, ~test1,
  "Billy",  NA,     "D",    "C",
  "Suzy",   "F",    NA,     NA,
  "Lionel", "B",    "C",    "B",
  "Jenny",  "A",    "A",    "B"
  )

Solution (click here)

classroom |> 
  pivot_longer(cols = c(quiz1,quiz2, test1), 
               names_to = "assesment", 
               values_to = "grade")

# A tibble: 12 × 3
   name   assesment grade
   <chr>  <chr>     <chr>
 1 Billy  quiz1     <NA> 
 2 Billy  quiz2     D    
 3 Billy  test1     C    
 4 Suzy   quiz1     F    
 5 Suzy   quiz2     <NA> 
 6 Suzy   test1     <NA> 
 7 Lionel quiz1     B    
 8 Lionel quiz2     C    
 9 Lionel test1     B    
10 Jenny  quiz1     A    
11 Jenny  quiz2     A    
12 Jenny  test1     B

Question 2

Find the means, minimum values, maximum values, and standard deviations of all of the numeric columns in the diamonds dataset using pivot_longer followed by summarise

Hint (click here)

Which are the numeric columns?

str(diamonds)

tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Solution (click here)

diamonds |> 
  pivot_longer(cols = c(carat,depth:z), 
               names_to = "parameter", 
               values_to = "value") |> 
  summarise(mean = mean(value),
            min = min(value),
            max = max(value),
            sd = sd(value),
            .by = parameter)

# A tibble: 7 × 5
  parameter     mean   min      max       sd
  <chr>        <dbl> <dbl>    <dbl>    <dbl>
1 carat        0.798   0.2     5.01    0.474
2 depth       61.7    43      79       1.43 
3 table       57.5    43      95       2.23 
4 price     3933.    326   18823    3989.   
5 x            5.73    0      10.7     1.12 
6 y            5.73    0      58.9     1.14 
7 z            3.54    0      31.8     0.706

Question 3 - Challange

Take a look at the very wide dataset who

data(who)

head(who)

# A tibble: 6 × 60
  country   iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
  <chr>     <chr> <chr> <dbl>       <dbl>        <dbl>        <dbl>        <dbl>
1 Afghanis… AF    AFG    1980          NA           NA           NA           NA
2 Afghanis… AF    AFG    1981          NA           NA           NA           NA
3 Afghanis… AF    AFG    1982          NA           NA           NA           NA
4 Afghanis… AF    AFG    1983          NA           NA           NA           NA
5 Afghanis… AF    AFG    1984          NA           NA           NA           NA
6 Afghanis… AF    AFG    1985          NA           NA           NA           NA
# ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
#   new_sp_m65 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
#   new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
#   new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
#   new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>,
#   new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, new_sn_m65 <dbl>,
#   new_sn_f014 <dbl>, new_sn_f1524 <dbl>, new_sn_f2534 <dbl>, …

This data contains records of the counts of confirmed tuberculosis cases by country, year, and demographic group. The demographic groups are broken down by sex (m, f) and age (0-14, 15-25, 25-34, 35-44, 45-54, 55-64, unknown).

After using pivot_longer, find the country with the most cases of tb in 2011.

Note: The first four columns (country, iso2, iso3, and year) are already properly formatted and do not need to be pivoted here.

Hint (click here)

NA values are liekly to mess up your analysis here - check the pivot_longer() help file to see what arguments are available to remove NA values

Solution (click here)

who |> 
  pivot_longer(cols = !c(country:year), 
               names_to = "demographic", 
               values_to = "tb_counts", 
               values_drop_na = TRUE) |> 
  summarise(count = sum(tb_counts),
            .by = c(country,year)) |> 
  filter(year=='2011') |> 
  arrange(desc(count))

# A tibble: 210 × 3
   country                                year  count
   <chr>                                 <dbl>  <dbl>
 1 China                                  2011 865059
 2 India                                  2011 642311
 3 Indonesia                              2011 313601
 4 South Africa                           2011 289111
 5 Bangladesh                             2011 148198
 6 Pakistan                               2011 127571
 7 Russian Federation                     2011 104320
 8 Philippines                            2011  93580
 9 Democratic People's Republic of Korea  2011  85564
10 Kenya                                  2011  84548
# ℹ 200 more rows

China is our winner with a sum of 865059 cases in 2011

4.2 Pivoting Wider

The other function of the day here is pivot_wider

pivot_wider , as the name might imply, is the inverse of pivot_longer and creates columns from from values stored in another column

4.2.1 Uses of `pivot_wider()`

You may be thinking, “why would I ever need this? I thought the whole point was to avoid having columns of variable values”

pivot_wider is an especially useful function in many situations. Here are only a few:

pivot_wider is useful where we have the opposite problem as we saw when learning pivot_longer - when a multiple variables are stored in a single column (single observation is scattered across more than one row)
pivot_wider can be useful when we want to make calculations across columns
Not every function in R accepts tidy data, sometimes we do need wide data! (e.g., anything that requires a matrix, PCA)

4.2.2 Using `pivot_wider()`

Lets see how pivot_wider() is used

This is a dataset containing population data and the number of cases of some disease in Afghanistan, Brazil, and China for the years 1999 and 2000

table2

# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Here, we have columns containing more than one variable this time (type has both cases and population).

Lets check out pivot_wider to see how we can undo this and make our data tidy again

?pivot_wider

Again, pivot_wider has a bunch of arguments, but this time there are really only two essential ones

names_from which denotes the column in which our new column names are stored
values_from which denotes the column in which the values we will be filling our new columns with are stored

Note:

There are fewer important arguments in pivot_wider - which one is missing its inverse from pivot_longer?

Lets see it in action

table2 |> 
  pivot_wider(names_from = type, values_from = count)

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

We have now seperated our variables into seperate columns, and now our rows each contain only one observation. Tidy data!

4.2.3 Pivot Wider Problems

Question 1

Undo the tidy_classroom and make it look like one of its untidy form using pivot_wider

tidy_classroom <- tibble( name = c( rep( "Billy", 3), rep( "Jenny", 3), rep("Lionel", 3), rep("Suzy", 3)),
        assessment = c( rep(c("quiz1", "quiz2", "test1"), 4)),
        grade = c(NA, "67", "75", "94", "98", "88", "86", "74", "84", "44", NA, NA))

Solution (click here)

tidy_classroom |> 
  pivot_wider(names_from = assessment, values_from = grade)

# A tibble: 4 × 4
  name   quiz1 quiz2 test1
  <chr>  <chr> <chr> <chr>
1 Billy  <NA>  67    75   
2 Jenny  94    98    88   
3 Lionel 86    74    84   
4 Suzy   44    <NA>  <NA>

Question 2

In one of the questions from using pivot_longer, we put all of the numeric variables into two columns, here named parameter and value. Can you return the data tampered_diamonds to its original state?

tampered_diamonds <- diamonds |> 
  pivot_longer(cols = c(carat, depth:z), names_to = "parameter", values_to = "value")

note : if you encounter an error, be sure to check the hint

Hint (click here)

tampered_diamonds |> 
  pivot_wider(names_from = parameter, values_from = value)

Warning: Values from `value` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} |>
  dplyr::summarise(n = dplyr::n(), .by = c(cut, color, clarity, parameter)) |>
  dplyr::filter(n > 1L)

# A tibble: 276 × 10
   cut       color clarity carat       depth       table price x     y     z    
   <ord>     <ord> <ord>   <list>      <list>      <lis> <lis> <lis> <lis> <lis>
 1 Ideal     E     SI2     <dbl [469]> <dbl [469]> <dbl> <dbl> <dbl> <dbl> <dbl>
 2 Premium   E     SI1     <dbl [614]> <dbl [614]> <dbl> <dbl> <dbl> <dbl> <dbl>
 3 Good      E     VS1     <dbl [89]>  <dbl [89]>  <dbl> <dbl> <dbl> <dbl> <dbl>
 4 Premium   I     VS2     <dbl [315]> <dbl [315]> <dbl> <dbl> <dbl> <dbl> <dbl>
 5 Good      J     SI2     <dbl [53]>  <dbl [53]>  <dbl> <dbl> <dbl> <dbl> <dbl>
 6 Very Good J     VVS2    <dbl [29]>  <dbl [29]>  <dbl> <dbl> <dbl> <dbl> <dbl>
 7 Very Good I     VVS1    <dbl [69]>  <dbl [69]>  <dbl> <dbl> <dbl> <dbl> <dbl>
 8 Very Good H     SI1     <dbl [547]> <dbl [547]> <dbl> <dbl> <dbl> <dbl> <dbl>
 9 Fair      E     VS2     <dbl [42]>  <dbl [42]>  <dbl> <dbl> <dbl> <dbl> <dbl>
10 Very Good H     VS1     <dbl [257]> <dbl [257]> <dbl> <dbl> <dbl> <dbl> <dbl>
# ℹ 266 more rows

Uh-oh! Uh-oh! With what we have learned so far, we cant undo pivot_longer with pivot_wider here! Can you figure out why?

Solution (click here)

# code from the error read out
tampered_diamonds %>%
  dplyr::group_by(cut, color, clarity, parameter) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop")  %>%
  dplyr::filter(n > 1L)

# A tibble: 1,897 × 5
   cut   color clarity parameter     n
   <ord> <ord> <ord>   <chr>     <int>
 1 Fair  D     I1      carat         4
 2 Fair  D     I1      depth         4
 3 Fair  D     I1      price         4
 4 Fair  D     I1      table         4
 5 Fair  D     I1      x             4
 6 Fair  D     I1      y             4
 7 Fair  D     I1      z             4
 8 Fair  D     SI2     carat        56
 9 Fair  D     SI2     depth        56
10 Fair  D     SI2     price        56
# ℹ 1,887 more rows

# example of multiple replicates
tampered_diamonds |> 
  filter(cut=="Premium", color == "D", clarity=="VS2")

# A tibble: 2,373 × 5
   cut     color clarity parameter  value
   <ord>   <ord> <ord>   <chr>      <dbl>
 1 Premium D     VS2     carat       0.22
 2 Premium D     VS2     depth      59.3 
 3 Premium D     VS2     table      62   
 4 Premium D     VS2     price     404   
 5 Premium D     VS2     x           3.91
 6 Premium D     VS2     y           3.88
 7 Premium D     VS2     z           2.31
 8 Premium D     VS2     carat       0.71
 9 Premium D     VS2     depth      62.5 
10 Premium D     VS2     table      60   
# ℹ 2,363 more rows

multiple values for each “type” of diamond - not uniquely identifiable, thus we can’t get to the original dataset

5 Citations

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

https://r4ds.had.co.nz/tidy-data.html#wider

1 Overview

1.1 Starting note

2 Introduction

2.1 Set up

3 Tidy data

3.1 Data Structure

3.2 Pre-Tidy Date

3.2.1 The look of tidy data

3.3 Tidy data definition

Why use tidy data?

4 Making data tidy - pivot_longer() and pivot_wider()

4.1 Pivoting Longer

4.1.1 More examples and uses of pivot

4.1.2 Pivot longer problems

4.2 Pivoting Wider

4.2.1 Uses of pivot_wider()

4.2.2 Using pivot_wider()

4.2.3 Pivot Wider Problems

5 Citations

4 Making data tidy - `pivot_longer()` and `pivot_wider()`

4.2.1 Uses of `pivot_wider()`

4.2.2 Using `pivot_wider()`