install.packages("readxl")
Tidyverse 5: Reading in data
1 Introduction
So far in code club this semester, all of the data we have used has been either made up by us or able to be loading from data embedded within base R or a package.
Today we are going to go over reading in external data into R. The tidyverse core package that contains functions for reading in (and writing out) data is readr
and readxl
allows reading in Excel data. This package allows us to read in rectangular data. Some of the material from today is adapted from a past code club by Jelmer Poelstra.
Useful references for today:
readr
documentation (for reading in delimited files)readxl
docuemtation (for reading in Excel files)- Data import with the tidyverse cheatsheet
- Data import chapter 7 in R for Data Science
There are other non-core tidyverse packages for reading in data (including directly from Google Drive), you can find them here though I will not go through them today.
We are going to use a new package today.
And then will start by loading the libraries we need for today.
library(tidyverse)
library(readxl)
But before we actually read in data, let’s talk a little bit about the different ways in which input data can be formatted.
2 What are some common data formatting styles?
2.1 .csv
The most common formatting style for data is comma separated values, which often has the file extension .csv
. This means that the data is delimited (i.e., separated) by commas. Often we open these files in a program like Excel, Numbers, or similar, which puts each value into its own cell, but the data actually looks like this:
Student ID,Full Name,favourite.food,mealPlan,AGE
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,N/A,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6
Here, the first row are our column headers, and each subsequent row contains information about one student, including their student ID, full name, favorite food, meal plan, and age.
If we look at this data as a table, it looks like this:
Student ID | Full Name | favourite.food | mealPlan | AGE |
---|---|---|---|---|
1 | Sunil Huffmann | Strawberry yoghurt | Lunch only | 4 |
2 | Barclay Lynn | French fries | Lunch only | 5 |
3 | Jayendra Lyne | N/A | Breakfast and lunch | 7 |
4 | Leon Rossini | Anchovies | Lunch only | NA |
5 | Chidiegwu Dunkel | Pizza | Breakfast and lunch | five |
6 | Güvenç Attila | Ice cream | Lunch only | 6 |
2.2 .tsv
Another common data format is as tab separated values, or .tsv
. Here, instead of being separated by commas like in a .csv
, the values are separated by tabs. Looking at the same students data in a .tsv
format would look like this:
Student ID Full Name favourite.food mealPlan AGE
1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 Barclay Lynn French fries Lunch only 5
3 Jayendra Lyne N/A Breakfast and lunch 7
4 Leon Rossini Anchovies Lunch only
5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 Güvenç Attila Ice cream Lunch only 6
Both comma-separated nad tab-separated values can also be stored with other file extensions like
.txt
.
2.3 .xlsx
and .xls
These are Microsoft-specific file formats created using the program Excel. Since using Excel spreadsheets so so common, there are special packages that allow the reading in of these file types. If you open up a file in Excel, you can save it in a variety of different file formats, just please do this with care, as Excel has reputation of turning gene names in to dates.
2.4 Which file format to use?
The truth is all file formats can be read into R, and some might be better for certain instances. You might read on the internet that its better to use .csv
files over .xlsx.
and .xls
for compatibility sake, but there are also benefits to Excel formats where lots of data can be stored in one file in different sheets.
Pick the data format that works best for your application.
3 Download data to import
The code below will let you download some (5) files to your computer for the exercise today, which you can then import. These files were created by Jelmer from R for Data Science for a code club many code clubs ago.
The code below allows you to download five files in different formats and with different data structure. We will practice data import with these files. Remember when you execute this code, your files will be downloaded to your working directory.
# saves the url where the file is located
<- "https://github.com/biodash/biodash.github.io/raw/master/content/codeclub/S05E05/students.csv"
url_csv # downloads the file and saves in your working directory as students.csv
download.file(url = url_csv, destfile = "students.csv")
<- "https://github.com/biodash/biodash.github.io/raw/master/content/codeclub/S05E05/students_noheader.csv"
url_csv_noheader download.file(url = url_csv_noheader, destfile = "students_noheader.csv")
<- "https://github.com/biodash/biodash.github.io/raw/master/content/codeclub/S05E05/students_with_meta.csv"
url_csv_meta download.file(url = url_csv_meta, destfile = "students_with_meta.csv")
<- "https://github.com/biodash/biodash.github.io/raw/master/content/codeclub/S05E05/students.tsv"
url_tsv download.file(url = url_tsv, destfile = "students.tsv")
<- "https://github.com/osu-codeclub/osu-codeclub.github.io/raw/main/posts/S07E10_tidy_05/students.xlsx"
url_excel download.file(url = url_excel, destfile = "students.xlsx")
Once you’ve downloaded all these files, take a minute to open them all up on your computer (you can use Excel, Numbers, whatever) to see how they are different. Be sure not to make any changes to the filesa.
4 Reading in delimited files
There are a few functions that allow you to read in tabular data in readr
, and they are part of the read_*()
family.
The following functions work similarly:
read_csv()
: reads in comma-separated values filesread_csv2()
: reads in semi-colon-separated values files where commas are the decimal point (common in Europe) instead of the period (as in the US)read_tsv()
: reads in tab-separated values filesread_delim()
: reads in delimited files, where the delimiter can be specified. If you were to specify the delimiter to be “,”, this could be the save asread_csv()
.
Let’s start by reading in our “students.csv” file and then work through how changing how we set the arguments alters how our data is imported.
The first (and only required) argument to read_*()
is the file, which we will provide in quotes as it is a string. If we were to have our file in a folder called “data” within our working directory, we could provide “data/students.csv”. If our file we wanted to read in was in some other location on our computer, we could provide the full path.
<- read_csv("students.csv") students_csv
Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Full Name, favourite.food, mealPlan, AGE
dbl (1): Student ID
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We can see a message after we’ve imported our data giving us information about how this process went. In this case, we’ve read in a file with 6 rows and 5 columns. The delimiter is ","
and we have information about what type our columns have been imported in. Here, the columns favourite.food, mealPlan, and AGE have been parsed as character type, and Student ID is numeric (i.e., dbl).
We can get a sense of our data by using the function head()
to see the first 6 rows, all columns.
head(students_csv)
# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
5 Reading in Excel files
I expect many of you have data stored in .xlsx
or .xls
files that you’d like to read into R. The package readxl
makes easy work of this. This packages has functions including:
read_xlsx()
: reads in.xlsx
filesread_xls()
: reads in.xls
filesread_excel()
: reads in both.xlsx
and.xls
files and tries to guess which one your file is. This usually but does not always works well.
There are subtle differences between
.xlsx
and.xls
files but I don’t think they’re that important to get into here.
Let’s read in our Excel file.
<- read_xlsx("students.xlsx")
students_excel
head(students_excel)
# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
By default the first sheet is read in. If we want to know what the names of the different sheets are, we can run the following code providing the name of our file.
excel_sheets("students.xlsx")
[1] "students" "students_2"
We can set what sheet specifically we want to read in using the argument sheet = ""
.
<- read_xlsx("students.xlsx",
students_excel sheet = "students")
6 Handling file reading issues
6.1 Specifying column types
Let’s say that the column favourite.food is actually a factor (i.e., categorical), and not a character. We can specify that column should be a factor using the argument col_types =
. The short codes you can specify in can be found here.
<- read_csv("students.csv",
students_csv_mealplan col_types = "nccfc")
glimpse(students_csv_mealplan)
Rows: 6
Columns: 5
$ `Student ID` <dbl> 1, 2, 3, 4, 5, 6
$ `Full Name` <chr> "Sunil Huffmann", "Barclay Lynn", "Jayendra Lyne", "Leo…
$ favourite.food <chr> "Strawberry yoghurt", "French fries", "N/A", "Anchovies…
$ mealPlan <fct> Lunch only, Lunch only, Breakfast and lunch, Lunch only…
$ AGE <chr> "4", "5", "7", NA, "five", "6"
6.2 Setting column names
You might have a datafile that doesn’t include headers, or column names. These functions by default use col_names = TRUE
and assume that your file does have headers. If you don’t, you can set this argument to FALSE
and provide column headers. Let’s just try reading in a file without headers just to see what the result is.
<- read_csv("students_noheader.csv") students_no_headers
Rows: 5 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Sunil Huffmann, Strawberry yoghurt, Lunch only, 4
dbl (1): 1
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(students_no_headers)
# A tibble: 5 × 5
`1` `Sunil Huffmann` `Strawberry yoghurt` `Lunch only` `4`
<dbl> <chr> <chr> <chr> <chr>
1 2 Barclay Lynn French fries Lunch only 5
2 3 Jayendra Lyne N/A Breakfast and lunch 7
3 4 Leon Rossini Anchovies Lunch only <NA>
4 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
5 6 Güvenç Attila Ice cream Lunch only 6
We can see that the first row has become the column names, and this is not what we want. We can indicate the column names using the argument col_names =
.
<- read_csv("students_noheader.csv",
students_no_headers col_names = c("Student ID", "Full Name", "favourite.food",
"mealPlan", "AGE"))
Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Full Name, favourite.food, mealPlan, AGE
dbl (1): Student ID
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(students_no_headers)
# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
These column names are a bit sloppy, so we can also clean them up using janitor::clean_names()
. If we apply clean_names()
, we can see what happens.
<- janitor::clean_names(students_no_headers)
students_no_headers_clean
head(students_no_headers_clean)
# A tibble: 6 × 5
student_id full_name favourite_food meal_plan age
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
6.3 Handling header or metadata lines
Sometimes the headers of your data might not be in the first row. In the case that you want to read in data that has non-data rows, you can indicate those to skip in the reading in process.
<- read_delim("students_with_meta.csv",
students_metadata delim = ",",
skip = 2) # since there are two lines we don't want to read in
Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Full Name, favourite.food, mealPlan, AGE
dbl (1): Student ID
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(students_metadata)
# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
This is particularly useful if you are reading in data that comes off an instrument or system that formats data in a specific way.
6.4 What to do with NAs
Often your dataset will have missing data, though different people will code NAs differently. They may be simply missing, N/A, 999, or in some other way. You can tell your data import function how NAs are coded so they can be parsed correctly.
In this data, we have NAs that are coded as “NA”, and “N/A”. The default NA values are R are na = c("NA", "")
, so our data deviates from what would by default be considered missing.
<- read_csv("students.csv",
students_missing na = c("NA", "N/A", "")) # adding N/A as a NA option
Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Full Name, favourite.food, mealPlan, AGE
dbl (1): Student ID
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(students_missing)
# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne <NA> Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
7 Practice
Download and successfully read in the file below. I would recommend that after you download it, you open it to have some information to guide your import.
# where the file is located
<- "https://github.com/osu-codeclub/osu-codeclub.github.io/raw/main/posts/S07E10_tidy_05/dogs2.csv"
dogs_url
# download it
download.file(url = dogs_url,
destfile = "dogs.csv")
Hint (click here)
Look at your file and the documentation for read_*()
to see what arguments you can set to aid in the import of your data. Take a look at comment =
and skip =
Solution (click here)
<- read_csv("dogs2.csv",
dogs comment = "%",
skip = 3)
Rows: 195 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Breed
dbl (8): 2013 Rank, 2014 Rank, 2015 Rank, 2016 Rank, 2017 Rank, 2018 Rank, 2...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dogs)
# A tibble: 6 × 9
Breed `2013 Rank` `2014 Rank` `2015 Rank` `2016 Rank` `2017 Rank` `2018 Rank`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Retri… 1 1 1 1 1 1
2 Frenc… 11 9 6 6 4 4
3 Germa… 2 2 2 2 2 2
4 Retri… 3 3 3 3 3 3
5 Bulld… 5 4 4 4 5 5
6 Poodl… 8 7 8 7 7 7
# ℹ 2 more variables: `2019 Rank` <dbl>, `2020 Rank` <dbl>
7.1 Take some data that you have and try and read it into R
This you should do on your own! Find a file that you have and try and read it in and calculate some summary information.
8 Bonus, writing out data
Often you will generate data in R that you want to write out to a spredsheet form. You can do this using the write_*()
functions.
write_delim()
writes out a delimited file, where you specify the delimiterwrite_csv()
writes out a csv file
Here, we provide the name of the R object we want to write out, and it will be written to your working directory.
write_csv(students_csv)