# 2 - Vectors ------------------------------------------------------------------
# 2.1 - Single-element vectors and quoting
# 2.2 - Multi-element vectors
# 2.3 - Vectorization
# Challenge 1
# A. Start by making a vector x with the whole numbers 1 through 26.
# Then, subtract 0.5 from each element in the vector and save the result in vector y.
# Check your results by printing both vectors.
# B. What do you think will be the result of the following operation?
# 1:5 * 1:5
# 2.4 - Exploring vectors
# 2.5 - Extracting element from vectors
# 3 - Data frames --------------------------------------------------------------
# 3.1 - Data frame intro
# 4 - Data types ---------------------------------------------------------------
# 4.1 - R's main data types
# 4.2 - Factors
# 4.3 - A vector can only contain one data type
# Challenge 2
# What type of vector (if any) do you think each of the following will produce?
# Try it out and see if you were right.
# typeof("TRUE")
# typeof(banana)
# typeof(c(2, 6, "3"))
# Bonus / trick question:
# typeof(18, 3)
# 4.4 - Automatic type coercion
# 4.5 - Manual type conversion
R’s data structures and data types
1 Introduction
What we’ll cover
In this session, we will learn about R’s data structures and data types.
Data structures are the kinds of objects that R can store data in. Here, we will cover the two most common ones: vectors and data frames.
Data types are how R distinguishes between different kinds of data like numbers and character strings. Here, we’ll talk about the 4 main data types:
character
,integer
,double
, andlogical.
We’ll also coverfactor
s, a construct related to the data types.
Setting up
To make it easier to keep track of what we do, we’ll write our code in a script (and send it to the console from there) – here is how to create and save a new R script:
Open a new R script (Click the
+
symbol in toolbar at the top, then clickR Script
)1.Save the script straight away as
data-structures.R
– you can save it anywhere you like, though it is probably best to save it in a folder specifically for this workshop.If you want the section headers as comments in your script, as in the script I am showing you now, then copy-and-paste the following into your script:
Section headers for your script (Click to expand)
2 Data structure 1: Vectors
The first data structure we will explore is the simplest: the vector. A vector in R is essentially a collection of one or more items. Moving forward, we’ll call such individual items “elements”.
2.1 Single-element vectors and quoting
Vectors can consist of just a single element, so each of the two lines of code below creates a vector:
<- 8
vector1 <- "panda" vector2
Two things are worth noting about the "panda"
example, which is a so-called character string (or string for short):
"panda"
constitutes one element, not 5 (its number of letters).- Unlike when dealing with numbers, we have to quote the string.2
Character strings need to be quoted because they are otherwise interpreted as R objects – for example, because our vectors vector1
and vector2
are objects, we refer to them without quotes:
# [Note that R will show auto-complete options after you type 3 characters]
vector1
[1] 8
vector2
[1] "panda"
Therefore, the code below doesn’t work, because there is no object called panda
:
<- panda vector_fail
Error: object 'panda' not found
2.2 Multi-element vectors
A common way to make vectors with multiple elements is by using the c
(combine) function:
c(2, 6, 3)
[1] 2 6 3
c()
can also append elements to an existing vector:
# First we create a vector:
<- c("vhagar", "meleys")
vector_to_append vector_to_append
[1] "vhagar" "meleys"
# Then we append another element to it:
c(vector_to_append, "balerion the dread")
[1] "vhagar" "meleys" "balerion the dread"
To create vectors with series of numbers, a couple of shortcuts are available. First, you can make series of whole numbers with the :
operator:
1:10
[1] 1 2 3 4 5 6 7 8 9 10
Second, you can use a function like seq()
for fine control over the sequence:
<- seq(from = 6, to = 8, by = 0.2)
myseq myseq
[1] 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0
2.3 Vectorization
Consider the output of this command:
* 2 myseq
[1] 12.0 12.4 12.8 13.2 13.6 14.0 14.4 14.8 15.2 15.6 16.0
Above, every individual element in myseq
was multiplied by 2. We call this behavior “vectorization” and this is a key feature of the R language. (Alternatively, you may have expected this code to repeat myseq
twice, but this did not happen!)
Challenge 1
A. Start by making a vector x
with the whole numbers 1 through 26. Then, subtract 0.5 from each element in the vector and save the result in vector y
. Check your results by printing both vectors.
Click for the solution
<- 1:26
x x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26
<- x - 0.5
y y
[1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5
[16] 15.5 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5
B. What do you think will be the result of the following operation? Try it out and see if you were right.
1:5 * 1:5
Click for the solution
1:5 * 1:5
[1] 1 4 9 16 25
Both vectors are of length 5 which will lead to “element-wise matching”: the first element in the first vector will be multiplied with the first element in the second vector, the second element in the first vector will be multiplied with the second element in the second vector, and so on.
2.4 Exploring vectors
R has many built-in functions to get information about vectors and other types of objects, such as:
Get the first and last few elements, respectively, with head()
and tail()
:
# Print the first 6 elements:
head(myseq)
[1] 6.0 6.2 6.4 6.6 6.8 7.0
# Both head and tail take an argument `n` to specify the number of elements to print:
head(myseq, n = 2)
[1] 6.0 6.2
# Print the last 6 elements:
tail(myseq)
[1] 7.0 7.2 7.4 7.6 7.8 8.0
Get the number of elements with length()
:
length(myseq)
[1] 11
Get arithmetic summaries like sum()
and mean()
for vectors with numbers:
# sum() will sum the values of all elements
sum(myseq)
[1] 77
# mean() will compute the mean (average) across all elements
mean(myseq)
[1] 7
2.5 Extracting elements from vectors
Extracting element from objects like vectors is often referred to as “indexing”. In R, we can do this using bracket notation – for example:
Get the second element:
2] myseq[
[1] 6.2
Get the second through the fifth elements:
2:5] myseq[
[1] 6.2 6.4 6.6 6.8
Get the first and eight elements:
c(1, 8)] myseq[
[1] 6.0 7.4
To put this in a general way: we can extract elements from a vector by using another vector, whose values are the positional indices of the elements in the original vector.
3 Data structure 2: Data frames
3.1 R stores tabular data in “data frames”
One of R’s most powerful features is its built-in ability to deal with tabular data – i.e., data with rows and columns like you are familiar with from spreadsheets like those you create with Excel.
In R, tabular data is stored in objects that are called “data frames”, the second R data structure we’ll cover in some depth. Let’s start by making a toy data frame with information about 3 cats:
<- data.frame(
cats name = c("Luna", "Thomas", "Daisy"),
coat = c("calico", "black", "tabby"),
weight = c(2.1, 5.0, 3.2)
)
cats
name coat weight
1 Luna calico 2.1
2 Thomas black 5.0
3 Daisy tabby 3.2
Above:
- We created 3 vectors and pasted them side-by-side to create a data frame in which each vector constitutes a column.
- We gave each vector a name (e.g.,
coat
), and those names became the column names. - The resulting data frame has 3 rows (one for each cat) and 3 columns (each with a type of info about the cats, like coat color).
Data frames are typically (and best) organized like above, where:
- Each column contains a different “variable” (e.g. coat color, weight)
- Each row contains a different “observation” (data on e.g. one cat/person/sample)
That’s all we’ll say about data frames for now, but in today’s remaining sessions we will explore this key R data structure more!
4 Data types
4.1 R’s main Data Types
R distinguishes different kinds of data, such as character strings and numbers, in a formal way, using several pre-defined “data types”. The behavior of R in various operations will depend heavily on the data type – for example, the below fails:
"valerion" * 5
Error in "valerion" * 5: non-numeric argument to binary operator
We can ask what type of data something is in R using the typeof()
function:
typeof("valerion")
[1] "character"
R sets the data type of "valerion"
to character
, which we commonly refer to as character strings or strings. In formal terms, the failed command did not work because R will not allow us to perform mathematical functions on vectors of type character
.
The character
data type most commonly contains letters, but anything that is placed between quotes ("..."
) will be interpreted as the character
data type — even plain numbers:
typeof("5")
[1] "character"
Besides character
, the other 3 common data types are:
double
/numeric
– numbers that can have decimal points:typeof(3.14)
[1] "double"
integer
– whole numbers only:typeof(1:3)
[1] "integer"
logical
(eitherTRUE
orFALSE
– unquoted!):typeof(TRUE)
[1] "logical"
4.2 Factors
Categorical data, like treatments in an experiment, can be stored as “factors” in R. Factors are useful for statistical analyses and for plotting, e.g. because they allow you to specify a custom order.
<- c("high", "medium", "low", "low", "medium")
diet_vec diet_vec
[1] "high" "medium" "low" "low" "medium"
factor(diet_vec)
[1] high medium low low medium
Levels: high low medium
In the example above, we turned a character vector into a factor. Its “levels” (low, medium, high) are sorted alphabetically by default, but we can manually specify an order that makes more sense:
<- factor(diet_vec, levels = c("low", "medium", "high"))
diet_fct diet_fct
[1] high medium low low medium
Levels: low medium high
This ordering would be automatically respected in plots and statistical analyses.
For most intents and purposes, it makes sense to think of factors as another data type, even though technically, they are a kind of data structure build on the integer
data type:
typeof(diet_fct)
[1] "integer"
4.3 A vector can only contain one data type
Individual vectors, and therefore also individual columns in data frames, can only be composed of a single data type.
R will silently pick the “best-fitting” data type when you enter or read data into a data frame. So let’s see what the data types are in our cats
data frame:
str(cats)
'data.frame': 3 obs. of 3 variables:
$ name : chr "Luna" "Thomas" "Daisy"
$ coat : chr "calico" "black" "tabby"
$ weight: num 2.1 5 3.2
- The
name
andcoat
columns arecharacter
, abbreviatedchr
. - The
weight
column isdouble
/numeric
, abbreviatednum
.
Challenge 2
What type of vector (if any) do you think each of the following will produce? Try it out and see if you were right.
typeof("TRUE")
typeof(banana)
typeof(c(2, 6, "3"))
Bonus / trick question:
typeof(18, 3)
Click for the solutions
typeof("TRUE")
[1] "character"
"TRUE"
ischaracter
(and notlogical
) because of the quotes around it.
typeof(banana)
Error: object 'banana' not found
- Recall the earlier example: this returns an error because the object
banana
does not exist. Any unquoted string (that is not a special keyword likeTRUE
andFALSE
) is interpreted as a reference to an object in R.
typeof(c(2, 6, "3"))
[1] "character"
- We’ll talk about why this produces a
character
vector in the next section.
typeof(18, 3)
Error in typeof(18, 3): unused argument (3)
This produces an error because the
typeof()
only accepts a single argument, which is an R object like a vector. Because we did not wrap18, 3
withinc()
(i.e. we did not usec(18, 3)
), we ended up passing two arguments to the function, and this resulted in an error.If you guessed that it would have TWICE returned
integer
(ordouble
), you were on the right track: you couldn’t have known that the function does not accept multiple objects.
4.4 Automatic Type Coercion
That a character vector was returned by c(2, 6, "3")
in the challenge above is due to something called type coercion.
When R encounters a mix of types (here, numbers and characters) to be combined into a single vector, it will force them all to be the same type. It “must” do this because, as pointed out above, a vector can consist of only a single data type.
Type coercion can be the source of many surprises, and is one reason we need to be aware of the basic data types and how R will interpret them.
4.5 Manual Type Conversion
Luckily, you are not simply at the mercy of whatever R decides to do automatically, but can convert vectors at will using the as.
group of functions:
as.
” and then press the Tab key.
as.integer(c("0", "2", "4"))
[1] 0 2 4
as.character(c(0, 2, 4))
[1] "0" "2" "4"
As you may have guessed, though, not all type conversions are possible — for example:
as.double("kiwi")
Warning: NAs introduced by coercion
[1] NA
(NA
is R’s way of denoting missing data – see this bonus section for more.)
5 Bonus material for self-study
5.1 Changing vector elements using indexing
Above, we saw how we can extract elements of a vector using indexing. To change elements in a vector, simply use the bracket on the other side of the arrow – for example:
Change the first element to
30
:1] <- 30 myseq[ myseq
[1] 30.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0
Change the last element to
0
:length(myseq)] <- 0 myseq[ myseq
[1] 30.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 0.0
Change the second element to the mean value of the vector:
2] <- mean(myseq) myseq[ myseq
[1] 30.000000 8.454545 6.400000 6.600000 6.800000 7.000000 7.200000 [8] 7.400000 7.600000 7.800000 0.000000
5.2 Extracting columns from a data frame
We can extract individual columns from a data frame using the $
operator:
$weight cats
[1] 2.1 5.0 3.2
$coat cats
[1] "calico" "black" "tabby"
This kind of operation will return a vector – and can be indexed as well:
$weight[2] cats
[1] 5
5.3 More on the logical
data type
Let’s add a column to our cats
data frame indicating whether each cat does or does not like string:
$likes_string <- c(1, 0, 1)
cats cats
name coat weight likes_string
1 Luna calico 2.1 1
2 Thomas black 5.0 0
3 Daisy tabby 3.2 1
So, likes_string
is numeric, but the 1
s and 0
s actually represent TRUE
and FALSE
.
We could instead use the logical
data type here, by converting this column with the as.logical()
function, which will turn 0’s into FALSE
and everything else, including 1, to TRUE
:
as.logical(cats$likes_string)
[1] TRUE FALSE TRUE
And to actually modify this column in the dataframe itself, we would do this:
$likes_string <- as.logical(cats$likes_string)
cats cats
name coat weight likes_string
1 Luna calico 2.1 TRUE
2 Thomas black 5.0 FALSE
3 Daisy tabby 3.2 TRUE
You might think that 1
/0
could be a handier coding than TRUE
/FALSE
because it may make it easier, for exmaple, to count the number of times something is true or false. But consider the following:
TRUE + TRUE
[1] 2
So, logicals can be used as if they were numbers, in which case FALSE
represents 0 and TRUE
represents 1.
5.4 Missing values (NA
)
R has a concept of missing data, which is important in statistical computing, as not all information/measurements are always available for each sample.
In R, missing values are coded as NA
(and like TRUE
/FALSE
, this is not a character string, so it is not quoted):
# This vector will contain one missing value
<- c(1, 3, NA, 7)
vector_NA vector_NA
[1] 1 3 NA 7
A key thing to be aware of with NA
s is that many functions that operate on vectors will return NA
if any element in the vector is NA
:
sum(vector_NA)
[1] NA
The way to get around this is by setting na.rm = TRUE
in such functions, for example:
sum(vector_NA, na.rm = TRUE)
[1] 11
5.5 A few other data structures in R
We did not go into details about R’s other data structures, which are less common than vectors and data frames. Two that are worth mentioning briefly, though, are:
Matrix, which can be convenient when you have tabular data that is exclusively numeric (excluding names/labels).
List, which is more flexible (and complicated) than vectors: it can contain multiple data types, and can also be hierarchically structured.
Bonus Challenge
An important part of every data analysis is cleaning input data. Here, you will clean a cat data set that has an added observation with a problematic data entry.
Start by creating the new data frame:
<- data.frame(
cats_v2 name = c("Luna", "Thomas", "Daisy", "Oliver"),
coat = c("calico", "black", "tabby", "tabby"),
weight = c(2.1, 5.0, 3.2, "2.3 or 2.4")
)
Then move on to the tasks below, filling in the blanks (_____
) and running the code:
# 1. Explore the data frame,
# including with an overview that shows the columns' data types:
cats_v2_____(cats_v2)
# 2. The "weight" column has the incorrect data type _____.
# The correct data type is: _____.
# 3. Correct the 4th weight with the mean of the two given values,
# then print the data frame to see the effect:
$weight[4] <- 2.35
cats_v2
cats_v2
# 4. Convert the weight column to the right data type:
$weight <- _____(cats_v2$weight)
cats_v2
# 5. Calculate the mean weight of the cats:
_____
Click for the solution
# 1. Explore the data frame,
# including with an overview that shows the columns' data types:
cats_v2
name coat weight
1 Luna calico 2.1
2 Thomas black 5
3 Daisy tabby 3.2
4 Oliver tabby 2.3 or 2.4
str(cats_v2)
'data.frame': 4 obs. of 3 variables:
$ name : chr "Luna" "Thomas" "Daisy" "Oliver"
$ coat : chr "calico" "black" "tabby" "tabby"
$ weight: chr "2.1" "5" "3.2" "2.3 or 2.4"
# 2. The "weight" column has the incorrect data type CHARACTER.
# The correct data type is: DOUBLE.
# 3. Correct the 4th weight data point with the mean of the two given values,
# then print the data frame to see the effect:
$weight[4] <- 2.35
cats_v2 cats_v2
name coat weight
1 Luna calico 2.1
2 Thomas black 5
3 Daisy tabby 3.2
4 Oliver tabby 2.35
# 4. Convert the weight column to the right data type:
$weight <- as.double(cats_v2$weight)
cats_v2
# 5. Calculate the mean weight of the cats:
mean(cats_v2$weight)
[1] 3.1625
5.6 Learn more
To learn more about data types and data structures, see this episode from a separate Carpentries lesson.