R’s data structures and data types

Author

Software Carpentry / Jelmer Poelstra

Published

February 11, 2025



1 Introduction

In this session, we will learn about R’s data structures and data types.

  • Data structures are the kinds of objects that R can store data in. Here, we will cover the two most common ones: vectors and data frames.

  • Data types are how R distinguishes between different kinds of data like numbers and character strings. Here, we’ll talk about the 4 main data types: character, integer, double, and logical.

During this session, I will write my code in a script and send it to the console from there. So, I will start by opening a new script (+ symbol in toolbar at the top => “R Script”, or “File” => “New file” => R Script”), and I will save it straight away as data-structures.R in a folder on my Desktop (create a new folder there or anywhere else for this workshop, if you haven’t done so already).

I will also change my Pane Layout to have the script and the console side-by-side.


2 Data structure 1: Vectors

The first data structure we will explore is the simplest: the vector. A vector in R is essentially a list of one or more items. Moving forward, we’ll call these individual items “elements”.

2.1 Single-element vectors and quoting

Vectors can consist of just a single element, so in each of the two lines of code below, a vector is in fact created:

vector1 <- 8
vector2 <- "panda"

Two things are worth noting about the second example with a character string:

  • “panda” constitutes one element, not 5 (its number of letters).

  • We have to quote the string (either double or single quotes are fine, with the former more common). This is because unquoted character strings are interpreted as R objects – for example, vector1 and vector2 above are objects, and should be referred to without quotes:

# [Note that R will show auto-complete options after you type 3 characters]
vector1
[1] 8
vector2
[1] "panda"

Conversely, the below doesn’t work, because there is no object called panda:

vector_fail <- panda
Error: object 'panda' not found

As a side note, in the R console, you can press the up arrow to retrieve the previous command, and do so repeatedly to go back to older commands. Let’s practice that to get back our vector1 command:

vector1
[1] 8


2.2 Multi-element vectors

A common way to make vectors with multiple elements is to use the c (combine) function:

c(2, 6, 3)
[1] 2 6 3

(In the above example, I didn’t assign the vector to an object, but a vector was created nevertheless.)

c() can also append elements to an existing vector:

vector_append <- c("vhagar", "meleys")
vector_append
[1] "vhagar" "meleys"
c(vector_append, "balerion the dread")
[1] "vhagar"             "meleys"             "balerion the dread"

To create vectors with series of numbers, a couple of shortcuts are available. First, you can make series of whole numbers with the : operator:

1:10
 [1]  1  2  3  4  5  6  7  8  9 10

Second, you can use a function like seq() for fine control over the sequence:

vector_seq <- seq(from = 6, to = 8, by = 0.2)
vector_seq
 [1] 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0


2.3 Vectorization

In R, you can do the following:

vector_seq * 2
 [1] 12.0 12.4 12.8 13.2 13.6 14.0 14.4 14.8 15.2 15.6 16.0

Above, we multiplied every single element in vector_seq by 2. Another way of looking at this is that 2 was recycled as many times as necessary to operate on each element in vector_seq. We call this “vectorization” and this is a key feature of the R language. This behavior may seem intuitive, but in most languages you’d need a special construct like a loop to operate on each value in a vector.

(Alternatively, you may have expected this code to repeat vector_seq twice, but this did not happen! R has the function rep() for that. For more about vectorization, see episode 9 from our Carpentries lesson.)




2.4 Challenge 1

A. Start by making a vector x with the whole numbers 1 through 26. Then, subtract 0.5 from each element in the vector and save the result in vector y. Check your results by printing both vectors.

Click for the solution
x <- 1:26
x
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26
y <- x - 0.5
y
 [1]  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5 13.5 14.5
[16] 15.5 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5


B. What do you think will be the result of the following operation?

1:5 * 1:5
Click for the solution
1:5 * 1:5
[1]  1  4  9 16 25

Both vectors are of length 5 which will lead to “element-wise matching”: the first element in the first vector will be multiplied with the first element in the second vector, the second element in the first vector will be multiplied with the second element in the second vector, and so on.



2.5 Exploring vectors

R has many built-in functions to get information about vectors and other types of objects, such as:

  • head() and tail() to get the first few and last few elements, respectively:
head(vector_seq)
[1] 6.0 6.2 6.4 6.6 6.8 7.0
# Both head and tail take an argument `n` to specify the number of elements to print:
head(vector_seq, n = 2)
[1] 6.0 6.2
tail(vector_seq)
[1] 7.0 7.2 7.4 7.6 7.8 8.0


  • length() to get the number of elements:
length(vector_seq)
[1] 11


  • Functions like sum() and mean(), if the vector contains numbers:
# sum() will sum the values of all elements
sum(vector_seq)
[1] 77
# mean() will compute the mean (average) across all elements
mean(vector_seq)
[1] 7


2.6 Extracting elements from vectors

We can extract elements of a vector by “indexing” them using bracket notation. Here are a couple of examples:

  • Get the second element:
vector_seq[2]
[1] 6.2
  • Get the elements 2 through 5:
vector_seq[2:5]
[1] 6.2 6.4 6.6 6.8
  • Get the first and eight elements:
vector_seq[c(1, 8)]
[1] 6.0 7.4

To change an element in a vector, use the bracket on the other side of the arrow:

# Change the first element to '30':
vector_seq[1] <- 30
vector_seq
 [1] 30.0  6.2  6.4  6.6  6.8  7.0  7.2  7.4  7.6  7.8  8.0


3 Data structure 2: Data frames

3.1 R stores tabular data in “data frames”

One of R’s most powerful features is its built-in ability to deal with tabular data – i.e., data with rows and columns like you are familiar with from spreadsheets.

In R, tabular data is stored in objects that are called “data frames”. Data frames are the second and final R data structure that we’ll cover in some depth.

Let’s start by making a toy data frame with some information about 3 cats:

cats <- data.frame(
  coat = c("calico", "black", "tabby"),
  weight = c(2.1, 5.0, 3.2),
  likes_string = c(1, 0, 1)
  )

cats
    coat weight likes_string
1 calico    2.1            1
2  black    5.0            0
3  tabby    3.2            1

What we really did above is to create 4 vectors, all of length 3, and pasted them side-by-side to create a data frame. We also gave each vector a name, which became the column names.

The resulting data frame has 3 rows (one for each cat) and 4 columns (each with a type of info about the cats, like coat color).

In data frames, typically:

  • Separate variables (e.g. coat color, weight) are spread across columns,
  • Separate “observations” (e.g., cat/person, sample) are spread across rows.


3.2 Extracting columns from a data frame

We can extract individual columns from a data frame by specifying their names using the $ operator:

cats$weight
[1] 2.1 5.0 3.2
cats$coat
[1] "calico" "black"  "tabby" 

This kind of operation will return a vector. We won’t go into more detail about exploring (or manipulating) data frames, because we will do that with the dplyr package in the next episode.


4 Data types

4.1 R’s main Data Types

R distinguishes between several kinds of data, such as between character strings and numbers, in a formal way, and uses several “data types” to do so. The behavior of R in various operations will heavily depend on the data type – for example, the below fails:

"valerion" * 5
Error in "valerion" * 5: non-numeric argument to binary operator

We can ask what type of data something is in R using the typeof() function:

typeof("valerion")
[1] "character"

So the data type is character, which we commonly refer to as character strings or strings. In formal terms, R will not allow us to perform mathematical functions on vectors of type character.

The character data type typically contains letters but can have any character, including numbers, as long as it is quoted:

typeof("5")
[1] "character"

Besides character, the other 3 common data types are double (also called numeric), integer, and logical.

  • double / numeric – numbers that can have decimal points:
typeof(3.14)
[1] "double"
  • integer – whole numbers only:
typeof(1:3)
[1] "integer"
  • logical (either TRUE or FALSE – unquoted!):
typeof(TRUE)
[1] "logical"
typeof(FALSE)
[1] "logical"



4.2 Challenge 2

What do you expect each of the following to produce?

typeof("TRUE")
typeof(banana)
Click for the solution
  1. "TRUE" is character because of the quotes around it.
  2. Recall the earlier example: this returns an error because the object banana does not exist.


4.3 Vectors and data frame columns can only have 1 data type

Vectors and individual columns in data frames can only be composed of a single data type. R will silently pick the “best-fitting” data type when you enter or read data into a data frame. Let’s see what the data types are in our cats data frame:

str(cats)
'data.frame':   3 obs. of  3 variables:
 $ coat        : chr  "calico" "black" "tabby"
 $ weight      : num  2.1 5 3.2
 $ likes_string: num  1 0 1
  • The coat column is character, abbreviated chr.
  • The weight column is double/numeric, abbreviated num.
  • The likes_string column is integer, abbreviated int.



4.4 Challenge 3

Given what we’ve learned so far, what type of vector do you think the following will produce?

quiz_vector <- c(2, 6, "3")
Click for the solution

It produces a character vector:

quiz_vector
[1] "2" "6" "3"
typeof(quiz_vector)
[1] "character"

We’ll talk about what happened here in the next section.



4.5 Automatic Type Coercion

What happened in the code from the challenge above is something called type coercion, which can be the source of many surprises, and is one reason we need to be aware of the basic data types and how R will interpret them. When R encounters a mix of types (here double and character) to be combined into a single vector, it will force them all to be the same type.

Here is another example:

coercion_vector <- c("a", TRUE)
coercion_vector
[1] "a"    "TRUE"
typeof(coercion_vector)
[1] "character"

Like in two examples we’ve seen, you will most commonly run into situations where numbers or logicals are converted to characters.

The nitty-gritty of type coercion aside, the point is: if your data doesn’t look like what you thought it was going to look like, type coercion may well be to blame!


4.6 Manual Type Conversion

Luckily, you are not simply at the mercy of whatever R decides to do automatically, but can convert vectors at will using the as. group of functions (here, try RStudio’s auto-complete function: Type “as.” and then press the TAB key):

as.double(c("0", "2", "4"))
[1] 0 2 4
as.character(c(0, 2, 4))
[1] "0" "2" "4"

As another example, in our cats data, likes_string is numeric, but the 1s and 0s actually represent TRUE and FALSE (a common way of representing them).

cats$likes_string
[1] 1 0 1

We could use the logical data type here, by converting this column with the as.logical() function, which will turn 0’s into FALSE and everything else, including 1, to TRUE:

as.logical(cats$likes_string)
[1]  TRUE FALSE  TRUE

As you may have guessed, though, not all type conversions are possible:

as.double("kiwi")
Warning: NAs introduced by coercion
[1] NA

(NA is R’s way of denoting missing data.)




5 Factors

In R, categorical data, like different treatments in an experiment, can be stored as “factors”. Factors are useful for statistical analyses and also for plotting, the latter because you can specify a custom order among the so-called “levels” of the factor.

diet_vec <- c("high", "medium", "low", "low", "medium", "high")
factor(diet_vec)
[1] high   medium low    low    medium high  
Levels: high low medium

In the example above, we turned a regular vector into a factor. The levels are sorted alphabetically by default, but we can manually specify an order that makes more sense and that would carry through if we would plot data associated with this factor:

diet_fct <- factor(diet_vec, levels = c("low", "medium", "high"))
diet_fct
[1] high   medium low    low    medium high  
Levels: low medium high

For most intents and purposes, it makes sense to think of factors as another data type, even though technically, it is a kind of data structure build on the integer data type:

typeof(diet_fct)
[1] "integer"

5.1 Challenge 4

An important part of every data analysis is cleaning input data. Here, you will clean a cat data set that has an added observation with a problematic data entry.

Start by creating the new data frame:

cats_v2 <- data.frame(
  name = c("Luna", "Misty", "Bella", "Oliver"),
  coat = c("calico", "black", "tabby", "tabby"),
  weight = c(2.1, 5.0, 3.2, "2.3 or 2.4"),
  likes_string = c(1, 0, 1, 1)
)

Then move on to the tasks below, filling in the blanks (_____) and running the code:

# 1. Explore the data frame,
#    including with an overview that shows the columns' data types:
cats_v2
_____(cats_v2)

# 2. The "weight" column has the incorrect data type _____.
#    The correct data type is: _____.

# 3. Correct the 4th weight with the mean of the two given values,
#    then print the data frame to see the effect:
cats_v2$weight[4] <- 2.35
cats_v2

# 4. Convert the weight column to the right data type:
cats_v2$weight <- _____(cats_v2$weight)

# 5. Calculate the mean weight of the cats:
_____


Click for the solution
# 1. Explore the data frame,
#    including with an overview that shows the columns' data types:
cats_v2
    name   coat     weight likes_string
1   Luna calico        2.1            1
2  Misty  black          5            0
3  Bella  tabby        3.2            1
4 Oliver  tabby 2.3 or 2.4            1
str(cats_v2)
'data.frame':   4 obs. of  4 variables:
 $ name        : chr  "Luna" "Misty" "Bella" "Oliver"
 $ coat        : chr  "calico" "black" "tabby" "tabby"
 $ weight      : chr  "2.1" "5" "3.2" "2.3 or 2.4"
 $ likes_string: num  1 0 1 1
# 2. The "weight" column has the incorrect data type CHARACTER.
#    The correct data type is: DOUBLE.

# 3. Correct the 4th weight data point with the mean of the two given values,
#    then print the data frame to see the effect:
cats_v2$weight[4] <- 2.35
cats_v2
    name   coat weight likes_string
1   Luna calico    2.1            1
2  Misty  black      5            0
3  Bella  tabby    3.2            1
4 Oliver  tabby   2.35            1
# 4. Convert the weight column to the right data type:
cats_v2$weight <- as.double(cats_v2$weight)

# 5. Calculate the mean weight of the cats:
mean(cats_v2$weight)
[1] 3.1625


6 Learn more

This material was adapted from this Carpentries lesson episode. To learn more about data types and data structures, see this episode from a separate Carpentries lesson.



7 Bonus

7.1 Writing and reading tabular data

Let’s practice writing and reading data. First, we will write data to file that is in our R environment, and then we will read data that is in a file into our R environment.

Via functions from an add-on package, R can interact with Excel spreadsheet files, but keeping your data in plain-text files generally benefits reproducibility. Tabular plain text files can be stored using a Tab as the delimiter (these are often called TSV files, and stored with a .tsv extension) or with a comma as the delimiter (these are often called CSV files, and stored with a .csv extension).

We will use the write.csv function to write the cats data frame to a CSV file in our current working directory:

write.csv(x = cats, file = "feline-data.csv", row.names = FALSE)

Here, we are explicitly naming all arguments, which can be good practice for clarity:

  • x is the R object to write to file
  • file is the file name (which can include directories/folders)
  • We are setting row.names = FALSE to avoid writing the row names, which by default are just row numbers.

In RStudio’s Files pane, let’s find our new file, click on it, and then click “View File”. That way, the file will open in the editor, where it should look like this:

"name","coat","weight","likes_string"
"Luna","calico",2.1,1
"Misty","black",5,0
"Bella","tabby",3.2,1

(Note that R adds double quotes "..." around strings – if you want to avoid this, add quote = FALSE to write.csv().)


Let’s also practice reading data from a file into R. We’ll use the read.csv() function for the file we just created:

cats_reread <- read.csv(file = "feline-data.csv")
cats_reread
    coat weight likes_string
1 calico    2.1            1
2  black    5.0            0
3  tabby    3.2            1

A final note: write.csv() and read.csv() are really just two more specific convenience versions of the write/read.table() functions, which can be used to write and read in tabular data in any kind of plain text file.

7.2 A few other data structures in R

We did not go into details about R’s other data structures, which are less common than vectors and data frames. Two that are worth mentioning briefly, though, are:

  • Matrix, which can be convenient when you have tabular data that is exclusively numeric (excluding names/labels).

  • List, which is more flexible (and complicated) than vectors: it can contain multiple data types, and can also be hierarchically structured.


7.3 Missing values (NA)

R has a concept of missing data, which is important in statistical computing, as not all information/measurements are always available for each sample.

In R, missing values are coded as NA (and this is not a character string, so it is not quoted):

# This vector will contain one missing value
vector_NA <- c(1, 3, NA, 7)
vector_NA
[1]  1  3 NA  7

The main reason to bring this up so early in your R journey is that you should be aware of the following: many functions that operate on vectors will return NA if any of the elements in the vector is NA:

sum(vector_NA)
[1] NA

The way to get around this is by setting na.rm = TRUE in such functions, for example:

sum(vector_NA, na.rm = TRUE)
[1] 11


7.4 More on the logical data type

If you think 1/0 could be more useful than TRUE/FALSE because it’s easier to count the number of cases something is true or false, consider:

TRUE + TRUE
[1] 2

So, logicals can be used as if they were numbers, in which case FALSE represents 0 and TRUE represents 1.

Back to top