R’s data structures and data types

Author

Software Carpentry / Jelmer Poelstra

Published

February 11, 2025



1 Introduction

What we’ll cover

In this session, we will learn about R’s data structures and data types.

  • Data structures are the kinds of objects that R can store data in. Here, we will cover the two most common ones: vectors and data frames.

  • Data types are how R distinguishes between different kinds of data like numbers and character strings. Here, we’ll talk about the 4 main data types: character, integer, double, and logical. We’ll also cover factors, a construct related to the data types.

Setting up

To make it easier to keep track of what we do, we’ll write our code in a script (and send it to the console from there) – here is how to create and save a new R script:

  1. Open a new R script (Click the + symbol in toolbar at the top, then click R Script)1.

  2. Save the script straight away as data-structures.R – you can save it anywhere you like, though it is probably best to save it in a folder specifically for this workshop.

  3. If you want the section headers as comments in your script, as in the script I am showing you now, then copy-and-paste the following into your script:

Section headers for your script (Click to expand)
# 2 - Vectors ------------------------------------------------------------------
# 2.1 - Single-element vectors and quoting

# 2.2 - Multi-element vectors

# 2.3 - Vectorization

# Challenge 1
# A. Start by making a vector x with the whole numbers 1 through 26.
#    Then, subtract 0.5 from each element in the vector and save the result in vector y.
#    Check your results by printing both vectors.

# B. What do you think will be the result of the following operation?
#    1:5 * 1:5

# 2.4 - Exploring vectors

# 2.5 - Extracting element from vectors

# 3 - Data frames --------------------------------------------------------------
# 3.1 - Data frame intro

# 4 - Data types ---------------------------------------------------------------
# 4.1 - R's main data types

# 4.2 - Factors

# 4.3 - A vector can only contain one data type

# Challenge 2
# What type of vector (if any) do you think each of the following will produce?
# Try it out and see if you were right.
#   typeof("TRUE")
#   typeof(banana)
#   typeof(c(2, 6, "3"))
# Bonus / trick question:
#   typeof(18, 3)

# 4.4 - Automatic type coercion
# 4.5 - Manual type conversion


2 Data structure 1: Vectors

The first data structure we will explore is the simplest: the vector. A vector in R is essentially a collection of one or more items. Moving forward, we’ll call such individual items “elements”.

2.1 Single-element vectors and quoting

Vectors can consist of just a single element, so each of the two lines of code below creates a vector:

vector1 <- 8
vector2 <- "panda"

Two things are worth noting about the "panda" example, which is a so-called character string (or string for short):

  • "panda" constitutes one element, not 5 (its number of letters).
  • Unlike when dealing with numbers, we have to quote the string.2

Character strings need to be quoted because they are otherwise interpreted as R objects – for example, because our vectors vector1 and vector2 are objects, we refer to them without quotes:

# [Note that R will show auto-complete options after you type 3 characters]
vector1
[1] 8
vector2
[1] "panda"

Therefore, the code below doesn’t work, because there is no object called panda:

vector_fail <- panda
Error: object 'panda' not found

2.2 Multi-element vectors

A common way to make vectors with multiple elements is by using the c (combine) function:

c(2, 6, 3)
[1] 2 6 3

Unlike in the first couple of vector examples, we didn’t save the above vector to an object: now the vector simply printed to the console – but it is created all the same.

c() can also append elements to an existing vector:

# First we create a vector:
vector_to_append <- c("vhagar", "meleys")
vector_to_append
[1] "vhagar" "meleys"
# Then we append another element to it:
c(vector_to_append, "balerion the dread")
[1] "vhagar"             "meleys"             "balerion the dread"

To create vectors with series of numbers, a couple of shortcuts are available. First, you can make series of whole numbers with the : operator:

1:10
 [1]  1  2  3  4  5  6  7  8  9 10

Second, you can use a function like seq() for fine control over the sequence:

myseq <- seq(from = 6, to = 8, by = 0.2)
myseq
 [1] 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0

2.3 Vectorization

Consider the output of this command:

myseq * 2
 [1] 12.0 12.4 12.8 13.2 13.6 14.0 14.4 14.8 15.2 15.6 16.0

Above, every individual element in myseq was multiplied by 2. We call this behavior “vectorization” and this is a key feature of the R language. (Alternatively, you may have expected this code to repeat myseq twice, but this did not happen!)

For more about vectorization, see episode 9 from the Carpentries lesson that this material is based on.


Challenge 1


A. Start by making a vector x with the whole numbers 1 through 26. Then, subtract 0.5 from each element in the vector and save the result in vector y. Check your results by printing both vectors.

Click for the solution
x <- 1:26
x
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26
y <- x - 0.5
y
 [1]  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5 13.5 14.5
[16] 15.5 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5

B. What do you think will be the result of the following operation? Try it out and see if you were right.

1:5 * 1:5
Click for the solution
1:5 * 1:5
[1]  1  4  9 16 25

Both vectors are of length 5 which will lead to “element-wise matching”: the first element in the first vector will be multiplied with the first element in the second vector, the second element in the first vector will be multiplied with the second element in the second vector, and so on.


2.4 Exploring vectors

R has many built-in functions to get information about vectors and other types of objects, such as:

Get the first and last few elements, respectively, with head() and tail():

# Print the first 6 elements:
head(myseq)
[1] 6.0 6.2 6.4 6.6 6.8 7.0
# Both head and tail take an argument `n` to specify the number of elements to print:
head(myseq, n = 2)
[1] 6.0 6.2
# Print the last 6 elements:
tail(myseq)
[1] 7.0 7.2 7.4 7.6 7.8 8.0

Get the number of elements with length():

length(myseq)
[1] 11

Get arithmetic summaries like sum() and mean() for vectors with numbers:

# sum() will sum the values of all elements
sum(myseq)
[1] 77
# mean() will compute the mean (average) across all elements
mean(myseq)
[1] 7


2.5 Extracting elements from vectors

Extracting element from objects like vectors is often referred to as “indexing”. In R, we can do this using bracket notation – for example:

  • Get the second element:

    myseq[2]
    [1] 6.2
  • Get the second through the fifth elements:

    myseq[2:5]
    [1] 6.2 6.4 6.6 6.8
  • Get the first and eight elements:

    myseq[c(1, 8)]
    [1] 6.0 7.4

To put this in a general way: we can extract elements from a vector by using another vector, whose values are the positional indices of the elements in the original vector.


3 Data structure 2: Data frames

3.1 R stores tabular data in “data frames”

One of R’s most powerful features is its built-in ability to deal with tabular data – i.e., data with rows and columns like you are familiar with from spreadsheets like those you create with Excel.

In R, tabular data is stored in objects that are called “data frames”, the second R data structure we’ll cover in some depth. Let’s start by making a toy data frame with information about 3 cats:

cats <- data.frame(
  name = c("Luna", "Thomas", "Daisy"),
  coat = c("calico", "black", "tabby"),
  weight = c(2.1, 5.0, 3.2)
  )

cats
    name   coat weight
1   Luna calico    2.1
2 Thomas  black    5.0
3  Daisy  tabby    3.2

Above:

  • We created 3 vectors and pasted them side-by-side to create a data frame in which each vector constitutes a column.
  • We gave each vector a name (e.g., coat), and those names became the column names.
  • The resulting data frame has 3 rows (one for each cat) and 3 columns (each with a type of info about the cats, like coat color).

Data frames are typically (and best) organized like above, where:

  • Each column contains a different “variable” (e.g. coat color, weight)
  • Each row contains a different “observation” (data on e.g. one cat/person/sample)

That’s all we’ll say about data frames for now, but in today’s remaining sessions we will explore this key R data structure more!


4 Data types

4.1 R’s main Data Types

R distinguishes different kinds of data, such as character strings and numbers, in a formal way, using several pre-defined “data types”. The behavior of R in various operations will depend heavily on the data type – for example, the below fails:

"valerion" * 5
Error in "valerion" * 5: non-numeric argument to binary operator

We can ask what type of data something is in R using the typeof() function:

typeof("valerion")
[1] "character"

R sets the data type of "valerion" to character, which we commonly refer to as character strings or strings. In formal terms, the failed command did not work because R will not allow us to perform mathematical functions on vectors of type character.

The character data type most commonly contains letters, but anything that is placed between quotes ("...") will be interpreted as the character data type — even plain numbers:

typeof("5")
[1] "character"

Besides character, the other 3 common data types are:

  • double / numeric – numbers that can have decimal points:

    typeof(3.14)
    [1] "double"
  • integer – whole numbers only:

    typeof(1:3)
    [1] "integer"
  • logical (either TRUE or FALSE – unquoted!):

    typeof(TRUE)
    [1] "logical"

4.2 Factors

Categorical data, like treatments in an experiment, can be stored as “factors” in R. Factors are useful for statistical analyses and for plotting, e.g. because they allow you to specify a custom order.

diet_vec <- c("high", "medium", "low", "low", "medium")
diet_vec
[1] "high"   "medium" "low"    "low"    "medium"
factor(diet_vec)
[1] high   medium low    low    medium
Levels: high low medium

In the example above, we turned a character vector into a factor. Its “levels” (low, medium, high) are sorted alphabetically by default, but we can manually specify an order that makes more sense:

diet_fct <- factor(diet_vec, levels = c("low", "medium", "high"))
diet_fct
[1] high   medium low    low    medium
Levels: low medium high

This ordering would be automatically respected in plots and statistical analyses.


For most intents and purposes, it makes sense to think of factors as another data type, even though technically, they are a kind of data structure build on the integer data type:

typeof(diet_fct)
[1] "integer"

4.3 A vector can only contain one data type

Individual vectors, and therefore also individual columns in data frames, can only be composed of a single data type.

R will silently pick the “best-fitting” data type when you enter or read data into a data frame. So let’s see what the data types are in our cats data frame:

str(cats)
'data.frame':   3 obs. of  3 variables:
 $ name  : chr  "Luna" "Thomas" "Daisy"
 $ coat  : chr  "calico" "black" "tabby"
 $ weight: num  2.1 5 3.2
  • The name and coat columns are character, abbreviated chr.
  • The weight column is double/numeric, abbreviated num.

Challenge 2

What type of vector (if any) do you think each of the following will produce? Try it out and see if you were right.

typeof("TRUE")
typeof(banana)
typeof(c(2, 6, "3"))

Bonus / trick question:

typeof(18, 3)
Click for the solutions
typeof("TRUE")
[1] "character"
  1. "TRUE" is character (and not logical) because of the quotes around it.
typeof(banana)
Error: object 'banana' not found
  1. Recall the earlier example: this returns an error because the object banana does not exist. Any unquoted string (that is not a special keyword like TRUE and FALSE) is interpreted as a reference to an object in R.
typeof(c(2, 6, "3"))
[1] "character"
  1. We’ll talk about why this produces a character vector in the next section.
typeof(18, 3)
Error in typeof(18, 3): unused argument (3)
  1. This produces an error because the typeof() only accepts a single argument, which is an R object like a vector. Because we did not wrap 18, 3 within c() (i.e. we did not use c(18, 3)), we ended up passing two arguments to the function, and this resulted in an error.

    If you guessed that it would have TWICE returned integer (or double), you were on the right track: you couldn’t have known that the function does not accept multiple objects.


4.4 Automatic Type Coercion

That a character vector was returned by c(2, 6, "3") in the challenge above is due to something called type coercion.

When R encounters a mix of types (here, numbers and characters) to be combined into a single vector, it will force them all to be the same type. It “must” do this because, as pointed out above, a vector can consist of only a single data type.

Type coercion can be the source of many surprises, and is one reason we need to be aware of the basic data types and how R will interpret them.


4.5 Manual Type Conversion

Luckily, you are not simply at the mercy of whatever R decides to do automatically, but can convert vectors at will using the as. group of functions:

Try to use RStudio’s auto-complete functionality here: type “as.” and then press the Tab key.
as.integer(c("0", "2", "4"))
[1] 0 2 4
as.character(c(0, 2, 4))
[1] "0" "2" "4"

As you may have guessed, though, not all type conversions are possible — for example:

as.double("kiwi")
Warning: NAs introduced by coercion
[1] NA

(NA is R’s way of denoting missing data – see this bonus section for more.)





5 Bonus material for self-study

5.1 Changing vector elements using indexing

Above, we saw how we can extract elements of a vector using indexing. To change elements in a vector, simply use the bracket on the other side of the arrow – for example:

  • Change the first element to 30:

    myseq[1] <- 30
    myseq
     [1] 30.0  6.2  6.4  6.6  6.8  7.0  7.2  7.4  7.6  7.8  8.0
  • Change the last element to 0:

    myseq[length(myseq)] <- 0
    myseq
     [1] 30.0  6.2  6.4  6.6  6.8  7.0  7.2  7.4  7.6  7.8  0.0
  • Change the second element to the mean value of the vector:

    myseq[2] <- mean(myseq)
    myseq
     [1] 30.000000  8.454545  6.400000  6.600000  6.800000  7.000000  7.200000
     [8]  7.400000  7.600000  7.800000  0.000000

5.2 Extracting columns from a data frame

We can extract individual columns from a data frame using the $ operator:

cats$weight
[1] 2.1 5.0 3.2
cats$coat
[1] "calico" "black"  "tabby" 

This kind of operation will return a vector – and can be indexed as well:

cats$weight[2]
[1] 5

5.3 More on the logical data type

Let’s add a column to our cats data frame indicating whether each cat does or does not like string:

cats$likes_string <- c(1, 0, 1)
cats
    name   coat weight likes_string
1   Luna calico    2.1            1
2 Thomas  black    5.0            0
3  Daisy  tabby    3.2            1

So, likes_string is numeric, but the 1s and 0s actually represent TRUE and FALSE.

We could instead use the logical data type here, by converting this column with the as.logical() function, which will turn 0’s into FALSE and everything else, including 1, to TRUE:

as.logical(cats$likes_string)
[1]  TRUE FALSE  TRUE

And to actually modify this column in the dataframe itself, we would do this:

cats$likes_string <- as.logical(cats$likes_string)
cats
    name   coat weight likes_string
1   Luna calico    2.1         TRUE
2 Thomas  black    5.0        FALSE
3  Daisy  tabby    3.2         TRUE

You might think that 1/0 could be a handier coding than TRUE/FALSE because it may make it easier, for exmaple, to count the number of times something is true or false. But consider the following:

TRUE + TRUE
[1] 2

So, logicals can be used as if they were numbers, in which case FALSE represents 0 and TRUE represents 1.


5.4 Missing values (NA)

R has a concept of missing data, which is important in statistical computing, as not all information/measurements are always available for each sample.

In R, missing values are coded as NA (and like TRUE/FALSE, this is not a character string, so it is not quoted):

# This vector will contain one missing value
vector_NA <- c(1, 3, NA, 7)
vector_NA
[1]  1  3 NA  7

A key thing to be aware of with NAs is that many functions that operate on vectors will return NA if any element in the vector is NA:

sum(vector_NA)
[1] NA

The way to get around this is by setting na.rm = TRUE in such functions, for example:

sum(vector_NA, na.rm = TRUE)
[1] 11

5.5 A few other data structures in R

We did not go into details about R’s other data structures, which are less common than vectors and data frames. Two that are worth mentioning briefly, though, are:

  • Matrix, which can be convenient when you have tabular data that is exclusively numeric (excluding names/labels).

  • List, which is more flexible (and complicated) than vectors: it can contain multiple data types, and can also be hierarchically structured.


Bonus Challenge

An important part of every data analysis is cleaning input data. Here, you will clean a cat data set that has an added observation with a problematic data entry.

Start by creating the new data frame:

cats_v2 <- data.frame(
  name = c("Luna", "Thomas", "Daisy", "Oliver"),
  coat = c("calico", "black", "tabby", "tabby"),
  weight = c(2.1, 5.0, 3.2, "2.3 or 2.4")
)

Then move on to the tasks below, filling in the blanks (_____) and running the code:

# 1. Explore the data frame,
#    including with an overview that shows the columns' data types:
cats_v2
_____(cats_v2)

# 2. The "weight" column has the incorrect data type _____.
#    The correct data type is: _____.

# 3. Correct the 4th weight with the mean of the two given values,
#    then print the data frame to see the effect:
cats_v2$weight[4] <- 2.35
cats_v2

# 4. Convert the weight column to the right data type:
cats_v2$weight <- _____(cats_v2$weight)

# 5. Calculate the mean weight of the cats:
_____
Click for the solution
# 1. Explore the data frame,
#    including with an overview that shows the columns' data types:
cats_v2
    name   coat     weight
1   Luna calico        2.1
2 Thomas  black          5
3  Daisy  tabby        3.2
4 Oliver  tabby 2.3 or 2.4
str(cats_v2)
'data.frame':   4 obs. of  3 variables:
 $ name  : chr  "Luna" "Thomas" "Daisy" "Oliver"
 $ coat  : chr  "calico" "black" "tabby" "tabby"
 $ weight: chr  "2.1" "5" "3.2" "2.3 or 2.4"
# 2. The "weight" column has the incorrect data type CHARACTER.
#    The correct data type is: DOUBLE.

# 3. Correct the 4th weight data point with the mean of the two given values,
#    then print the data frame to see the effect:
cats_v2$weight[4] <- 2.35
cats_v2
    name   coat weight
1   Luna calico    2.1
2 Thomas  black      5
3  Daisy  tabby    3.2
4 Oliver  tabby   2.35
# 4. Convert the weight column to the right data type:
cats_v2$weight <- as.double(cats_v2$weight)

# 5. Calculate the mean weight of the cats:
mean(cats_v2$weight)
[1] 3.1625

5.6 Learn more

To learn more about data types and data structures, see this episode from a separate Carpentries lesson.

Back to top

Footnotes

  1. Or Click File => New file => R Script.↩︎

  2. Either double quotes ("...") or single quotes ('...') work, but the former are most commonly used by convention.↩︎