R Basics 5: Data Manipulation With Base R

r-basics

Author

Horacio Lopez-Nicora

Published

February 9, 2024

1 Introduction

Recap of last week

Last week, we discussed vectorized operations and introduced the concept of a data frame. Furthermore, you successfully created a data frame containing multiple columns and rows.

Today, we will explore data manipulation using base R syntax. It is important to note that there are countless ways to achieve the same objective, particularly in base R. Whenever possible, let us consider simpler syntax.

2 Data manipulation with (base) R

2.1 Create a new dataset

###

Please create a new data frame and name it df. This data frame should consist of three columns (Control, Trt1, and Trt2) with 50 observations each. The Control column should contain 50 data points that follow a normal distribution with a mean of 35 and a standard deviation of 10. Likewise, the Trt1 column should have a mean of 37 and a standard deviation of 10, and the Trt2 column should have a mean of 75 and a standard deviation of 10. Additionally, add five blocks (a, b, c, d, e), each repeating 10 times. Let’s use the function set.seed(1234) to work with the same values. (Click for the answer)

set.seed(1234)
df <- data.frame(Control = rnorm(50, 35, 10),
                 Trt1 = rnorm(50, 37, 10),
                 Trt2 = rnorm(50, 75, 10),
                 Block = rep(c("a", "b", "c", "d", "e"), 10))

Data mostly come in two shapes – “long” format and “wide” format.

What type of data do you think is df? (Click for the answer)

Our data frame df is in wide format.

2.2 Extract variables (columns)

There are multiple ways to extract/select variables/columns. Here are two methods that we have previously used:

df[, c("Control", "Trt2")] # by name

df[, c(1, 2)]  # by column index

2.3 Make new variables (columns)

Let’s create two new variables from existing ones:

df$Trt1.log <- log(df$Trt1)
df$Trt2.log <- log(df$Trt2)

2.4 Extract observations (rows)

There are multiple ways to extract/filter observations/rows. Here are two ways we can do this:

# Using [,]
df[df$Trt1.log < 3.5, ]

df[df$Trt2.log > 4.2 & df$Block == "a", ]

# Using subset
subset(df, df$Trt2.log > 4.2 & df$Block == "a")

2.5 Arrange observations (rows)

Sorting is an operation that we typically perform when manipulating our dataset.

# ascending order of Block (alphabetic) followed by ascending order of Trt2.log
df[order(df$Block, df$Trt2.log) , ]

# descending order of Block (alphabetic) followed by ascending order of Trt2.log
df[order(rev(df$Block), df$Trt2.log) , ]

2.6 Summarize observations (rows)

There are numerous ways to accomplish this task, and we will discover additional methods as we progress to the tidyverse package.

# Manually create a data.frame
data.frame(Trt1.mean = mean(df$Trt1),
           Trt1.sd = sd(df$Trt1),
           Trt2.mean = mean(df$Trt2),
           Trt2.sd = sd(df$Trt2))

2.7 Summarize rows within groups

Typically, our goal is to summarize data according to specific variables. Below is how we can achieve this:

# First operate in the data.frame by group
df_by <- by(df, 
   INDICES = list(df$Block),
   FUN = function(x){
     data.frame(Block = unique(x$Block),
                Control.mean = mean(x$Control),
                Control.sd = sd(x$Control),
                Trt1.mean = mean(x$Trt1),
                Trt1.sd = sd(x$Trt1),
                Trt2.mean = mean(x$Trt2),
                Trt2.sd = sd(x$Trt2))
   })

# Then combine the results into a data.frame
do.call(rbind, df_by)

Alternatively, we can use the aggregate() function:

df.by <- do.call(data.frame, aggregate(cbind(Control, Trt1, Trt2) ~ Block, data = df, FUN = function(x) c(mean = mean(x), sd = sd(x) ) ))
df.by

Let’s take a look at each treatment graphically (we will do more data viz soon!):

barplot(df.by$Control.mean, names.arg =  paste(df.by$Block)) 

barplot(df.by$Trt1.mean, names.arg =  paste(df.by$Block))

barplot(df.by$Trt2.mean, names.arg =  paste(df.by$Block))

2.8 Reshape our data frame

Wide to long format:

In data analysis, the need to reshape the data frequently arises in order to enhance manageability and usefulness. Reshaping the data entails converting it from one format, such as wide, to another, like long, or vice versa. Such transformations aid in facilitating data accessibility, simplifying analysis, and providing more information.

l <- reshape(df, 
             varying = c("Control", "Trt1", "Trt2", "Trt1.log", "Trt2.log"), 
             v.names = "Yield", 
             timevar = "Treatment", 
             times = c("Control", "Trt1", "Trt2", "Trt1.log", "Trt2.log"), 
             new.row.names = 1:1000, 
             direction = "long")

l

3 Practice

Before seeking assistance from others, it is generally advisable for you to attempt to resolve the problem on your own. R provides comprehensive tools for accessing documentation and searching for help.

3.1 Exercise 1

Let’s use the same data frame df we created at the beginning of this session (Click for the answer)

Please go to the beginning of this session and re-run the code.

3.2 Exercise 2

A more informative variable could be one showing the difference between the treatment and the control. Create two new variables: Trt1.Delta and Trt2.Delta. (Click for the answer)

df$Trt1.Delta <- df$Trt1 - df$Control
df$Trt2.Delta <- df$Trt2 - df$Control

3.3 Exercise 3

Create a new data frame df_delta containing the following variables (columns): Block, Trt1.Delta, and Trt2.Delta. (Click for the answer)

df_delta <- df[, c("Block", "Trt1.Delta", "Trt2.Delta")]

3.4 Exercise 4

Summarize Trt1.Delta and Trt2.Delta by Block. Produce the mean and standard deviation for each variable. (Click for the answer)

df.by <- do.call(data.frame, aggregate(cbind(Trt1.Delta, Trt2.Delta) ~ Block, data = df, FUN = function(x) c(mean = mean(x), sd = sd(x) ) ))
df.by

3.5 Exercise 5

In the end, our goal is to conduct statistical analysis to assess the impact of the treatment. However, the current data format does not allow us to proceed with our analysis. To fix this, please transform the data frame df_delta from a wide format to a long format. Please name this new data frame as long_delta. (Click for the answer)

l <- reshape(df_delta, 
             varying = c("Trt1.Delta", "Trt2.Delta"), 
             v.names = "Yield", 
             timevar = "Treatment", 
             times = c("Trt1.Delta", "Trt2.Delta"), 
             new.row.names = 1:1000, 
             direction = "long")