set.seed(1234)
<- data.frame(Control = rnorm(50, 35, 10),
df Trt1 = rnorm(50, 37, 10),
Trt2 = rnorm(50, 75, 10),
Block = rep(c("a", "b", "c", "d", "e"), 10))
R Basics 5: Data Manipulation With Base R
1 Introduction
Recap of last week
Last week, we discussed vectorized operations and introduced the concept of a data frame. Furthermore, you successfully created a data frame containing multiple columns and rows.
Today, we will explore data manipulation using base R syntax. It is important to note that there are countless ways to achieve the same objective, particularly in base R. Whenever possible, let us consider simpler syntax.
2 Data manipulation with (base) R
2.1 Create a new dataset
###
Please create a new data frame and name it df. This data frame should consist of three columns (Control, Trt1, and Trt2) with 50 observations each. The Control column should contain 50 data points that follow a normal distribution with a mean of 35 and a standard deviation of 10. Likewise, the Trt1 column should have a mean of 37 and a standard deviation of 10, and the Trt2 column should have a mean of 75 and a standard deviation of 10. Additionally, add five blocks (a, b, c, d, e), each repeating 10 times. Let’s use the function set.seed(1234)
to work with the same values. (Click for the answer)
Data mostly come in two shapes – “long” format and “wide” format.
What type of data do you think is df
? (Click for the answer)
Our data frame df
is in wide format.
2.2 Extract variables (columns)
There are multiple ways to extract/select variables/columns. Here are two methods that we have previously used:
c("Control", "Trt2")] # by name
df[,
c(1, 2)] # by column index df[,
2.3 Make new variables (columns)
Let’s create two new variables from existing ones:
$Trt1.log <- log(df$Trt1)
df$Trt2.log <- log(df$Trt2) df
2.4 Extract observations (rows)
There are multiple ways to extract/filter observations/rows. Here are two ways we can do this:
# Using [,]
$Trt1.log < 3.5, ]
df[df
$Trt2.log > 4.2 & df$Block == "a", ]
df[df
# Using subset
subset(df, df$Trt2.log > 4.2 & df$Block == "a")
2.5 Arrange observations (rows)
Sorting is an operation that we typically perform when manipulating our dataset.
# ascending order of Block (alphabetic) followed by ascending order of Trt2.log
order(df$Block, df$Trt2.log) , ]
df[
# descending order of Block (alphabetic) followed by ascending order of Trt2.log
order(rev(df$Block), df$Trt2.log) , ] df[
2.6 Summarize observations (rows)
There are numerous ways to accomplish this task, and we will discover additional methods as we progress to the tidyverse
package.
# Manually create a data.frame
data.frame(Trt1.mean = mean(df$Trt1),
Trt1.sd = sd(df$Trt1),
Trt2.mean = mean(df$Trt2),
Trt2.sd = sd(df$Trt2))
2.7 Summarize rows within groups
Typically, our goal is to summarize data according to specific variables. Below is how we can achieve this:
# First operate in the data.frame by group
<- by(df,
df_by INDICES = list(df$Block),
FUN = function(x){
data.frame(Block = unique(x$Block),
Control.mean = mean(x$Control),
Control.sd = sd(x$Control),
Trt1.mean = mean(x$Trt1),
Trt1.sd = sd(x$Trt1),
Trt2.mean = mean(x$Trt2),
Trt2.sd = sd(x$Trt2))
})
# Then combine the results into a data.frame
do.call(rbind, df_by)
Alternatively, we can use the aggregate()
function:
<- do.call(data.frame, aggregate(cbind(Control, Trt1, Trt2) ~ Block, data = df, FUN = function(x) c(mean = mean(x), sd = sd(x) ) ))
df.by df.by
Let’s take a look at each treatment graphically (we will do more data viz soon!):
barplot(df.by$Control.mean, names.arg = paste(df.by$Block))
barplot(df.by$Trt1.mean, names.arg = paste(df.by$Block))
barplot(df.by$Trt2.mean, names.arg = paste(df.by$Block))
2.8 Reshape our data frame
Wide to long format:
In data analysis, the need to reshape the data frequently arises in order to enhance manageability and usefulness. Reshaping the data entails converting it from one format, such as wide, to another, like long, or vice versa. Such transformations aid in facilitating data accessibility, simplifying analysis, and providing more information.
<- reshape(df,
l varying = c("Control", "Trt1", "Trt2", "Trt1.log", "Trt2.log"),
v.names = "Yield",
timevar = "Treatment",
times = c("Control", "Trt1", "Trt2", "Trt1.log", "Trt2.log"),
new.row.names = 1:1000,
direction = "long")
l
3 Practice
Before seeking assistance from others, it is generally advisable for you to attempt to resolve the problem on your own. R provides comprehensive tools for accessing documentation and searching for help.
3.1 Exercise 1
Let’s use the same data frame df
we created at the beginning of this session (Click for the answer)
Please go to the beginning of this session and re-run the code.
3.2 Exercise 2
A more informative variable could be one showing the difference between the treatment and the control. Create two new variables: Trt1.Delta
and Trt2.Delta
. (Click for the answer)
$Trt1.Delta <- df$Trt1 - df$Control
df$Trt2.Delta <- df$Trt2 - df$Control df
3.3 Exercise 3
Create a new data frame df_delta
containing the following variables (columns): Block
, Trt1.Delta
, and Trt2.Delta
. (Click for the answer)
<- df[, c("Block", "Trt1.Delta", "Trt2.Delta")] df_delta
3.4 Exercise 4
Summarize Trt1.Delta
and Trt2.Delta
by Block
. Produce the mean and standard deviation for each variable. (Click for the answer)
<- do.call(data.frame, aggregate(cbind(Trt1.Delta, Trt2.Delta) ~ Block, data = df, FUN = function(x) c(mean = mean(x), sd = sd(x) ) ))
df.by df.by
3.5 Exercise 5
In the end, our goal is to conduct statistical analysis to assess the impact of the treatment. However, the current data format does not allow us to proceed with our analysis. To fix this, please transform the data frame df_delta
from a wide format to a long format. Please name this new data frame as long_delta
. (Click for the answer)
<- reshape(df_delta,
l varying = c("Trt1.Delta", "Trt2.Delta"),
v.names = "Yield",
timevar = "Treatment",
times = c("Trt1.Delta", "Trt2.Delta"),
new.row.names = 1:1000,
direction = "long")