```
set.seed(1234)
<- data.frame(Control = rnorm(50, 35, 10),
df Trt1 = rnorm(50, 37, 10),
Trt2 = rnorm(50, 75, 10),
Block = rep(c("a", "b", "c", "d", "e"), 10))
```

# R Basics 5: Data Manipulation With Base R

## 1 Introduction

**Recap of last week**

Last week, we discussed vectorized operations and introduced the concept of a data frame. Furthermore, you successfully created a data frame containing multiple columns and rows.

Today, we will explore data manipulation using base R syntax. It is important to note that there are countless ways to achieve the same objective, particularly in base R. Whenever possible, let us consider simpler syntax.

## 2 Data manipulation with (base) R

### 2.1 Create a new dataset

###

##
**Please create a new data frame and name it df. This data frame should consist of three columns (Control, Trt1, and Trt2) with 50 observations each. The Control column should contain 50 data points that follow a normal distribution with a mean of 35 and a standard deviation of 10. Likewise, the Trt1 column should have a mean of 37 and a standard deviation of 10, and the Trt2 column should have a mean of 75 and a standard deviation of 10. Additionally, add five blocks (a, b, c, d, e), each repeating 10 times. Let’s use the function **`set.seed(1234)`

to work with the same values. **(Click for the answer)**

`set.seed(1234)`

to work with the same values.##
**Data mostly come in two shapes – “long” format and “wide” format.**

**
What type of data do you think is **`df`

? **(Click for the answer)**

**Data mostly come in two shapes – “long” format and “wide” format.**

`df`

?Our data frame `df`

is in **wide** format.

### 2.2 Extract variables (columns)

There are multiple ways to extract/select variables/columns. Here are two methods that we have previously used:

```
c("Control", "Trt2")] # by name
df[,
c(1, 2)] # by column index df[,
```

### 2.3 Make new variables (columns)

Let’s create two new variables from existing ones:

```
$Trt1.log <- log(df$Trt1)
df$Trt2.log <- log(df$Trt2) df
```

### 2.4 Extract observations (rows)

There are multiple ways to extract/filter observations/rows. Here are two ways we can do this:

```
# Using [,]
$Trt1.log < 3.5, ]
df[df
$Trt2.log > 4.2 & df$Block == "a", ]
df[df
# Using subset
subset(df, df$Trt2.log > 4.2 & df$Block == "a")
```

### 2.5 Arrange observations (rows)

Sorting is an operation that we typically perform when manipulating our dataset.

```
# ascending order of Block (alphabetic) followed by ascending order of Trt2.log
order(df$Block, df$Trt2.log) , ]
df[
# descending order of Block (alphabetic) followed by ascending order of Trt2.log
order(rev(df$Block), df$Trt2.log) , ] df[
```

### 2.6 Summarize observations (rows)

There are numerous ways to accomplish this task, and we will discover additional methods as we progress to the `tidyverse`

package.

```
# Manually create a data.frame
data.frame(Trt1.mean = mean(df$Trt1),
Trt1.sd = sd(df$Trt1),
Trt2.mean = mean(df$Trt2),
Trt2.sd = sd(df$Trt2))
```

### 2.7 Summarize rows within groups

Typically, our goal is to summarize data according to specific variables. Below is how we can achieve this:

```
# First operate in the data.frame by group
<- by(df,
df_by INDICES = list(df$Block),
FUN = function(x){
data.frame(Block = unique(x$Block),
Control.mean = mean(x$Control),
Control.sd = sd(x$Control),
Trt1.mean = mean(x$Trt1),
Trt1.sd = sd(x$Trt1),
Trt2.mean = mean(x$Trt2),
Trt2.sd = sd(x$Trt2))
})
# Then combine the results into a data.frame
do.call(rbind, df_by)
```

Alternatively, we can use the `aggregate()`

function:

```
<- do.call(data.frame, aggregate(cbind(Control, Trt1, Trt2) ~ Block, data = df, FUN = function(x) c(mean = mean(x), sd = sd(x) ) ))
df.by df.by
```

Let’s take a look at each treatment graphically (**we will do more data viz soon!**):

```
barplot(df.by$Control.mean, names.arg = paste(df.by$Block))
barplot(df.by$Trt1.mean, names.arg = paste(df.by$Block))
barplot(df.by$Trt2.mean, names.arg = paste(df.by$Block))
```

### 2.8 Reshape our data frame

Wide to long format:

In data analysis, the need to reshape the data frequently arises in order to enhance manageability and usefulness. Reshaping the data entails converting it from one format, such as wide, to another, like long, or vice versa. Such transformations aid in facilitating data accessibility, simplifying analysis, and providing more information.

```
<- reshape(df,
l varying = c("Control", "Trt1", "Trt2", "Trt1.log", "Trt2.log"),
v.names = "Yield",
timevar = "Treatment",
times = c("Control", "Trt1", "Trt2", "Trt1.log", "Trt2.log"),
new.row.names = 1:1000,
direction = "long")
l
```

## 3 Practice

Before seeking assistance from others, it is generally advisable for you to attempt to resolve the problem on your own. R provides comprehensive tools for accessing documentation and searching for help.

### 3.1 **Exercise 1**

##
**Let’s use the same data frame **`df`

we created at the beginning of this session **(Click for the answer)**

`df`

we created at the beginning of this sessionPlease go to the beginning of this session and re-run the code.

### 3.2 Exercise 2

##
**A more informative variable could be one showing the difference between the treatment and the control. Create two new variables: **`Trt1.Delta`

and `Trt2.Delta`

. **(Click for the answer)**

`Trt1.Delta`

and `Trt2.Delta`

.```
$Trt1.Delta <- df$Trt1 - df$Control
df$Trt2.Delta <- df$Trt2 - df$Control df
```

### 3.3 Exercise 3

##
**Create a new data frame **`df_delta`

containing the following variables (columns): `Block`

, `Trt1.Delta`

, and `Trt2.Delta`

. **(Click for the answer)**

`df_delta`

containing the following variables (columns): `Block`

, `Trt1.Delta`

, and `Trt2.Delta`

.`<- df[, c("Block", "Trt1.Delta", "Trt2.Delta")] df_delta `

### 3.4 Exercise 4

##
**Summarize **`Trt1.Delta`

and `Trt2.Delta`

by `Block`

. Produce the mean and standard deviation for each variable. **(Click for the answer)**

`Trt1.Delta`

and `Trt2.Delta`

by `Block`

. Produce the mean and standard deviation for each variable.```
<- do.call(data.frame, aggregate(cbind(Trt1.Delta, Trt2.Delta) ~ Block, data = df, FUN = function(x) c(mean = mean(x), sd = sd(x) ) ))
df.by df.by
```

### 3.5 Exercise 5

##
**In the end, our goal is to conduct statistical analysis to assess the impact of the treatment. However, the current data format does not allow us to proceed with our analysis. To fix this, please transform the data frame **`df_delta`

from a wide format to a long format. Please name this new data frame as `long_delta`

. **(Click for the answer)**

`df_delta`

from a wide format to a long format. Please name this new data frame as `long_delta`

.```
<- reshape(df_delta,
l varying = c("Trt1.Delta", "Trt2.Delta"),
v.names = "Yield",
timevar = "Treatment",
times = c("Trt1.Delta", "Trt2.Delta"),
new.row.names = 1:1000,
direction = "long")
```