Introduction to ggplot
Code structure, data, aesthetic mappings, and geoms
1 Introduction
The very popular R package ggplot2
is based on a system called the Grammar of Graphics by Leland Wilkinson which aims to create a grammatical rules for the development of graphics. It is part of a larger group of packages called “the tidyverse.”
1.1 What is the tidyverse?
The package ggplot2
is a part of a larger collection of packages called “the tidyverse” that are designed for data science. You can certainly use R without using the tidyverse, but it has many packages that I think will make your life a lot easier. We can install just ggplot2
or install all of the packages in the core tidyverse (which is what I’d recommend since we will use the others too), which include:
dplyr
: for data manipulationggplot2
: a “grammar of graphics” for creating beautiful plotsreadr
: for reading in rectangular data (i.e., Excel-style formatting)tibble
: using tibbles as modern/better dataframesstringr
: handling strings (i.e., text or stuff in quotes)forcats
: for handling categorical variables (i.e., factors) (meow!)tidyr
: to make “tidy data”purrr
: for enhancing functional programming (also meow!)lubridate
: for working with dates
We have used many of these other packages in Code Club. There are more tidyverse packages outside of these core nine, and we will talk about some of them another time.
tl;dr Tidyverse has a lot of packages that make data analysis easier. None of them are required, but I think you’ll find many tidyverse approaches easier and more intuitive than using base R.
You can find here some examples of comparing tidyverse and base R syntax.
1.2 Installing ggplot & tidyverse
To install packages in R that are on the Comprehensive R Archive Network (CRAN), you can use the function install.packages()
.
install.packages("tidyverse")
install.packages("ggplot2")
We only need to install packages once. But, every time we want to use them, we need to “load” them, and can do this using the function library()
. Since you will likely often use the tidyverse
functions, it’s a good habit to add the code library(tidyverse)
to the top of each of your scripts/RMarkdown/Quarto documents.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
2 What is “ggplot?”
The “gg” in ggplot stands for “grammar of graphics” and all plots share a common template. This is fundamentally different than plotting using a program like Excel, where you first pick your plot type, and then you add your data. With ggplot, you start with data, add a coordinate system, and then add “geoms,” which indicate what type of plot you want. A cool thing about ggplot is that you can add and layer different geoms together, to create a fully customized plot that is exactly what you want. If this sounds nebulous right now, that’s okay, we are going to talk more about this.
3 What can you do with ggplot?
Let’s start by looking at the different types of plots that can be made using ggplot2
. We will do this by looking at the ggplot2
cheatsheet.
4 A plotting framework
You can think about a ggplot as being composed of layers. You start with your data, and continue to add layers until you get the plot that you want. This might sound a bit abstract so I am going to talk through this with an example.
First, let’s load some practice data. We are going to use a fun 🐧 data set from the package palmerpenguins
. If you don’t already have this, you can download it with the code below:
install.packages("palmerpenguins")
Then we can load the data.
library(palmerpenguins)
The dataset itself is called penguins
. Let’s look at it using the function glimpse()
.
glimpse(penguins)
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Let’s start by trying to make a simple scatterplot, where we see the relationship between bill_length_mm
and bill_depth_mm
.
4.1 Data
The first argument passed to your plot is the data. How did I know that? It’s in the documentation.
ggplot() ?
The simplest ggplot code you can write, just using the ggplot()
function and indicating the data we want to use. Because data is the default first argument, you can actually omit the data =
part of this code and it will work just the same.
ggplot(data = penguins)
Why do we not see a plot? Well we haven’t told R what to plot! We are getting the first “base” layer of the plot.
You can also pipe |>
or %>%
, the data to the ggplot function. When reading code, you can interpret the pipe as “and then.” Here, take the penguins
data, and then, run ggplot()
. Writing code in this way is my preference so I tend to code like this. We talked in more detail about the pipe in past Code Clubs.
|>
penguins ggplot()
Still nothing. Well that’s what we would expect.
4.2 Aesthetic mappings aes()
Now that we’ve indicated our data, we can add aesthetics mapping so we can work towards actually see a plot. We want to make a scatterplot where on the x-axis we have bill length (bill_length_mm
), and on the y-axis we have bill depth (bill_depth_mm
).
|>
penguins ggplot(aes(x = bill_length_mm, y = bill_depth_mm))
So we have progressed from a blank plot, but we still do not have a plot by basically anyone’s defintion. Why not?
Even though we have indicated to R our data and aesthetic mappings, we have not indicated what precisely to do with our data. We have said what we want on x and y (and now we can see those labelled appearing) but we have not indicated what type of plot we want. And, we can do that in the next step, by adding a geom_
.
4.3 Geoms geom_
Now let’s indicate what type of plot we want. In this example, we are going to make a scatterplot, and to do that we will use geom_point()
|>
penguins ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
We have a plot! It’s not a finished plot, but its a plot and we can work from here.
Let’s say we wanted to see whether penguins of different species
are in different places on our plot. We can take the variable species
and map it to the aesthetic color
.
|>
penguins ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Note what R has done for us - we now see each dot colored based on which species it is, and we also have a new legend.
What if we wanted to add a line that shows the relationship between bill_length_mm
and bill_depth_mm
for each species
? We can layer in another geom, here we will use geom_smooth.
|>
penguins ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point() +
geom_smooth(method = "lm") # the method is a linear model
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
4.3.1 Global vs. local aes()
A note about aesthetic mappings now that we have introduced geoms -aes()
can go in two places:
- in the
ggplot()
call, and this means they will inherit for every layer of the plot - in a specific
geom_
, and those aesthetics will only be for that specific geom.
So we can make the same plot we saw above by mapping aesthetics within geom_point()
.
|>
penguins ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
|>
penguins ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species))
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Let’s look at example where changing the location of the aesthetic mappings does make a difference.
|>
penguins ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point() +
geom_smooth(method = "lm") # the method is a linear model
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
|>
penguins ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_smooth(method = "lm") # the method is a linear model
`geom_smooth()` using formula = 'y ~ x'
Error in `geom_smooth()`:
! Problem while computing stat.
ℹ Error occurred in the 2nd layer.
Caused by error in `compute_layer()`:
! `stat_smooth()` requires the following missing aesthetics: x and y.
What happened here? We got an “error in geom_smooth()...
stat_smooth()` requires the following miss aesthetics: x and y”.
This happened because we have only set our x and y aesthetics in geom_point()
and not in geom_smooth()
so R doesn’t know what to map x and y to. When we map our aesthetics globally, we don’t have this problem because x and y inherit for every subsequent layer.
We can also do a combination of global and local setting.
|>
penguins ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(aes(color = species)) +
geom_smooth(method = "lm") # the method is a linear model
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
|>
penguins ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(aes(color = species)) +
geom_smooth(aes(color = species), method = "lm") +
geom_smooth(method = "lm", color = "black") # the method is a linear model
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range (`stat_smooth()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
When we set color
only in geom_point()
, we do not “group” by color
(here, by species
) so we get our smoothed line for all the data (instead of by species
).
4.3.2 Mapping vs. ‘setting’
If you want to map a variable to an aesthetic, it MUST be within the aes()
statement. If you just want to change the color to “blue” for example, it should be outside the aes()
statement. Look at the difference.
|>
penguins ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(color = "#088F8F")
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
If we put “blue” instead our aesthetic mappings, we get something that doesn’t make sense.
|>
penguins ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(aes(color = "blue"))
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
tl:dr if mapping a variable to an aesthetic, inside
aes()
, if not, then outside.
5 Practice
Create a plot that shows the relationship between flipper length and body mass. Color your points based on the sex of the penguins.
Need a hint?
Try using the variables flipper_length
, body_mass_g
, and sex
. You can make x
, y
, and color
.
Click for the solution
|>
penguins ggplot(aes(x = flipper_length_mm, y = body_mass_g, color = sex)) +
geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Not happy with the missing value? We can remove it.
|>
penguins drop_na(flipper_length_mm, body_mass_g, sex) |> # drop missing values
ggplot(aes(x = flipper_length_mm, y = body_mass_g, color = sex)) +
geom_point()
5.1 Different geoms
Create a boxplot that shows the distribution of body mass for penguins on the different islands.
Need a hint?
The geom for a boxplot is called geom_boxplot()
.
Click for the solution
|>
penguins ggplot(aes(x = island, y = body_mass_g)) +
geom_boxplot()
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
5.2 Mapping to other aesthetics
Create a scatterplot that shows the relationship between bill length and bill depth, but color the points based on what island the penguins are from, and make the points a different shape based on sex.
Need a hint?
You can make the aesthetic shape =
in the same way you use color.
Click for the solution
|>
penguins ggplot(aes(x = bill_length_mm, bill_depth_mm,
color = island, shape = sex)) +
geom_point()
Warning: Removed 11 rows containing missing values or values outside the scale range
(`geom_point()`).
Not happy with missing values? We can remove them.
|>
penguins drop_na(bill_length_mm, bill_depth_mm, island, sex) |> # drop missing values
ggplot(aes(x = bill_length_mm, bill_depth_mm,
color = island, shape = sex)) +
geom_point()