Introduction to ggplot

Code structure, data, aesthetic mappings, and geoms

ggplot
dataviz
plotting
tidyverse
Author

Jessica Cooperstone

Published

January 14, 2025

1 Introduction

A fuzzy monster in a beret and scarf, critiquing their own column graph on a canvas in front of them while other assistant monsters (also in berets) carry over boxes full of elements that can be used to customize a graph (like themes and geometric shapes). In the background is a wall with framed data visualizations. Stylized text reads “ggplot2: build a data masterpiece.

Figure from Allison Horst

The very popular R package ggplot2 is based on a system called the Grammar of Graphics by Leland Wilkinson which aims to create a grammatical rules for the development of graphics. It is part of a larger group of packages called “the tidyverse.”

1.1 What is the tidyverse?

The package ggplot2 is a part of a larger collection of packages called “the tidyverse” that are designed for data science. You can certainly use R without using the tidyverse, but it has many packages that I think will make your life a lot easier. We can install just ggplot2 or install all of the packages in the core tidyverse (which is what I’d recommend since we will use the others too), which include:

  • dplyr: for data manipulation
  • ggplot2: a “grammar of graphics” for creating beautiful plots
  • readr: for reading in rectangular data (i.e., Excel-style formatting)
  • tibble: using tibbles as modern/better dataframes
  • stringr: handling strings (i.e., text or stuff in quotes)
  • forcats: for handling categorical variables (i.e., factors) (meow!)
  • tidyr: to make “tidy data”
  • purrr: for enhancing functional programming (also meow!)
  • lubridate: for working with dates

We have used many of these other packages in Code Club. There are more tidyverse packages outside of these core nine, and we will talk about some of them another time.

tl;dr Tidyverse has a lot of packages that make data analysis easier. None of them are required, but I think you’ll find many tidyverse approaches easier and more intuitive than using base R.

You can find here some examples of comparing tidyverse and base R syntax.

1.2 Installing ggplot & tidyverse

To install packages in R that are on the Comprehensive R Archive Network (CRAN), you can use the function install.packages().

install.packages("tidyverse")
install.packages("ggplot2")

We only need to install packages once. But, every time we want to use them, we need to “load” them, and can do this using the function library(). Since you will likely often use the tidyverse functions, it’s a good habit to add the code library(tidyverse) to the top of each of your scripts/RMarkdown/Quarto documents.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

2 What is “ggplot?”

The “gg” in ggplot stands for “grammar of graphics” and all plots share a common template. This is fundamentally different than plotting using a program like Excel, where you first pick your plot type, and then you add your data. With ggplot, you start with data, add a coordinate system, and then add “geoms,” which indicate what type of plot you want. A cool thing about ggplot is that you can add and layer different geoms together, to create a fully customized plot that is exactly what you want. If this sounds nebulous right now, that’s okay, we are going to talk more about this.

A group of fuzzy round monsters with binoculars, backpacks and guide books looking up a graphs flying around with wings (like birders, but with exploratory data visualizations). Stylized text reads “ggplot2: visual data exploration.”

Figure from Allison Horst

3 What can you do with ggplot?

Let’s start by looking at the different types of plots that can be made using ggplot2. We will do this by looking at the ggplot2 cheatsheet.

4 A plotting framework

A pictorial depiction of the different ggplot layers, starting with data, aesthetics, geometries, scales, facets, coordinates, labels, and themes

Figure from Andrew Heiss

You can think about a ggplot as being composed of layers. You start with your data, and continue to add layers until you get the plot that you want. This might sound a bit abstract so I am going to talk through this with an example.

First, let’s load some practice data. We are going to use a fun 🐧 data set from the package palmerpenguins. If you don’t already have this, you can download it with the code below:

install.packages("palmerpenguins")

Then we can load the data.

library(palmerpenguins)

The dataset itself is called penguins. Let’s look at it using the function glimpse().

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Let’s start by trying to make a simple scatterplot, where we see the relationship between bill_length_mm and bill_depth_mm.

4.1 Data

The first argument passed to your plot is the data. How did I know that? It’s in the documentation.

?ggplot()

The simplest ggplot code you can write, just using the ggplot() function and indicating the data we want to use. Because data is the default first argument, you can actually omit the data = part of this code and it will work just the same.

ggplot(data = penguins)

Why do we not see a plot? Well we haven’t told R what to plot! We are getting the first “base” layer of the plot.

You can also pipe |> or %>%, the data to the ggplot function. When reading code, you can interpret the pipe as “and then.” Here, take the penguins data, and then, run ggplot(). Writing code in this way is my preference so I tend to code like this. We talked in more detail about the pipe in past Code Clubs.

penguins |> 
  ggplot()

Still nothing. Well that’s what we would expect.

4.2 Aesthetic mappings aes()

Now that we’ve indicated our data, we can add aesthetics mapping so we can work towards actually see a plot. We want to make a scatterplot where on the x-axis we have bill length (bill_length_mm), and on the y-axis we have bill depth (bill_depth_mm).

penguins |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm))

So we have progressed from a blank plot, but we still do not have a plot by basically anyone’s defintion. Why not?

Even though we have indicated to R our data and aesthetic mappings, we have not indicated what precisely to do with our data. We have said what we want on x and y (and now we can see those labelled appearing) but we have not indicated what type of plot we want. And, we can do that in the next step, by adding a geom_.

4.3 Geoms geom_

Now let’s indicate what type of plot we want. In this example, we are going to make a scatterplot, and to do that we will use geom_point()

penguins |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

We have a plot! It’s not a finished plot, but its a plot and we can work from here.

Let’s say we wanted to see whether penguins of different species are in different places on our plot. We can take the variable species and map it to the aesthetic color.

penguins |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Note what R has done for us - we now see each dot colored based on which species it is, and we also have a new legend.

What if we wanted to add a line that shows the relationship between bill_length_mm and bill_depth_mm for each species? We can layer in another geom, here we will use geom_smooth.

penguins |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point() +
  geom_smooth(method = "lm") # the method is a linear model
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

4.3.1 Global vs. local aes()

A note about aesthetic mappings now that we have introduced geoms -aes() can go in two places:

  • in the ggplot() call, and this means they will inherit for every layer of the plot
  • in a specific geom_, and those aesthetics will only be for that specific geom.

So we can make the same plot we saw above by mapping aesthetics within geom_point().

penguins |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

penguins |> 
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species))
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Let’s look at example where changing the location of the aesthetic mappings does make a difference.

penguins |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point() +
  geom_smooth(method = "lm") # the method is a linear model
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

penguins |> 
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_smooth(method = "lm") # the method is a linear model
`geom_smooth()` using formula = 'y ~ x'
Error in `geom_smooth()`:
! Problem while computing stat.
ℹ Error occurred in the 2nd layer.
Caused by error in `compute_layer()`:
! `stat_smooth()` requires the following missing aesthetics: x and y.

What happened here? We got an “error in geom_smooth()...stat_smooth()` requires the following miss aesthetics: x and y”.

This happened because we have only set our x and y aesthetics in geom_point() and not in geom_smooth() so R doesn’t know what to map x and y to. When we map our aesthetics globally, we don’t have this problem because x and y inherit for every subsequent layer.

We can also do a combination of global and local setting.

penguins |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species)) +
  geom_smooth(method = "lm") # the method is a linear model
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

penguins |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species)) +
  geom_smooth(aes(color = species), method = "lm") +
  geom_smooth(method = "lm", color = "black") # the method is a linear model
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range (`stat_smooth()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

When we set color only in geom_point(), we do not “group” by color (here, by species) so we get our smoothed line for all the data (instead of by species).

4.3.2 Mapping vs. ‘setting’

If you want to map a variable to an aesthetic, it MUST be within the aes() statement. If you just want to change the color to “blue” for example, it should be outside the aes() statement. Look at the difference.

penguins |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(color = "#088F8F") 
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

If we put “blue” instead our aesthetic mappings, we get something that doesn’t make sense.

penguins |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = "blue"))
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

tl:dr if mapping a variable to an aesthetic, inside aes(), if not, then outside.

5 Practice

Create a plot that shows the relationship between flipper length and body mass. Color your points based on the sex of the penguins.

Need a hint?

Try using the variables flipper_length, body_mass_g, and sex. You can make x, y, and color.

Click for the solution
penguins |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g, color = sex)) + 
  geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Not happy with the missing value? We can remove it.

penguins |> 
  drop_na(flipper_length_mm, body_mass_g, sex) |> # drop missing values 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g, color = sex)) + 
  geom_point()

5.1 Different geoms

Create a boxplot that shows the distribution of body mass for penguins on the different islands.

Need a hint?

The geom for a boxplot is called geom_boxplot().

Click for the solution
penguins |> 
  ggplot(aes(x = island, y = body_mass_g)) +
  geom_boxplot()
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

5.2 Mapping to other aesthetics

Create a scatterplot that shows the relationship between bill length and bill depth, but color the points based on what island the penguins are from, and make the points a different shape based on sex.

Need a hint?

You can make the aesthetic shape = in the same way you use color.

Click for the solution
penguins |> 
  ggplot(aes(x = bill_length_mm, bill_depth_mm, 
             color = island, shape = sex)) +
  geom_point()
Warning: Removed 11 rows containing missing values or values outside the scale range
(`geom_point()`).

Not happy with missing values? We can remove them.

penguins |> 
  drop_na(bill_length_mm, bill_depth_mm, island, sex) |>  # drop missing values
  ggplot(aes(x = bill_length_mm, bill_depth_mm, 
             color = island, shape = sex)) +
  geom_point()

Back to top