Iterating part III: purrr’s map() function

A deeper dive into iterating with purrr’s map() function.

functional programming
purrr
Author

Jelmer Poelstra

Published

April 1, 2025



1 Intro and getting started

In the last two weeks, you’ve learned about a couple of effective coding strategies for situations where you need to repeat an operation, for example across different subsets of your data. Instead of copy-pasting your code and making small edits for every copy, you can use for loops and functions like map().

Today, we will start with a recap of these approaches and then dive deeper into the map() function (and the very similar map_vec()).

The map() function is part of the purrr package, one of the core tidyverse packages that can be loaded with library(tidyverse). We will also load the familiar palmerpenguins dataset for a couple of examples.

library(palmerpenguins)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors


2 Iteration recap

When we iterate, we repeat a procedure for each value/element in a certain collection.

Let’s say we have three sets of measurements (each stored in a vector):

vec1 <- c(3, 74, 18)
vec2 <- c(33, 14, 25)
vec3 <- c(10, 88, 47)

We want to compute the mean value for each of these. The simplest approach would be to just repeat the code to do so three times, changing only the identity of the vector that we operate on:

mean(vec1)
[1] 31.66667
mean(vec2)
[1] 24
mean(vec3)
[1] 48.33333

But we’d like to be able to avoid such code repetition. First, we may run into situations where we have many more than three collections. Second, the code that we need repeat may be much longer than just a call to mean(). All in all, the copy-and-paste routine can get very tedious, is error-prone, and would also make it more difficult to edit the repeated code.

In the previous two sessions, Jess has shown us two different ways of avoiding code repetition. The first is the for loop, which is a very widely used technique in programming, though is not nearly as common in R as in many other languages. We can iterate over our vectors with a for loop as follows — note that I am putting them together in a list to do so:

for (vec in list(vec1, vec2, vec3)) {
  print(mean(vec))
}
[1] 31.66667
[1] 24
[1] 48.33333

A more compact and elegant way of iterating is using functional programming, where a function does the iteration — here, the map() function:

map(.x = list(vec1, vec2, vec3), .f = mean)
[[1]]
[1] 31.66667

[[2]]
[1] 24

[[3]]
[1] 48.33333

Some notes on the syntax of the map() function:

  • The first argument (.x) is the collection you want to iterate over, which can be a vector, list, or dataframe.
  • The second argument (.f) is the function that you want to apply to each element of the collection.
  • The name of that function is written without parentheses: mean and not mean()!

Under the hood, the function mean() will be run three times, each time with one of the vectors as its argument.


3 Beyond the basics of map()

Returning vectors

By default, map() will return a list. But in some cases, like here, we may prefer to get a vector instead. We can do this with a slight variant on map(), map_vec():

map_vec(.x = list(vec1, vec2, vec3), .f = mean)
[1] 31.66667 24.00000 48.33333

How to handle additional arguments?

What if we need to pass additional arguments to the function that map() calls for us?

For example, let’s say we had an NA in our data, which means that by default, mean() will return NA:

# Change the second element of vec1 to 2
vec1[2] <- NA
vec1
[1]  3 NA 18
mean(vec1)
[1] NA
map_vec(.x = list(vec1, vec2, vec3), .f = mean)
[1]       NA 24.00000 48.33333

We can avoid this by using na.rm = TRUE in a stand-alone call to mean()

mean(vec1, na.rm = TRUE)
[1] 10.5

…but how can we do that with the map() function? The below doesn’t work:

map_vec(.x = list(vec1, vec2, vec3), .f = mean(na.rm = TRUE))
Error in mean.default(na.rm = TRUE): argument "x" is missing, with no default

Instead, we need to pass any additional arguments separately, basically as if they were arguments of map():

map_vec(.x = list(vec1, vec2, vec3), .f = mean, na.rm = TRUE)
[1] 10.50000 24.00000 48.33333

Defining an anonymous function within map()

It is possible to write the function call within map() using parentheses — but this would essentially entail defining a function on the fly, which you can do as follows:

map_vec(.x = list(vec1, vec2, vec3), .f = function(x) mean(x, na.rm = TRUE))
[1] 10.50000 24.00000 48.33333

In practice, the above syntax is most commonly used in slighly more complex situations. For example, you may want to count the number of NAs in the following way, where the is.na() function is nested within the sum() function:

sum(is.na(vec1))
[1] 1

This poses a challenge to the standard map() syntax, but can be easily achieved with the anonymous function syntax shown above:

map_vec(.x = list(vec1, vec2, vec3), .f = function(x) sum(is.na(x)))
[1] 1 0 0


Exercise 1

A) Once again compute the mean of the three sets of measurements with map(), but now also pass the argument trim = 0.1 to the mean() function (If you’re interested, type ?mean for some information about what this argument does.)

Click here for the solution

Arguments are added as if they were arguments of map() / map_vec(), so you simply add trim = 0.1 after a comma and don’t use additional parentheses:

map_vec(.x = list(vec1, vec2, vec3), .f = mean, na.rm = TRUE, trim = 0.1)
[1] 10.50000 24.00000 48.33333

Which runs the following under the hood:

mean(vec1, na.rm = TRUE, trim = 0.1)
mean(vec2, na.rm = TRUE, trim = 0.1)
mean(vec3, na.rm = TRUE, trim = 0.1)

B) Now use map_vec() with the length() function to compute the length of each of our vectors.

Click here for the solution
map_vec(.x = list(vec1, vec2, vec3), .f = length)
[1] 3 3 3

Exercise 2

A) R does not have a built-in function to compute the standard error. Here is how we can do so — by taking the standard deviation using the sd() function and dividing it by the square root (sqrt()) of the number of observations (length()):

sd(vec1, na.rm = TRUE) / sqrt(length(vec1))
[1] 6.123724
  • Based on the above code, define your own function that computes the standard error. For a refresher on defining your own functions, see this section from last week’s material.

  • Then, use your custom function inside map_vec() to compute the standard error for our data.

Click here for the solution for the custom function
se <- function(x) {
  sd(x, na.rm = TRUE) / sqrt(length((x)))
}
Click here for the full solution
se <- function(x) {
  sd(x, na.rm = TRUE) / sqrt(length((x)))
}

map_vec(.x = list(vec1, vec2, vec3), .f = se)
[1]  6.123724  5.507571 22.526528

B) Now, restructure your code to compute the standard error with an anonymous function inside map_vec().

Click here for the solution
map_vec(
  .x = list(vec1, vec2, vec3),
  .f = function(x) sd(x, na.rm = TRUE) / sqrt(length((x)))
)
[1]  6.123724  5.507571 22.526528


4 Automatic/implicit iteration in R

The example we used above with the three sets of vectors is rather contrived. This speaks to the fact that in R, we don’t have to explicitly iterate with for loops or map() except in more complex situations.

This is because first, iteration is in many cases “automagic” in R due to vectorization, so we don’t have to iterate over the values in vectors:

measurements_inch <- c(3, 74, 18)

# Multiply each value in the vector with 2.54 :
2.54 * measurements_inch
[1]   7.62 187.96  45.72

Also, we usually have (or can put) our data in data frames, where vectorization across rows applies as well:

data.frame(measurements_inch) |>
  # Multiply each value in the vector with 2.54 :
  mutate(measurements_cm = 2.54 * measurements_inch)
  measurements_inch measurements_cm
1                 3            7.62
2                74          187.96
3                18           45.72

Speaking of implicit iteration in data frames, dplyr has functionality to repeat operations across subsets of the data without having to explicitly iterate over these subsets. For example, once the data from the above three-vector example is in a data frame…

all <- data.frame(
  value = c(vec1, vec2, vec3),
  group = rep(c("vec1", "vec2",  "vec3"), each = 3) 
)

all
  value group
1     3  vec1
2    NA  vec1
3    18  vec1
4    33  vec2
5    14  vec2
6    25  vec2
7    10  vec3
8    88  vec3
9    47  vec3

…we can use summarize() with .by to repeat computations across our sets:

all |> summarize(mean = mean(value, na.rm = TRUE), .by = group)
  group     mean
1  vec1 10.50000
2  vec2 24.00000
3  vec3 48.33333

However, operating across multiple columns of a dataframe is a bit more challenging, and in the final section below, we’ll see how to do this with map().


5 Using map() to iterate across columns of a data frame

A data frame is really a special case of a list, one in which each vector is of the same length and constitutes a column. Therefore, iterating over a dataframe with a function like map() means that you’ll repeat the operation for each column.

For example, it’s easy to check what type of data each column contains by using map_vec() with the class() function:

map_vec(.x = penguins, .f = class)
          species            island    bill_length_mm     bill_depth_mm 
         "factor"          "factor"         "numeric"         "numeric" 
flipper_length_mm       body_mass_g               sex              year 
        "integer"         "integer"          "factor"         "integer" 

Similarly, the n_distinct() function computes the number of distinct (unique) values, and we can run that on each column like so:

map_vec(.x = penguins, .f = n_distinct)
          species            island    bill_length_mm     bill_depth_mm 
                3                 3               165                81 
flipper_length_mm       body_mass_g               sex              year 
               56                95                 3                 3 

You can also operate on multiple columns using dplyr’s across() function, which should be used inside another dplyr function, most commonly summarise() or mutate().

For usage in the simplest cases, like for our map() examples above, using across() is more verbose than map():

penguins |> summarise(across(.cols = everything(), .fns = n_distinct))
# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
    <int>  <int>          <int>         <int>             <int>       <int>
1       3      3            165            81                56          95
# ℹ 2 more variables: sex <int>, year <int>

Notes on the across() syntax as shown above:

  • Its main arguments are:
    • .cols – corresponding to .x in map() (the things to operate on)
    • .fns – corresponding to .f in map() (the function to repeat)
  • You should always make an explicit column selection, so for the simplest case of operating across all columns, it’s best to use everything()1.

On the other hand, it is much eaier to perform group-wise computations with summarise(across()) than with map():

penguins |>
  summarise(across(.cols = everything(), .fns = n_distinct), .by = species)
# A tibble: 3 × 8
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>      <int>          <int>         <int>             <int>       <int>
1 Adelie         3             79            50                33          56
2 Gentoo         1             76            40                26          48
3 Chinstrap      1             55            33                25          34
# ℹ 2 more variables: sex <int>, year <int>

Finally, if you need to use additional arguments for the function that across() calls, you should use the anonymous function notation that was explained in the first box on this page:

penguins |>
  summarise(across(
    .cols = where(is.numeric),
    .fns = function(x) mean(x, na.rm = TRUE)
    ))
# A tibble: 1 × 5
  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
           <dbl>         <dbl>             <dbl>       <dbl> <dbl>
1           43.9          17.2              201.       4202. 2008.

Exercise 2

A) Use map_vec() to compute the mean value for each column in the penguins dataframe. Why are you getting warning messages and NAs?

Click here for the solution
map_vec(.x = penguins, .f = mean, na.rm = TRUE)
Warning in mean.default(.x[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(.x[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(.x[[i]], ...): argument is not numeric or logical:
returning NA
          species            island    bill_length_mm     bill_depth_mm 
               NA                NA          43.92193          17.15117 
flipper_length_mm       body_mass_g               sex              year 
        200.91520        4201.75439                NA        2008.02907 

The warning messages and NAs (despite using na.rm = TRUE) appear because not some of the columns, like species and island, don’t contain numbers at all.

B) Can you modify the penguins dataframe before passing it to map_vec() so it only contains columns with numbers?

Click here for a hint on the general workflow

You could save the modified dataframe and then use map_vec(), but we’d prefer to use pipes as follows:

penguins |>
  select(bill_length_mm) |>
  map_vec(.f = mean, na.rm = TRUE)
bill_length_mm 
      43.92193 

(The above only selects one column though, you’ll still have to work on the select() function call!)

Click here for a hint on column selection

The naive way to select all numeric columns would be to first figure out out which are numeric, and then simply list all of those inside select().

However, there is a handy helper function to select columns by type: where(): see this Help page. Can you figure out how to use it to select numeric columns?

Click here for the solution
penguins |>
  select(where(is.numeric)) |>
  map_vec(.f = mean, na.rm = TRUE)
   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
         43.92193          17.15117         200.91520        4201.75439 
             year 
       2008.02907 

C) Count the number of NAs in each column of the penguins dataframe within map_vec().

Click here for the solution
map_vec(.x = penguins, .f = function(x) sum(is.na(x)))
          species            island    bill_length_mm     bill_depth_mm 
                0                 0                 2                 2 
flipper_length_mm       body_mass_g               sex              year 
                2                 2                11                 0 
Back to top

Footnotes

  1. The across() function does (still) work without .cols(), and will then select all columns, but this behavior is “deprecated” (outdated) and should not be used.↩︎