A deeper dive into iterating with purrr’smap() function.
functional programming
purrr
Author
Jelmer Poelstra
Published
April 1, 2025
1 Intro and getting started
In the last two weeks, you’ve learned about a couple of effective coding strategies for situations where you need to repeat an operation, for example across different subsets of your data. Instead of copy-pasting your code and making small edits for every copy, you can use for loops and functions like map().
Today, we will start with a recap of these approaches and then dive deeper into the map() function (and the very similar map_vec()).
The map() function is part of the purrr package, one of the core tidyverse packages that can be loaded with library(tidyverse). We will also load the familiar palmerpenguins dataset for a couple of examples.
library(palmerpenguins)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
2 Iteration recap
When we iterate, we repeat a procedure for each value/element in a certain collection.
Let’s say we have three sets of measurements (each stored in a vector):
We want to compute the mean value for each of these. The simplest approach would be to just repeat the code to do so three times, changing only the identity of the vector that we operate on:
mean(vec1)
[1] 31.66667
mean(vec2)
[1] 24
mean(vec3)
[1] 48.33333
But we’d like to be able to avoid such code repetition. First, we may run into situations where we have many more than three collections. Second, the code that we need repeat may be much longer than just a call to mean(). All in all, the copy-and-paste routine can get very tedious, is error-prone, and would also make it more difficult to edit the repeated code.
In the previous two sessions, Jess has shown us two different ways of avoiding code repetition. The first is the for loop, which is a very widely used technique in programming, though is not nearly as common in R as in many other languages. We can iterate over our vectors with a for loop as follows — note that I am putting them together in a list to do so:
for (vec inlist(vec1, vec2, vec3)) {print(mean(vec))}
[1] 31.66667
[1] 24
[1] 48.33333
A more compact and elegant way of iterating is using functional programming, where a function does the iteration — here, the map() function:
The first argument (.x) is the collection you want to iterate over, which can be a vector, list, or dataframe.
The second argument (.f) is the function that you want to apply to each element of the collection.
The name of that function is written without parentheses: mean and not mean()!
Under the hood, the function mean() will be run three times, each time with one of the vectors as its argument.
3 Beyond the basics of map()
Returning vectors
By default, map() will return a list. But in some cases, like here, we may prefer to get a vector instead. We can do this with a slight variant on map(), map_vec():
map_vec(.x =list(vec1, vec2, vec3), .f = mean)
[1] 31.66667 24.00000 48.33333
How to handle additional arguments?
What if we need to pass additional arguments to the function that map() calls for us?
For example, let’s say we had an NA in our data, which means that by default, mean() will return NA:
# Change the second element of vec1 to 2vec1[2] <-NAvec1
[1] 3 NA 18
mean(vec1)
[1] NA
map_vec(.x =list(vec1, vec2, vec3), .f = mean)
[1] NA 24.00000 48.33333
We can avoid this by using na.rm = TRUE in a stand-alone call to mean()…
mean(vec1, na.rm =TRUE)
[1] 10.5
…but how can we do that with the map() function? The below doesn’t work:
It is possible to write the function call within map() using parentheses — but this would essentially entail defining a function on the fly, which you can do as follows:
In practice, the above syntax is most commonly used in slighly more complex situations. For example, you may want to count the number of NAs in the following way, where the is.na() function is nested within the sum() function:
sum(is.na(vec1))
[1] 1
This poses a challenge to the standard map() syntax, but can be easily achieved with the anonymous function syntax shown above:
A) Once again compute the mean of the three sets of measurements with map(), but now also pass the argument trim = 0.1 to the mean() function (If you’re interested, type ?mean for some information about what this argument does.)
Click here for the solution
Arguments are added as if they were arguments of map() / map_vec(), so you simply add trim = 0.1 after a comma and don’t use additional parentheses:
mean(vec1, na.rm =TRUE, trim =0.1)mean(vec2, na.rm =TRUE, trim =0.1)mean(vec3, na.rm =TRUE, trim =0.1)
B) Now use map_vec() with the length() function to compute the length of each of our vectors.
Click here for the solution
map_vec(.x =list(vec1, vec2, vec3), .f = length)
[1] 3 3 3
Exercise 2
A) R does not have a built-in function to compute the standard error. Here is how we can do so — by taking the standard deviation using the sd() function and dividing it by the square root (sqrt()) of the number of observations (length()):
sd(vec1, na.rm =TRUE) /sqrt(length(vec1))
[1] 6.123724
Based on the above code, define your own function that computes the standard error. For a refresher on defining your own functions, see this section from last week’s material.
Then, use your custom function inside map_vec() to compute the standard error for our data.
Click here for the solution for the custom function
se <-function(x) {sd(x, na.rm =TRUE) /sqrt(length((x)))}
The example we used above with the three sets of vectors is rather contrived. This speaks to the fact that in R, we don’t have to explicitly iterate with for loops or map() except in more complex situations.
This is because first, iteration is in many cases “automagic” in R due to vectorization, so we don’t have to iterate over the values in vectors:
measurements_inch <-c(3, 74, 18)# Multiply each value in the vector with 2.54 :2.54* measurements_inch
[1] 7.62 187.96 45.72
Also, we usually have (or can put) our data in data frames, where vectorization across rows applies as well:
data.frame(measurements_inch) |># Multiply each value in the vector with 2.54 :mutate(measurements_cm =2.54* measurements_inch)
Iterate across groups with summarize() and .by(Click to expand)
Speaking of implicit iteration in data frames, dplyr has functionality to repeat operations across subsets of the data without having to explicitly iterate over these subsets. For example, once the data from the above three-vector example is in a data frame…
all <-data.frame(value =c(vec1, vec2, vec3),group =rep(c("vec1", "vec2", "vec3"), each =3) )all
value group
1 3 vec1
2 NA vec1
3 18 vec1
4 33 vec2
5 14 vec2
6 25 vec2
7 10 vec3
8 88 vec3
9 47 vec3
…we can use summarize() with .by to repeat computations across our sets:
all |>summarize(mean =mean(value, na.rm =TRUE), .by = group)
group mean
1 vec1 10.50000
2 vec2 24.00000
3 vec3 48.33333
However, operating across multiple columns of a dataframe is a bit more challenging, and in the final section below, we’ll see how to do this with map().
5 Using map() to iterate across columns of a data frame
A data frame is really a special case of a list, one in which each vector is of the same length and constitutes a column. Therefore, iterating over a dataframe with a function like map() means that you’ll repeat the operation for each column.
For example, it’s easy to check what type of data each column contains by using map_vec() with the class() function:
map_vec(.x = penguins, .f = class)
species island bill_length_mm bill_depth_mm
"factor" "factor" "numeric" "numeric"
flipper_length_mm body_mass_g sex year
"integer" "integer" "factor" "integer"
Similarly, the n_distinct() function computes the number of distinct (unique) values, and we can run that on each column like so:
map_vec(.x = penguins, .f = n_distinct)
species island bill_length_mm bill_depth_mm
3 3 165 81
flipper_length_mm body_mass_g sex year
56 95 3 3
An alternative approach: the across() function (Click to expand)
You can also operate on multiple columns using dplyr’s across() function, which should be used inside another dplyr function, most commonly summarise() or mutate().
For usage in the simplest cases, like for our map() examples above, using across() is more verbose than map():
# A tibble: 3 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <int> <int> <int> <int> <int>
1 Adelie 3 79 50 33 56
2 Gentoo 1 76 40 26 48
3 Chinstrap 1 55 33 25 34
# ℹ 2 more variables: sex <int>, year <int>
Finally, if you need to use additional arguments for the function that across() calls, you should use the anonymous function notation that was explained in the first box on this page:
A) Use map_vec() to compute the mean value for each column in the penguins dataframe. Why are you getting warning messages and NAs?
Click here for the solution
map_vec(.x = penguins, .f = mean, na.rm =TRUE)
Warning in mean.default(.x[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(.x[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(.x[[i]], ...): argument is not numeric or logical:
returning NA
species island bill_length_mm bill_depth_mm
NA NA 43.92193 17.15117
flipper_length_mm body_mass_g sex year
200.91520 4201.75439 NA 2008.02907
The warning messages and NAs (despite using na.rm = TRUE) appear because not some of the columns, like species and island, don’t contain numbers at all.
B) Can you modify the penguins dataframe before passing it to map_vec() so it only contains columns with numbers?
Click here for a hint on the general workflow
You could save the modified dataframe and then use map_vec(), but we’d prefer to use pipes as follows:
(The above only selects one column though,you’ll still have to work on the select() function call!)
Click here for a hint on column selection
The naive way to select all numeric columns would be to first figure out out which are numeric, and then simply list all of those inside select().
However, there is a handy helper function to select columns by type: where(): see this Help page. Can you figure out how to use it to select numeric columns?
The across() function does (still) work without .cols(), and will then select all columns, but this behavior is “deprecated” (outdated) and should not be used.↩︎