Statistics, Science, Random Ramblings

A blog mostly about data and R

Reduce, Filter, Find and more: R's unknown heroes?

Posted at — Oct 15, 2020

The base package contains a set of functions summarised under one common help page as Common Higher-Order Functions in Functional Programming Languages. These functions are Reduce, Filter, Find, Map, Negate and Position which are really useful, especially when working with lists, but after only recently discovering them it appears like they are relatively unknown.

In this post I will walk you through them and hopefully illustrate that they are useful and deserve more attention.

Before we begin we will set a seed for reproducibility.

set.seed(0809)

Reduce

Many programming languages have reducers, which reduce some input to a single value. For example it is easy to sum a vector using a reducer.

data <- 1:100
Reduce(function(x, y) x + y, data)
## [1] 5050

Functions in base are unfortunately often inconsistent with the order of arguments. While the apply family takes the data first, then the function, it is the other way around with the functions in this post.

Summing a vector is not exactly where a reducer shines given there is sum, but you are of course not limited to such simple actions.

dflist <- list()
dflist$var1 <- data.frame(
    index = sample(100, 70, replace = FALSE),
    data = rnorm(70))
dflist$var2 <- data.frame(
    index = sample(100, 70, replace = FALSE),
    data = rnorm(70))
dflist$var3 <- data.frame(
    index = sample(100, 70, replace = FALSE),
    data = rnorm(70))
head(dflist$var1)
##   index       data
## 1    96 -0.2172096
## 2    43  0.2087588
## 3    33  0.2258928
## 4     3 -1.2601701
## 5    61 -0.2457201
## 6    94  0.3557803

Let’s assume we want to join all these data frames into one. This is easy using Reduce.

joined <- Reduce(function(x, y) dplyr::full_join(x, y, by = "index"), dflist)
head(joined)
##   index     data.x     data.y       data
## 1    96 -0.2172096         NA  2.1861660
## 2    43  0.2087588  1.9315462  1.5866881
## 3    33  0.2258928         NA  0.1491109
## 4     3 -1.2601701  1.3754394         NA
## 5    61 -0.2457201 -0.1122428 -0.4668343
## 6    94  0.3557803  0.2250966 -0.6917175

Nice. The column names are ugly, but fortunately the columns are added in the same order as they appear in our input list.

colnames(joined)[-1] <- names(dflist)
head(joined)
##   index       var1       var2       var3
## 1    96 -0.2172096         NA  2.1861660
## 2    43  0.2087588  1.9315462  1.5866881
## 3    33  0.2258928         NA  0.1491109
## 4     3 -1.2601701  1.3754394         NA
## 5    61 -0.2457201 -0.1122428 -0.4668343
## 6    94  0.3557803  0.2250966 -0.6917175

In this case we have reduced a list of data frames to a single one.

Filter

The Filter function is almost self explanatory. Let’s create a list of vectors of random data.

vlist <- lapply(1:500, function(x) rnorm(sample(100, 1)))
head(vlist, n = 3)
## [[1]]
##  [1]  0.14654264 -1.59871229  1.31606365 -0.94209953 -0.11370812  1.62622760
##  [7] -0.14067118  0.64724496  0.15097800 -2.20464045  1.69921332 -0.41241822
## [13] -0.81433539  0.46966305  0.83877240 -0.04004403  0.79305087 -1.13656109
## [19]  0.53795521  0.38059841  1.41302113  0.34022742 -0.38370447  0.26579416
## [25]  0.22156801 -0.59288470  1.81715082  0.82614125 -0.92116337  0.86557695
## [31] -0.42587301  0.35998214  2.03493624 -1.03060378  0.05972806 -0.22740067
## [37]  0.93614992  0.39537180 -0.26582876 -0.95188922 -0.42890996  0.09118043
## [43]  2.54315077  1.46714751 -0.81925437 -0.62716330 -0.26453089  0.60175881
## 
## [[2]]
##  [1] -1.84737446  0.31359091  0.32133103  0.95312967 -0.48817293 -0.82938857
##  [7]  0.85463259  0.55218706 -0.25984122  0.19306059  1.44391945  0.44339821
## [13]  1.22126565  1.21504783  0.64697871  1.06278504  1.12492301 -0.63206884
## [19]  0.59991901  1.28890590 -0.01225846 -0.19838356  1.70789276  0.23468874
## [25] -0.40031207  0.32190156  0.73948137 -1.64207348 -0.02101392  0.61324601
## [31] -0.04908262 -1.56145261 -2.10742397 -0.24688728 -2.05114010  2.24237056
## [37]  1.39220709  1.33418784  0.51253936 -0.09434968  0.39085905  0.45668357
## [43] -0.74045275  0.75733294 -0.50151198  0.57714794 -1.67395793  1.25935204
## [49] -0.76024239 -0.20739095 -1.37388056 -0.60620558 -0.59249755 -1.05960860
## [55] -0.65359800 -0.46495141  0.24861045 -0.89241740 -1.60474512  0.52351051
## [61] -0.59720753  0.40752107 -1.11638928  2.60699871  0.95955923  1.81058090
## [67] -1.26713879  0.85522636  0.92669215
## 
## [[3]]
##  [1] -2.66540380  1.75887192  0.73571882 -0.32644552  1.42074684  0.63448049
##  [7]  0.56128750 -0.85229628  1.55715905 -1.14988165 -1.68846580  0.88899404
## [13] -0.37591209 -0.96017974 -1.80370895  0.31478438 -0.31269648 -0.60966367
## [19] -0.48034356 -1.16691746 -0.19119550 -0.04398318 -1.27105605  0.18778268
## [25] -0.19126845 -1.16007202  1.20815234  0.09259228  0.17550677  1.27724235
## [31] -0.87229310  1.42735133 -0.29932805 -1.43058204  0.91248320  0.42877782
## [37]  0.02762169

Assume we want to filter the list to those items where the sum is at least 10. Using Filter this is straightforward.

filtered <- Filter(function(x) sum(x) >= 10, vlist)
length(filtered)
## [1] 45

The length shows that we did indeed filter out quite a bit of data, let’s have a closer look:

sapply(filtered, sum)
##  [1] 17.10347 11.31184 13.24680 20.35947 14.20383 10.32478 11.01560 11.53005
##  [9] 12.22508 10.16311 10.32248 19.51816 13.57972 15.78467 12.91138 16.29072
## [17] 10.22657 14.98985 16.41516 14.70465 10.01238 10.93764 10.01407 10.01546
## [25] 15.92181 18.54769 15.36236 25.01672 11.08992 12.00762 11.72382 15.63941
## [33] 11.20629 10.15452 10.50183 14.21937 15.52676 18.35091 10.09651 10.03426
## [41] 12.30421 12.47863 11.30195 18.98647 14.38877

This looks about right.

Find and Position

Both Find and Position do very similar things, so I’ll cover them together.

Assume we have a lot of values.

vals <- sample(1e7, 1e5)
head(vals)
## [1] 4639329 3753050 9561538 7975142  453770 5941340

We want to retrieve the first value larger than 9,750,000:

Find(function(x) x > 9750000, vals)
## [1] 9873421

While this is useful, finding its index is even more useful.

Position(function(x) x > 9750000, vals)
## [1] 47

Which is easy to verify.

vals[47]
## [1] 9873421

I hope you refrain from awkward constructions using which from now on.

Negate

Negate takes a function returning logical values and negates its output. This is especially useful when creating new functions.

`%notin%` <- Negate(`%in%`)

Yes, this does exactly what it says it does.

1 %notin% c(2:10)
## [1] TRUE
1 %notin% c(1:10)
## [1] FALSE

This might also be useful for grepl which lacks an invert option.

Map

Finally we have Map which is just a wrapper around mapply and thus probably the least useful of the functions covered here. The main difference between mapply and Map is that the latter does not try to simplify the result.

val1 <- list(1:5, seq(10, 50, 10), seq(100, 500, 100))
sq <- seq(10, 100, 20)
val2 <- list(sq, sq, sq)

With Map you specify a function to apply to the inputs and then the inputs.

Map(function(x, y) x + y, val1, val2)
## [[1]]
## [1] 11 32 53 74 95
## 
## [[2]]
## [1]  20  50  80 110 140
## 
## [[3]]
## [1] 110 230 350 470 590

In this case, for each element in val1 we added the element in the corresponding position from val2.

If we do the same with mapply we get a simplified output.

mapply(function(x, y) x + y, val1, val2)
##      [,1] [,2] [,3]
## [1,]   11   20  110
## [2,]   32   50  230
## [3,]   53   80  350
## [4,]   74  110  470
## [5,]   95  140  590

These are the same results as above, just in a different format.

With Map and mapply the inputs you specify can also be used as positional arguments to the function you want to execute. So, we can sample varying amount of numbers for example.

Map(sample, rep(100, times = 5), seq(5, 25, 5))
## [[1]]
## [1] 28 69 71 35 58
## 
## [[2]]
##  [1]  84 100  80  49  81  18  34  98  77  82
## 
## [[3]]
##  [1] 26 97 13 18 59  2 37 73 61 31 75 62 98 46 29
## 
## [[4]]
##  [1]  8 47 34 42  2  5 94  9 83 37 27 76 36 43 32 22 13 81 45 55
## 
## [[5]]
##  [1] 68 11 96  1 92 82 64 93 23 62 88 81 46 60 65 75 28 53 34 35 66 77 24 50 54

You should make sure to specify inputs of the same length though, otherwise they will be recycled, which might lead to unexpected results.

Conclusions

In this post we had a look a what R calls Common Higher-Order Functions in Functional Programming Languages. While everything these functions do can be done with different means in R, they nonetheless provide shortcuts for common use patterns with clear names. So far I have seen them almost never in the wild and thus I have only recently learned about them. I suppose as it is not too hard implementing these functions using other means or to use a package providing similar functionality, making these functions kind of unpopular. The next time you find yourself fiddling with lapply or purrr::map you might want to reconsider though and use on of these functions, it might improve the clarity of your code.