The base
package contains a set of functions summarised under one common
help page as
Common Higher-Order Functions in Functional Programming Languages.
These functions are Reduce
, Filter
, Find
, Map
, Negate
and Position
which are really useful, especially when working with lists, but after only
recently discovering them it appears like they are relatively unknown.
In this post I will walk you through them and hopefully illustrate that they are useful and deserve more attention.
Before we begin we will set a seed for reproducibility.
set.seed(0809)
Reduce
Many programming languages have reducers, which reduce some input to a single value. For example it is easy to sum a vector using a reducer.
data <- 1:100
Reduce(function(x, y) x + y, data)
## [1] 5050
Functions in base
are unfortunately often inconsistent with the order of
arguments.
While the apply family takes the data first, then the function, it is the
other way around with the functions in this post.
Summing a vector is not exactly where a reducer shines given there is sum
,
but you are of course not limited to such simple actions.
dflist <- list()
dflist$var1 <- data.frame(
index = sample(100, 70, replace = FALSE),
data = rnorm(70))
dflist$var2 <- data.frame(
index = sample(100, 70, replace = FALSE),
data = rnorm(70))
dflist$var3 <- data.frame(
index = sample(100, 70, replace = FALSE),
data = rnorm(70))
head(dflist$var1)
## index data
## 1 96 -0.2172096
## 2 43 0.2087588
## 3 33 0.2258928
## 4 3 -1.2601701
## 5 61 -0.2457201
## 6 94 0.3557803
Let’s assume we want to join all these data frames into one.
This is easy using Reduce
.
joined <- Reduce(function(x, y) dplyr::full_join(x, y, by = "index"), dflist)
head(joined)
## index data.x data.y data
## 1 96 -0.2172096 NA 2.1861660
## 2 43 0.2087588 1.9315462 1.5866881
## 3 33 0.2258928 NA 0.1491109
## 4 3 -1.2601701 1.3754394 NA
## 5 61 -0.2457201 -0.1122428 -0.4668343
## 6 94 0.3557803 0.2250966 -0.6917175
Nice. The column names are ugly, but fortunately the columns are added in the same order as they appear in our input list.
colnames(joined)[-1] <- names(dflist)
head(joined)
## index var1 var2 var3
## 1 96 -0.2172096 NA 2.1861660
## 2 43 0.2087588 1.9315462 1.5866881
## 3 33 0.2258928 NA 0.1491109
## 4 3 -1.2601701 1.3754394 NA
## 5 61 -0.2457201 -0.1122428 -0.4668343
## 6 94 0.3557803 0.2250966 -0.6917175
In this case we have reduced a list of data frames to a single one.
Filter
The Filter
function is almost self explanatory.
Let’s create a list of vectors of random data.
vlist <- lapply(1:500, function(x) rnorm(sample(100, 1)))
head(vlist, n = 3)
## [[1]]
## [1] 0.14654264 -1.59871229 1.31606365 -0.94209953 -0.11370812 1.62622760
## [7] -0.14067118 0.64724496 0.15097800 -2.20464045 1.69921332 -0.41241822
## [13] -0.81433539 0.46966305 0.83877240 -0.04004403 0.79305087 -1.13656109
## [19] 0.53795521 0.38059841 1.41302113 0.34022742 -0.38370447 0.26579416
## [25] 0.22156801 -0.59288470 1.81715082 0.82614125 -0.92116337 0.86557695
## [31] -0.42587301 0.35998214 2.03493624 -1.03060378 0.05972806 -0.22740067
## [37] 0.93614992 0.39537180 -0.26582876 -0.95188922 -0.42890996 0.09118043
## [43] 2.54315077 1.46714751 -0.81925437 -0.62716330 -0.26453089 0.60175881
##
## [[2]]
## [1] -1.84737446 0.31359091 0.32133103 0.95312967 -0.48817293 -0.82938857
## [7] 0.85463259 0.55218706 -0.25984122 0.19306059 1.44391945 0.44339821
## [13] 1.22126565 1.21504783 0.64697871 1.06278504 1.12492301 -0.63206884
## [19] 0.59991901 1.28890590 -0.01225846 -0.19838356 1.70789276 0.23468874
## [25] -0.40031207 0.32190156 0.73948137 -1.64207348 -0.02101392 0.61324601
## [31] -0.04908262 -1.56145261 -2.10742397 -0.24688728 -2.05114010 2.24237056
## [37] 1.39220709 1.33418784 0.51253936 -0.09434968 0.39085905 0.45668357
## [43] -0.74045275 0.75733294 -0.50151198 0.57714794 -1.67395793 1.25935204
## [49] -0.76024239 -0.20739095 -1.37388056 -0.60620558 -0.59249755 -1.05960860
## [55] -0.65359800 -0.46495141 0.24861045 -0.89241740 -1.60474512 0.52351051
## [61] -0.59720753 0.40752107 -1.11638928 2.60699871 0.95955923 1.81058090
## [67] -1.26713879 0.85522636 0.92669215
##
## [[3]]
## [1] -2.66540380 1.75887192 0.73571882 -0.32644552 1.42074684 0.63448049
## [7] 0.56128750 -0.85229628 1.55715905 -1.14988165 -1.68846580 0.88899404
## [13] -0.37591209 -0.96017974 -1.80370895 0.31478438 -0.31269648 -0.60966367
## [19] -0.48034356 -1.16691746 -0.19119550 -0.04398318 -1.27105605 0.18778268
## [25] -0.19126845 -1.16007202 1.20815234 0.09259228 0.17550677 1.27724235
## [31] -0.87229310 1.42735133 -0.29932805 -1.43058204 0.91248320 0.42877782
## [37] 0.02762169
Assume we want to filter the list to those items where the sum is at least 10.
Using Filter
this is straightforward.
filtered <- Filter(function(x) sum(x) >= 10, vlist)
length(filtered)
## [1] 45
The length shows that we did indeed filter out quite a bit of data, let’s have a closer look:
sapply(filtered, sum)
## [1] 17.10347 11.31184 13.24680 20.35947 14.20383 10.32478 11.01560 11.53005
## [9] 12.22508 10.16311 10.32248 19.51816 13.57972 15.78467 12.91138 16.29072
## [17] 10.22657 14.98985 16.41516 14.70465 10.01238 10.93764 10.01407 10.01546
## [25] 15.92181 18.54769 15.36236 25.01672 11.08992 12.00762 11.72382 15.63941
## [33] 11.20629 10.15452 10.50183 14.21937 15.52676 18.35091 10.09651 10.03426
## [41] 12.30421 12.47863 11.30195 18.98647 14.38877
This looks about right.
Find
and Position
Both Find
and Position
do very similar things, so I’ll cover them
together.
Assume we have a lot of values.
vals <- sample(1e7, 1e5)
head(vals)
## [1] 4639329 3753050 9561538 7975142 453770 5941340
We want to retrieve the first value larger than 9,750,000:
Find(function(x) x > 9750000, vals)
## [1] 9873421
While this is useful, finding its index is even more useful.
Position(function(x) x > 9750000, vals)
## [1] 47
Which is easy to verify.
vals[47]
## [1] 9873421
I hope you refrain from awkward constructions using which
from now on.
Negate
Negate takes a function returning logical values and negates its output. This is especially useful when creating new functions.
`%notin%` <- Negate(`%in%`)
Yes, this does exactly what it says it does.
1 %notin% c(2:10)
## [1] TRUE
1 %notin% c(1:10)
## [1] FALSE
This might also be useful for grepl
which lacks an invert
option.
Map
Finally we have Map
which is just a wrapper around mapply
and thus probably
the least useful of the functions covered here.
The main difference between mapply
and Map
is that the latter does not
try to simplify the result.
val1 <- list(1:5, seq(10, 50, 10), seq(100, 500, 100))
sq <- seq(10, 100, 20)
val2 <- list(sq, sq, sq)
With Map
you specify a function to apply to the inputs and then the inputs.
Map(function(x, y) x + y, val1, val2)
## [[1]]
## [1] 11 32 53 74 95
##
## [[2]]
## [1] 20 50 80 110 140
##
## [[3]]
## [1] 110 230 350 470 590
In this case, for each element in val1
we added the element in the
corresponding position from val2
.
If we do the same with mapply
we get a simplified output.
mapply(function(x, y) x + y, val1, val2)
## [,1] [,2] [,3]
## [1,] 11 20 110
## [2,] 32 50 230
## [3,] 53 80 350
## [4,] 74 110 470
## [5,] 95 140 590
These are the same results as above, just in a different format.
With Map
and mapply
the inputs you specify can also be used as positional
arguments to the function you want to execute.
So, we can sample varying amount of numbers for example.
Map(sample, rep(100, times = 5), seq(5, 25, 5))
## [[1]]
## [1] 28 69 71 35 58
##
## [[2]]
## [1] 84 100 80 49 81 18 34 98 77 82
##
## [[3]]
## [1] 26 97 13 18 59 2 37 73 61 31 75 62 98 46 29
##
## [[4]]
## [1] 8 47 34 42 2 5 94 9 83 37 27 76 36 43 32 22 13 81 45 55
##
## [[5]]
## [1] 68 11 96 1 92 82 64 93 23 62 88 81 46 60 65 75 28 53 34 35 66 77 24 50 54
You should make sure to specify inputs of the same length though, otherwise they will be recycled, which might lead to unexpected results.
In this post we had a look a what R calls
Common Higher-Order Functions in Functional Programming Languages.
While everything these functions do can be done with different means in R,
they nonetheless provide shortcuts for common use patterns with clear names.
So far I have seen them almost never in the wild and thus I have only recently
learned about them.
I suppose as it is not too hard implementing these functions using other means
or to use a package providing similar functionality, making these functions kind
of unpopular.
The next time you find yourself fiddling with lapply
or purrr::map
you
might want to reconsider though and use on of these functions, it might
improve the clarity of your code.