# Statistics, Science, Random Ramblings

## A blog mostly about data and R

The `base` package contains a set of functions summarised under one common help page as Common Higher-Order Functions in Functional Programming Languages. These functions are `Reduce`, `Filter`, `Find`, `Map`, `Negate` and `Position` which are really useful, especially when working with lists, but after only recently discovering them it appears like they are relatively unknown.

In this post I will walk you through them and hopefully illustrate that they are useful and deserve more attention.

Before we begin we will set a seed for reproducibility.

``set.seed(0809)``

## `Reduce`

Many programming languages have reducers, which reduce some input to a single value. For example it is easy to sum a vector using a reducer.

``````data <- 1:100
Reduce(function(x, y) x + y, data)``````
``##  5050``

Functions in `base` are unfortunately often inconsistent with the order of arguments. While the apply family takes the data first, then the function, it is the other way around with the functions in this post.

Summing a vector is not exactly where a reducer shines given there is `sum`, but you are of course not limited to such simple actions.

``````dflist <- list()
dflist\$var1 <- data.frame(
index = sample(100, 70, replace = FALSE),
data = rnorm(70))
dflist\$var2 <- data.frame(
index = sample(100, 70, replace = FALSE),
data = rnorm(70))
dflist\$var3 <- data.frame(
index = sample(100, 70, replace = FALSE),
data = rnorm(70))
``````##   index       data
## 1    96 -0.2172096
## 2    43  0.2087588
## 3    33  0.2258928
## 4     3 -1.2601701
## 5    61 -0.2457201
## 6    94  0.3557803``````

Let’s assume we want to join all these data frames into one. This is easy using `Reduce`.

``````joined <- Reduce(function(x, y) dplyr::full_join(x, y, by = "index"), dflist)
``````##   index     data.x     data.y       data
## 1    96 -0.2172096         NA  2.1861660
## 2    43  0.2087588  1.9315462  1.5866881
## 3    33  0.2258928         NA  0.1491109
## 4     3 -1.2601701  1.3754394         NA
## 5    61 -0.2457201 -0.1122428 -0.4668343
## 6    94  0.3557803  0.2250966 -0.6917175``````

Nice. The column names are ugly, but fortunately the columns are added in the same order as they appear in our input list.

``````colnames(joined)[-1] <- names(dflist)
``````##   index       var1       var2       var3
## 1    96 -0.2172096         NA  2.1861660
## 2    43  0.2087588  1.9315462  1.5866881
## 3    33  0.2258928         NA  0.1491109
## 4     3 -1.2601701  1.3754394         NA
## 5    61 -0.2457201 -0.1122428 -0.4668343
## 6    94  0.3557803  0.2250966 -0.6917175``````

In this case we have reduced a list of data frames to a single one.

## `Filter`

The `Filter` function is almost self explanatory. Let’s create a list of vectors of random data.

``````vlist <- lapply(1:500, function(x) rnorm(sample(100, 1)))
``````## []
##    0.14654264 -1.59871229  1.31606365 -0.94209953 -0.11370812  1.62622760
##   -0.14067118  0.64724496  0.15097800 -2.20464045  1.69921332 -0.41241822
##  -0.81433539  0.46966305  0.83877240 -0.04004403  0.79305087 -1.13656109
##   0.53795521  0.38059841  1.41302113  0.34022742 -0.38370447  0.26579416
##   0.22156801 -0.59288470  1.81715082  0.82614125 -0.92116337  0.86557695
##  -0.42587301  0.35998214  2.03493624 -1.03060378  0.05972806 -0.22740067
##   0.93614992  0.39537180 -0.26582876 -0.95188922 -0.42890996  0.09118043
##   2.54315077  1.46714751 -0.81925437 -0.62716330 -0.26453089  0.60175881
##
## []
##   -1.84737446  0.31359091  0.32133103  0.95312967 -0.48817293 -0.82938857
##    0.85463259  0.55218706 -0.25984122  0.19306059  1.44391945  0.44339821
##   1.22126565  1.21504783  0.64697871  1.06278504  1.12492301 -0.63206884
##   0.59991901  1.28890590 -0.01225846 -0.19838356  1.70789276  0.23468874
##  -0.40031207  0.32190156  0.73948137 -1.64207348 -0.02101392  0.61324601
##  -0.04908262 -1.56145261 -2.10742397 -0.24688728 -2.05114010  2.24237056
##   1.39220709  1.33418784  0.51253936 -0.09434968  0.39085905  0.45668357
##  -0.74045275  0.75733294 -0.50151198  0.57714794 -1.67395793  1.25935204
##  -0.76024239 -0.20739095 -1.37388056 -0.60620558 -0.59249755 -1.05960860
##  -0.65359800 -0.46495141  0.24861045 -0.89241740 -1.60474512  0.52351051
##  -0.59720753  0.40752107 -1.11638928  2.60699871  0.95955923  1.81058090
##  -1.26713879  0.85522636  0.92669215
##
## []
##   -2.66540380  1.75887192  0.73571882 -0.32644552  1.42074684  0.63448049
##    0.56128750 -0.85229628  1.55715905 -1.14988165 -1.68846580  0.88899404
##  -0.37591209 -0.96017974 -1.80370895  0.31478438 -0.31269648 -0.60966367
##  -0.48034356 -1.16691746 -0.19119550 -0.04398318 -1.27105605  0.18778268
##  -0.19126845 -1.16007202  1.20815234  0.09259228  0.17550677  1.27724235
##  -0.87229310  1.42735133 -0.29932805 -1.43058204  0.91248320  0.42877782
##   0.02762169``````

Assume we want to filter the list to those items where the sum is at least 10. Using `Filter` this is straightforward.

``````filtered <- Filter(function(x) sum(x) >= 10, vlist)
length(filtered)``````
``##  45``

The length shows that we did indeed filter out quite a bit of data, let’s have a closer look:

``sapply(filtered, sum)``
``````##   17.10347 11.31184 13.24680 20.35947 14.20383 10.32478 11.01560 11.53005
##   12.22508 10.16311 10.32248 19.51816 13.57972 15.78467 12.91138 16.29072
##  10.22657 14.98985 16.41516 14.70465 10.01238 10.93764 10.01407 10.01546
##  15.92181 18.54769 15.36236 25.01672 11.08992 12.00762 11.72382 15.63941
##  11.20629 10.15452 10.50183 14.21937 15.52676 18.35091 10.09651 10.03426
##  12.30421 12.47863 11.30195 18.98647 14.38877``````

## `Find` and `Position`

Both `Find` and `Position` do very similar things, so I’ll cover them together.

Assume we have a lot of values.

``````vals <- sample(1e7, 1e5)
``##  4639329 3753050 9561538 7975142  453770 5941340``

We want to retrieve the first value larger than 9,750,000:

``Find(function(x) x > 9750000, vals)``
``##  9873421``

While this is useful, finding its index is even more useful.

``Position(function(x) x > 9750000, vals)``
``##  47``

Which is easy to verify.

``vals``
``##  9873421``

I hope you refrain from awkward constructions using `which` from now on.

## `Negate`

Negate takes a function returning logical values and negates its output. This is especially useful when creating new functions.

```%notin%` <- Negate(`%in%`)``

Yes, this does exactly what it says it does.

``1 %notin% c(2:10)``
``##  TRUE``
``1 %notin% c(1:10)``
``##  FALSE``

This might also be useful for `grepl` which lacks an `invert` option.

## `Map`

Finally we have `Map` which is just a wrapper around `mapply` and thus probably the least useful of the functions covered here. The main difference between `mapply` and `Map` is that the latter does not try to simplify the result.

``````val1 <- list(1:5, seq(10, 50, 10), seq(100, 500, 100))
sq <- seq(10, 100, 20)
val2 <- list(sq, sq, sq)``````

With `Map` you specify a function to apply to the inputs and then the inputs.

``Map(function(x, y) x + y, val1, val2)``
``````## []
##  11 32 53 74 95
##
## []
##   20  50  80 110 140
##
## []
##  110 230 350 470 590``````

In this case, for each element in `val1` we added the element in the corresponding position from `val2`.

If we do the same with `mapply` we get a simplified output.

``mapply(function(x, y) x + y, val1, val2)``
``````##      [,1] [,2] [,3]
## [1,]   11   20  110
## [2,]   32   50  230
## [3,]   53   80  350
## [4,]   74  110  470
## [5,]   95  140  590``````

These are the same results as above, just in a different format.

With `Map` and `mapply` the inputs you specify can also be used as positional arguments to the function you want to execute. So, we can sample varying amount of numbers for example.

``Map(sample, rep(100, times = 5), seq(5, 25, 5))``
``````## []
##  28 69 71 35 58
##
## []
##    84 100  80  49  81  18  34  98  77  82
##
## []
##   26 97 13 18 59  2 37 73 61 31 75 62 98 46 29
##
## []
##    8 47 34 42  2  5 94  9 83 37 27 76 36 43 32 22 13 81 45 55
##
## []
##   68 11 96  1 92 82 64 93 23 62 88 81 46 60 65 75 28 53 34 35 66 77 24 50 54``````

You should make sure to specify inputs of the same length though, otherwise they will be recycled, which might lead to unexpected results.

## Conclusions

In this post we had a look a what R calls Common Higher-Order Functions in Functional Programming Languages. While everything these functions do can be done with different means in R, they nonetheless provide shortcuts for common use patterns with clear names. So far I have seen them almost never in the wild and thus I have only recently learned about them. I suppose as it is not too hard implementing these functions using other means or to use a package providing similar functionality, making these functions kind of unpopular. The next time you find yourself fiddling with `lapply` or `purrr::map` you might want to reconsider though and use on of these functions, it might improve the clarity of your code.