# Statistics, Science, Random Ramblings

## A blog mostly about data and R

Part of the `base` package of every `R` installation are the functions `agrep` and `agrepl` which perform fuzzy pattern matching and are quite useful when working with text.

``````animals <- c("alpaca", "bear", "cat", "duck", "emu", "fox",
"goose", "hamster", "ibis", "jaguar", "kangaroo", "moose",
"owl", "penguin", "quetzal", "rabbit", "snail", "toad",
"urial", "viper", "wolf", "yak", "zebra")``````

We can now fuzzy-search the data.

``````result <- agrep("moose", animals)
animals[result]``````
``## [1] "goose" "moose"``

In this case both goose and moose were matched as they only differ in the initial letter.

As with `grep` the result is a vector of indices where the match occurred.

``result``
``## [1]  7 12``

We can further tweak this. Consider:

``````result <- agrep("meese", animals)
animals[result]``````
``## character(0)``

This did not return anything as the distance between singular moose and (the not quite correct) plural meese is too large. There is however a `max.distance` parameter which can be set:

``````result <- agrep("meese", animals, max.distance = 0.3)
animals[result]``````
``## [1] "moose"``

Setting the `max.distance` too high will return a lot of noise, so you must be careful with that and have some expectations which content you expect to match against.

``````result <- agrep("meese", animals, max.distance = 0.7)
animals[result]``````
``````##  [1] "bear"    "emu"     "goose"   "hamster" "ibis"    "moose"   "penguin"
##  [8] "quetzal" "snail"   "viper"   "zebra"``````

The `max.distance` can either be expressed as value between 0 and 1 or as integer. Additionally there are sub-parameters that can be tweaked to your needs, but you should see `?agrep` for those.

Of course, there is also `agrepl` which returns logical vectors instead of indices, but otherwise works exactly the same.

``agrepl("moose", animals)``
``````##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE``````

The place where I live has at least three ways of spelling it that you will encounter in the wild, so `agrep` comes in quite handy here.

``````place_names <- c("DÃ¼sseldorf", "Duesseldorf", "Dusseldorf")
agrep("dÃ¼sseldorf", place_names, ignore.case = TRUE, max.distance = 0.15)``````
``## [1] 1 2 3``

Of course this could also be handled using a regular expression, but `agrep` seems like the easier choice here. You could also use regular expressions with agrep, but this seems like a slippery slope.

So, to sum things up: there is fuzzy matching bundled with your installation of R in the `agrep` function, which in some cases might be the easier choice compared to building regular expressions.