Statistics, Science, Random Ramblings

A blog mostly about data and R

Fuzzy text matching in R

Posted at — Apr 21, 2020

Part of the base package of every R installation are the functions agrep and agrepl which perform fuzzy pattern matching and are quite useful when working with text.

animals <- c("alpaca", "bear", "cat", "duck", "emu", "fox",
             "goose", "hamster", "ibis", "jaguar", "kangaroo", "moose",
             "owl", "penguin", "quetzal", "rabbit", "snail", "toad",
             "urial", "viper", "wolf", "yak", "zebra")

We can now fuzzy-search the data.

result <- agrep("moose", animals)
animals[result]
## [1] "goose" "moose"

In this case both goose and moose were matched as they only differ in the initial letter.

As with grep the result is a vector of indices where the match occurred.

result
## [1]  7 12

We can further tweak this. Consider:

result <- agrep("meese", animals)
animals[result]
## character(0)

This did not return anything as the distance between singular moose and (the not quite correct) plural meese is too large. There is however a max.distance parameter which can be set:

result <- agrep("meese", animals, max.distance = 0.3)
animals[result]
## [1] "moose"

Setting the max.distance too high will return a lot of noise, so you must be careful with that and have some expectations which content you expect to match against.

The following is obviously bad:

result <- agrep("meese", animals, max.distance = 0.7)
animals[result]
##  [1] "bear"    "emu"     "goose"   "hamster" "ibis"    "moose"   "penguin"
##  [8] "quetzal" "snail"   "viper"   "zebra"

The max.distance can either be expressed as value between 0 and 1 or as integer. Additionally there are sub-parameters that can be tweaked to your needs, but you should see ?agrep for those.

Of course, there is also agrepl which returns logical vectors instead of indices, but otherwise works exactly the same.

agrepl("moose", animals)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

The place where I live has at least three ways of spelling it that you will encounter in the wild, so agrep comes in quite handy here.

place_names <- c("Düsseldorf", "Duesseldorf", "Dusseldorf")
agrep("düsseldorf", place_names, ignore.case = TRUE, max.distance = 0.15)
## [1] 1 2 3

Of course this could also be handled using a regular expression, but agrep seems like the easier choice here. You could also use regular expressions with agrep, but this seems like a slippery slope.

So, to sum things up: there is fuzzy matching bundled with your installation of R in the agrep function, which in some cases might be the easier choice compared to building regular expressions.