Part of the base
package of every R
installation are the functions
agrep
and agrepl
which perform fuzzy pattern matching and are quite useful
when working with text.
animals <- c("alpaca", "bear", "cat", "duck", "emu", "fox",
"goose", "hamster", "ibis", "jaguar", "kangaroo", "moose",
"owl", "penguin", "quetzal", "rabbit", "snail", "toad",
"urial", "viper", "wolf", "yak", "zebra")
We can now fuzzy-search the data.
result <- agrep("moose", animals)
animals[result]
## [1] "goose" "moose"
In this case both goose and moose were matched as they only differ in the initial letter.
As with grep
the result is a vector of indices where the match occurred.
result
## [1] 7 12
We can further tweak this. Consider:
result <- agrep("meese", animals)
animals[result]
## character(0)
This did not return anything as the distance between singular moose and
(the not quite correct) plural
meese is too large.
There is however a max.distance
parameter which can be set:
result <- agrep("meese", animals, max.distance = 0.3)
animals[result]
## [1] "moose"
Setting the max.distance
too high will return a lot of noise, so you must
be careful with that and have some expectations which content you expect to
match against.
The following is obviously bad:
result <- agrep("meese", animals, max.distance = 0.7)
animals[result]
## [1] "bear" "emu" "goose" "hamster" "ibis" "moose" "penguin"
## [8] "quetzal" "snail" "viper" "zebra"
The max.distance
can either be expressed as value between 0 and 1 or as
integer.
Additionally there are sub-parameters that can be tweaked to your
needs, but you should see ?agrep
for those.
Of course, there is also agrepl
which returns logical vectors instead of
indices, but otherwise works exactly the same.
agrepl("moose", animals)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The place where I live has at least three ways of spelling it that you will
encounter in the wild, so agrep
comes in quite handy here.
place_names <- c("Düsseldorf", "Duesseldorf", "Dusseldorf")
agrep("düsseldorf", place_names, ignore.case = TRUE, max.distance = 0.15)
## [1] 1 2 3
Of course this could also be handled using a regular expression, but agrep
seems like the easier choice here.
You could also use regular expressions with agrep, but this seems like a
slippery slope.
So, to sum things up: there is fuzzy matching bundled with your
installation of R in the agrep
function, which in some cases might be
the easier choice compared to building regular expressions.