Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

algorithm - Finding how similar two strings are

I'm looking for an algorithm that takes 2 strings and will give me back a "factor of similarity".

Basically, I will have an input that may be misspelled, have letters transposed, etc, and I have to find the closest match(es) in a list of possible values that I have.

This is not for searching in a database. I'll have an in-memory list of 500 or so strings to match against, all under 30 chars, so it can be relatively slow.

I know this exists, i've seen it before, but I can't remember its name.


Edit: Thanks for pointing out Levenshtein and Hamming. Now, which one should I implement? They basically measure different things, both of which can be used for what I want, but I'm not sure which one is more appropriate.

I've read up on the algorithms, Hamming seems obviously faster. Since neither will detect two characters being transposed (ie. Jordan and Jodran), which I believe will be a common mistake, which will be more accurate for what I want? Can someone tell me a bit about the trade-offs?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Ok, so the standard algorithms are:

1) Hamming distance Only good for strings of the same length, but very efficient. Basically it simply counts the number of distinct characters. Not useful for fuzzy searching of natural language text.

2) Levenstein distance. The Levenstein distance measures distance in terms of the number of "operations" required to transform one string to another. These operations include insertion, deletion and substition. The standard approach of calculating the Levenstein distance is to use dynamic programming.

3) Generalized Levenstein/(Damerau–Levenshtein distance) This distance also takes into consideration transpositions of characters in a word, and is probably the edit distance most suited for fuzzy matching of manually-entered text. The algorithm to compute the distance is a bit more involved than the Levenstein distance (detecting transpositions is not easy). Most common implementations are a modification of the bitap algorithm (like grep).

In general you would probably want to consider an implementation of the third option implemented in some sort of nearest neighbour search based on a k-d tree


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share

2.1m questions

2.1m answers

63 comments

56.6k users

...