Module texcraft_stdext::algorithms::spellcheck
source · Expand description
Spell checking using Levenshtein distance
This module contains a simple spell check facilty in the form a find_close_words function. This function accepts a word and a dictionary, which is a list of valid words. It find the words in the dictionary that are closest to the original world, where “closest” is defined using Levenshtein distance.
The result of the search is a list of WordDiffs. This data structure describes how the initial word can be changed to the dictionary word using a combination of keep, add, subtract and modify DiffOps.
Implementation notes
The Levenshtein distance calculation is implemented using dynamic programming and has the smallest possible space and time complexity (both O(n^2)). Note the space complexity could be O(n) if we only cared about the minimal distance between the strings, but we want to return the full WordDiff. We use only O(n) space during the calculation, but each WordDiff is also O(n), so we end up with O(n^2).
To see why we can only use O(n) space,
let n = a.len()
and m = b.len()
and consider the n x m
matrix X
where X[i][j]
is the comparison
between a[:i]
and b[:j]
.
The notation a[:i]
means all characters in a
between 0 and i
inclusive.
The solution is X[n][m]
. The recursive relation is:
X[i][j] = {
X[i-1][j-1] if a[i] == b[j] // append the (keep the character a[i]) op
// to the best WordDiff for (a[:i-1], b[:j-1])
1 + min (
X[i-1][j], // append the (subtract the character a[i]) op
// to the best WordDiff for (a[:i-1], b[j])
X[i][j-1], // append the (add the character b[j]) op
// to the best WordDiff for (a[:i], b[j-1])
X[i-1][j-1], // append the (modify the character a[i] to b[j]) op
// to the best WordDiff for (a[:i-1], b[:j-1])
) otherwise
}
We calculate the matrix by iterating over the j
variable first and then the i
variable:
for j in 0..(b.len() + 1) {
for i in 0..(a.len() + 1) {
// calculate X[i][j] using X[i-1][j], X[i][j-1] and X[i-1][j-1]
}
}
With this calculation order, we observe that X[i][j]
only depends on the last m+2
elements of X
that have
been calculated. Specifically,
X[i-1][j]
was calculated in the previous iteration of thei
loop, so1
iteration before.X[i][j-1]
was calculated in the previous iteration of thej
loop with the samei
index, som+1
iterations before because thei
variable takesm+1
values.X[i-1][j-1]
was calculatedm+2
iterations before.
So, we don’t need to store the full X
matrix at all: we just need to store the last m+2
elements that were calculated.
We use a circular buffer of size m+2
to do this.
Structs
Enums
Functions
- Find words in the provided dictionary that are close to the search word.