Module texcraft_stdext::algorithms::spellcheck

Expand description

Spell checking using Levenshtein distance

This module contains a simple spell check facilty in the form a find_close_words function. This function accepts a word and a dictionary, which is a list of valid words. It find the words in the dictionary that are closest to the original world, where “closest” is defined using Levenshtein distance.

The result of the search is a list of WordDiffs. This data structure describes how the initial word can be changed to the dictionary word using a combination of keep, add, subtract and modify DiffOps.

Implementation notes

The Levenshtein distance calculation is implemented using dynamic programming and has the smallest possible space and time complexity (both O(n^2)). Note the space complexity could be O(n) if we only cared about the minimal distance between the strings, but we want to return the full WordDiff. We use only O(n) space during the calculation, but each WordDiff is also O(n), so we end up with O(n^2).

To see why we can only use O(n) space, let n = a.len() and m = b.len() and consider the n x m matrix X where X[i][j] is the comparison between a[:i] and b[:j]. The notation a[:i] means all characters in a between 0 and i inclusive. The solution is X[n][m]. The recursive relation is:

X[i][j] = {
    X[i-1][j-1] if a[i] == b[j]  // append the (keep the character a[i]) op
                                 //  to the best WordDiff for (a[:i-1], b[:j-1])
    1 + min (
        X[i-1][j],    // append the (subtract the character a[i]) op
                      //  to the best WordDiff for (a[:i-1], b[j])
        X[i][j-1],    // append the (add the character b[j]) op
                      //  to the best WordDiff for (a[:i], b[j-1])
        X[i-1][j-1],  // append the (modify the character a[i] to b[j]) op
                      //  to the best WordDiff for (a[:i-1], b[:j-1])
    ) otherwise
}

We calculate the matrix by iterating over the j variable first and then the i variable:

for j in 0..(b.len() + 1) {
    for i in 0..(a.len() + 1) {
        // calculate X[i][j] using X[i-1][j], X[i][j-1] and X[i-1][j-1]
    }
}

With this calculation order, we observe that X[i][j] only depends on the last m+2 elements of X that have been calculated. Specifically,

X[i-1][j] was calculated in the previous iteration of the i loop, so 1 iteration before.
X[i][j-1] was calculated in the previous iteration of the j loop with the same i index, so m+1 iterations before because the i variable takes m+1 values.
X[i-1][j-1] was calculated m+2 iterations before.

So, we don’t need to store the full X matrix at all: we just need to store the last m+2 elements that were calculated. We use a circular buffer of size m+2 to do this.

Structs

WordDiff

Enums

DiffOp

Functions

find_close_words
Find words in the provided dictionary that are close to the search word.