[update: we’re now using Aspell and the Text::Aspell module]
HIP 4 contains a spellchecking “did you mean?” facility which, although not as powerful as Googles, is certainly a step in the right direction. One of the basic rules of designing any web based system that supports searching or browsing is to always give the user choices — even if they have gone down a virtual one way street and hit a dead end.
Unfortunately it’s going to be another few months before SirsiDynix release the UK enhanced version of HIP 4 for beta testing, so I thought I’d have a stab at adding the facility to our existing HIP 3.04 server.
Fortunately Perl provides a number of modules for this kind of thing, including String::Approx, Text::Metaphone, and Text::Soundex.
String::Approx is good at catching simple typos (e.g. Hudersfield or embarassement) whereas the latter two modules attempt to find “sounds like” matches — for example, when given batched, Text::Metaphone suggests scratched, thatched and matched.
To set something like this up, you need to have a word list. You could download one (e.g. a list of dictionary words), but it makes more sense to generate your own — in my case I’ve parsed Horizon’s title table to create a list of keywords and frequency. That’s given me a list of nearly 67,000 keywords that all bring up matches in either a general or title keyword search.
Once I’d got the keyword list, I ran it through Text::Metaphone and Text::Soundex to generate the relevant phonetic values — doing that in advance means that your spellchecking code can run faster as it doesn’t need to generate the values again for each incoming request.
Next up, I wrote an Apache mod_perl handle to create the suggestions from a given search term. As String::Approx can often give the best results, the term is run against that first. If no suggestions are found, the term is run against Text::Metaphone and then Text::Soundex in turn to find broader “sounds like” suggestions.
Assuming that one of the modules comes up with a least one suggestion, then that gets displayed in HIP:
There’s still more work to do, as the suggestions only appear for a failed single keyword. Handling two misspelled words (or more) is technically challenging — what’s the best method of presenting all the possible options to a user? You could just give them a list of possibilities, but I’d prefer to give them something they can click on to initiate a new search.