HIPpie — how to build a dictionary

Many thanks to those of you who’ve tested the code from yesterday! Those of you outside of the UK might want to see if this version works slightly faster for you:
hippie_spellcheck_v0.02.txt
The next thing I’ll be looking at is how to optimise the spellchecker dictionary for each library. Some of you will already have read this in the email I sent out this morning or in the comment I left previously, but I’m thinking of attacking it this way:
1) Start off with a standard word list (e.g. the 1000 most commonly used English words) to create the spellcheck dictionary for your library, as the vast majority should match something on your catalogue.
2) Add some extra code to your HIP so that all successful keyword searches get logged. Those keywords can then be added to your dictionary.
It could even be that starting with an empty dictionary might prove to be more effective (i.e. don’t bother with step 1) — just let the “network effect” of your users searching your OPAC generate the dictionary from scratch (how “2.0” is that?!)
To avoid any privacy issues, the code for capturing the successful keywords could be hosted locally on your own web server (I should be able to knock up suitable Perl and PHP scripts for you to use). Then, periodically, you’d upload your keyword list to HIPpie so that it can add the words to your spellchecker dictionary.
What about if you don’t have SirsiDynix HIP? Well, as mentioned previously, the spellchecker has been implemented as a web service (more info here), and the HIP spellchecker makes use of that web service to get a suggestion. At the moment it only returns text or XML, but I’m planning to add JSON as an option soon. Also, if you have a look at the HIP stylesheet changes, you can see the general flow of the code:
1) insert a div with an id of “hippie_spellchecker” into the HTML
2) make a call to “https://library.hud.ac.uk/hippie_perl/spellchecker2.pl” with your library ID (currently “demo”) and the search term(s) as the parameters
3) the call to “spellchecker2.pl” returns JavaScript to update the div from step 1
4) clicking on the spelling suggestion triggers the “hippie_search” JavaScript function which is responsible for creating a search URL suitable for the OPAC (which might include things like a session ID or an index to search)
None of the above 4 steps are specifically tied to the SirsiDynix HIP and should be transferable to other OPACs. I’ve put together a small sample HTML page that does nothing apart from pull in a suggestion using those 4 steps:
example001.html
If you do want to have a go with your own OPAC, please let me know — at some point I’ll need people to register their libraries so that each can have their own dictionary, and I might start limiting the number of requests that any single IP address can make using the “demo” account. Also, it would be good to build up a collection of working implementations for different OPACs.

HIPpie “Did you mean?” ready for testing (again!)

A thousand apologies to those of you who’ve been waiting for HIPpie to reach the testing phase — has it really been 6 months since I last posted anything?! HIPpie was/is a project that I’ll be doing in my spare time and, unfortunately, since Christmas, my spare time has been taken up with everything but working on HIPpie!
Anyway, having realised that it’s so long since I posted anything, I was shamed into making some time and I’m now at the stage where some brave HIP 3.x library can alpha test the spellchecker code. Ideally you want to be doing this on a test HIP 3.x server, unless you’re feeling particularly reckless.
The usual caveats apply — make sure you safely back up any files you edit and you promise not to hold me responsible if your server room mysteriously burns down shortly after you add the code. Also, altering your XSL stylesheets may have an impact on what support SirsiDynix will be able to give you.
To test the code, you’ll need to edit the searchinput.xsl stylesheet. Once you’ve found the file, make a safe backup before you make the changes! Open up the file and scroll down to around line 580 — you should see a <center> tag. After that tag, you need to insert (i.e. copy & paste) in the contents of this file:
hippie_spellcheck_v0.01.txt
Save the altered file and give your HIP server a minute to pick up the altered stylesheet. Now fire up a web browser and run a search for a misspelled word. If you get an error message, then double-check the changes you made to the stylesheet and, if all else fails, you can revert back to your backed up version. Touch wood, you should get a “did you mean” suggestion which looks like this:
hippie_v001
If you do test the code, please feed back!
Notes
This test version of the code is using a fairly small American dictionary of words, so you may not get appropriate suggestions for your locale.

Spellchecker + Network Effect = Better Spellchecker?

I’ve been having a few email discussions relating to whether or not it’s best to use a standard dictionary of words for an OPAC spellchecker or an index created from the actual holdings of that library…
Standard dictionary
pros: correct spelling
cons: suggestion might not find any results, might not contain buzz/new words
Custom dictionary
pros: suggestions should find results
cons: will contain mis-spellings (e.g. “mangement”), needs regular updates, might be difficult to extract the words from ILS/LMS/OPAC
I’m beginning to think that the best of both worlds might be to start with a standard dictionary and then let your users/patrons build upon that. In other words, whenever someone carries out a successful keyword search on the OPAC, automatically add the keyword(s) they used to your dictionary so that they can appear as spelling suggestions in the future.
Any comments?

HIPpie “Did you mean?” ready for testing

I’ve just finished plugging the first bit of HIPpie into our test OPAC:
hippie_spellchecker
I’m gonna be out of the office for most of next week (3 days in London at Online Information 2007), but I’ll start contacting those of you who said you’d like to be involved with the testing. The test code just requires you to paste a short block of JavaScript into one of the HIP stylesheets (searchinput.xsl).
At present, the version I’ve plugged into our test OPAC uses a generic US word list, but the idea is to allow libraries to either upload their own word lists or choose from country specific ones.
Although the code needs to be able to create links that contain the HIP profile string and the session ID, neither of these are actually passed back to the server at Huddersfield (just in case session privacy is an issue).

HIPpie update (20/Nov/2007)

Just a quick update — I’ve not had too much spare time to work on HIPpie since announcing it, partly due to work and conference commitments, but I have been slowly beavering away.
The first chunk is some of the back-end code for the “did you mean” spell checker. To try and make the code as re-usable as possible (especially for other OPACs), the back-end has been coded so that it can be used as a standalone web service:
library.hud.ac.uk/wikis/hippie/index.php/Spell_checker
Various options can also be specified to affect the output, e.g. for “newmonia thrombrosis”:

The grand plan is that anyone who wants to make use of it (either as a web service or the code that will embed into HIP) will have an account. By logging into the account, they’ll be able to specify a dictionary to use (e.g. standard US English) or they’ll be able to upload a their own word list (e.g. generated from the indexes in the ILS).
It’s still early days, but if anyone has any comments or suggestions, please get in touch!

(Hopefully) coming soon — HIPpie

In the last couple of months, I’ve had several email exchanges with Dynix & Horizon libraries who were interested in using some of the “2.0” features that I’ve added to our OPAC at Huddersfield, but the technical challenges (setting up an extra web server, MySQL database, etc) would have been too much.
I’ve been thinking for a while that some of the features could be done if someone else (e.g. me) were to handle the techie stuff. All the library would need to do would be to add a few lines of JavaScript to the relevant XSL stylesheets…
hippie
HIPpie was the best name that I could think of in the bath last night, and (unless the SirsiDynix lawyers come down on me like a tonne of bricks) it stands for HIP patron interface enhancements (HIP being the product name of the Dynix and Horizon OPAC).
It’s still mostly vapourware (i.e. I haven’t finished writing the code yet), but if you’re running HIP version 2 or version 3 and you fancy adding any of the following to your OPAC, then please get in touch (email d.c.pattern [at] hud.ac.uk):

  • RSS feeds for keyword searches
  • “did you mean” spelling suggestions
  • email alerts for keyword searches
  • user reviews
  • user ratings

I’ve deliberately picked features that I don’t think are being offered via other channels (e.g. LibraryThing for Libraries or Jim Taylor).
Unfortunately HIP version 4 was never released in the UK, so I’m not sure how easy it would be to add the features to that version, but if there’s someone out there who’s familiar with the stylesheets and is willing to experiment…?
HIP is the only OPAC I’m intimately familiar with, but if other people can figure out ways of making the features work with other products, then that’d be cool.
HIPpie will be offered for free and will hopefully stay that way, unless it becomes incredibly popular.
Like I say, it ain’t ready yet, but please get in touch if you’re interested in testing it once it’s ready!