January 2006 – Self Plagiarism is Style

“Did You Mean?” for HIP 2 or 3

[update: we’re now using Aspell and the Text::Aspell module]
HIP 4 contains a spellchecking “did you mean?” facility which, although not as powerful as Googles, is certainly a step in the right direction. One of the basic rules of designing any web based system that supports searching or browsing is to always give the user choices — even if they have gone down a virtual one way street and hit a dead end.
Unfortunately it’s going to be another few months before SirsiDynix release the UK enhanced version of HIP 4 for beta testing, so I thought I’d have a stab at adding the facility to our existing HIP 3.04 server.
Fortunately Perl provides a number of modules for this kind of thing, including String::Approx, Text::Metaphone, and Text::Soundex.
String::Approx is good at catching simple typos (e.g. Hudersfield or embarassement) whereas the latter two modules attempt to find “sounds like” matches — for example, when given batched, Text::Metaphone suggests scratched, thatched and matched.
To set something like this up, you need to have a word list. You could download one (e.g. a list of dictionary words), but it makes more sense to generate your own — in my case I’ve parsed Horizon’s title table to create a list of keywords and frequency. That’s given me a list of nearly 67,000 keywords that all bring up matches in either a general or title keyword search.
Once I’d got the keyword list, I ran it through Text::Metaphone and Text::Soundex to generate the relevant phonetic values — doing that in advance means that your spellchecking code can run faster as it doesn’t need to generate the values again for each incoming request.
Next up, I wrote an Apache mod_perl handle to create the suggestions from a given search term. As String::Approx can often give the best results, the term is run against that first. If no suggestions are found, the term is run against Text::Metaphone and then Text::Soundex in turn to find broader “sounds like” suggestions.
Assuming that one of the modules comes up with a least one suggestion, then that gets displayed in HIP:

There’s still more work to do, as the suggestions only appear for a failed single keyword. Handling two misspelled words (or more) is technically challenging — what’s the best method of presenting all the possible options to a user? You could just give them a list of possibilities, but I’d prefer to give them something they can click on to initiate a new search.

Amazon.co.uk Greasemonkey script

This Greasemonkey script (based on Carrick Mundell‘s script) for the Firefox web browser adds details of (and links to) our library holdings to the Amazon.co.uk site:

To use the script, you’ll need to do the following:
1) Make sure you have the latest version of Firefox installed
2) Go to http://greasemonkey.mozdev.org/ and install the Greasemonkey extension

3) Close down all your Firefox browser windows and restart it
4) You should now see a little smiling Greasemonkey at the bottom right hand corner of the browser

5) Now go to https://library.hud.ac.uk/firefox/, click on the hudamazon.user.js script, and then click on the “Install” button.

6) Now go to Amazon.co.uk and search for some books!

~ o ~

For reference, here are the various messages that might appear:
1) Available…
Copies of the item are available.

2) Available in electronic format…
Access is available to an electronic version of the item.

3) Due back…
The item is currently on loan.

4) Other editions of this title are available…
We don’t have this specific edition of the item, but we do have others – click on the link to show them all.

5) ISBN not found…
The Amazon ISBN doesn’t match anything on our catalogue, but you can click on the link to start a title keyword search.

REST output from Huddersfield’s catalogue

Inspiried by John Blyberg’s middleware which provides REST output from the Ann Arbor catalogue, I’ve put together something similar for ours:

https://library.hud.ac.uk/rest/info.html

Here are some sample results:

http://161.112.232.203:4128/rest/keyword/author/tolkien
http://161.112.232.203:4128/rest/keyword/subject/yorkshire railways
http://161.112.232.203:4128/rest/record/id/411760
http://161.112.232.203:4128/rest/record/isbn/1904633048

I’ve decided to include more information in the output than John did — primarily because I want to use the REST output to power a staff OPAC. Amongst other things, the output includes:

borrowing suggestions (based on historical circ data)
links to cover scans and thumbnails
loan history data (at both the bib and item level)
“other edition” links (using the OCLC xISBN service)

The output is littered with xlink links, which can be used to issue further REST requests.
For those of you who like gory techie details, the REST output is generated by approx 1,000 lines of Perl coded as a single mod_perl handle. The code works by fetching the XML output from HIP and then parsing it to strip out anything that’s not required in the REST output. At the same time, extra information is pulled in from other sources (e.g. direct from the Horizon database, and from the xISBN service).
Unfortunately, looking at the XML output from other HIP servers, I doubt the code can quickly be used by other Horizon sites. Also, not everyone has their own mod_perl server to the the code on. However, if anyone wants to play around with the code then please send me an email (d.c.pattern [at] hud.ac.uk). There’s also a cloud on the Horizon (pun itended) relating to getting XML output out of HIP 4 — it seems Dynix have chosen to make it harder (not easier) to do this with the latest version of their OPAC (boo! hiss!).
I’ve already said that I’m planning to use the REST output to power a staff OPAC, but what I’m really keen on is letting our students loose on the data for use in final year projects, etc. I’m also planning to use the output for a revised version of the Amazon Greasemonkey script.
The University is gradually moving towards a portal environment and I’m hoping the REST output will come in handy for dropping live catalogue content into other systems.
There’s still quite a bit of work to do, especially with adding information for journals. We’ve already got live journal information from our SFX OpenURL server appearing in our OPAC, so I might as well include that in the REST output too:

Have fun!