July 2011 – Self Plagiarism is Style

Librarian/Shambrarian Venn Diagram

To go with Ned’s “Great Library Stereotypometer“, which seems to be lacking one vital item, here’s a handy Venn Diagram…

You may find it useful to copy the diagram out onto a small piece of card and keep about your person for reference purposes.
If you are a librarian and you meet a shambrarian:

DO ask questions such as “would you like some more cake?” and “what is your favourite cake anecdote?”
DO feel free to compliment the shambrarian if they are wearing a particularly witty t-shirt
DO NOT bore the shambrarian by talking about your recent holiday tour of “Ye Olde Gin Palaces of London Town” or by reciting verbatim your top 50 gin based cocktail recipes
DO NOT attempt to sexually arouse the shambrarian by showing them photographs of library porn (e.g. this, this, this or this)
UNDER NO CIRCUMSTANCES should you say “if all the librarians got together, we could easily index the entire web… probably using an index card based system”

If you are a shambrarian and you meet a librarian:

DO ask them questions such as “where is your closed stack¹?” and “what is the Dewey classification for Chocolate Guinness Cake?”
DO feel free to compliment the librarian if you think that they have particularly nice cupcakes
DO NOT bore the librarian by showing them your Roy Tennant Fan Club membership card
DO NOT embarrass the librarian by asking them if “colon classification” means what you think it means
DO NOT attempt to sexually arouse the librarian by showing them photographs of shambrarian porn (e.g. this, this, this or this)
UNDER NO CIRCUMSTANCES should you say “Google Scholar is much better than that *very* expensive product your library just bought”

¹ The “closed stack” is where librarians store their cakes and usually has a “NO ENTRY — LIBRARIANS ONLY” sign on the door. If the librarian does not have enough room in their office, the closed stack may also be used to house the library’s gin distillery.

Co-operative Group advertising in Rupert Murdoch’s newspapers

a copy of the email I’ve just sent to the Co-operative, who I’ve banked with all my working life…

To: customer.relations@co-op.co.uk
Dear Sir or Madam
I am writing to express my disappointment that the Co-operative Group has decided to continue advertising in Rupert Murdoch’s newspapers: http://bit.ly/mzXiTI
As a banking customer of nearly 20 years, I have been extremely proud of your ethical stance. However, I believe that this is compromised by your public support of his newspapers at this time. By continuing to advertise in them, I also feel that the Co-operative Group is implicitly condoning their unlawful and highly immoral reporting practices.
In order to help me decide whether or not to close my bank account, I would be grateful if you could respond within 7 days with an explanation as to why you believe your continued advertising is in the best interest of your existing members and customers.
yours
Dave Pattern

Extending the availability messages in Summon

At the recent SummonCamp in New Orleans, there was a question about the local “Availability:” messages that appear in Summon for things like books, e.g.

Availability: available, Huddersfield (Loan Collection Floor 6 – 2 wk loan)

By default, Summon either scrapes your OPAC or makes use of an ILS/LMS API to get real time availability. If neither are available, or if the OPAC takes too long to respond, a “check availability” message appears instead (which typically links through to the item page on the OPAC).
Early on in our Summon implementation, we were concerned about the potential impact on our OPAC — SirsiDynix HIP — of screen scraping. In particular, HIP wasn’t designed to be scraped like this or to be indexed by search engines (many Horizon sites deliberately block Google et al from indexing their HIP) and it creates a new session ID for each request. As each new session takes up some of the OPAC server’s resources, there’s a theoretical limit to the number of concurrent sessions the OPAC can maintain before slowing down (or even crashing). Also, if you’ve done a search in Summon that delivers 25 book results, it takes time for the OPAC to respond to the 25 HTTP requests generated by Summon, and so you often end up getting the “check availability” message anyway.
So, working with Andrew Nagy at Serials Solutions, we implemented a very basic DLF XML web service (code and brief documentation available here) that bypasses our OPAC and pulls the live availability data straight from the Horizon database. Not only does it ensure the OPAC doesn’t take a performance hit, it’s also extremely fast (especially if you run it using mod_perl with a persistent database connection to Horizon) — you can see a typical response (for this book) here: library.hud.ac.uk/perl/summon/dlf.pl?497856
In his Code4Lib Journal article — “Hacking Summon” — Michael B. Klein talks about enhancing an availability API to include extra info and even embedded hyperlinks. This would also be a great way of including item level hold/request functionality into Summon.
At Huddersfield, we’ve done something similar to Michael for our e-resource/database level links, e.g.:

Availability: available, online resource (University network login required)

To help with known item searching, we’ve created some dummy MARC records on our library catalogue for most of the resources listed on our e-resources wiki and these get pushed out to Summon (in the same way that book MARC records do). If the user clicks on the result, they get passed through to the relevant wiki page. However, we also decided we wanted to try and save the user a mouse-click by embedding the actual URL to the resource into the availability message.
To do this, we extended the DLF script so that it detects when an incoming availability request from Summon is for one of the dummy MARC records (rather than a book). The script then does the following:

as the link to the wiki page for that resource is part of the dummy MARC record (the 856 field), it extracts that URL up from the record in Horizon
it then web scrapes that wiki page to extract the actual link to the e-resource (in this particular case, it’s an EZproxy’d link)
the DLF XML is then generated, including the link: library.hud.ac.uk/perl/summon/dlf.pl?646531

One thing that we’ve not done yet, but plan to do, is to include an extra step that queries our E-Resources Blog to check if there are any known problems for that e-resource. If there were, then a link through to the relevant blog post would also be included.

Relevancy ranking in Summon

Yesterday, Tim Fletcher tweeted me a question about Summon:

How does Summon rank results? is there a logic?

…it’s not the kind of question that you can answer in 140 characters, but I quickly knocked off an email to Tim. This morning David F. Flanders suggested I should also blog the response.
So, first of all, a quick caveat: much of the following was gleaned from various presentations over the last couple of years or so and may not be 100% accurate (I’m particularly good at misunremembering stuff!)
The first time I saw Summon (back in early 2009), I believe Serials Solutions were still using the default relevancy ranking that comes with the Open Source Lucene software (which is documented here). In a nutshell, Lucene generates a score for each indexed item (that matches the search query) and then those items are sorted by score (in descending order) to produce the ranked results.
I’ve read quite a few times that the relevancy ranking engine in Lucene is regarded as one of the best, which might be one of the reasons why SirsiDynix recently moved Enterprise from using Brainware to Lucene.
When you mention Lucene, chances are Solr won’t be too far behind. Solr (which is also Open Source) extends Lucene to provide a host of extra features, including facets.
As Summon has developed, and in response to customer feedback, Serials Solutions have gradually tweaked the way their Lucene installation generates the scores by giving each result an additional boost (or reduction) depending on a variety of factors, including:

Currency – newer items are given a slight boost over older items
Content type – books, ebooks and journal articles get a boost to their scores, whilst newspaper articles and book reviews have their scores reduced
Local collections – things that come from the user’s library (e.g. books, repository items, local archives, etc) get a little boost

Additionally, the Summon search engine handles certain words and phrases differently. For example, Lucene normally treats the singular and plural version of words as the same, so searches for “africa hospital” and “africas hospitals” both bring back roughly the same number of results. However, Summon understands that “africa aid” isn’t the same thing as “africa aids“.
Given that few users go beyond the first page of results (I was told the exact figure last week, but it’s slipped from my memory — I think it was less than 5%?), Serials Solutions put a lot of effort into trying to ensure that the most relevant results appear on that first page. Given that the Summon master index is fast approaching 1,000,000,000 items, that’s no trivial task!
As they say, the proof of the pudding is in the eating, so feel free to run some searches on our Summon instance to see how well you think it ranks the results.