Library Stuff – Page 3 – Self Plagiarism is Style

Librarian/Shambrarian Venn Diagram

To go with Ned’s “Great Library Stereotypometer“, which seems to be lacking one vital item, here’s a handy Venn Diagram…

You may find it useful to copy the diagram out onto a small piece of card and keep about your person for reference purposes.
If you are a librarian and you meet a shambrarian:

DO ask questions such as “would you like some more cake?” and “what is your favourite cake anecdote?”
DO feel free to compliment the shambrarian if they are wearing a particularly witty t-shirt
DO NOT bore the shambrarian by talking about your recent holiday tour of “Ye Olde Gin Palaces of London Town” or by reciting verbatim your top 50 gin based cocktail recipes
DO NOT attempt to sexually arouse the shambrarian by showing them photographs of library porn (e.g. this, this, this or this)
UNDER NO CIRCUMSTANCES should you say “if all the librarians got together, we could easily index the entire web… probably using an index card based system”

If you are a shambrarian and you meet a librarian:

DO ask them questions such as “where is your closed stack¹?” and “what is the Dewey classification for Chocolate Guinness Cake?”
DO feel free to compliment the librarian if you think that they have particularly nice cupcakes
DO NOT bore the librarian by showing them your Roy Tennant Fan Club membership card
DO NOT embarrass the librarian by asking them if “colon classification” means what you think it means
DO NOT attempt to sexually arouse the librarian by showing them photographs of shambrarian porn (e.g. this, this, this or this)
UNDER NO CIRCUMSTANCES should you say “Google Scholar is much better than that *very* expensive product your library just bought”

¹ The “closed stack” is where librarians store their cakes and usually has a “NO ENTRY — LIBRARIANS ONLY” sign on the door. If the librarian does not have enough room in their office, the closed stack may also be used to house the library’s gin distillery.

Extending the availability messages in Summon

At the recent SummonCamp in New Orleans, there was a question about the local “Availability:” messages that appear in Summon for things like books, e.g.

Availability: available, Huddersfield (Loan Collection Floor 6 – 2 wk loan)

By default, Summon either scrapes your OPAC or makes use of an ILS/LMS API to get real time availability. If neither are available, or if the OPAC takes too long to respond, a “check availability” message appears instead (which typically links through to the item page on the OPAC).
Early on in our Summon implementation, we were concerned about the potential impact on our OPAC — SirsiDynix HIP — of screen scraping. In particular, HIP wasn’t designed to be scraped like this or to be indexed by search engines (many Horizon sites deliberately block Google et al from indexing their HIP) and it creates a new session ID for each request. As each new session takes up some of the OPAC server’s resources, there’s a theoretical limit to the number of concurrent sessions the OPAC can maintain before slowing down (or even crashing). Also, if you’ve done a search in Summon that delivers 25 book results, it takes time for the OPAC to respond to the 25 HTTP requests generated by Summon, and so you often end up getting the “check availability” message anyway.
So, working with Andrew Nagy at Serials Solutions, we implemented a very basic DLF XML web service (code and brief documentation available here) that bypasses our OPAC and pulls the live availability data straight from the Horizon database. Not only does it ensure the OPAC doesn’t take a performance hit, it’s also extremely fast (especially if you run it using mod_perl with a persistent database connection to Horizon) — you can see a typical response (for this book) here: library.hud.ac.uk/perl/summon/dlf.pl?497856
In his Code4Lib Journal article — “Hacking Summon” — Michael B. Klein talks about enhancing an availability API to include extra info and even embedded hyperlinks. This would also be a great way of including item level hold/request functionality into Summon.
At Huddersfield, we’ve done something similar to Michael for our e-resource/database level links, e.g.:

Availability: available, online resource (University network login required)

To help with known item searching, we’ve created some dummy MARC records on our library catalogue for most of the resources listed on our e-resources wiki and these get pushed out to Summon (in the same way that book MARC records do). If the user clicks on the result, they get passed through to the relevant wiki page. However, we also decided we wanted to try and save the user a mouse-click by embedding the actual URL to the resource into the availability message.
To do this, we extended the DLF script so that it detects when an incoming availability request from Summon is for one of the dummy MARC records (rather than a book). The script then does the following:

as the link to the wiki page for that resource is part of the dummy MARC record (the 856 field), it extracts that URL up from the record in Horizon
it then web scrapes that wiki page to extract the actual link to the e-resource (in this particular case, it’s an EZproxy’d link)
the DLF XML is then generated, including the link: library.hud.ac.uk/perl/summon/dlf.pl?646531

One thing that we’ve not done yet, but plan to do, is to include an extra step that queries our E-Resources Blog to check if there are any known problems for that e-resource. If there were, then a link through to the relevant blog post would also be included.

Relevancy ranking in Summon

Yesterday, Tim Fletcher tweeted me a question about Summon:

How does Summon rank results? is there a logic?

…it’s not the kind of question that you can answer in 140 characters, but I quickly knocked off an email to Tim. This morning David F. Flanders suggested I should also blog the response.
So, first of all, a quick caveat: much of the following was gleaned from various presentations over the last couple of years or so and may not be 100% accurate (I’m particularly good at misunremembering stuff!)
The first time I saw Summon (back in early 2009), I believe Serials Solutions were still using the default relevancy ranking that comes with the Open Source Lucene software (which is documented here). In a nutshell, Lucene generates a score for each indexed item (that matches the search query) and then those items are sorted by score (in descending order) to produce the ranked results.
I’ve read quite a few times that the relevancy ranking engine in Lucene is regarded as one of the best, which might be one of the reasons why SirsiDynix recently moved Enterprise from using Brainware to Lucene.
When you mention Lucene, chances are Solr won’t be too far behind. Solr (which is also Open Source) extends Lucene to provide a host of extra features, including facets.
As Summon has developed, and in response to customer feedback, Serials Solutions have gradually tweaked the way their Lucene installation generates the scores by giving each result an additional boost (or reduction) depending on a variety of factors, including:

Currency – newer items are given a slight boost over older items
Content type – books, ebooks and journal articles get a boost to their scores, whilst newspaper articles and book reviews have their scores reduced
Local collections – things that come from the user’s library (e.g. books, repository items, local archives, etc) get a little boost

Additionally, the Summon search engine handles certain words and phrases differently. For example, Lucene normally treats the singular and plural version of words as the same, so searches for “africa hospital” and “africas hospitals” both bring back roughly the same number of results. However, Summon understands that “africa aid” isn’t the same thing as “africa aids“.
Given that few users go beyond the first page of results (I was told the exact figure last week, but it’s slipped from my memory — I think it was less than 5%?), Serials Solutions put a lot of effort into trying to ensure that the most relevant results appear on that first page. Given that the Summon master index is fast approaching 1,000,000,000 items, that’s no trivial task!
As they say, the proof of the pudding is in the eating, so feel free to run some searches on our Summon instance to see how well you think it ranks the results.

Hurricanes and shrimp po’boys (part 1)

I’m jetlagged (this is the first time I’ve had jetlag that feels like being drunk) and still coming down from an-ALA induced high, but here goes a blog post!
I’m currently fortunate enough to be a member of the Serials Solutions Summon Advisory Board, and last week saw the fourth pre-ALA meeting, this time in the one and only New Orleans, the home of hurricane cocktails, shrimp po’boys, high heat & humidity and more seafood than you can shake a stick at…

(seafood platter at the Grand Isle Restaurant)
Summon Advisory Board notes

there are now more than 250 Summon customers around the world
the company is currently concentrating on comprehensiveness (in terms of coverage and seamless access to articles)
gone are the days when Serials Solutions had to approach publishers and argue the case for them to make their content in Summon — most publishers now realise the value and are approaching the company directly to have their content added
John Law’s manta is currently “relevancy, relevancy, relevancy!” — with 800,000,000 items in Summon, relevancy is key to ensuring the user gets the right articles on the first page of results
it wasn’t until I saw some demo searches that the awesomeness of the deal with HathiTrust Collection integration began to sink in — librarians of the world, this truly is a game changer! (on a practial note, it’s going to take Serials Solutions a little while to complete the indexing of the entire HathiTrust Collection)
a pilot with JSTOR means that a Summon search box is integrated into the JSTOR web site interface — it appears when a JSTOR search produces only a small number (or zero) results, so that the user’s search can be expanded to other journal platforms
due to being en route from the UK to New Orleans, I’d missed this annoucement, but the long-awaited deal with Elsevier has been signed
for journal articles, Serials Solutions create “super records” that combine the best metadata from multiple sources — this is de-duping on steroids!
coming soon — discipline searching (currently 63 subject disciplines have been defined, which work at the journal title and journal article level)
coming soon — new article linking improvements (when relevant, Summon results will link directly to the article abstract page on the supplier’s platform, instead of using OpenURLs)
Daniel Forsman (Chalmers University of Technology, Gothenburg, Sweden) suggested that we should promote Summon to our users as being more comprehensive that Google Scholar
although librarians often get hung-up on what’s not in Summon, some analysis by a Summon customer indicated that the non-indexed content is often low quality “filler material” added by aggregator platforms to bump up journal totals

(a bourbon nightcap after the Advisory Board Meeting)

EZproxy and Summon

I’ll flesh out this blog post later on today, but just wanted to post some screenshots (partly as a rebuttal to Nicole’s blog post “Some thoughts about (authentication) discovery aimed at librarians“) to show how well EZproxy fits as the authentication layer between a discovery service (such as Summon) and journal articles on publisher sites.
As Nicole well knows, I’m not a librarian and I couldn’t give two hoots about the “official CILIP endorsed librarian way of doing things” (n.b. my quote, not Nicole’s) when it comes to e-resource access. All I care about is trying to get the user to where they want to be (e.g. the full text of a journal article) with the least number of mouse clicks, and the least amount of swearing, frustration and death-threats against the library for making it so flippin’ difficult ;-D
[edit] Apologies — I didn’t mean to imply that librarians don’t care about users. I just took offence when I felt Nicole’s post implied this was a librarian problem and/or that librarians were the root cause of the problem. As I’m not a librarian myself, I felt it was wrong to infer that anything I say or do is endorsed by, or represents, librarianship in general, or is the way a librarian would choose to do it. To the best of my knowledge, librarians perfer not to have barriers (such as stupidly complicated publisher log in pages) in the way when it comes to accessing information.
This first example is about as good as it gets. A student uses Summon to locate an article (“Ethics, Public Policy, and Global Warming”)…

…when they click on the article link, Summon opens a new browser window and passes the OpenURL details for the article to the link resolver (360 Link). If the user isn’t already authenticated (e.g. by accessing Summon via the University Portal or via the VLE), they’ll need to log in. If they have already authenticated, then they don’t see this screen at all.
The login process logs the user into EZproxy, and also establishes an Athens session in the background (which isn’t required to access the article, but might be useful it they end up wandering off to look at other resources)…

…as this particular article is on JSTOR, the user is able to view the article straight away (via the “Page Scan” preview) or they can choose to download the PDF…

So, from Summon, there’s either a single click (if the user has already authenticated) or two clicks (if the user needs to log in) to get to the full-text (or a page that has a link to the PDF). Ignoring the ethical/moral/technical/philosophical issues of using a proxy solution instead of Shib, I think this is as good as it gets for students.
If they do have to authenticate, it’s a familiar login page and they’re not having to figure out which link on the publisher’s web site to use — do they try putting their university network login details into the username & password fields (1), do they scroll through a list of nearly 200 institutions (2) to find Huddersfield (and are we “University of Huddersfield” or “Huddersfield University”?) …or can they remember that the librarian told them to look for the “Athens” link (3) during the library induction all those months ago?

Plus, if they’ve found this article via Google Scholar, how do then even know if they have access to it? If you want to frustrate a student, nothing does it better than pointing them at a useful article that they can’t access ;-D
This doesn’t mean that there isn’t a role for Shib/Athens, but I feel it’s a different part of the jigsaw puzzle. If I’m an off-campus Huddersfield student wanting to get to ScienceDirect, there’s lots of ways to get there, but one of the simplest is to just Google “science direct huddersfield” (we don’t tell students about EZproxy, so they would never include that as a search term on Google)…

…where the first result takes them through to our electronic resources wiki page for ScienceDirect (which is where most of the other routes end up)…

…the first “Access Link” is the Athens link to ScienceDirect and the (slightly superfluous) note beneath is really just for students who’ve gone directly to ScienceDirect and who aren’t sure which of the various login options to select on the site.

CILIP Cymru Conference

I’m journeying down to Llandrindod Wells tomorrow to give a presentation about usage data to the Welsh Libraries, Archives and Museums Conference (hashtag #cilipw11). I’ve been promised that there’ll be real ale there 🙂
You can grab a draft copy of my presentation (“If you want to get laid, go to college…”) from here (15MB).
The main web links in the presentation are:
– JISC Library Impact Data Project
– JISC Activity Data Programme (including a list of the projects)
– Rufus Pollock (Open Data and Componentization, XTech 2007)
– Paul Walk (“The coolest thing to do with your data will be thought of by someone else”)
– University of Huddersfield – Open Data Release (from Dec 2008)

Are books becoming more important?

Being a shamistician, rather than a statistician, I’m not sure how much importance to attach to this, but I thought it was interesting enough to share!
The JISC Library Impact Data Project has given us an opportunity to churn through our usage data again and, following on from the last blog post, we’ve been looking at the statistical significance (if any!) of the correlations we’re finding in the data.
The book loan data for the last 5 years of undergrads (who graduated with a specific honour) has a small overall Pearson correlation of -0.17 (see this blog post for an explanation of why it’s negative) with a high statistical significance (p-value of 0). However, when we looked at just the 2009/10 data (which is the period the other project partners are providing data for), we found a stronger correlation (-0.20).
If we go a step further and look at the Pearson correlation each year, there appears to be a possible underlying trend at Huddersfield…

If you accept that there might be a trend there (with the Pearson correlation value increasing over time), then it raises an interesting question… are books becoming an increasingly more important part of achieving a higher grade?

5 years of book loans and grades

I’m just starting to pull our data out for the JISC Library Impact Data Project and I thought it might be interesting to look at 5 years of grades and book loans. Unfortunately, our e-resource usage data and our library visits data only goes back as far as 2005, but our book loan data goes back to the mid 1990s, so we can look at a full 3 years of loans for each graduating students.
The following graph shows the average number of books borrowed by undergrad students who graduated with an specific honour (1, 2:1, 2:2 or 3) in that particular academic year…

…and, to try and tease out any trends, here’s a line graph version….

Just a couple of general comments:

the usage & grade correlation (see original blog post) for books seems to be fairly consistent over the last 5 years, although there is a widening in the usage by the lowest & highest grades
the usage by 2:2 and 3 students seems to be in gradual decline, whilst usage by those who gain the highest grade (1) seems to on the increase

Sliding down the long tail

At a recent event in Edinburgh, I was asked about how we generate the “people who borrowed this, also borrowed…” suggestions in our OPAC and whether or not there are privacy issues with generating them.
Last week, I popped over to Manchester for a meeting of the JISC funded SALT (Surfacing the Academic Long Tail), which is one of the recently funded Activity Data projects. Part of the discussion at the meeting was around how to generate recommendations for items that haven’t circulated many times.
At both events, I promised to put together a blog post detailing the method we use, so here it is!
To generate recommendations for book A, we find every person who’s borrowed that book. Just to simply things, let’s say only 4 people have borrowed that book. We then find every book that those 4 people have borrowed. As a Venn diagram, where each set represents the books borrowed by that person, it’d look like…

To generate useful and relevant recommendations (and also to help protect privacy), we set a threshold and ignore anything below that. So, if we decide to set the threshold at 3 or more, we can ignore anything in the red and orange segments, and just concentrate on the yellow and green intersections…

There’ll always be at least one book in the green intersection — the book we’re generating the recommendations for, so we can ignore that.
If we sort the books that appear in those intersections by how many borrowers they have in common (in descending order), we should get a useful list of recommendations. For example, if we do this for “Social determinants of health (ISBN 9780198565895), we get the following titles (the figures in square brackets is the number of people who borrowed both books and the total number of loans for the suggested book)…

Health promotion: foundations for practice [43 / 1312]
The helping relationship: process and skills [41 / 248]
Skilled interpersonal communication: research, theory and practice [31 / 438]
Public health and health promotion: developing practice [29 / 317]
The sociology of health and illness [29 / 188]
Promoting health: a practical guide [28 / 704]
Sociology: themes and perspectives [28 / 612]
Understanding social problems: issues in social policy [28 / 300]
Psychology: the science of mind and behaviour [27 / 364]
Health policy for health care professionals [25 / 375]

When we trialled generating suggestions this way, we found a couple of issues:

more often than not, the suggested books tend to be ones that are popular and circulate well already — is there a danger that this creates a closed loop, where more relevant but less popular don’t get recommended?
the suggested books are often more general — e.g. the suggestions for a book on MySQL might be ones that cover databases in general, rather than specifically just MySQL

To try and address those concerns, we tweaked the sorting to take into account the total number of times the suggested book has been borrowed. So, if 10 people have borrowed book A and book B, and book B has only been borrowed by 12 people in total, we could imply that there’s a strong link between both books.
If we divide the number of common borrowers (10) with the total number of people who’ve borrowed the suggested book (12), we’ll end up with a figure between 0 and 1 that we can use to sort the titles. Here’s a list that uses 15 and above as the threshold…

Status syndrome : how your social standing directly affects your health [15 / 33]
Values for care practice [15 / 61]
The study of social problems: seven perspectives [18 / 90]
Essentials of human anatomy & physiology [15 / 81]
Human health and disease [15 / 88]
Social problems: an introduction to critical constructionism [15 / 90]
Thinking about social problems: an introduction to constructionist perspectives [21 / 127]
The helping relationship: process and skills [41 / 248]
Health inequality: an introduction to theories, concepts and methods [20 / 122]
The sociology of health and illness [29 / 188]

…and if we used a lower threshold of 5, we’d get…

If you think of the 3 sets of suggestions in terms of the Long Tail, the first set favours popular items that will mostly appear in the green (“head”) section, the second will be further along the tail, and the third, even further along.

As we move along the tail, we begin to favour books that haven’t been borrowed as often and we also begin to see a few more eclectic suggestions appearing (e.g. the “How effective have National Healthy School Standards…” literature based study).
One final factor that we include in our OPAC suggestions is whether or not the suggested book belongs to the same stock collection in the library — if it does, then the book gets a slight boost.

JISC Activity Data Programme

I’m chuffed to bits that the Library Impact Data bid that Huddersfield submitted, along with 7 project partner institutions, was one of the successful ones in the JISC Activity Data Programme and the project will kick off on Tuesday this week!

… the aim of this project is to prove a statistically significant correlation between library usage and student attainment. The project will collect anonymised data from University of Bradford, De Montfort University, University of Exeter, University of Lincoln, Liverpool John Moores University, University of Salford, Teesside University as well as Huddersfield. By identifying subject areas or courses which exhibit low usage of library resources, service improvements can be targeted. Those subject areas or courses which exhibit high usage of library resources can be used as models of good practice.

If you’re interested, keep an eye on the project blog: https://library.hud.ac.uk/blogs/projects/lidp/