Sliding down the long tail

At a recent event in Edinburgh, I was asked about how we generate the “people who borrowed this, also borrowed…” suggestions in our OPAC and whether or not there are privacy issues with generating them.
Last week, I popped over to Manchester for a meeting of the JISC funded SALT (Surfacing the Academic Long Tail), which is one of the recently funded Activity Data projects. Part of the discussion at the meeting was around how to generate recommendations for items that haven’t circulated many times.
At both events, I promised to put together a blog post detailing the method we use, so here it is!
To generate recommendations for book A, we find every person who’s borrowed that book. Just to simply things, let’s say only 4 people have borrowed that book. We then find every book that those 4 people have borrowed. As a Venn diagram, where each set represents the books borrowed by that person, it’d look like…

To generate useful and relevant recommendations (and also to help protect privacy), we set a threshold and ignore anything below that. So, if we decide to set the threshold at 3 or more, we can ignore anything in the red and orange segments, and just concentrate on the yellow and green intersections…

There’ll always be at least one book in the green intersection — the book we’re generating the recommendations for, so we can ignore that.
If we sort the books that appear in those intersections by how many borrowers they have in common (in descending order), we should get a useful list of recommendations. For example, if we do this for “Social determinants of health (ISBN 9780198565895), we get the following titles (the figures in square brackets is the number of people who borrowed both books and the total number of loans for the suggested book)…

  1. Health promotion: foundations for practice [43 / 1312]
  2. The helping relationship: process and skills [41 / 248]
  3. Skilled interpersonal communication: research, theory and practice [31 / 438]
  4. Public health and health promotion: developing practice [29 / 317]
  5. The sociology of health and illness [29 / 188]
  6. Promoting health: a practical guide [28 / 704]
  7. Sociology: themes and perspectives [28 / 612]
  8. Understanding social problems: issues in social policy [28 / 300]
  9. Psychology: the science of mind and behaviour [27 / 364]
  10. Health policy for health care professionals [25 / 375]

When we trialled generating suggestions this way, we found a couple of issues:

  • more often than not, the suggested books tend to be ones that are popular and circulate well already — is there a danger that this creates a closed loop, where more relevant but less popular don’t get recommended?
  • the suggested books are often more general — e.g. the suggestions for a book on MySQL might be ones that cover databases in general, rather than specifically just MySQL

To try and address those concerns, we tweaked the sorting to take into account the total number of times the suggested book has been borrowed. So, if 10 people have borrowed book A and book B, and book B has only been borrowed by 12 people in total, we could imply that there’s a strong link between both books.
If we divide the number of common borrowers (10) with the total number of people who’ve borrowed the suggested book (12), we’ll end up with a figure between 0 and 1 that we can use to sort the titles. Here’s a list that uses 15 and above as the threshold…

…and if we used a lower threshold of 5, we’d get…

  1. Status syndrome : how your social standing directly affects your health [15 / 33]
  2. What is the real cost of more patient choice? [5 / 12]
  3. Interpersonal helping skills [5 / 12]
  4. Coaching and mentoring in higher education : a learning-centred approach [6 / 15]
  5. Understanding social policy [5 / 13]
  6. Managing and leading in inter-agency settings [11 / 29]
  7. Read, reflect, write : the elements of flexible reading, fluent writing, independent learning [5 / 14]
  8. Community psychology : in pursuit of liberation and well-being [6 / 20]
  9. Communication skills for health and social care [9 / 32]
  10. How effective have National Healthy School Standards and the National Healthy School programme been, in contributing to improvements in children’s health? [5 / 18]

If you think of the 3 sets of suggestions in terms of the Long Tail, the first set favours popular items that will mostly appear in the green (“head”) section, the second will be further along the tail, and the third, even further along.

As we move along the tail, we begin to favour books that haven’t been borrowed as often and we also begin to see a few more eclectic suggestions appearing (e.g. the “How effective have National Healthy School Standards…” literature based study).
One final factor that we include in our OPAC suggestions is whether or not the suggested book belongs to the same stock collection in the library — if it does, then the book gets a slight boost.

Here comes Summ(er|on)

It’s probably a sign of getting old and decrepit, but this year has just flown by — it doesn’t seem like two minutes since we kicked off our implementation of Serials Solutions’ Summon and now it’s gone fully live (it actually went fully live halfway through the Mashed Library event we ran the other week).
woods_004
The bulk of the implementation was done and dusted by early January 2010, and the majority of the implementation time was spent populating 360 Link (the Serials Solutions link resolver) with our journal holdings — a task our Journals Team found much easier than when we implemented SFX back in 2006.  As the plan had always been to run Summon in parallel to MetaLib during the 2009/10 academic year, it meant we had lots of time to play and tweak. 
We flipped the link resolver over from SFX to 360 Link in late January and then formally “soft” launched Summon during the University’s Research Festival in early March.  Throughout the academic year, usage of Summon has been growing and the vast majority of the feedback has been positive 🙂
As part of the JISC Summon4HN Project, we’ll be documenting the implementation and releasing chunks of code that we hope might be of use to the community, including:

  • code for automating the export of deleted, new and updated MARC records from Horizon so that they can be imported into Summon (or VuFind, AquaBrowser, etc)
  • code for creating “dummy” journal title records (so that known journal titles can be easily located in Summon, e.g. American Journal of Nursing)
  • a basic mod_perl implementation of the DLF spec for exposing availability data for library collections
  • details of the various tweaks we’ve made to our 360 Link instance

Also, as part of the roll out of Summon, we’ve been revamping our E-Resources Wiki to provide a browseable list of resources — as with the journal titles, we’ve been dropping dummy MARC records into Summon so that known resources can be located via a search (e.g. Mintel Reports).

Non/low library usage and final grades

Whilst chatting to one of the delegates at yesterday’s “Gaining business intelligence from user activity data” event (my Powerpoint slides can be grabbed from here) about non & low-usage of library services/resources, I began wondering how that relates to final grades.
In the previous blog post, we’ve seen that there appears to be evidence of a correlation between usage and grades, but that doesn’t really give an indication into how many students are non/low users. For example, if we happened to know that 25% of all students never borrow anything from the library, does that mean that 25% of students who gain the highest grades don’t borrow a book?
Let’s churn the data again 🙂
In the following 3 graphs, we’re looking at:

  • X axis: bands of usage (zero usage, then incremental bands of 20, then everything over 180 uses)
  • Y axis: as a percentage, what proportion of the students who achieved a particular grade are in each band

You can click on the graphs to view a full-sized version.
One of the things to look for is which grade peaks in each band of usage.
Borrowing
The usage bands represent the number of items borrowed from the library during the final 3 years of study…
horizon
caveat: we have a lot of distance learners across the world and we wouldn’t expect them to borrow anything from the library
In terms on non-usage (i.e. never borrowing an item), there’s a marked difference between those who get the two highest grades (1 and 2:1) and those who get the lowest honours grade (3). It seems that those who get a third-class honour are twice as likely to be non-users than those who get a first-class or 2:1 degree.
E-Resource Usage
The usage bands represent the number of times the student logged into MetaLib (or AthensDA) during the final 3 years of study…
metalib
caveat: this is a relatively crude measure of e-resource usage, as it doesn’t measure what the student accessed or how long they accessed each e-resource
Even at a quick glance, we can see that this graphs tells a different story to the previous one — the numbers of non-users is lower, but there’s a huge (worrying?) amount of low usage (the “1-20” band). I can only speculate on that:

  • did students try logging in but found the e-resources too difficult to use?
  • how much of an impact do the barriers to off-campus access (e.g. having to know when & how to authenticate using Athens or Shibboleth) have on repeat usage?
  • are students finding the materials they need for their studies outside of the subscription materials?

As I mentioned previously, Summon is a different kettle of fish to MetaLib, so it’s unlikely we’ll be able to capture comparative usage data — if you’ve tried using Summon, you’ll know that you don’t need to log in to use it (authentication only kicks in when you try to access the full-text). However, we’re confident that Summon’s ease-of-use and the work we’ve done to improve off-campus access will result in a dramatic increase in e-resource usage.
As before, we see it’s those students who graduate with a third-class honour who are the most likely to be non or low-users of e-resources.
Visits to the Library
The usage bands represent the number of visits to the library during the final 3 years of study…
sentry
caveat: we have a lot of distance learners across the world and we wouldn’t expect them to borrow anything the the library
Again, the graph shows that those who gain a third-class degree are twice as likely to never visit the library than those who gain a first-class or 2:1.

Library usage and final grades

It’s high time I started blogging again, so let’s start off with something that my colleagues in the library have been talking about at recent conferences — the link between the usage of library services and the final academic grades achieved by students.
As a bit of background to this, it’s probably worth mentioning that we’ve had an ongoing project (since 2006?) in the library looking at non and low-usage of library resources. That project has helped identify the long term trends in book borrowing, e-resource usage and library visits by the students at Huddersfield. Plus, we’ve used that information to help identify specific courses and cohorts of students who probably aren’t using the library as much as they should be, as well as when is the most effective time during a course to do refresher training.
Towards the back end of last year, we worked with the Student Records Team to build up a profile of library usage by the previous 2 years worth of graduates. For each graduate, we compared their final degree grade with their last 3 years of library usage data — specifically:

  • Items loaned — how many things did they borrow from the library?
  • MetaLib/AthensDA logins — how often did they access e-resources?
  • Entry stats — how many times did they venture in to the library?

Now, I’ll be the first to admit that these are basic & crude measures…

  • A student might borrow many items, but maybe he’s just working his way through our DVD collection for fun.
  • A login to MetaLib doesn’t tell you what they looked at or how long they used our e-resources.
  • Students might (and do) come into the library for purely social reasons.
  • Using the library is just one part of the overall academic experience.

…but they are rough indicators, useful for a quick initial check to see if there is a correlation. Plus, we know from the non & low-usage project that there are still many students who (for many reasons) don’t use the library much.
So, let’s churn the data! 🙂
Here’s the average usage by the 3,400 or so undergraduate degree students who graduated with an honour in the 2007/8 academic year:
2007/8
In terms of visits to the library, there’s no overall correlation — the average number of visits per student ranges from 109 to 120 — although we do seem some correlation at the level of individual courses. What does this tell us (if anything)? I’d say it’s evidence that the library is for everyone, regardless of their ability and academic prowess.
We do see a correlation with stock usage and e-resource usage. Those who achieved a first (1) on average borrowed twice as many items as those who got a third (3) and logged into MetaLib/AthensDA to access e-resources 3.5 times as much. The correlation is fairly linear across the grades, although there’s a noticable jump up in e-resource usage (when compared to stock borrowing) in those who gained a first.
Now the data for the 3,200 or students from the following academic year, 2008/9:
2008/9
As before, no particular correlation with visits to the library, but a noticeable correlation with stock & e-resource usage. Again we see that jump in e-resource usage for those who got the highest grade.
Note too that the average usage has increased. We’ve not changed the way we measure logins or item circulation, so this is a real year-on-year growth. (Side note: as we make the move from MetaLib to Summon, the concept of an “e-resource login” will change dramatically, so we won’t be able to accurately compare year-on-year in future)
Finally, here’s both years of graduates usage combined onto a single graph:
2007/8 & 2008/9
I’m curious about that jump in e-resource usage. Does it mean, to gain the best marks, students need to be looking online for the best journal articles, rather than relying on the printed page? If that is the case, will Summon have a measurably positive impact on improving grades (it certainly makes it a lot easier to find relevant articles quickly)?
Going forward, we’ve still got a lot of work to do drilling down into the data — analysing it by individual courses, looking deeper into the books that were borrowed and the e-resources that were accessed, etc. We’re also need to prove that all this has a stastical relevance. Not only that, but how can we use that knowledge and insight to improve the services which the library offers — it’d be foolish to say “borrow more books and you’ll get better grades”, but maybe we can continue to help guide students to the most relevant materials for their students.
It’s all exciting stuff and, believe me, the University of Huddersfield Library is a great environment to work in… I just wish there were more hours in the day! 🙂

Quick plug: CILIP U&CR Y&H Open Source event

Just a quick plug to say that there are still spaces available at the “Open Source: Free Speech, Free Beer and Free Kittens!” event at Hudderfield on Friday 26th June. Full details and a link to the booking form are available on the CILIP University College and Research Group web site.
Speakers at the event include:
– Ken Chad (Ken Chad Consulting)
– Nick Dimant and Jonathan Field (PTFS Europe)
– Nicolas Morin (BibLibre)
– Richard Wallis (Talis)
…although I don’t think there’ll be any free beer or kittens on offer to delegates, there will be a free lunch which is kindly being sponsored by PTFS Europe 🙂

Web service for the free book usage data

I’ve been meaning to get around to adding a web service front end on to the book usage data that we released in December for ages. So, better late than never, here it is!
It’s not the fastest bit of code I’ve ever written, but (if there’s enough interest) I could speed it up.
The web service can be called a couple of different ways:
1) using an ISBN
Examples:
a) https://library.hud.ac.uk/api/usagedata/isbn=0415014190 (“Language in the news”)
b) https://library.hud.ac.uk/api/usagedata/isbn=159308000X (“The Adventures of Huckleberry Finn”)
Assuming a match is located, data for 1 or more items will be returned. This will include FRBR style matching using the LibraryThing thingISBN data, as shown in the second example where we don’t have an item which exactly matches the given ISBN.
2) using an ID number
Examples:
a) https://library.hud.ac.uk/api/usagedata/id=125120 (“Language and power”)
The item ID numbers are included in the suggestion data and are the internal bibliographic ID numbers used by our library management system.
——————-
edit 1: I should also have mentioned that the XML returned is essentially the same format as described here.
edit 2: Ive now re-written the code as a mod_perl script (to make it faster when using ISBNs) and slightly altered the URL

Keyword search data

We’ve been logging all keyword searches on our OPAC for nearly 3 years and now have details for over 3 million searches. Just in case the data is of any use to anyone, I’ve uploaded an aggregated XML version to our web server: https://library.hud.ac.uk/data/keyworddata/
As with the usage data, we’re putting it out there with no strings attached by using an Open Data Commons Licence.
The XML file contains a list of about 8,500 keywords. For each keyword, there’s a list of other terms that have been used with that keyword in multi-term searches. The readme file contains more information about the structure.

Books that connect users

I thought it would be interesting to trawl the data and find out which books have been borrowed by the largest number of different courses within the university. I forget what the correct Graph Theory term is, but these are the books (nodes?) that connect together (edges?) the largest number of separate groups of students (networks?). The figure in brackets is the number of different courses that have borrowed the book.

  1. Questionnaire design, interviewing and attitude measurement by Oppenheim (245)
  2. Doing your research project: a guide for first-time researchers in education and social science (3rd ed) by Bell (215)
  3. Real world research: a resource for social scientists and practitioner-researchers (2nd ed) by Robson (190)
  4. Organisational behaviour and analysis: an integrated approach by Rollinson, Broadfield & Edwards (167)
  5. Sociology (3rd ed) by Giddens (161)
  6. The reflective practitioner: how professionals think in action by Schön (152)
  7. Experiential learning: experience as the source of learning and development by Kolb (150)
  8. Strategic management: awareness and change (3rd ed) by Thompson (134)
  9. Strategic management: an analytical introduction (3rd ed) by Luffman (133)
  10. Sociology: themes and perspectives (5th ed) by Haralambos & Holborn (133)
  11. Educating the reflective practitioner: toward a new design for teaching and learning in the professions by Schön (131)
  12. The good research guide: for small-scale social research projects by Denscombe (129)
  13. Qualitative data analysis: an expanded sourcebook (2nd ed) by Miles & Huberman (127)
  14. Health promotion: foundations for practice (2nd ed) by Naidoo & Wills (125)
  15. Team roles at work by Belbin (124)
  16. Research methods in education (5th ed) by Cohen, Manion & Morrison (124)
  17. How to research by Blaxter, Hughes & Tight (124)
  18. Understanding organizations (4th ed) by Handy (123)
  19. Basics of qualitative research: techniques and procedures for developing grounded theory (2nd ed) by Strauss & Corbin (121)
  20. The study skills handbook by Cottrell (120)
  21. Health promotion: models and values (2nd ed) by Downie, Tannahill & Tannahill (120)
  22. Doing qualitative research: a practical handbook by Silverman (116)
  23. Marketing by Lancaster & Reynolds (116)
  24. Reflection: turning experience into learning by Boud, Keogh & Walker (113)
  25. Management (6th ed) by Stoner, Freeman & Gilbert (109)
  26. No sweat!: the indispensable guide to reports and dissertations by Irving & Smith (109)
  27. The good study guide by Northedge (106)
  28. Research methods for nurses and the caring professions (6th ed) by Abbott & Sapsford (106)
  29. Marketing by Lancaster & Reynolds (106)
  30. Operations and the management of change by Gilgeous (106)

Conversely, these are the books that have only ever been borrowed by students on one specific course. The figure in brackets is the number of loans.

  1. The meaning of everyday occupation by Hasselkus (61)
  2. Perspectives in human occupation: participation in life by Kramer, Hinojosa & Royeen (48)
  3. Introduction to podopediatrics (2nd ed) by Thomson & Volpe (42)
  4. Occupational therapy without borders: learning from the spirit of survivors by Algado, Pollard & Kronenberg (38)
  5. Transformation through occupation by Watson & Swartz (38)
  6. Operating department practice A-Z by Smith & Williams (31)
  7. Lully, lulla, thou little tiny child: for soprano solo and SATB (unaccompanied), op.25 no.2 by Leighton (31)
  8. Five childhood lyrics: for unaccompanied mixed voices by Rutter (31)
  9. Task analysis: an occupational performance approach by Watson & Llorens (30)
  10. Conditions in occupational therapy: effect on occupational performance (3rd ed) by Atchison & Dirette (30)