Free book usage data from the University of Huddersfield

I’m very proud to announce that Library Services at the University of Huddersfield has just done something that would have perhaps been unthinkable a few years ago: we’ve just released a major portion of our book circulation and recommendation data under an Open Data Commons/CC0 licence. In total, there’s data for over 80,000 titles derived from a pool of just under 3 million circulation transactions spanning a 13 year period.
I would like to lay down a challenge to every other library in the world to consider doing the same.
This isn’t about breaching borrower/patron privacy — the data we’ve released is thoroughly aggregated and anonymised. This is about sharing potentially useful data to a much wider community and attaching as few strings as possible.
I’m guessing some of you are thinking: “what use is the data to me?”. Well, possibly of very little use — it’s just a droplet in the ocean of library transactions and it’s only data from one medium-sized new University, somewhere in the north of England. However, if just a small number of other libraries were to release their data as well, we’d be able to begin seeing the wider trends in borrowing.
The data we’ve released essentially comes in two big chunks:
1) Circulation Data

This breaks down the loans by year, by academic school, and by individual academic courses. This data will primarily be of interest to other academic libraries. UK academic libraries may be able to directly compare borrowing by matching up their courses against ours (using the UCAS course codes).

2) Recommendation Data

This is the data which drives the “people who borrowed this, also borrowed…” suggestions in our OPAC. This data had previously been exposed as a web service with a non-commercial licence, but is now freely available for you to download. We’ve also included data about the number of times the suggested title was borrowed before, at the same time, or afterwards.

Smaller data files provide further details about our courses, the relevant UCAS course codes, and expended ISBN lookup indexes (many thanks to Tim Spalding for allowing the use of thingISBN data to enable this!).
All of the data is in XML format and, in the coming weeks, I’m intending to create a number of web services and APIs which can be used to fetch subsets of the data.
The clock has been ticking to get all of this done in time for the “Sitting on a gold mine: improving provision and services for learners by aggregating and using learner behaviour data” event, organised by the JISC TILE Project. Therefore, the XML format is fairly simplistic. If you have any comments about the structuring of the data, please let me know.
I mentioned that the data is a subset of our entire circulation data — the criteria for inclusion was that the relevant MARC record must contain an ISBN and borrowing must have been significant. So, you won’t find any titles without ISBNs in the data, nor any books which have only been borrowed a couple of times.
So, this data is just a droplet — a single pixel in a much larger picture.
Now it’s up to you to think about whether or not you can augment this with data from your own library. If you can’t, I want to know what the barriers to sharing are. Then I want to know how we can break down those barriers.
I want you to imagine a world where a first year undergraduate psychology student can run a search on your OPAC and have the results ranked by the most popular titles as borrowed by their peers on similar courses around the globe.
I want you to imagine a book recommendation service that makes Amazon’s look amateurish.
I want you to imagine a collection development tool that can tap into the latest borrowing trends at a regional, national and international level.
Sounds good? Let’s start talking about how we can achieve it.

FAQ (OK, I’m trying to anticipate some of your questions!)
Q. Why are you doing this?
A. We’ve been actively mining circulation data for the benefit of our students since 2005. The “people who borrowed this, also borrowed…” feature in our OPAC has been one of the most successful and popular additions (second only to adding a spellchecker). The JISC TILE Project has been debating the benefits of larger scale aggregations of usage data and we believe that would greatly increase the end benefit to our users. We hope that the release of the data will stimulate a wider debate about the advantages and disadvantages of aggregating usage data.
Q. Why Open Data Commons / CC0?
A. We believe this is currently the most suitable licence to release the data under. Restrictions limit (re)use and we’re keen to see this data used in imaginative ways. In an ideal world, there would be services to harvest the data, crunch it, and then expose it back to the community, but we’re not there yet.
Q. What about borrower privacy?
A. There’s a balance to be struck between safeguarding privacy and allowing usage data to improve our services. It is possible to have both. Data mining is typically about looking for trends — it’s about identifying sizeable groups of users who exhibit similar behaviour, rather than looking for unique combinations of borrowing that might relate to just one individual. Setting a suitable threshold on the minimum group size ensures anonymity.

42 thoughts on “Free book usage data from the University of Huddersfield”

  1. Many thanks, Patrick! It was a bit of a rush to get the data ready, so I didn’t think too long or hard about the structure of the XML. Just let me know if any changes would make it easier to convert.

  2. Many thanks to everyone who’s blogged about the annoucement — you’ve already raised some interesting issues.
    Huddersfield has invested heavily in e-book/e-journal provision and this isn’t included in the data. From memory, borrowing levels haven’t really changed much in the last few years (I’ll have to go back to the raw data to confirm that), but e-resource usage has continued to shoot up.
    Sadly, data mining the physical stock usage is easy peasy compared to getting a handle on e-resource usage :-S

  3. Dave,
    Gee, I can’t imagine anyone in our business ever being in a rush to push something out! 🙂
    So far, the XSLTs I’m writing for the sample data are coming out very nicely, and it’s been great fun. The data has been great to work with.
    I hope to have something to share with you very soon.
    Thanks again — this is really the kinds of idea that will (and has) gotten lots of people excited.

  4. Hi Dave,
    Good good stuff.
    Can you briefly describe the tools you used to create this? perl? any libs? do you need to create a database (obviously aside from your LMS/horizon db) to work with the data?
    Obviously a lot of the circ data code will be Huddersfield/Horizon specific, but is there any potential for the final stages of the process (outputting the XML) could be released?

  5. Hi Chris
    I’ll put something together in the next few days. The starting point was the “circ_tran” table in Horizon, which I believe was something added for UK customers. It stores 3 useful bits of data:
    1) item# — The ID of a physical copy of a book, which can then be mapped to a bib(liographic) ID. So, each MARC record in Horizon has a bib ID and that might have zero or more items linked to it (e.g. if we have 8 copies of a book, then there’s normally be 8 item records)
    2) borrower# — The ID of a borrower. This can be used to locate all of the item#s that that person borrowed (e.g. “select item# from circ_tran where borrower#=123”), and can be used to pick up other attributes (e.g. academic course) from elsewhere.
    3) A timestamp of when the item# was borrowed.
    As long as your system stores something equivalent to those 3 piece of info for each transaction, then you should be able to generate the data. It’s also the info that I use to populate a table for the “people who borrowed” suggestions.

  6. I believe BiblioCommons, which is now in production at Oakville Public Library, uses circulation activity in defining relevancy, it would be good for more library engines to use this kind of data internally in addition to exposing it. We had some catalogue stability problems this Fall that made me wonder about using google sitemaps in combination with a custom google search engine, the piece that eluded me but which would make the most difference was how to set up page rank in this scenario. For example, would using circulation data to define pages that point to the bibliographic page for a work bring the power of page rank to the equation if google was indexing library content? I have often wondered if google books, which is a good example of google’s engine without the benefit of page rank, would work better if circ or citation data could fulfill this kind of role. Anyway, it totally rocks that you have exposed this data, as more works get digitized and have a new life as web objects, maybe this will be one of the most effective ways to help them fit into their new environment.

  7. I did wonder about adding authors, but decided that the ISBN might be the best overall identifier.
    What does everyone think? Would it be useful to include authors in the main circulation data? Would it be more useful to create a separate file that contains more book metadata (author, year of publication, publisher, etc)?

  8. It seems to me that it might be overkill for you to do that. LibraryThing, xISBN, and the RDF Book Mashup can get at most, if not all of that info, already.
    I suppose it might hinge on whether applications perform well doing calls to those other services?
    Happy New Year!

  9. Hi Dave, as others have said, this is a great initiative by Huddersfield. I wondered whether OPAC lookup data would also be valuable – how many times books looked up on OPAC, how many times unavailable because all copies on loan, etc.? Maybe it’s there but a quick glance at the demo xml files suggests not?

  10. Pingback: Looking at Data
  11. Brilliant! I’m sure this will be very useful over here. We just need all our institutions to do this now!

  12. Dear Dave,
    Does this dataset provide such information: bookId, userId, borrow_time?
    Such information is quite useful for book recommendation.

  13. Hi
    It contains a bookid, but not userid (for privacy reasons) or borrow_time. That’s partly why we created a second XML file (suggestion_data.xml) of recommendations. See the readme file for further details info.

  14. Dear Dave,
    Does Suggestion datasets are mined ? If it is ,i think it can help semantic web, it can help to create ontolgy and improving semantic web.If you acheived that(create ontolgy or sematic web) please let me know and sharing with you.

Comments are closed.