opendata – Self Plagiarism is Style

Simple API for JISC MOSAIC Project Developer Competition data

For those of you interested in the developer competition being run by the JISC MOSAIC Project, I’ve put together a quick & dirty API for the available data sets. If it’s easier for you, you can use this API to develop your competition entry rather than working with the entire downloaded data set.

edit (31/Jul/2009): Just to clarify — the developer competition is open to anyone, not just UK residents (however, UK law applies to how the competition is being run). Fingers crossed, the Project Team is hopeful that a few more UK academic libraries will be adding their data sets to the pot in early August.

The URL to use for the API is https://library.hud.ac.uk/mosaic/api.pl and you’ll need to supply a ucas and/or isbn parameter to get a response back (in XML), e.g.:

The “ucas” value is a UCAS Course Code. You can find these codes by going to the UCAS web site and doing a “search by subject”. Not all codes will generate output using the API, but you can find a list of codes that do appear in the MOSAIC data sets here.
If you use both a “ucas” and “isbn” value, the output will be limited to just transactions for that ISBN on courses with that UCAS course code.
You can also use these extra parameters in the URL…

show=summary — only show the summary section in the XML output
show=data — only show the data in the XML output (i.e. hide the summary)
prog=… — only show data for the specified progression level (e.g. staff, UG1, etc, see documentation for full list)
year=… — only show data for the specified academic year (e.g. 2005 = academic year 2005/6)

rows=… — max number of rows of data to include (default is 500) n.b. the summary section shows the breakdown for all rows, not just the ones included by the rows limit

The format of the XML is pretty much the same as shown in the project documentation guide, except that I’ve added a summary section to the output.
Notes
The API was knocked together quite quickly, so please report any bugs! Also, I can’t guarentee that the API is 100% stable, so please let me know (e.g. via Twitter) if it appears to be down.

Web service for the free book usage data

I’ve been meaning to get around to adding a web service front end on to the book usage data that we released in December for ages. So, better late than never, here it is!
It’s not the fastest bit of code I’ve ever written, but (if there’s enough interest) I could speed it up.
The web service can be called a couple of different ways:
1) using an ISBN
Examples:
a) https://library.hud.ac.uk/api/usagedata/isbn=0415014190 (“Language in the news”)
b) https://library.hud.ac.uk/api/usagedata/isbn=159308000X (“The Adventures of Huckleberry Finn”)
Assuming a match is located, data for 1 or more items will be returned. This will include FRBR style matching using the LibraryThing thingISBN data, as shown in the second example where we don’t have an item which exactly matches the given ISBN.
2) using an ID number
Examples:
a) https://library.hud.ac.uk/api/usagedata/id=125120 (“Language and power”)
The item ID numbers are included in the suggestion data and are the internal bibliographic ID numbers used by our library management system.
——————-
edit 1: I should also have mentioned that the XML returned is essentially the same format as described here.
edit 2: Ive now re-written the code as a mod_perl script (to make it faster when using ISBNs) and slightly altered the URL

Keyword search data

We’ve been logging all keyword searches on our OPAC for nearly 3 years and now have details for over 3 million searches. Just in case the data is of any use to anyone, I’ve uploaded an aggregated XML version to our web server: https://library.hud.ac.uk/data/keyworddata/
As with the usage data, we’re putting it out there with no strings attached by using an Open Data Commons Licence.
The XML file contains a list of about 8,500 keywords. For each keyword, there’s a list of other terms that have been used with that keyword in multi-term searches. The readme file contains more information about the structure.

Yay for Talis!

Congratulations to both Talis and LibLime!

Talis, the UK market leader in providing academic and public library solutions, and LibLime, the leader in open solutions for libraries, are pleased to announce a partnership to make available over five million bibliographic records to the library community on the ‡biblios.net platform.
— Talis and LibLime Open Data on ‡biblios.net

How cool is that?

Talis Podcast

I can’t remember if I was using my “posh telephone voice”, but Richard Wallis has just posted a podcast that was recorded yesterday afternoon with Patrick Murray-John.
It’s definitely worth fast-forwarding past my inane waffley bits to listen to Patrick’s comments, as he makes some great points. Using usage data for marketing purposes wasn’t something that had occurred to me, but it’s a fantastic idea!
Even though it was an informal chat, I kept feeling twinges of “job interview syndrome” — that horrible sensation you get when you’re busy talking and you realise you’ve forgotten what the actual question was :-S
For my sins, I’m going to be doing something about OPACs and usage data at the upcoming JISC Developer Happiness Days event along with Ken Chad.
p.s. Can I propose a drinking game for this podcast? The rules are you have to have a drink every time someone mentions Tony Hirst‘s name ;-D

Free book usage data from the University of Huddersfield

I’m very proud to announce that Library Services at the University of Huddersfield has just done something that would have perhaps been unthinkable a few years ago: we’ve just released a major portion of our book circulation and recommendation data under an Open Data Commons/CC0 licence. In total, there’s data for over 80,000 titles derived from a pool of just under 3 million circulation transactions spanning a 13 year period.
https://library.hud.ac.uk/usagedata/
I would like to lay down a challenge to every other library in the world to consider doing the same.
This isn’t about breaching borrower/patron privacy — the data we’ve released is thoroughly aggregated and anonymised. This is about sharing potentially useful data to a much wider community and attaching as few strings as possible.
I’m guessing some of you are thinking: “what use is the data to me?”. Well, possibly of very little use — it’s just a droplet in the ocean of library transactions and it’s only data from one medium-sized new University, somewhere in the north of England. However, if just a small number of other libraries were to release their data as well, we’d be able to begin seeing the wider trends in borrowing.
The data we’ve released essentially comes in two big chunks:
1) Circulation Data

This breaks down the loans by year, by academic school, and by individual academic courses. This data will primarily be of interest to other academic libraries. UK academic libraries may be able to directly compare borrowing by matching up their courses against ours (using the UCAS course codes).

2) Recommendation Data

This is the data which drives the “people who borrowed this, also borrowed…” suggestions in our OPAC. This data had previously been exposed as a web service with a non-commercial licence, but is now freely available for you to download. We’ve also included data about the number of times the suggested title was borrowed before, at the same time, or afterwards.

Smaller data files provide further details about our courses, the relevant UCAS course codes, and expended ISBN lookup indexes (many thanks to Tim Spalding for allowing the use of thingISBN data to enable this!).
All of the data is in XML format and, in the coming weeks, I’m intending to create a number of web services and APIs which can be used to fetch subsets of the data.
The clock has been ticking to get all of this done in time for the “Sitting on a gold mine: improving provision and services for learners by aggregating and using learner behaviour data” event, organised by the JISC TILE Project. Therefore, the XML format is fairly simplistic. If you have any comments about the structuring of the data, please let me know.
I mentioned that the data is a subset of our entire circulation data — the criteria for inclusion was that the relevant MARC record must contain an ISBN and borrowing must have been significant. So, you won’t find any titles without ISBNs in the data, nor any books which have only been borrowed a couple of times.
So, this data is just a droplet — a single pixel in a much larger picture.
Now it’s up to you to think about whether or not you can augment this with data from your own library. If you can’t, I want to know what the barriers to sharing are. Then I want to know how we can break down those barriers.
I want you to imagine a world where a first year undergraduate psychology student can run a search on your OPAC and have the results ranked by the most popular titles as borrowed by their peers on similar courses around the globe.
I want you to imagine a book recommendation service that makes Amazon’s look amateurish.
I want you to imagine a collection development tool that can tap into the latest borrowing trends at a regional, national and international level.
Sounds good? Let’s start talking about how we can achieve it.

FAQ (OK, I’m trying to anticipate some of your questions!)
Q. Why are you doing this?
A. We’ve been actively mining circulation data for the benefit of our students since 2005. The “people who borrowed this, also borrowed…” feature in our OPAC has been one of the most successful and popular additions (second only to adding a spellchecker). The JISC TILE Project has been debating the benefits of larger scale aggregations of usage data and we believe that would greatly increase the end benefit to our users. We hope that the release of the data will stimulate a wider debate about the advantages and disadvantages of aggregating usage data.
Q. Why Open Data Commons / CC0?
A. We believe this is currently the most suitable licence to release the data under. Restrictions limit (re)use and we’re keen to see this data used in imaginative ways. In an ideal world, there would be services to harvest the data, crunch it, and then expose it back to the community, but we’re not there yet.
Q. What about borrower privacy?
A. There’s a balance to be struck between safeguarding privacy and allowing usage data to improve our services. It is possible to have both. Data mining is typically about looking for trends — it’s about identifying sizeable groups of users who exhibit similar behaviour, rather than looking for unique combinations of borrowing that might relate to just one individual. Setting a suitable threshold on the minimum group size ensures anonymity.

Coming soon, to a blog near here…

Okay — I’m the first to admit I don’t blog enough… I still haven’t even blogged about how great Mashed Library 2008 was (luckily other attendees have already blogged about it!)
Anyway, unless I get run over by a bus, later on this week I’m going to post something fairly big — well, it’s about 90MB which perhaps isn’t that “big” these days — that I’m hoping will get a lot of people in the library world talking. What I’ll be posting will just be a little droplet, but I’m hoping one day it’ll be part of a small stream …or perhaps even a little river.

(view slideshow of Mashed Library 2008)

Show Us a Better Way

Thanks to Iman Moradi for highlighting this site:

Show Us a Better Way
Tell us what you’d build with public information and we could help fund your idea!
Ever been frustrated that you can’t find out something that ought to be easy to find? Ever been baffled by league tables or ‘performance indicators’? Do you think that better use of public information could improve health, education, justice or society at large?
The UK Government wants to hear your ideas for new products that could improve the way public information is communicated.
To show they are serious, the Government is making available gigabytes of new or previously invisible public information especially for people to use in this competition.
Go on, Show Us A Better Way.

The UK Government has come under a lot of criticism in the last few years for not making publicly funded data available, so does this mark a sea change in attitude?
My second thought when I read the web page was that you could do the same with your library… although I’m not suggesting you offer a top prize of £20,000!

Show Us a Better Way
Tell us what you’d do to improve the library and we could make it a reality!
Ever been frustrated that you can’t find out something that ought to be easy to find? Ever been baffled by library resources or the library services on offer?
We want to hear your ideas for new ways that we can improve how our services.
Go on, Show Us A Better Way.

Alternatively, as we begin to make our library data available for re-use, this would be a great way of promoting unintended uses.