May 2009 – Self Plagiarism is Style

I’ve been meaning to get around to adding a web service front end on to the book usage data that we released in December for ages. So, better late than never, here it is!
It’s not the fastest bit of code I’ve ever written, but (if there’s enough interest) I could speed it up.
The web service can be called a couple of different ways:
1) using an ISBN
Examples:
a) https://library.hud.ac.uk/api/usagedata/isbn=0415014190 (“Language in the news”)
b) https://library.hud.ac.uk/api/usagedata/isbn=159308000X (“The Adventures of Huckleberry Finn”)
Assuming a match is located, data for 1 or more items will be returned. This will include FRBR style matching using the LibraryThing thingISBN data, as shown in the second example where we don’t have an item which exactly matches the given ISBN.
2) using an ID number
Examples:
a) https://library.hud.ac.uk/api/usagedata/id=125120 (“Language and power”)
The item ID numbers are included in the suggestion data and are the internal bibliographic ID numbers used by our library management system.
——————-
edit 1: I should also have mentioned that the XML returned is essentially the same format as described here.
edit 2: Ive now re-written the code as a mod_perl script (to make it faster when using ISBNs) and slightly altered the URL

At Mashed Library UK 2009, we’re planning to kick the event off with six 30 minute opening sessions. We’ve got two rooms, so there’ll be a session running in each room at the same time. Since a delegate can’t be in two places at the same time, they’ll only be able to go to three of the six sessions. So, how do you ensure that you keep everyone happy and that you don’t have too many clashes (i.e. having to miss a session you’d have quite liked to have gone to)?
Having never organised an event before, I’m guessing the usual way would be to try and schedule sessions together that target different audiences? However, that sounds like a potential headache inducer and I’m a programmer, not a planner!
So, what we’re going to do, once we’ve got all six sessions finalised, is to let each of the 60 odd delegates (and by that I mean we’ve got more than 60 delegates!) rank the sessions in order of preference. So, their 1st, 2nd, and 3rd choices would be the three sessions that they’d most like to go to.
With that kind of data, you’d expect to see some clustering (i.e. delegates making the same or similar choices) and so (in theory) there will be an optimal sequencing of sessions that will give the most delegates the best chance to going to their three top choices.
There’s a wide variety of programming techniques for finding optimal solutions to problems, from the simple to the complex (e.g. simulated annealing and genetic algorithms). However, because I’d got a bath running, I decided to knock up a quick hack using the simplest method — randomly generate a session sequence and then see how well it meets the choices of the delegates. By the way, if you want to learn more about calculating optimal solutions, see “Programming Collective Intelligence” by Toby Segaran (ISBN 9780596529321).
With any optimal solution code, you need to way of measuring the success of a given solution. To my mind, that would be “happiness” — if you find a solution that gives a delegate the ability to attend their top three choices, they’ll be very happy, but if you have a session clash for their 1st and 2nd choices, they won’t be happy. Once you’ve calculated the overall “happiness” for all the delegates, then that allows you to compare that particular solution with other random solutions (i.e. “does this session sequence generate more happiness or less that the previous one?”)
I hadn’t planned on releasing the code, as it really was a 5 minute “quick and dirty” hack, but Ben tweeted to say he might find it useful, so I’ve uploaded the Perl script to here. I’ve also included a sample file containing some dummy delegate choices.
For each delegate, there’s a comma separated list showing their session preference (1=top choice)…

Andy    2,4,3,5,6,1

…so Andy’s top choice is session 6, followed by session 1, then session 3, etc.
If you run the Perl script, it’ll pick a random session sequence and calculate the happiness. It’ll keep on looping and trying to find better solutions until it finds one that can’t be improved upon. You’d probably want to run the code several times to ensure that the final solution really is the best one. You might want to also try one of the alternative $overall calculations to see if that produces the same session sequence.
Here’s an example of an early solution…

[1]     session 1 = 11 delegate(s)
[1]     session 6 = 4 delegate(s)
[2]     session 5 = 6 delegate(s)
[2]     session 4 = 9 delegate(s)
[3]     session 2 = 8 delegate(s)
[3]     session 3 = 7 delegate(s)
HAPPINESS = 87 (5.8)
        1       Andy    -4.8
        3       Beth    -2.8
        3       Cary    -2.8
        9       Dave    +3.2
        5       Earl    -0.8
        9       Fred    +3.2
        9       Gene    +3.2
        3       Hans    -2.8
        9       Iggy    +3.2
        5       Jane    -0.8
        5       Karl    -0.8
        9       Leah    +3.2
        9       Macy    +3.2
        3       Neil    -2.8
        5       Owen    -0.8
CLASHES = 7 / OVERALL = 12.4285714285714 / DIFF = 38.4

In the above output, it’s proposing to run sessions 1 & 6 together, then 5 & 4, and finally 2 & 3. By looking at the delegate choices, you can easily calculate which of the two concurrent sessions each delegate would prefer to go to (i.e. 11 delegates would choose to go to session 1).
The code also calculates a “happiness” value for each delegate. If a delegate gets to go to their 1st, 2nd and 3rd choices, then they’d get a maximum happiness score of 9 (3 x 3 points). If a 1st choice session is being run at the same time as their 2nd choice (or a 2nd at the same time as the 3rd), that would make them unhappy, so a point is deducted. If a 1st choice runs at the same time as their 3rd choice, they’d probably accept that (however, nothing is added to their happiness score).
Once all the scores have been calculated, we get an overall happiness of 87 (out of a possible 135, i.e. 15 delegates x the maximum happiness score of 9) and the average happiness is 5.8 out of 9.
We can also see the how (un)happy each delegate is and how much they deviate from the average happiness. Dave, Fred, Gene, Iggy, Leah and Macy all get to go to their top 3 choices, so they’ve all got scores of 9 out of 9. Andy is very unhappy (1 out of 9). The others are somewhere in the middle, so they’ve all had to make compromises and won’t be going to their top 3 sessions.
There are 7 clashes (when a 1st choice runs at the same time as the 2nd, or the 2nd at the same time as the 3rd). Ideally, we’d like to keep the clashes to a minimum.
Here’s an example of a better solution (which might actually be the optimal solution for the dummy data)…

[1]     session 3 = 9 delegate(s)
[1]     session 5 = 6 delegate(s)
[2]     session 4 = 9 delegate(s)
[2]     session 6 = 6 delegate(s)
[3]     session 1 = 10 delegate(s)
[3]     session 2 = 5 delegate(s)
HAPPINESS = 101 (6.73333333333333)
        5       Andy    -1.73333333333333
        9       Beth    +2.26666666666667
        3       Cary    -3.73333333333333
        3       Dave    -3.73333333333333
        5       Earl    -1.73333333333333
        3       Fred    -3.73333333333333
        5       Gene    -1.73333333333333
        9       Hans    +2.26666666666667
        5       Iggy    -1.73333333333333
        9       Jane    +2.26666666666667
        9       Karl    +2.26666666666667
        9       Leah    +2.26666666666667
        9       Macy    +2.26666666666667
        9       Neil    +2.26666666666667
        9       Owen    +2.26666666666667
CLASHES = 2 / OVERALL = 50.5 / DIFF = 36.2666666666667

The average happiness is now up to 6.73 per delegate and there are only 2 clashes, which is much better. Cary, Dave and Fred will be the most affected by this particular session scheduling, but we now have 8 delegates attending their top choices.
So, the big question will be: what happens when we get the real data from the 60 odd delegates who are coming to Mashed Library? Stay tuned for the answer!

Month: May 2009

Library Refurbishment — Year 3 of 3

Quick plug: CILIP U&CR Y&H Open Source event

Web service for the free book usage data

Keeping everyone happy at a conference