Archive

OAI

For a couple of months this year, the library world was aflame with rage at the proposed OCLC licensing policy regarding bibliographic records.  It was a justifiable complaint, although I basically stayed out of it:  it just didn’t affect me very much.  After much gnashing of teeth, petitions, open letters from consortia, etc. OCLC eventually rescinded their proposal.

Righteous indignation: 1, “the man”: 0.

While this could certainly counted as a success (I think, although this means we default to the much more ambiguous 1987 guidelines), there is a bit of a mixed message here about where the library community’s priorities lie.  It’s great that you now have the right to share your data, but, really, how do you expect to do it?

It has been a little over a year since the Jangle 1.0 specification has been released; 15 months or so since all of the major library vendors (with one exception) agreed to the Digital Library Federation’s “Berkeley Accord”; and we’re at the anniversary of the workshop where the vendors actually agreed on how we would implement a “level 1” DLF API.

So far, not a single vendor at the table has honored their commitment, and I have seen no intention to do so with the exception of Koha (although, interestingly, not by the company represented in the Accord).

I am going to focus here on the DLF ILS-DI API, rather than Jangle, because it is something we all agreed to.  For all intents and purposes, Jangle and the ILS-DI are interchangeable:  I think anybody that has invested any energy in either project would be thrilled if either one actually caught on and was implemented in a major ILMS.  Both specifications share the same scope and purpose.  The resources required to support one would be the same as the other, the only difference between the two are the client-side interfaces.  Jangle technically meets all of the recommendations of the ILS-DI, but not to the bindings that we, the vendors, agreed to (although there is an ‘adapter’ to bridge that gap).  Despite having spent the last two years of my life working on Jangle, I would be thrilled to no end if the ILS-DI saw broad uptake.  I couldn’t care less about the serialization; I only care about the access.

There is only one reason that the vendors are not honoring their commitment:  libraries aren’t demanding that they do.

Why is this?  Why the rally to ensure that our bibliographic data is free for us to share when we lack the technology to actually do the sharing?

When you look at the open source OPAC replacements (I’m only going to refer to the OSS ones here, because they are transparent, as opposed to their commercial counterparts):  VuFind, Blacklight, Scriblio, etc. and take stock of hoops that have to be jumped through to populate their indexes and check availability, most libraries would throw their hands in the air and walk away.  There are batch dumps of MARC records.  Rsync jobs to get the data to the OPAC server.  Cron jobs to get the MARC into the discovery system.  Screen scrapers and one off “drivers” to parse holdings and status.  It is a complete mess.

It’s also the case for every Primo, Encore, Worldcat Local, AquaBrowser, etc. that isn’t sold to an internal customer.

If you’ve ever wondered why the third party integration and enrichment services are ultimately somewhat unsatisfying (think BookSite.com or how LibraryThing for Libraries is really only useful when you can actually find something), this is it.  The vendors have made it nearly impossible for a viable ecosystem to exist because there is no good way to access the library’s own data.

And it has got to stop.

For the OCLC withdrawal to mean anything, libraries have either got to put pressure on their vendors to support one of the two open APIs, migrate to a vendor that does support the open APIs, or circumvent the vendors entirely by implementing the specifications themselves (and sharing with their peers).  This cartel of closed access is stifling innovation and, ultimately, hurting the library users.

I’ll hold up my end (and ensure it’s ILS-DI compatible via this) and work towards it being officially supported here, but the 110 or so Alto customers aren’t exactly going to make or break this.

Hold your vendor’s feet to the fire and insist they uphold their commitment.

I am still feeling my way around Python. I have yet to grasp the zen of being Pythonic, but I am at least coming to grips with real object orientation (as opposed to the named hashes of PHP) and am actually taking the leap into error handling, which, if you have dealt with any of the myriad bugs in any of my other projects, you’d know has been a bit of a foreign concept to me.

Python project #2 is known as RepoMan (thanks to Ed Summers for the name). It attempts to solve a problem that not one but two other opensource projects already have solved admirably (I’ll go into more about this in a bit). RepoMan is an OAI Repository indexer that makes said repository available via SRU. I created it in an attempt to make our DSpace implementation searchable from remote applications (namely, the site search and the upcoming alternative opac). It’s an extremely simple two script project that has only taken a week to get running largely due to the existence of two similar and available python scripts that I could modify for my own use. It’s also due to the help of Ed Summers and Aaron Lav.

The harvester is, basically, Thom Hickey’s one page OAI harvester with some minor modification. I have added error handling (the two lines I added to compensate for malformed xml must have been over the “one page limit”) and instead of outputting to a text file, it shoves the records in a Lucene index (thanks to PyLucene). This part still needs some work (I’m not sure what it would do with an “updated” record, for example), but it makes a nice index of the Dublin Core fields, plus a field for the whole record, for “default” searches. This was a good exercise for me to work with xml, Python and Lucene, because I was having some trouble when trying to index the MODS records for the alternative opac.

The SRU server is, basically, Dan Chudnov‘s SRU implementation for unalog. It needed to be de-Quixotefied and is, in fact, much more robust than Dan’s original (of course, unalog’s implementation doesn’t need to be as “robust”, since the metadata is much more uniform), but certainly having a working model to modify made this go much, much faster. The nice part is that there might be some stuff in there that Dan might want to put back into unalog.

So, here is the result. The operations currently supported are explain and searchRetrieve and majority of CQL relations are unsupported, but it does most of the queries I need it to do and, most importantly, it’s less than a week old.

So the burning question here is: why on earth would I waste time developing this when OCKHAM’s Harvest-to-Query is out there, and, even more specifically, OCLC’s SRW/U implementation for DSpace is available? Further, I knew full well that these projects existed before I started.

Lemme tell ya.

Harvest-to-Query looked very promising. I began down this road, but stopped about halfway down the installation document. Granted, anything that uses Perl, TCL and PHP has to be, well, something… After all, those were the first three languages I learned (and in the same order!). Adding in IndexData’s Zebra seemed logical as well since it has a built-in Z39.50 server. Still, this didn’t exactly solve my problem. I’d have to install yazproxy, as well, in order to achieve my SRU requirement. Requiring Perl, TCL, PHP, Zebra and yazproxy is a bit much to maintain for this project. Too many dependencies and I am too easily distracted.

OCLC’s SRW/U seemed so obvious. It seemed easy. It seemed perfect. Except our DSpace admin couldn’t get it to work. Oh, I inquired. I nagged. I pestered. That still didn’t make it work. I have very limited permissions on the machine that DSpace runs on (and no permissions for Tomcat), so there was little I could do to help. This also solved a specific purpose, but didn’t necessarily address any other OAI providers that we might have.

So, enter RepoMan. Another wheel that closely resembles all the other wheels out there, but possibly with minor cosmetic changes. Let a thousand wheels be invented.