I’ve been dealing with a lot of dumb technology problems, lately.

The mlaut has uncovered stupid issues with both Voyager and SFX in the last week (Voyager’s Z39.50 server returns every record in the database if you do an ‘or’ search on the 001 and SFX’s handling of title changes is too complicated to even mention).

This simple fix makes me very happy, though.

Some backstory:

In June, I had a meeting with our Digital Initiatives department about the effect of the mlaut on their services, namely DSpace. I told them that it was impractical to search SMARTech (our DSpace instance) for every incoming citation, since, theoretically, SMARTech should appear in the Google/Yahoo results. When we tested the theory, our results looked like this. This is obviously ugly, but even worse, it probably discourages discovery of the items in the repository. One has to assume that people are mainly finding items via the search engines. When they see results like that, with no real indication of what they’re looking at, they will probably just move on (even if DSpace is holding the preprint to what they’re searching for).

I left the meeting and asked Dorothea Salo if she, too, had this problem and if she knew a fix for it. About 20 minutes later, she had this awesome title hack worked up. I sent it to our DSpace admin (I, thankfully, don’t deal with it directly) and now we get to bask in the glory of our new and spiffy title listing in the search engines.

Thanks, Dorothea, for doing something genius and simple that DSpace should have done years ago.

I am still feeling my way around Python. I have yet to grasp the zen of being Pythonic, but I am at least coming to grips with real object orientation (as opposed to the named hashes of PHP) and am actually taking the leap into error handling, which, if you have dealt with any of the myriad bugs in any of my other projects, you’d know has been a bit of a foreign concept to me.

Python project #2 is known as RepoMan (thanks to Ed Summers for the name). It attempts to solve a problem that not one but two other opensource projects already have solved admirably (I’ll go into more about this in a bit). RepoMan is an OAI Repository indexer that makes said repository available via SRU. I created it in an attempt to make our DSpace implementation searchable from remote applications (namely, the site search and the upcoming alternative opac). It’s an extremely simple two script project that has only taken a week to get running largely due to the existence of two similar and available python scripts that I could modify for my own use. It’s also due to the help of Ed Summers and Aaron Lav.

The harvester is, basically, Thom Hickey’s one page OAI harvester with some minor modification. I have added error handling (the two lines I added to compensate for malformed xml must have been over the “one page limit”) and instead of outputting to a text file, it shoves the records in a Lucene index (thanks to PyLucene). This part still needs some work (I’m not sure what it would do with an “updated” record, for example), but it makes a nice index of the Dublin Core fields, plus a field for the whole record, for “default” searches. This was a good exercise for me to work with xml, Python and Lucene, because I was having some trouble when trying to index the MODS records for the alternative opac.

The SRU server is, basically, Dan Chudnov‘s SRU implementation for unalog. It needed to be de-Quixotefied and is, in fact, much more robust than Dan’s original (of course, unalog’s implementation doesn’t need to be as “robust”, since the metadata is much more uniform), but certainly having a working model to modify made this go much, much faster. The nice part is that there might be some stuff in there that Dan might want to put back into unalog.

So, here is the result. The operations currently supported are explain and searchRetrieve and majority of CQL relations are unsupported, but it does most of the queries I need it to do and, most importantly, it’s less than a week old.

So the burning question here is: why on earth would I waste time developing this when OCKHAM’s Harvest-to-Query is out there, and, even more specifically, OCLC’s SRW/U implementation for DSpace is available? Further, I knew full well that these projects existed before I started.

Lemme tell ya.

Harvest-to-Query looked very promising. I began down this road, but stopped about halfway down the installation document. Granted, anything that uses Perl, TCL and PHP has to be, well, something… After all, those were the first three languages I learned (and in the same order!). Adding in IndexData’s Zebra seemed logical as well since it has a built-in Z39.50 server. Still, this didn’t exactly solve my problem. I’d have to install yazproxy, as well, in order to achieve my SRU requirement. Requiring Perl, TCL, PHP, Zebra and yazproxy is a bit much to maintain for this project. Too many dependencies and I am too easily distracted.

OCLC’s SRW/U seemed so obvious. It seemed easy. It seemed perfect. Except our DSpace admin couldn’t get it to work. Oh, I inquired. I nagged. I pestered. That still didn’t make it work. I have very limited permissions on the machine that DSpace runs on (and no permissions for Tomcat), so there was little I could do to help. This also solved a specific purpose, but didn’t necessarily address any other OAI providers that we might have.

So, enter RepoMan. Another wheel that closely resembles all the other wheels out there, but possibly with minor cosmetic changes. Let a thousand wheels be invented.