Archive

SRU

Like most Indexdata products, Metaproxy is an incredibly useful, although damn near impenetrable, application.  I have been using it to manage our Z39.50 access in the project I’ve been working for the last year or so (Talis Aspire Digitised Content, if you’re interested).  Its purpose is two-fold: most importantly, it makes the Z39.50 services available via SRU (so we don’t need Z39.50 support in our apps themselves), and it also allows us to federate a bunch of library catalogs in one target, so we can get a cross-section of major libraries in one query.

For the last nine months or so, I’ve been using Metaproxy quite successfully, albeit in a very low-volume and tightly scoped context, without really understanding in the slightest idea how it works (as one tends to do with Indexdata-produced software) or what the configuration settings really did.  Despite the fact that we were pointing at a little less than twenty five Z39.50 targets, it just worked (after some initial trial and error) even though these targets made up a diverse cross-section of ILMSes (Voyager, Symphony, Aleph, Prism).  Granted, we’re only searching on ISBN and ISSN right now, but none of the currently existing catalogs required any special configuration.

There was a notable vendor that wasn’t represented, however.

Recently, TADC has gone from ‘closed pilot’ to ‘available for any institution to request a demo’.  A university recently requested a demo and when I added their Z39.50 target (which you will not be surprised to learn was the vendor that we hadn’t dealt with) to Metaproxy, I noticed I kept getting ‘invalid combination of attributes for index’ errors when I would try to do ISBN and ISSN queries via SRU (although, interestingly, not via Z39.50).

If you’re not familiar with queries in Z39.50, they have an incredibly opaque construction where every query element takes (up to) 6 ‘attributes’:

  1. use
    which field you want to search: title, author, etc.
  2. relation
    =, >, <, exact, etc.
  3. position
    first in field, first in subfield, anywhere in field, etc.
  4. structure
    word, phrase, keyword, date, etc.
  5. truncate
    left, right, no truncation, etc.
  6. completeness
    incomplete subfield, complete subfield, complete field

So a query for ISBN=1234567890 in Prefix Query Format (PQF – what Z39.50 uses) would look like:

@attr 1=7 @attr 2=3 @attr 3=2 @attr 4=1 @attr 5=100 @attr 6=1 1234567890

To translate this, you refer to the Bib-1 attribute set, but to break it down, it’s saying: search the ISBN field for strings that start with ‘1234567890’ followed by a space (and possibly more characters) or the end of the field. Servers will often have default behavior on fields so you don’t, in practice, always have to send all 6 attributes (often you only need to send the use attribute), but accepted attributes combinations are completely at the discretion of the server.  The servers we were pointing at up until now were happy with @attr 1=7 @attr 2=3 1234567890

Since this is horribly arcane search syntax, CQL was developed to replace it.  To do the above query in CQL, all you need is:

bath.isbn = 1234567890

Where bath defines the context set (the vocabulary of fields), and isbn is the field to search (yes, I realize that the Bath context set is deprecated, but the Bibliographic context set requires index modifiers, which nobody supports, as far as I can tell).

However, to make this work in Metaproxy, you need a mapping to translate the incoming CQL to PQF to send to the Z39.50 server.  And this is where our demo instance was breaking down. When I changed the mapping to work with the newly added catalog, some (but not all!) of the existing catalogs would stop returning results for ISBN/ISSN queries. I needed a different configuration for them, which meant that I actually had to figure out how Metaproxy works.

Metaproxy’s documentation explains that it is basically made up of three components:

  • Packages

    A package is request or response, encoded in some protocol, issued by a client, making its way through Metaproxy, send to or received from a server, or sent back to the client.

    The core of a package is the protocol unit – for example, a Z39.50 Init Request or Search Response, or an SRU searchRetrieve URL or Explain Response. In addition to this core, a package also carries some extra information added and used by Metaproxy itself.

    Um, ok.  To be honest, I still don’t really understand what packages are.  They don’t seem to exist in the example configurations or at least in not ones I care about.

  • Routes

    Packages make their way through routes, which can be thought of as programs that operate on the package data-type. Each incoming package initially makes its way through a default route, but may be switched to a different route based on various considerations.

    Well, this seems to make sense, at least. A requests can be routed through certain paths. Check.

  • Filters

    Filters provide the individual instructions within a route, and effect the necessary transformations on packages. A particular configuration of Metaproxy is essentially a set of filters, described by configuration details and arranged in order in one or more routes. There are many kinds of filter – about a dozen at the time of writing with more appearing all the time – each performing a specific function and configured by different information.

    The word “filter” is sometimes used rather loosely, in two different ways: it may be used to mean a particular type of filter, as when we speak of “the auth_simple filter” or “the multi filter”; or it may be used to be a specific instance of a filter within a Metaproxy configuration.

    Ugh, well that’s clear as mud, isn’t it? But, ok, so these are the things that do what you want to happen in the route. The documentation for these is pretty spartan, as well (considering they’re supposed to do the bulk of the work), but maybe through some trial and error we can figure it out.

All of this is declared in an XML configuration file, here’s an example that comes supplied with Metaproxy’s sources.  In this file you have a metaproxy root element and under that you have a start tag where you declare the default route that every request goes through to begin with.

Generally, this is also where you’d declare your global-level filters.  Filters can be defined in two ways: with an id attribute that can called one or more times from within routes, or you can put a filter element directly in the route (probably without an id attribute) which can be thought of as being scoped locally to that route.

For the global filters, you put a filters element under your metaproxy root element under which there are one or more filter elements. These filter elements should have id and type attributes (the types are available here).  You would also define the behavior of your filter here.  Here’s an example of how we define our two different CQL to PQF mappings:

<filters>
  <filter id="default-cql-rpn" type="cql_rpn">
    <conversion file="../etc/cql2pqf.txt" />
  </filter>
  <filter id="other-cql-rpn" type="cql_rpn">
    <conversion file="../etc/other-cql2pqf.txt" />
  </filter>
</filters>

To call these filters from your route, you use an empty element with the refid attribute:

<filter refid="default-cql-rpn" />

Also under the metaproxy element is a routes tag. The routes element can have one or more route blocks. One of these needs to have an id that matches the route attribute in your start tag (i.e. your default route). All of the examples (I think) use the id ‘start’.

In your default route, you can declare your common filters, such as the HTTP listener for the SRU service. You can also log the incoming requests. Here’s an example:

<routes>
  <route id="start">
    <filter type="log">
      <message>Something</message>
    </filter>
    <filter refid="id-declared-for-filter-in-your-filters-block" />
    ...
  </route>
</routes>

The first filter element in that route is locally scoped, it can’t be called again. The second one calls a filter that was defined in your filters section. That same filter could, in theory, also be called by a different route, keeping the configuration file (relatively) DRY.

The filters happen sequentially.

It is in one of the locally scoped filters within a route where you’d define the settings of the databases you are proxying. The filter type for these is ‘virt_db’.

<route id="start">
  ...
  <filter type="virt_db">
    <virtual>
      <database>lc</database>
      <target>z3950.loc.gov:7090/voyager</target>
    </virtual>
  </filter>
  ...
</route>

It is at this point that you can branch into different routes so different databases can have different configurations. It would look something like:

<routes>
  <route id="start">
    ...
    <filter type="virt_db">
      <virtual>
        <database>lc</database>
        <target>z3950.loc.gov:7090/voyager</target>
      </virtual>
      <virtual route="otherdbroute">
        <database>otherdb</database>
        <target>example.org:210/otherdb</target>
      </virtual>
    </filter>
    ...
  </route>
  <route id="otherdbroute">
    <filter type="log">
      <message>Other DB route</message>
    </filter>
    <filter refid="other-cql-rpn" />
    ...
  </route>

It’s important to note here that the branched route will return to route where it was initiated after it completes, so if there are more filters declared after this route is called, they will be applied to this route, as well.

It’s also important to note (at least this is how things appear to work for me) that if a filter of the same type is called more than once for a particular route, only the first one seems to get applied. In our example above, you could apply the cql_rpn filters like this:

<routes>
  <route id="start">
    ...
    <filter type="virt_db">
      <virtual>
        <database>lc</database>
        <target>z3950.loc.gov:7090/voyager</target>
      </virtual>
      <virtual route="otherdbroute">
        <database>otherdb</database>
        <target>example.org:210/otherdb</target>
      </virtual>      
    </filter>
    <filter refid="default-cql-rpn" />
    ...
  </route>
  <route id="otherdbroute">
    <filter type="log">
      <message>Other DB route</message>
    </filter>
    <filter refid="other-cql-rpn" />
    ...
  </route>

In this case, requests for “lc” would have “default-cql-rpn” applied to them. Requests for “otherdb” would have “other-cql-rpn” applied to them, but they seem to ignore the “default-cql-rpn” filter that comes later (I, for one, found this extremely counter-intuitive). So rather than having your edge cases overwrite your default configuration, you set your edge cases first and then set a default for anything that hasn’t already had a particular filter applied to it.

Also somewhat counter-intuitively, if you’re searching multiple databases with a single query and the databases require different configurations, you configure it like this:

<routes>
  <route id="start">
    ...
    <filter type="virt_db">
      <virtual>
        <database>lc</database>
        <target>z3950.loc.gov:7090/voyager</target>
      </virtual>
      <virtual route="otherdbroute">
        <database>otherdb</database>
        <target>example.org:210/otherdb</target>
      </virtual>
      <virtual>
        <database>all</database>
        <target>z3950.loc.gov:7090/voyager</target>
        <target>example.org:210/otherdb</target>
      </target>  
    </filter>
    <filter type="multi">
      <target route="otherdbroute">example.org:210/otherdb</target>
    </filter>
    <filter refid="default-cql-rpn" />

    ...
  </route>
  <route id="otherdbroute">
    <filter type="log">
      <message>Other DB route</message>
    </filter>
    <filter refid="other-cql-rpn" />
    ...
  </route>

Since the route can’t be applied to the whole aggregate of databases (the other databases would fail), you declare the route for a particular target in a ‘multi’ filter. Again, I think this route would have to appear before the default-cql-rpn filter is called to work.

I hope this helps somebody. If nothing, else, it will probably help myself when I need to eventually remember how all of this works.

Here the steps I just took to install metaproxy (which requires yaz and yaz++) on Red Hat Enterprise Linux 6.2.  The reason for this exercise is because Indexdata’s RPMs don’t work for 6.2 (the versions of boost-devel and icu-devel they require seem to only be available in 5.5).  Since I expect Indexdata to eventually release 6.2 compatible RPMs, I installed all of this into /opt/local (so it’s easy to remove — of course, if you’re already using /opt/local, you might want to try somewhere else).  Also, this assumes you’ll put a metaproxy.xml in /opt/local/etc/metaproxy/, so keep that in mind.

  1. yum install boost boost-devel icu icu-devel libxml2 libxml2-devel gnutls gnutls-devel libxslt libxslt-devel gcc-c++ libtool
  2. Install yaz:
    1. wget http://ftp.indexdata.dk/pub/yaz/yaz-4.2.33.tar.gz
    2. tar -zxvf yaz-4.2.33.tar.gz
    3. cd yaz-4.2.33
    4. ./configure –prefix=/opt/local
    5. make
    6. make install
  3. Install yaz++
    1. wget http://ftp.indexdata.dk/pub/yazpp/yazpp-1.3.0.tar.gz
    2. tar -zxvf yazpp-1.3.0.tar.gz
    3. cd yazpp-1.3.0
    4. ./configure –prefix=/opt/local/ –with-yaz=/opt/local/bin
    5. make
    6. make install
  4. Install metaproxy
    1. wget http://ftp.indexdata.dk/pub/metaproxy/metaproxy-1.3.36.tar.gz
    2. tar -zxvf metaproxy-1.3.36.tar.gz
    3. cd metaproxy-1.3.36
    4. ./configure –prefix=/opt/local –with-yazpp=/opt/local/bin/
    5. make
    6. make install
  5. cd /opt/local
  6. mkdir etc; mkdir etc/metaproxy; mkdir etc/sysconfig
  7. Copy this gist as /etc/rc.d/init.d/metaproxy
  8. chmod 744 /etc/rc.d/init.d/metaproxy
  9. Copy this gist as /opt/local/etc/sysconfig/metaproxy
  10. chkconfig –add /etc/rc.d/init.d/metaproxy
  11. /etc/init.d/metaproxy start

I’ve been waiting for a while to have this title. Well, actually, not a long while, and that’s testimony to how quickly I’m able to develop things in Rails.

While I think SFX is fine product and we are completely and utterly dependent upon it for many things, it does still have its shortcomings. It is not a terribly intuitive interface (no link resolver that I’m aware of has one) and there are some items it just doesn’t resolve well, such as conference proceedings. Since conference proceedings and technical reports are huge for us, I decided we needed something that resolved these items better. That’s when the idea of the übeResolver (now mainly known as ‘the umlaut’) was born.

Although I had been working with Ed Summers on the Ruby OpenURL libraries before Code4Lib 2006, I really began working on umlaut earlier this month when I thought I might have something coherent together in time before the ELUNA proposal submission deadline. Although I barely had anything functional on the 8th (the deadline — 2 days after I had really broken ground), I could see that this was actually feasible and doable.

Three weeks later and it’s really starting to take shape (although it’s really, really slow right now). Here are some examples:

The journal ‘Science’

A book: ‘Advances in Communication Control Networks’

Conference Proceeding

Granted, the conference proceeding is less impressive as a result of IEEE being available via SFX (although, in this case, it’s getting the link from our catalog) and the fact that I’m having less luck with SPIE conferences (they’re being found, but I’m having some problems zeroing in on the correct volume — more on that in a bit), but I think that since this is the result of < 14 days of development time, it isn't a bad start. Now on to what it's doing. If the item is a "book", it queries our catalog for ISBN; asks xISBN for other matches, queries our catalog for that; does a title/author search; does a conferenceName/title/year search. If there are matches, it then asks the opac for holdings data. If the item is either not held or not available, it does the same to our consortial catalog. Currently it’s doing both, regardless, because I haven’t worried about performance.

It checks the catalog via SRU and tries to fill out the OpenURL ContextObject with more information (such as publisher and place). This would be useful to then export into a citation manager (which most link resolvers have fairly minimal support for). While it has the MODS records, it also grabs LCSH and Table of Contents (if they exist). When I find an item with more data, I’ll grab it as well (such as abstracts, etc.).

It then queries Amazon Web Services for more information (editorial content, similar items, etc.).

It still needs to check SFX, but, unfortunately, that would slow it down even more.

For journals, it checks SFX first. If there’s no volume, issue, date or article title, it will try to get coverage information. Unfortunately, SFX’s XML interface doesn’t send this, so I have to get this information from elsewhere. When I made our Ejournal Suggest service, I had to create a database of journals and journal titles and I have since been adding functionality to it (since I am running reports from SFX for titles and it includes the subject associations, I load them as well — it includes coverage, too, so including that field was trivial). So when I get the SFX result document back, I parse it for its services (getFullText, getDocumentDelivery, getCitedBy, etc.) and if no article information is sent, I make a web service request to a little PHP/JSON widget I have on the Ejournal Suggest database that gets back coverage, subjects and other similar journals based on the ISSN. The ‘other similar journals’ are 10 (arbitrary number) other journals that appear in the same subject headings, ordered by number of clickthroughs in the last month. This doesn’t appear if there is an article, because I haven’t decided if it’s useful in that case (plus the user has a link to the ‘journal level’ if they wish).

Umlaut then asks the opac for holdings and tries to parse the holdings records to determine if a specific issue is held in print (this works well if you know the volume number — I have thought about how to parse just a year, but haven’t implemented it yet). If there are electronic holdings, it attempts to dedupe.

There is still a lot more work to do with journals, although I hope to be able to implement this soon. The getCitedBy options will vary from grad students/faculty to undergrads. Since we have very limited seats to Web of Science, undergraduates will, instead, get their getCitedBy links to Google Scholar. Graduate students and faculty will get both Web of Science and Google Scholar. Also, if no fulltext results are found, it will then go out to the search engines to try to find something (whether it finds the original item or a postprint in arxiv.org or something). We will also have getAbstracts and getTOCs services enabled so the user can find other databases that might be useful or table of content services, accordingly. Further, I plan on associating the subject guides with SFX Subjects and LCC, so we can make recommendations from a specific subject guide (and actually promote the guide a bit) based, contextually, by what the user is already looking at. By including the SFX Target name in the subject items (which is an existing field that’s currently unused), we could also match on the items themselves.
The real value in umlaut, however, will come in its unAPI interface. Since we’ll have Z39.88 ContextObjects, MODS records, Amazon Web Services results and who knows what else, umlaut could feed an Atom store (such as unalog) with a whole hell of a lot of data. This would totally up the ante of scholarly social bookmarking services (such as Connotea and Cite-U-Like) by behaving more like personal libraries that match on a wide variety of metadata, not just url or title. The associations that users make can also aid umlaut in recommendations of other items.

The idea here is not a replacement of the current link resolver, the intention is to enhance it. SFX makes excellent middleware, but I think it’s interface leaves a bit to be desired. By utilizing its strength, we can layer more useful services on top of it. Also, a user can add other affiliations that they belong to in their profile, so umlaut can check their local public library or, if they are taking classes at another university, they can include those.

At this point I can already hear you saying, “But Ross, not everyone uses SFX”. How true! I propose a microformat for link resolver results that could be parseable by umlaut (and in an ‘eating your own dog food’ fashion, will add this to umlaut’s template, eventually), making any link resolver available to umlaut.

There is another problem that I’ve encountered while working on this project, though, too. Last week and the week before, while I was doing the bulk of the SRU development, I kept on noticing (and reporting) our catalog (and, more often, it’s Z39.50 server) going down. Like many times a day. After concluding that, in fact, I was probably causing the problem, I finally got around to doing something that I’ve been meaning to do for months (and I would recommend to everyone else if they want to actually make useful systems): exporting the bib database into something better. Last week I imported our catalog into Zebra and sometime this week I will have a system that syncs the database every other hour (we already have the plumbing for this for our consortial catalog). I am also experimenting with Cheshire3 (since I think it’s potential is greater — it’s possible we may use both for different purposes). The advantage to this (besides not crashing our catalog every half hour) is that I can index it any way want/need to as well as store the data any way I need to in order to make sure that users get the best experience they can.

Going back to the SPIE conferences, there is no way in Voyager that I can limit my results to less than 360+ results for “SPIE Proceedings” in 2003. At least, not from the citations I get from Compendex (which is where anyone would get the idea to look for SPIE Proceedings in our catalog, anyway). With an exported database, however, I could index the volume and pinpoint the exact record in our catalog. Or, if that doesn’t scale (for instance, if they’re all done a little differently), I can pound the hell out our zebra (or cheshire3 or whatever) server looking for the proper volume without worrying about impacting all of our other services. I can also ‘game the system’ a bit and store bits in places that I can query when I need them. Certainly this makes umlaut (and other services) more difficult to share to other libraries (at least, other libraries that don’t have similar setups to ours), but I think these sorts of solutions are essential to improving access to our collections.

Oh yeah, and lest you think that mirroring your bib database is too much to maintain: Zebra can import marc records (so you can use your opac’s marc export utility) and our entire bib database (705,000 records) takes up less than 2GB of storage. The more indexes added, the larger the database size, of course, but I am indexing a LOT in that.

I am still feeling my way around Python. I have yet to grasp the zen of being Pythonic, but I am at least coming to grips with real object orientation (as opposed to the named hashes of PHP) and am actually taking the leap into error handling, which, if you have dealt with any of the myriad bugs in any of my other projects, you’d know has been a bit of a foreign concept to me.

Python project #2 is known as RepoMan (thanks to Ed Summers for the name). It attempts to solve a problem that not one but two other opensource projects already have solved admirably (I’ll go into more about this in a bit). RepoMan is an OAI Repository indexer that makes said repository available via SRU. I created it in an attempt to make our DSpace implementation searchable from remote applications (namely, the site search and the upcoming alternative opac). It’s an extremely simple two script project that has only taken a week to get running largely due to the existence of two similar and available python scripts that I could modify for my own use. It’s also due to the help of Ed Summers and Aaron Lav.

The harvester is, basically, Thom Hickey’s one page OAI harvester with some minor modification. I have added error handling (the two lines I added to compensate for malformed xml must have been over the “one page limit”) and instead of outputting to a text file, it shoves the records in a Lucene index (thanks to PyLucene). This part still needs some work (I’m not sure what it would do with an “updated” record, for example), but it makes a nice index of the Dublin Core fields, plus a field for the whole record, for “default” searches. This was a good exercise for me to work with xml, Python and Lucene, because I was having some trouble when trying to index the MODS records for the alternative opac.

The SRU server is, basically, Dan Chudnov‘s SRU implementation for unalog. It needed to be de-Quixotefied and is, in fact, much more robust than Dan’s original (of course, unalog’s implementation doesn’t need to be as “robust”, since the metadata is much more uniform), but certainly having a working model to modify made this go much, much faster. The nice part is that there might be some stuff in there that Dan might want to put back into unalog.

So, here is the result. The operations currently supported are explain and searchRetrieve and majority of CQL relations are unsupported, but it does most of the queries I need it to do and, most importantly, it’s less than a week old.

So the burning question here is: why on earth would I waste time developing this when OCKHAM’s Harvest-to-Query is out there, and, even more specifically, OCLC’s SRW/U implementation for DSpace is available? Further, I knew full well that these projects existed before I started.

Lemme tell ya.

Harvest-to-Query looked very promising. I began down this road, but stopped about halfway down the installation document. Granted, anything that uses Perl, TCL and PHP has to be, well, something… After all, those were the first three languages I learned (and in the same order!). Adding in IndexData’s Zebra seemed logical as well since it has a built-in Z39.50 server. Still, this didn’t exactly solve my problem. I’d have to install yazproxy, as well, in order to achieve my SRU requirement. Requiring Perl, TCL, PHP, Zebra and yazproxy is a bit much to maintain for this project. Too many dependencies and I am too easily distracted.

OCLC’s SRW/U seemed so obvious. It seemed easy. It seemed perfect. Except our DSpace admin couldn’t get it to work. Oh, I inquired. I nagged. I pestered. That still didn’t make it work. I have very limited permissions on the machine that DSpace runs on (and no permissions for Tomcat), so there was little I could do to help. This also solved a specific purpose, but didn’t necessarily address any other OAI providers that we might have.

So, enter RepoMan. Another wheel that closely resembles all the other wheels out there, but possibly with minor cosmetic changes. Let a thousand wheels be invented.

SRW/U is to Yngwie J. Malmsteen as OpenSearch is to Keith Richards.

Yngwie Malmsteen is technically superior, however aesthetically unlistenable (unimplementable, in the case of SRW/U).

Keith Richards is sloppy, unsophisticated and writes timeless melodies that resonate with the masses (OpenSearch is sloppy, unsophisticated — while time will tell if OpenSearch becomes “timeless” [seems doubtful, honestly], but there are certainly a lot of OpenSearch targets).

Mike Taylor wrote a very insightful reaction to Dan’s worklog posting (and incredibly objective, given his investment and relationship to SRW/U). And he’s right.

And I’m right. Unless SRW/U can capture some of the mojo that OpenSearch has, it might as well be Yngwie J. Malmsteen.