Sometime in November, I came to the realization that I had horribly misinterpreted the NISO Z39.88/OpenURL 1.0 spec.  I’m on the NISO Advisory Committee for OpenURL (which makes this even more embarrassing) and was reviewing the proposal for the Request Transfer Message Community Profile and its associated metadata formats when it dawned on me that my mental model was completely wrong.  For those of you that have primarily dealt with KEV based OpenURLs (which is 99% of all the OpenURLs in the wild), I would wager that your mental model is probably wrong, too.

A quick primer on OpenURL:

  • OpenURL is a standard for transporting ContextObjects (basically a reference to something, in practice, mostly bibliographic citations)
  • A ContextObject (CTX, for short from now on) is comprised of Entities that help define what it is.  Entities can be one of six kinds:
    • Referent – this is the meat of the CTX, what it’s about, what you’re trying to get context about.  A CTX must have one referent and only one.
    • ReferringEntitydefines the resource that cited the referent.  This is optional and can only appear once.
    • Referrer – the source of where the CTX came from (i.e. the A&I database).  This is optional and can only appear once.
    • Requester – this is information about who is making the request (i.e. the user’s IP address).  This is optional and can only appear once.
    • ServiceType – this defines what sorts of services are being requested about the referent (i.e. getFullText, document delivery services, etc.).  There can be zero or many ServiceType entities defined in the CTX.
    • Resolver these are messages specifically to the resolver about the request.  There can be zero or more Resolver entities defined in the CTX.
  • All entities are basically the same in what they can hold:
    • Identifiers (such as DOI or IP Address)
    • By-Value Metadata (the metadata is included in the Entity)
    • By-Reference Metadata (the Entity has a pointer to a URL where you can retrieve the metadata, rather than including it in the CTX itself)
    • Private Data (presumably data, possibly confidential, between the entity and the resolver)
  • A CTX can also contain administrative data, which defines the version of the ContextObject, a timestamp and an identifier for the CTX (all optional)
  • Community Profiles define valid configurations and constraints for a given use case (for instance, scholarly search services are defined differently than document delivery).  Context objects don’t actually specify any community profile they conform to.  This is a rather loose agreement between the resolver and the context object source:   if you provide me with a SAP1, SAP2 or Dublin Core compliant OpenURL, I can return something sensible.
  • There are currently two registered serializations for OpenURL:  Key/Encoded Values where all of the values are output on a single string, formatted as key=value and delimited by ampersands (this is what majority of all OpenURLs that currently exist look like) and XML (which is much rarer, but also much more powerful)
  • There is no standard OpenURL ‘response’ format.  Given the nature of OpenURL, it’s highly unlikely that one could be created that would meet all expected needs.  A better alternative would be for a particular community profile to define a response format since the scope would be more realistic and focused.

Looking back on this, I’m not sure how “quick” this is, but hopefully it can bootstrap those of you that have only cursory knowledge of OpenURL (or less).  Another interesting way to look at OpenURL is Jeff Young’s 6 questions approach, which breaks OpenURL down to “who”, “what”, “where”, “when”, “why” and “how”.

One of the great failings of OpenURL (in my mind, at least) is the complete and utter lack of documentation, examples, dialog or tutorials about its use or potential.  In fact, outside of COinS, maybe, there is no notion of “community” to help promote OpenURL or cultivate awareness or adoption.  To be fair, I am as guilty as anybody for this failure, since I had proposed making a community site for OpenURL, but due to a shift in job responsibilities and then the wholesale change in employers, coupled with the hacking of the server it was to live on, left this by the wayside.  I’m putting this back on my to do list.

What this lack of direction leads to is that would-be implementors wind up making a lot of assumptions about OpenURL.  The official spec published at NISO is a tough read and is generally discouraged by the “inner core” of the OpenURL universe (the Herbert van de Sompels, the Eric Hellmans, the Karen Coyles, etc.) in favor of the “Implementation Guidelines” documents.  However, only the KEV Guidelines are actually posted there.  The only other real avenue for trying to come to grips with OpenURL is to dissect the behavior of link resolvers.  Again, in almost every instance this means you’re working with KEVs and the downside of KEVs is that they give you a very naive view of OpenURL.

KEVs, by their very nature, are flat and expose next to nothing about the structure of the model of the context object they represent.  Take the following, for example:


Ugly, I know, but bear with me for a moment.  From this example, let’s focus on the Referent:


and then let’s make this a little more human readable:

rft_val_fmt:  info:ofi/fmt:kev:mtx:book
rft.genre:  book
rft.aulast:  Vergnaud
rft.auinit:  J.-R
rft.btitle:  Dépendances et niveaux de représentation en syntaxe  1985  Benjamins  Amsterdam, Philadelphia

Looking at this example, it’s certainly easy to draw some conclusions about the referent, the most obvious being that it’s a book.

Actually (and this is where it gets complicated and I begin to look pedantic) it’s really only telling you, I am sending some by value metadata in the info:ofi/fmt:kev:mtx:book format, not that the thing is actually a book (although the info:ofi/fmt:kev:mtx:book metadata values do state that, but, ignore that for a minute since genre is optional).

The way this actually should be thought of:

       Metadata by Value:
          Format:  info:ofi/fmt:kev:mtx:book
             Genre:  book
             Btitle:  Dépendances et niveaux de représentation en syntaxe
       Identifier:  urn:isbn:0262531283
       Metadata by Value:
           Format:  info:ofi/fmt:kev:mtx:book
               Genre:  book
               Isbn:  0262531283
               Btitle:  Minimalist Progam
       Identifier:  info:sid/
       Metadata By Value:
           Format:  info:ofi/fmt:kev:mtx:sch_svc
               Abstract:  yes

So, this should still seem fairly straightforward, but the hierarchy certainly isn’t evident in the KEV.  It’s a good starting point to begin talking about the complexity of working with OpenURL, though, especially if you’re trying to create a service that consumes OpenURL context objects.

Back to the referent metadata.  The context object didn’t have to send the data in the “metadata by value” stanza.  It could have just sent the identifier “urn:isbn:9027231141” (and note in the above example, it didn’t have an identifier at all).  It could also have sent metadata in the Dublin Core format, MARC21, MODS, ONIX or all of the above (the Metadata By Value element is repeatable) if you wanted to make sure your referent could be parsed by the widest range of resolvers. While all of these are bibliographic formats, in Request Transfer Message context objects (which would be used for document delivery, which got me started down this whole path), you would conceivably have one or more of the aforementioned metadata types plus a Request Transfer Profile Referent type that describes the sorts of interlibrary loan-ish types of data that accompany the referent as well as an ISO Holdings Schema metadata element carrying the actual items a library has, their locations and status.

If you only have run across KEVs describing journal articles or books, this may come as a bit of a surprise.  Instead of saying the above referent is a book, it becomes important to say that the referent contains a metadata package (as Jonathan Rochkind calls it) that is in this (OpenURL specific) book format.  In this regard, OpenURL is similar to METS.  It wraps other metadata documents and defines the relationships between them.  It is completely ambivalent about the data it is transporting and makes no attempt to define it or format it in any way.  The Journal, Book, Patent and Dissertation formats were basically contrived to make compatibility with OpenURL 0.1 easier, but they are not directly associated with OpenURL and could have just as easily been replaced with, say, BibTex or RIS (although the fact that they were created alongside Z39.88 and are maintained by the same community makes the distinction difficult to see).

What this means, then, is that in order to know anything about a given entity, you also need to know about the metadata format that is being sent about it.  And since that metadata could literally be in any format, it means there are lot of variables that need to be addressed just to know what a thing is.

For the Umlaut, I wrote an OpenURL library for Ruby as a means to parse and create OpenURLs.  Needless to say, it was originally written with that naive, KEV-based, mental model (plus some other just completely errant assumptions about how context objects worked) and, because of this, I decided to completely rewrite it.  I am still in the process of this, but am struggling with some core architectural concepts and am throwing this out to the larger world as an appeal for ideas or advice.

Overall the design is pretty simple:  there is a ContextObject object that contains a hash of the administrative metadata and then attributes (referent, referrer, requester, etc.) that contain Entity objects.

The Entity object has arrays of identifiers, private data and metadata.

And then this is where I start to run aground.

The original (and current) plan was to populate the metadata array with native metadata objects that are generated by registering metadata classes in a MetadataFactory class.  The problem, you see, is that I don’t want to get into the business of having to create classes to parse and access every kind of metadata format that gets approved for Z39.88.  For example, Ed Summers’ ruby-marc has already solved the problem of effectively working with MARC in Ruby, so why do I want to reinvent that wheel?  The counter argument is, by delegating these responsibilities to third party libraries, there is no consistency of APIs between “metadata packages”.  A method used in format A may very well raise an exception (or, worse, overwrite data) in format B

There is a secondary problem that third party libraries aren’t going have any idea that they’re in an OpenURL context object or even know what that is.  This means there would have to be some class that handles functionality like xml serialization (since ruby-marc doesn’t know that Z39.88 refers to it as info:ofi/fmt:xml:xsd:MARC21), although this can be handled by the specific metadata factory class.  This would also be necessary when parsing an incoming OpenURL since, theoretically, every library could have a different syntax for importing XML, KEVs or whatever other serialization is devised in the future.

So I’m looking for advice on how to proceed.  All ideas welcome.

It’s been a while since I’ve written anything about the mlaut. It’s been a while since I’ve written about anything, really. Lots of reasons for that: been frantically trying to pull the mlaut together in time to launch for fall semester, and I’ve got this little bit of business going on…

Still, it’s probably time to touch on some of the changes that have happened.

  1. The backend has been completely updated
  2. The intial design was… shaky… at best. While the new backend is probably still shaky (it is, after all, my creation), it’s certainly more thought-out. Incoming requests are split by their referent and referrers (see Thom Hickey’s translation for these arcane terms) and the referent is checked against a ‘Ferret store’ of referents. The rationale here is that citations take a bunch of forms in regards to their completeness and quality, so we do some fulltext searching against an index of previously resolved referents to see if we’ve dealt with this before.

    It then stores the referent and the response object as Marshaled data, which is great, except it royally screws up trying to tail -f the logs.

  3. New design
  4. Heather King, our web designer here at Tech, has vastly improved the design. There’s still quite a bit more to do (isn’t there always?), but we’ve got a good, eminently usable, interface to build upon. The bits that need to be cleaned up (mainly references to other citations) won’t be that hard to clean up.

  5. Added Social Bookmarking Support
  6. Well, read-only support. Connotea, Yahoo’s MyWeb and Unalog support were pretty easy to add courtesy of their existing APIs. The downside is that I can only hope to find bookmarks based on URL which… doesn’t work well. I really wish Connotea would get some sort of fielded searching going on. support, which would be great, can’t really happen until they ease the restrictions on how often you can hit it.

    CiteULike was a bit more of a hack, as it has no API. Instead, I am finding CiteULike pages in the Google/Yahoo searches I was already doing, grabbing the RSS/PRISM and tags and then doing a search (again, retrieving the RSS/PRISM) with the tags in the first article. It’s working pretty well, although I need to work out title matching since both Yahoo and Google truncate HTML titles. I plan on adding the CiteULike journal table of contents feeds this way, too.

  7. Improved performance thanks to the magic of AJAX
  8. Let’s face it. The mlaut was bloody slow. There were a couple of reasons for this, but it was mainly due to hitting Connotea and trying to harvest any OAI records it could find for the given citation. The kicker was that this was probably unnecessary for majority of the users. Now we’ve pushed the social bookmarkers and the OAI repositories to a ‘background task’ that gets called via javascript when the page renders. It’s not technically AJAX as much as a remote procedure call, but AJAX is WebTwoFeriffic! Besides, this is a Rails project. Gotta kick up the buzz technologies.

  9. Now storing subjects in anticipation of recommender service
  10. The mlaut now grabs subject headings from Pubmed (MeSH); LCSH from our catalogs; SFX subjects; tags from Connotea, CiteULike, MyWeb and unalog; and subjects from the OAI repositories and stores them with the referent. It also stores all of these in the Ferret store. The goal here is to search on subject headings to find recommendations to other similar items. As of this writing, there is only one citation with subject associations, so there’s nothing really to see here.

The big todo that’s on my plate for the rest of the week is adding statistics gathering. I’ve got my copy of An Architecture for the Aggregation and Analysis of Scholarly Usage Data by Bollen and Van de Sompel printed out and I plan on incorporating their bX concept for this.

I’ve been waiting for a while to have this title. Well, actually, not a long while, and that’s testimony to how quickly I’m able to develop things in Rails.

While I think SFX is fine product and we are completely and utterly dependent upon it for many things, it does still have its shortcomings. It is not a terribly intuitive interface (no link resolver that I’m aware of has one) and there are some items it just doesn’t resolve well, such as conference proceedings. Since conference proceedings and technical reports are huge for us, I decided we needed something that resolved these items better. That’s when the idea of the übeResolver (now mainly known as ‘the umlaut’) was born.

Although I had been working with Ed Summers on the Ruby OpenURL libraries before Code4Lib 2006, I really began working on umlaut earlier this month when I thought I might have something coherent together in time before the ELUNA proposal submission deadline. Although I barely had anything functional on the 8th (the deadline — 2 days after I had really broken ground), I could see that this was actually feasible and doable.

Three weeks later and it’s really starting to take shape (although it’s really, really slow right now). Here are some examples:

The journal ‘Science’

A book: ‘Advances in Communication Control Networks’

Conference Proceeding

Granted, the conference proceeding is less impressive as a result of IEEE being available via SFX (although, in this case, it’s getting the link from our catalog) and the fact that I’m having less luck with SPIE conferences (they’re being found, but I’m having some problems zeroing in on the correct volume — more on that in a bit), but I think that since this is the result of < 14 days of development time, it isn't a bad start. Now on to what it's doing. If the item is a "book", it queries our catalog for ISBN; asks xISBN for other matches, queries our catalog for that; does a title/author search; does a conferenceName/title/year search. If there are matches, it then asks the opac for holdings data. If the item is either not held or not available, it does the same to our consortial catalog. Currently it’s doing both, regardless, because I haven’t worried about performance.

It checks the catalog via SRU and tries to fill out the OpenURL ContextObject with more information (such as publisher and place). This would be useful to then export into a citation manager (which most link resolvers have fairly minimal support for). While it has the MODS records, it also grabs LCSH and Table of Contents (if they exist). When I find an item with more data, I’ll grab it as well (such as abstracts, etc.).

It then queries Amazon Web Services for more information (editorial content, similar items, etc.).

It still needs to check SFX, but, unfortunately, that would slow it down even more.

For journals, it checks SFX first. If there’s no volume, issue, date or article title, it will try to get coverage information. Unfortunately, SFX’s XML interface doesn’t send this, so I have to get this information from elsewhere. When I made our Ejournal Suggest service, I had to create a database of journals and journal titles and I have since been adding functionality to it (since I am running reports from SFX for titles and it includes the subject associations, I load them as well — it includes coverage, too, so including that field was trivial). So when I get the SFX result document back, I parse it for its services (getFullText, getDocumentDelivery, getCitedBy, etc.) and if no article information is sent, I make a web service request to a little PHP/JSON widget I have on the Ejournal Suggest database that gets back coverage, subjects and other similar journals based on the ISSN. The ‘other similar journals’ are 10 (arbitrary number) other journals that appear in the same subject headings, ordered by number of clickthroughs in the last month. This doesn’t appear if there is an article, because I haven’t decided if it’s useful in that case (plus the user has a link to the ‘journal level’ if they wish).

Umlaut then asks the opac for holdings and tries to parse the holdings records to determine if a specific issue is held in print (this works well if you know the volume number — I have thought about how to parse just a year, but haven’t implemented it yet). If there are electronic holdings, it attempts to dedupe.

There is still a lot more work to do with journals, although I hope to be able to implement this soon. The getCitedBy options will vary from grad students/faculty to undergrads. Since we have very limited seats to Web of Science, undergraduates will, instead, get their getCitedBy links to Google Scholar. Graduate students and faculty will get both Web of Science and Google Scholar. Also, if no fulltext results are found, it will then go out to the search engines to try to find something (whether it finds the original item or a postprint in or something). We will also have getAbstracts and getTOCs services enabled so the user can find other databases that might be useful or table of content services, accordingly. Further, I plan on associating the subject guides with SFX Subjects and LCC, so we can make recommendations from a specific subject guide (and actually promote the guide a bit) based, contextually, by what the user is already looking at. By including the SFX Target name in the subject items (which is an existing field that’s currently unused), we could also match on the items themselves.
The real value in umlaut, however, will come in its unAPI interface. Since we’ll have Z39.88 ContextObjects, MODS records, Amazon Web Services results and who knows what else, umlaut could feed an Atom store (such as unalog) with a whole hell of a lot of data. This would totally up the ante of scholarly social bookmarking services (such as Connotea and Cite-U-Like) by behaving more like personal libraries that match on a wide variety of metadata, not just url or title. The associations that users make can also aid umlaut in recommendations of other items.

The idea here is not a replacement of the current link resolver, the intention is to enhance it. SFX makes excellent middleware, but I think it’s interface leaves a bit to be desired. By utilizing its strength, we can layer more useful services on top of it. Also, a user can add other affiliations that they belong to in their profile, so umlaut can check their local public library or, if they are taking classes at another university, they can include those.

At this point I can already hear you saying, “But Ross, not everyone uses SFX”. How true! I propose a microformat for link resolver results that could be parseable by umlaut (and in an ‘eating your own dog food’ fashion, will add this to umlaut’s template, eventually), making any link resolver available to umlaut.

There is another problem that I’ve encountered while working on this project, though, too. Last week and the week before, while I was doing the bulk of the SRU development, I kept on noticing (and reporting) our catalog (and, more often, it’s Z39.50 server) going down. Like many times a day. After concluding that, in fact, I was probably causing the problem, I finally got around to doing something that I’ve been meaning to do for months (and I would recommend to everyone else if they want to actually make useful systems): exporting the bib database into something better. Last week I imported our catalog into Zebra and sometime this week I will have a system that syncs the database every other hour (we already have the plumbing for this for our consortial catalog). I am also experimenting with Cheshire3 (since I think it’s potential is greater — it’s possible we may use both for different purposes). The advantage to this (besides not crashing our catalog every half hour) is that I can index it any way want/need to as well as store the data any way I need to in order to make sure that users get the best experience they can.

Going back to the SPIE conferences, there is no way in Voyager that I can limit my results to less than 360+ results for “SPIE Proceedings” in 2003. At least, not from the citations I get from Compendex (which is where anyone would get the idea to look for SPIE Proceedings in our catalog, anyway). With an exported database, however, I could index the volume and pinpoint the exact record in our catalog. Or, if that doesn’t scale (for instance, if they’re all done a little differently), I can pound the hell out our zebra (or cheshire3 or whatever) server looking for the proper volume without worrying about impacting all of our other services. I can also ‘game the system’ a bit and store bits in places that I can query when I need them. Certainly this makes umlaut (and other services) more difficult to share to other libraries (at least, other libraries that don’t have similar setups to ours), but I think these sorts of solutions are essential to improving access to our collections.

Oh yeah, and lest you think that mirroring your bib database is too much to maintain: Zebra can import marc records (so you can use your opac’s marc export utility) and our entire bib database (705,000 records) takes up less than 2GB of storage. The more indexes added, the larger the database size, of course, but I am indexing a LOT in that.