Archive

Monthly Archives: April 2009

I had the opportunity to attend and present at the excellent ELAG conference last week in Bratislava, Slovakia.  The event was advertised as being somewhat of a European Code4Lib, but in reality, the format seemed to me to be more in line with Access, which in my mind is a plus.

Being the ugly American that I am, I made a series of provocative statements both in my presentation and in the Twitter “back channel” (or whatever they call hash tagging an event) about vendors, library standards, and a seeming disdain for both.  I feel like I should probably clarify my position here a bit, since Twitter is a terrible medium for in-depth communication and I didn’t go into much detail in my presentation (outside of saying vendor development teams were populated by scalliwags and ne’er-do-wells from previous gigs in finance, communications and publishing).

Here was my point I was angling towards in my presentation:  your Z39.50 implementation is never going to get any better than it was in 2001.  Outside of critical bug fixes, I would wager the Z39.50 implementation has not even been touched since it was introduced, never mind improved.  The reason for this is my above “joke” about the development teams being staffed by people that do not have a library background.  They are literally just ignoring the Z-server and praying that nothing breaks in unit and regression testing.  There are only a handful of people that understand how Z39.50 works and they are all employed by IndexData.  For everybody else, it’s just voodoo that was there when they got here, but is a requirement for each patch and release.

Thing is, even as hardware gets faster, and ILSes (theoretically) get more sophisticated, the Z-server just gets worse.  You would think that if this is the most common and consistent mechanism to get data out of ILSes that we would have seen some improvement in implementations as the need for better interoperability increases, but this is just not a reality that I have witnessed.  With the last two ILSes that I primarily worked with (Voyager and Unicorn) I would routinely, accidentally, completely bring down due to trying to use the Z39.50 server as a data source in applications.  For the Umlaut, I had to export the Voyager bib database into an external Zebra index to prevent the ILS from crashing multiple times a day just to look up incoming OpenURL requests.  Let me note that a vast majority of these lookups were just ISSN or ISBN.  Unsurprisingly, the Zebra index held up with no problems.  It’s still working, in fact.

Talis uses Zebra for Alto.  It’s probably the main reason we can check off “SRU Support” in an RFP when practically nobody else can.  But, again, this means the Z/SRU-server is sort of “outside” the development plan, delegated to IndexData.  Our SRU servers technically aren’t even conformant to the spec, since we don’t serve explain documents.  I’m not sure anybody at Talis even was aware of this until I pointed it out last year.

All of this is not intended to demonize vendors (really!) or bite the hand that feeds me.  It’s also not intended to denigrate library standards.  I’m merely trying to be pragmatic and, more importantly, I’m hoping we can make library development a less frustrating and backwards exercise for all parties (even the cads and scalliwags).

My point is that initiatives like the DLF ILS-DI, on paper, make a lot of sense.  I completely understand why they chose to implement their model using a handful of library standards (OAI-PMH, SRU).  The standards are there, why not use them?  The problem is in the reality of the situation.  If the specification “requires” SRU for search, how many vendors do you think will just slap Yaz Proxy in front of their existing (shaky, flaky) Z39.50 server and call it a day?  The OAI-PMH provider should be pretty trivial, but I would not expect any company to provide anything innovative with regards to sets or different metadata formats.

As long as libraries are not going to be writing the software they use themselves, they need to reconcile the fact that suppliers of their software is more than likely not going to be written by librarians or library technologists.  If this is the case, what’s the better alternative?  Clinging to half-assed implementations of our incredibly niche standards?  Or figuring out what technologies are developing outside of the library realm that could be used to deliver our data and services?  Is there really, honestly, no way we could figure out how to use OpenSearch to do the things we expect SRU to do?

I realize I have an axe to grind here, but this isn’t really about Jangle.

I have seen OpenURL bandied about as a “solution” to problems outside of its current primary use of “retrieving context based services from scholarly citations” (I know this is not what OpenURL’s sole use case is, but it’s all it’s being used for.  Period).  The most recent example of this was in a workshop (that I didn’t participate in) at ELAG about how libraries could share social data, such as tagging, reviews, etc. in order to create the economies of scale needed to make these concepts work satisfactorily.  Since they needed a way to “identify” things in their collection (books, journals, articles, maps, etc.) somebody had the (understandable, re: DLF) idea to use OpenURL as the identifier mechanism.

I realize that I have been accused of being “allergic” to OpenURL, but in general, my advice is that if you have a problem and you think OpenURL is the answer to said problem there’s actually probably a simpler and better answer to this if you approach it from outside of a library POV.

The drawbacks of Z39.88 for this scenario are numerous, but I didn’t go into details with my criticisms in Twitter.  Here are a few reasons why I would recommend away from OpenURL for this (and they are not exclusive to this potential application):

  1. OpenURL context objects are not identifiers.  They are a means to describe a resource, not identify it.  A context object may contain an identifier in its description.  Use that, scrap the rest of it.
  2. Because a context object is a description and not an identifier, it would have to be parsed to try to figure out what exactly it is describing.  This is incredibly expensive, error prone and more sophisticated than necessary.
  3. It was not entirely clear how the context objects would be used in this scenario.  Would they just be embedded in, say, an XML document as a clue as to what is being tagged or reviewed?  Or would the consuming service actually be an OpenURL resolver that took these context objects and returned some sort of response?  If it’s the former, what would the base URI be?  If it’s the latter… well, there’s a lot there, but let’s start simple, what sort of response would it return?
  4. There is no current infrastructure defined in OpenURL for these sorts of requests.  While there are metadata formats that could handle journals, articles, books, etc., it seems as though this would just scratch the surface of what would need context objects (music, maps, archival collections, films, etc.).  There are no ‘service types’ defined for this kind of usage (tags, reviews, etc.). The process for adding metadata formats or community profiles is not nimble, which would make it prohibitively difficult to add new functionality when the need arises.
  5. Such an initiative would have to expect to interoperate with non-library sources.  Libraries, even banding together, are not going to have the scale or attraction of LibraryThing, Freebase, IMDB, Amazon, etc.  It is not unreasonable to say that an expectation that any of these services would really adopt OpenURL to share data is naive and a waste of time and energy.
  6. There’s already a way to share this data, called SIOC.  What we should be working towards, rather than pursuing OpenURL, is designing a URI structure for these sorts of resources in a service like this.  Hell, I could even be talked into info URIs over OpenURLs for this.

We could further isolate ourselves by insisting on using our standards.  Navel gaze, keep the data consistent and standard.  To me, however, it makes more sense to figure out how to bridge this gap.  After all, the real prize here is to be able to augment our highly structured metadata with the messy, unstructured web.  A web that isn’t going to fiddle around with OpenURL.  Or Z39.50.  Or NCIP.  I have a feeling the same is ultimately true with our vendors.

There comes a point that we have to ask if our relentless commitment to library-specific standards (in cases when there are viable alternatives) is actually causing more harm than help.

While what I’m posting here might be incredibly obvious to anyone that understands unicode or Ruby better than me, it was new to me and might be new to you, so I’ll share.

Since Ed already let the cat out of the bag about LCSubjects.org, I can explain the backstory here.  At lcsh.info, Ed made the entire dataset available as N-Triples, so just before he yanked the site, I grabbed the data and have been holding onto it since.  I wrote a simple little N-Triples parser in Ruby to rewrite some of the data before I loaded it into the platform store I have.  My first pass at this was really buggy, I wasn’t parsing N-Triple literals well at all and was leaving out quoted text within the literal and whatnot.  I also, inadvertantly, was completely ignoring the escaped unicode within the literals and sending them verbatim.

N-Triples escapes unicode the same way Python string literals do (or at least this is how I’ve understood it), so 7⁰03ʹ43ʺN 151⁰56ʹ25ʺE is serialized into nt like: 7\\u207003\\u02B943\\u02BAN 151\\u207056\\u02B925\\u02BAE.  Try as I might, I could not figure out how to turn that back into unicode.

Jonathan Rochkind recommended that I look at the Ruby JSON library for some guidance, since JSON also encodes this way.  With that, I took a peek in JSON::Pure::Parser and modified parse_string for my needs.  So, if you have escaped unicode strings like this, and want them to be unicode, here’s a simple class to handle it.

$KCODE = 'u'
require 'strscan'
require 'iconv'
require 'jcode'
class UTF8Parser < StringScanner
  STRING = /(([\x0-\x1f]|[\\\/bfnrt]|\\u[0-9a-fA-F]{4}|[\x20-\xff])*)/nx
  UNPARSED = Object.new
  UNESCAPE_MAP = Hash.new { |h, k| h[k] = k.chr }
  UNESCAPE_MAP.update({
    ?"  => '"',
    ?\\ => '\\',
    ?/  => '/',
    ?b  => "\b",
    ?f  => "\f",
    ?n  => "\n",
    ?r  => "\r",
    ?t  => "\t",
    ?u  => nil,
  })
  UTF16toUTF8 = Iconv.new('utf-8', 'utf-16be')
  def initialize(str)
    super(str)
    @string = str
  end
  def parse_string
    if scan(STRING)
      return '' if self[1].empty?
      string = self[1].gsub(%r((?:\\[\\bfnrt"/]|(?:\\u(?:[A-Fa-f\d]{4}))+|\\[\x20-\xff]))n) do |c|
        if u = UNESCAPE_MAP[$&[1]]
          u
        else # \uXXXX
          bytes = ''
          i = 0
          while c[6 * i] == ?\\ && c[6 * i + 1] == ?u
            bytes << c[6 * i + 2, 2].to_i(16) << c[6 * i + 4, 2].to_i(16)
            i += 1
          end
          UTF16toUTF8.iconv(bytes)
        end
      end
      if string.respond_to?(:force_encoding)
        string.force_encoding(Encoding::UTF_8)
      end
      string
    else
      UNPARSED
    end
  rescue Iconv::Failure => e
    raise GeneratorError, "Caught #{e.class}: #{e}"
  end
end