My relationship with Ruby nowadays is roughly akin to somebody addicted to pain killers. I know it’s not good for me (since everything I work on nowadays is RDF, XML or both) but I’m able to still be productive and the pain of quitting, while in the long run would be better for everybody, just isn’t something I have time for right now. Maybe someday I’ll make the jump back to Python (since it’s actually pretty good at dealing with both RDF and XML), but for now I’ll just find workarounds to my problems (unlike others, I am completely incapable of juggling more than one language).
I first ran into my big XML and Ruby problem a couple of weeks ago while working on the TalisLMS connector for Jangle. I’ve, of course, run into it before, but it has never been a total show stopper like this. In order to add the Resource entity to the TalisLMS (Jangle-ese for bibliographic records) connector, I am querying the Platform store the OPAC uses. I’m using the Platform rather than the Zebra index that comes with Alto (the records are indexed in both places) because the modified date isn’t sortable in Zebra and that would be an issue when serializing everything to Atom. The records are transformed into a proprietary RDF format (called BibRDF) when loaded into the Platform (this is for the benefit of Prism, our OPAC). In order to get the MARC records (there’s no route back to the MARC from BibRDF), I have to pull the UniqueIdentifer (which is the mapped 001) field out of the BibRDF and throw them in a Z39.50 client (Ruby/ZOOM) and query the Zebra index. In order to get enough metadata to create a valid Atom entry, I needed to be able to parse the BibRDF (which comes out of the Platform as RDF/XML), since that is the default record format.
And this is where I’d run into problems. I have the default number of records set to be returned by the Jangle to 100. That’s a pretty sweet spot for both servers to handle the load and clients to deal with resulting Atom document. Well, you’d think it was, anyway, except REXML was taking about 10 seconds to parse the Platform response into Ruby objects.
I realize the Rubyists out there are already dismissing this and scrolling down to the comment box to write “well don’t use REXML, you dumbass”, but let me explain. I generally don’t use REXML (unless it’s something very small and simple), instead opting for Hpricot for parsing XML. I’ve tended to avoid LibXML in Ruby, when I first tried it, it segfaulted a lot, but that was the past… my reasons for avoiding it lately is because I have this stubborn ideal about having things work with JRuby and that’s just not going to be an option with LibXML (before you scroll down and add another comment about the Ruby/ZOOM requirement, it will eventually be replaced with Ruby-SRU… probably). Hpricot was falling flat on its face with the BibRDF namespace prefixes, though (j.0:UniqueIdentifier). It seems to have problems with periods in the prefix, so that was a no go.
So I had REXML and I had horrible performance. Now what?
Well, JSON is fast in Ruby, so I thought that might be an option. The Platform has a transform service, if you pass an argument with the URL for an XSLT stylesheet, it will output the result in the format you want. Googling found several projects that would turn XML into JSON via XSLT (this one seems the best if you have an XSLT 2.0 parser), but they weren’t quite what I needed. I wanted to preserve the original RDF/XML since I was just going to be turning around and regurgitating it back to the Jangle server, anyway. I just needed a quick way to grab the UniqueIdentifier, MainAuthor and LastModified fields and shove the rest of the XML into an object attribute.
I have always chafed at the thought of actually doing anything in XSLT. In retrospect (after I’ve been using almost exclusively for a month, now), I realize that my opinion was probably actually the result of the data that I was trying to transform (EAD, the metadata format designed to punish technologists) rather than XSLT itself (the project got sucked into a vortex when I tried working with the EAD directly with Ruby, too). Still, I had always resisted. The syntax is weird, variables confused me, I just never got the hang of it.
But, damn, it’s fast.
I wasn’t done, yet, though. The DLF ILS-DI Adapter for Jangle’s OAI-PMH service was sooooo slow. Requests were literally taking around 35 seconds each. This was because I was using FeedTools to parse the Atom documents and Builder::XmlMarkup to generate the OAI-PMH output. And this was silly. Atom is a very short hop to OAI-PMH, and there was really no need to manipulate the data itself at all. However, I did need to add stuff to the final XML output that I wouldn’t know until it was time to render. So I wrote these two XSLTs. I have patterns in there which are identified by “##verb##” or “##requestUrl##”, etc. This way, I can load the XSLT file into my Ruby script, replace the patterns with their real values via regex, and then transform the Atom to OAI-PMH using libxslt-ruby. Requests are now down to about 5 seconds. Not bad.
All in all I’m pretty happy with this. And I don’t have to quit my addiction just yet.
For those of you that noticed that libxslt-ruby doesn’t quite jibe with my JRuby requirement, well, I guess I’m not a very dogmatic at the end of the day (which is right about now).