Archive

Ruby on Rails

I have been working on Fancy-Pants quite a bit in the last couple of weeks. This is an AJAX layer over Voyager’s WebVoyage — an attempt to de-suck-ify its interface a bit. Why is it called Fancy-Pants? Well, Voyager still has the same underwear, it’s just got a new set of britches.

There are two main problems that it’s trying to solve:

  1. For items that have more than one MFHD, WebVoyage won’t show any item information in the title list.
  2. We wanted to link to 856 URLs from the title list.

Now, we’re already doing the second one, but it’s not implemented particularly well. While we were solving those problems, we wanted to see what we could do about that god-awful table based display.

I took NCSU’s Endeca layout as the baseline template for what I wanted the results to look like. Right now, Fancy-Pants can only be accessed via this Greasemonkey script [get Greasemonkey here]. Greasemonkey, of course, wouldn’t be a requirement, but we’re using it to inject the initial javascript call since we’re having to work on a live system.

For the title list screen, the javascript is looping through the bib ids on the page (it grabs them from the ‘save record’ checkboxes) and sends them to a Ruby on Rails app that queries Voyager’s Oracle database and builds a new result set. The javascript hides the original page results (display: none) and inserts a div with the new results. If there are multiple 856es or locations, the result has expanding/collapsing divs to show/hide them.

I send the query terms to Yahoo’s spell check API and will return a link to any suggestions it gives. No, this isn’t the ideal, but I’m still in proof-of-concept stage.

Things I still want to do with title list screen are:

  1. Come up with a way to show what the item is (journal, microform, map, etc.) — I’ve started on this, but it’s very rough
  2. Make the ‘sort by’ dropdown a row of links
  3. Turn the ‘Narrow my search’ button/page into a faceted navigation menu with options that make sense for the result set (for instance, limiting language to Dutch, Middle (ca. 1050-1350) isn’t going to come into play that much). Also add some logical facets a la Evergreen
  4. Replace the ‘save record’ feature to work during the entire session and be able to save directly to Zotero, Endnote, Bibtex, CiteULike or Connotea.
  5. COinS and UnAPI
  6. Give it the same style as the rest of our new web design.

I’m currently not doing much with the record view page, but I am adding a direct link to the record. I plan on integrating Umlaut responses here, as well as other context sensitive items – especially those that don’t conform well to OpenURL requests.

If you were able to install the Greasemonkey script and want to try it out, go to GIL’s keyword search and try:

  1. senate hearings — this is a good example of multiple mfhds/856es
  2. thomas friedmann — a good example of “Did you mean”

Also try a journal search for “Nature”. Then try whatever floats your boat and let me know how it worked. If you notice that it’s really slow, this is actually because of Voyager. The “Available online” and relevance icons are all rendered dynamically and they just grind the output to a halt. When we go live with this, we’d disable those features in WebVoyage to speed things up.
Fancy-pants is by no means a final product. I view this as a bridge between what we have and an upcoming Solr based catalog interface. The Solr catalog will still need to interface with Voyager, so Fancy-pants would transition to that. Ultimately, I would like this whole process to eventually lead to the Communicat.

I have a very conflicted relationship with our archives department. While their projects still need to get done, their services get very little use (especially when compared to other pending projects) and every time I get near any of their projects, it starts to become “Ross Singer and the Tar Baby”. Everything, EVERY SINGLE THING, archives/special collections has ever created is arcane, dense, and enormously time consuming. It always seems simple and then it somehow always turns into EAD.

About a year ago, I was tasked with developing something to help enable archives publish their recently converted EAD 2002 finding aids to HTML. I knew that XSLT stylesheets existed to do the heavy lifting in this so I thought it wouldn’t be too hard to get this up and running.

But it never works out that way with archives. The stylesheets that did what they liked worked only in Saxon and Saxon only works with Java (well, now with .Net, too — but that still didn’t help at the time). I wasn’t going to take on a language that I only poke at with a long stick on the most ambitious of days for some archives project (no offense, archives, but come on…). My whining caught Jeremy Frumkin‘s attention who pointed me at a project that Terry Reese had done. It was a PHP/MySQL project that took the EAD, called Saxon from the commandline and printed the resulting HTML. It also indexed all these parts of the finding aid and put them in a MySQL database to enable search. I could never get the search part to work very well with our data, so I gave up that part, focusing instead on the Saxon->XSLT->cache to HTML part.

I set up a module in our intranet that let the archivists upload xml files, preview the result with the stylesheet, rollback to an earlier version, etc. No matter what, though, this was a hack. And, most importantly, I never got search working. Also, the detailed description (the dsc) was incredibly difficult to get to display how archives wanted it.

Another thing that nagged at me was, for all this work I (and they) had invested in this, how was this really any better than the HTML finding aids they were migrating from? They put all this work into encoding the EAD and all that we were really doing was displaying a web page that was less flexible in its output than the previous HTML.
This summer, I met with archives to discuss how we were going to enable searching. Their vision seemed pretty simple to implement.

Why do I keep falling for that? How many hours had I already invested in something I thought would be trivial?

The new system was (of course) built in Rails. The plan was to circumvent XSLT altogether so:

  1. The web designer could have more control over how things worked without punching the XSLT tarbaby.
  2. We could make some of that stuff in the EAD “actionable”, like the subject headings, personal names, corporate names, etc.
  3. We could avoid the XSLT nightmare that is displaying the Box/Folder/etc. list.

The fulltext indexing would be provided by Ferret. I thought I could hammer this out in a couple of weeks. I think dumb things like this a lot.

The infrastructure went up pretty quickly. Uploading, indexing, and displaying the browse lists (up until now, they had to get the web designer to add their newly encoded finding aid to the various pages to link to it) all took a little over a week, maybe. Search took maybe another week. For launch purposes, I wasn’t worried about pagination of search results. There were only around 210 finding aids in the system, so a search for “georgia” (which cast the widest net) didn’t really make a terribly unmanageable results page. That’s the nice thing about working with such a small dataset that’s not heavily used. Inefficiencies don’t matter that much. I’ve since added pagination (took a day or two).

No, like last time, the real burden was displaying the finding aid. My initial plan, parsing the XML and divvying it up into a bunch of Ruby objects, was taking a lot of time. EAD is inordinately difficult to put into logical containers that have easily usable attributes. That’s just not how EAD rolls. I found I was severely delaying launch trying to shoehorn my vision on the finding aid, which archives didn’t really care about, of course. They just wanted searching. So, while I was working out my EAD/Ruby object modeler, I deferred to XSLT to get the system out the door.

Rather than using Saxon this time, I opted for Ruby/XSLT (based on libxml/libxslt) for main part of the document and ruby scripting/templates for the detailed collection list. The former worked pretty well (and fast!) but the latter was turning into a nightmare of endless recursion. When I tried looping through all of the levels (EAD can have 9 levels of recursion, c01-c09, starting at any number — I think — and going to any depth), my vain attempts either showed the horrid performance of REXML (a native Ruby XML parser) or attempts at navigating the recursion that would leave you clutching at the fibers of sanity that remain when you get this far in an EAD document.

Finally I found what I thought was my answer: a nifty little Ruby library called CobraVsMongoose that would transform a REXML document to a Ruby hash (or vice-versa). It was unbelievably fast and made working with this nested structure a WHOLE lot easier. There was some strangeness to overcome (naturally). For instance, if an element’s child nodes include an element that appears more than once, it will nestle the the children in an array. If there are no repeating elements, it will put then child nodes in a Hash, (or the other way around, I can’t remember) so you have to check what the object is and process it differently accordingly. Still, it was fast, easy to manipulate and allowed me to launch the finding aids system.

Everybody was happy.

And then archives actually looked at the detailed description. Anything in italics or bold or in those weird ’emph’ tags wasn’t being displayed. Ok, no problem, I just need to mess with the CobraVsMongoose function… oh wait. Yeah, what I hadn’t really thought about was that Ruby hashes don’t preserve sort order, so there was no way to get the italicized or bolded or whatevered text to display where it was supposed to in the output.

Damn you, tarbaby!

Back to the drawing board. I decided to return to REXML, emboldened now by (what I thought) was a better handle on the dsc structure and some better approaches to the recursion. Every c0n element would get mapped to a Ruby object (Collection, Series, Subseries, Folder, Item, etc.) and nest as children to each other and have partial views that would display them properly.

On my first pass, it was taking 28 seconds for our largest finding aid to display. TWENTY EIGHT SECONDS?! I would like to note, as finding aids go, ours aren’t very large. So, after tweaking a bit more (eliminating checking for children in Item elements, for example), I got it down to 24 seconds. Still not exactly the breathtaking performance I had hoped for.

What was bugging me was how quick CobraVsMongoose was. Somehow, despite also using REXML, it was fast. Really fast. And here my implementation seemed more like TurtleVsSnail. I was all set to turn to Libxml-Ruby (which would require a bunch of refactoring to migrate from REXML) when I found my problem and its solution. This was last night at 11PM.

While poring over the REXML API docs, I noticed that the REXML::Element.each_element method’s argument was called ‘xpath’. Terry had written about how dreadfully slow XPath queries were with REXML and, as a result, I thought I was avoiding them. When I removed the path arg from the each_element call in one of my methods and just iterated through each child element to see if its name matched, it cut the processing time in half! So, while 12 seconds was certainly no thoroughbred, it was definitely the right track. When I eliminated every xpath in the recursion process, I got it down to about 5 seconds. Add a touch of fragment caching and the natural performance boost of a production vs. development site in rails, and I think we’ve got a “good enough for now” solution.

There’s a little more to do with this project. I plan on adding an OpenSearch interface (I have all the plumbing from adding it to the Umlaut), an OAI provider and an SRU interface (when I finally get around to porting that CQL parser). And, yeah, finishing the EAD Object Model.

But right now, archives and I need to spend a little time away from each other.

In the meantime, here’s the development site.

The Umlaut “launched” last Monday. I wouldn’t call it the most graceful of take-offs, but I think it’s pretty much working now.

We immediately ran into a problem with ProQuest as a source. ProQuest sends a query string that starts with an ampersand (“&ctx_ver=foo…”) which Rails strongly disliked. Thankfully, it was still part of the break, so traffic was light. It gave me the opportunity to isolate and fix the myriad little bugs that could really only have been found by exposing the Umlaut to the wild and crazy OpenURLs that occur in nature. Ed Summers talked me into (and through) patching Rails so it didn’t choke on those ProQuest queries, although we later saw that the same patch was already in Rails Edge. It’s nice how hackable this framework is, though.
There was also a bit of gaffe when my trac project went ballistic after being crawled by Googlebot bringing down my server (and with it all journal coverage and our Zebra index of the OPAC — this nice little denial of service just so happened to bring down the Umlaut, as well). As a result, the trac site is down until I can figure out how to keep it from sending my machine into a frenzy (although I’m in the process of moving the Zebra index to our Voyager server — as I type this, in fact — and the Umlaut will be able to handle its own journal coverages tomorrow, probably).

There was also a bit of a panic on Friday when the mongrel_cluster would throw 500 errors whenever it would serve a session from another server. Ugh. Little did I know that mongrel_cluster must not use Pstores to save sessions. I scaled back to one mongrel server over the weekend (which of course, was overkill considering the umlaut was crapping out on the webservices my desktop machine wasn’t providing to it, anyway) while I migrated to ActiveRecord to store sessions. It seems to be working fine now and we’re back up to 5 mongrel servers. Yeah, hadn’t even thought about that one…
Yeah, it wasn’t perfect. But all in all, no catastrophes and it was nice having the safety of class break. So, for the rest of the week, I can focus on getting our EAD finding aids searchable.

Joy.

It’s been a while since I’ve written anything about the mlaut. It’s been a while since I’ve written about anything, really. Lots of reasons for that: been frantically trying to pull the mlaut together in time to launch for fall semester, and I’ve got this little bit of business going on…

Still, it’s probably time to touch on some of the changes that have happened.

  1. The backend has been completely updated
  2. The intial design was… shaky… at best. While the new backend is probably still shaky (it is, after all, my creation), it’s certainly more thought-out. Incoming requests are split by their referent and referrers (see Thom Hickey’s translation for these arcane terms) and the referent is checked against a ‘Ferret store’ of referents. The rationale here is that citations take a bunch of forms in regards to their completeness and quality, so we do some fulltext searching against an index of previously resolved referents to see if we’ve dealt with this before.

    It then stores the referent and the response object as Marshaled data, which is great, except it royally screws up trying to tail -f the logs.

  3. New design
  4. Heather King, our web designer here at Tech, has vastly improved the design. There’s still quite a bit more to do (isn’t there always?), but we’ve got a good, eminently usable, interface to build upon. The bits that need to be cleaned up (mainly references to other citations) won’t be that hard to clean up.

  5. Added Social Bookmarking Support
  6. Well, read-only support. Connotea, Yahoo’s MyWeb and Unalog support were pretty easy to add courtesy of their existing APIs. The downside is that I can only hope to find bookmarks based on URL which… doesn’t work well. I really wish Connotea would get some sort of fielded searching going on. Del.icio.us support, which would be great, can’t really happen until they ease the restrictions on how often you can hit it.

    CiteULike was a bit more of a hack, as it has no API. Instead, I am finding CiteULike pages in the Google/Yahoo searches I was already doing, grabbing the RSS/PRISM and tags and then doing a search (again, retrieving the RSS/PRISM) with the tags in the first article. It’s working pretty well, although I need to work out title matching since both Yahoo and Google truncate HTML titles. I plan on adding the CiteULike journal table of contents feeds this way, too.

  7. Improved performance thanks to the magic of AJAX
  8. Let’s face it. The mlaut was bloody slow. There were a couple of reasons for this, but it was mainly due to hitting Connotea and trying to harvest any OAI records it could find for the given citation. The kicker was that this was probably unnecessary for majority of the users. Now we’ve pushed the social bookmarkers and the OAI repositories to a ‘background task’ that gets called via javascript when the page renders. It’s not technically AJAX as much as a remote procedure call, but AJAX is WebTwoFeriffic! Besides, this is a Rails project. Gotta kick up the buzz technologies.

  9. Now storing subjects in anticipation of recommender service
  10. The mlaut now grabs subject headings from Pubmed (MeSH); LCSH from our catalogs; SFX subjects; tags from Connotea, CiteULike, MyWeb and unalog; and subjects from the OAI repositories and stores them with the referent. It also stores all of these in the Ferret store. The goal here is to search on subject headings to find recommendations to other similar items. As of this writing, there is only one citation with subject associations, so there’s nothing really to see here.

The big todo that’s on my plate for the rest of the week is adding statistics gathering. I’ve got my copy of An Architecture for the Aggregation and Analysis of Scholarly Usage Data by Bollen and Van de Sompel printed out and I plan on incorporating their bX concept for this.

I’ve been waiting for a while to have this title. Well, actually, not a long while, and that’s testimony to how quickly I’m able to develop things in Rails.

While I think SFX is fine product and we are completely and utterly dependent upon it for many things, it does still have its shortcomings. It is not a terribly intuitive interface (no link resolver that I’m aware of has one) and there are some items it just doesn’t resolve well, such as conference proceedings. Since conference proceedings and technical reports are huge for us, I decided we needed something that resolved these items better. That’s when the idea of the übeResolver (now mainly known as ‘the umlaut’) was born.

Although I had been working with Ed Summers on the Ruby OpenURL libraries before Code4Lib 2006, I really began working on umlaut earlier this month when I thought I might have something coherent together in time before the ELUNA proposal submission deadline. Although I barely had anything functional on the 8th (the deadline — 2 days after I had really broken ground), I could see that this was actually feasible and doable.

Three weeks later and it’s really starting to take shape (although it’s really, really slow right now). Here are some examples:

The journal ‘Science’

A book: ‘Advances in Communication Control Networks’

Conference Proceeding

Granted, the conference proceeding is less impressive as a result of IEEE being available via SFX (although, in this case, it’s getting the link from our catalog) and the fact that I’m having less luck with SPIE conferences (they’re being found, but I’m having some problems zeroing in on the correct volume — more on that in a bit), but I think that since this is the result of < 14 days of development time, it isn't a bad start. Now on to what it's doing. If the item is a "book", it queries our catalog for ISBN; asks xISBN for other matches, queries our catalog for that; does a title/author search; does a conferenceName/title/year search. If there are matches, it then asks the opac for holdings data. If the item is either not held or not available, it does the same to our consortial catalog. Currently it’s doing both, regardless, because I haven’t worried about performance.

It checks the catalog via SRU and tries to fill out the OpenURL ContextObject with more information (such as publisher and place). This would be useful to then export into a citation manager (which most link resolvers have fairly minimal support for). While it has the MODS records, it also grabs LCSH and Table of Contents (if they exist). When I find an item with more data, I’ll grab it as well (such as abstracts, etc.).

It then queries Amazon Web Services for more information (editorial content, similar items, etc.).

It still needs to check SFX, but, unfortunately, that would slow it down even more.

For journals, it checks SFX first. If there’s no volume, issue, date or article title, it will try to get coverage information. Unfortunately, SFX’s XML interface doesn’t send this, so I have to get this information from elsewhere. When I made our Ejournal Suggest service, I had to create a database of journals and journal titles and I have since been adding functionality to it (since I am running reports from SFX for titles and it includes the subject associations, I load them as well — it includes coverage, too, so including that field was trivial). So when I get the SFX result document back, I parse it for its services (getFullText, getDocumentDelivery, getCitedBy, etc.) and if no article information is sent, I make a web service request to a little PHP/JSON widget I have on the Ejournal Suggest database that gets back coverage, subjects and other similar journals based on the ISSN. The ‘other similar journals’ are 10 (arbitrary number) other journals that appear in the same subject headings, ordered by number of clickthroughs in the last month. This doesn’t appear if there is an article, because I haven’t decided if it’s useful in that case (plus the user has a link to the ‘journal level’ if they wish).

Umlaut then asks the opac for holdings and tries to parse the holdings records to determine if a specific issue is held in print (this works well if you know the volume number — I have thought about how to parse just a year, but haven’t implemented it yet). If there are electronic holdings, it attempts to dedupe.

There is still a lot more work to do with journals, although I hope to be able to implement this soon. The getCitedBy options will vary from grad students/faculty to undergrads. Since we have very limited seats to Web of Science, undergraduates will, instead, get their getCitedBy links to Google Scholar. Graduate students and faculty will get both Web of Science and Google Scholar. Also, if no fulltext results are found, it will then go out to the search engines to try to find something (whether it finds the original item or a postprint in arxiv.org or something). We will also have getAbstracts and getTOCs services enabled so the user can find other databases that might be useful or table of content services, accordingly. Further, I plan on associating the subject guides with SFX Subjects and LCC, so we can make recommendations from a specific subject guide (and actually promote the guide a bit) based, contextually, by what the user is already looking at. By including the SFX Target name in the subject items (which is an existing field that’s currently unused), we could also match on the items themselves.
The real value in umlaut, however, will come in its unAPI interface. Since we’ll have Z39.88 ContextObjects, MODS records, Amazon Web Services results and who knows what else, umlaut could feed an Atom store (such as unalog) with a whole hell of a lot of data. This would totally up the ante of scholarly social bookmarking services (such as Connotea and Cite-U-Like) by behaving more like personal libraries that match on a wide variety of metadata, not just url or title. The associations that users make can also aid umlaut in recommendations of other items.

The idea here is not a replacement of the current link resolver, the intention is to enhance it. SFX makes excellent middleware, but I think it’s interface leaves a bit to be desired. By utilizing its strength, we can layer more useful services on top of it. Also, a user can add other affiliations that they belong to in their profile, so umlaut can check their local public library or, if they are taking classes at another university, they can include those.

At this point I can already hear you saying, “But Ross, not everyone uses SFX”. How true! I propose a microformat for link resolver results that could be parseable by umlaut (and in an ‘eating your own dog food’ fashion, will add this to umlaut’s template, eventually), making any link resolver available to umlaut.

There is another problem that I’ve encountered while working on this project, though, too. Last week and the week before, while I was doing the bulk of the SRU development, I kept on noticing (and reporting) our catalog (and, more often, it’s Z39.50 server) going down. Like many times a day. After concluding that, in fact, I was probably causing the problem, I finally got around to doing something that I’ve been meaning to do for months (and I would recommend to everyone else if they want to actually make useful systems): exporting the bib database into something better. Last week I imported our catalog into Zebra and sometime this week I will have a system that syncs the database every other hour (we already have the plumbing for this for our consortial catalog). I am also experimenting with Cheshire3 (since I think it’s potential is greater — it’s possible we may use both for different purposes). The advantage to this (besides not crashing our catalog every half hour) is that I can index it any way want/need to as well as store the data any way I need to in order to make sure that users get the best experience they can.

Going back to the SPIE conferences, there is no way in Voyager that I can limit my results to less than 360+ results for “SPIE Proceedings” in 2003. At least, not from the citations I get from Compendex (which is where anyone would get the idea to look for SPIE Proceedings in our catalog, anyway). With an exported database, however, I could index the volume and pinpoint the exact record in our catalog. Or, if that doesn’t scale (for instance, if they’re all done a little differently), I can pound the hell out our zebra (or cheshire3 or whatever) server looking for the proper volume without worrying about impacting all of our other services. I can also ‘game the system’ a bit and store bits in places that I can query when I need them. Certainly this makes umlaut (and other services) more difficult to share to other libraries (at least, other libraries that don’t have similar setups to ours), but I think these sorts of solutions are essential to improving access to our collections.

Oh yeah, and lest you think that mirroring your bib database is too much to maintain: Zebra can import marc records (so you can use your opac’s marc export utility) and our entire bib database (705,000 records) takes up less than 2GB of storage. The more indexes added, the larger the database size, of course, but I am indexing a LOT in that.

Since my foray into python a couple of months ago, I’ve been enjoying branching out into new languages.

I had pitched the concept of a link resolver router for the state universal catalog to a committee I sit on (this group talks about SFX links in the 856 tag and whatnot). The problem with making links for a publicly available resource point to your institutional resolver is just that. It’s pointing your your institutional resolver, despite the fact that your audience could be coming from anywhere. This plays out even greater in a venue such as a universal catalog, since there’s not really a “home institution” to point a resolver link, anyway. OCLC and UKOLN both have resolver routers, and OCLC’s certainly is an option, but I don’t feel comfortable with the possibility that all of our member institutions might have to pay for the service (in the future). My other problem with OCLC’s service is that you can only belong to one institution and I have never liked that (especially as more and more institutions have link resolvers).

So, in this committee I mentioned that it would be pretty simple to make a router, and since I was having trouble getting people to understand what exactly I was talking about, I decided to make a proof-of-concept. And, since I was making a proof-of-concept, I thought it’d be fun to try it in Ruby on Rails.

Now, a resolver router is about the simplest concept possible. It doesn’t really do anything but take requests and pass them off to the appropriate resolver. It’s a resolver abstraction layer, if you will. I thought this was a nice, small project to try to cut my Ruby teeth on. There’s a little bit a database, a little bit of AJAX. It’s also useful, unlike making a cookbook from a tutorial or something.

It took about three days to make this. After you pick your resolver (Choose a couple! Add your own!), you’ll be taken to a page to choose between your various communities for appropriate copy.

I chose this particular citation because it shows the very huge limitation of link resolvers (if you choose Georgia Tech’s resolver and Emory’s resolver, for instance); despite the fact that this is freely available, it does not appear in my resolver. That’s not really the use case I envision, though. I am thinking more of a case like my co-worker, Heather, who should have access to Georgia Tech’s collection, Florida State’s resources (she’s in grad school there), Richland County Public Library (she lives in Columbia, SC), and the University of South Carolina (where her husband is a librarian). The resolver router alleviates the need to search for a given citation in the various communities (indeed to even have to think of or know where to look within those communities).

Sometime later this winter, I’ll have an even better use case. I’ll keep that under wraps for now.

Now, my impression of Ruby on Rails… For a project like this, it is absolutely amazing. I cannot believe I was able to learn the language from scratch and implement something that works (with this amount of functionality) in such a short amount of time. By bypassing the need to create the “framework” for the application, you can just dive into implementation.

In fact, I think my time to implementation would have been even faster if the number of resources/tutorials out there didn’t suck out loud. Most references point to these tutorials to get you started, but they really aren’t terribly helpful. They explain nothing about why they are doing what they are doing in them. I found this blog posting to be infinitely more useful. Her blog in general is going in my aggregator, I think.

When it comes to learning Ruby, this is a masterful work of art… but… not terribly useful if you just want to look things up. I recommend this for that.

Anyway, I am so impressed with Ruby on Rails that I am planning on using it (currently) for “alternative opac project“, which is now being code named “Communicat”. More on this shortly (although I did actually develop the database schema today).