Monthly Archives: December 2006

Well, my inaugural fantasy football season is over. Somehow, despite not having a freaking clue of what I was doing (and it being a year I actually have watched the least football in… ages), my team somehow limped into the championship game against Jeremy.

I knew I was screwed. I barely got past Carol in the playoff round; we were tied going into the Monday night game. It boiled down to Peyton Manning (my team) having a great game and T.J. Houshmanzadeh (Carol’s team) merely showing up.

And then I immediately overanalyzed my team and choked. Fearing that the Bears were going to sit all their starters (since their playoff position is decided), I benched Desmond Clark for Alge Crumpler (no net difference in points this weekend) and I benched Bernard Berrian (eight points) and played Lee Evans instead. This would, on the surface, seem ok, but to get Lee Evans, I dropped Travis Henry (another career day and how about them Titans, y’all?) leaving God knows how many points sitting in free agency.

I then focused all of my attention on matchups and injuries. I benched Fast Willie since he was facing the Ravens (good choice) and substituted him with Deuce McAllister (who probably had one of his best games of the season) — so that was a total net gain. Since I had cut Travis Henry (ugh, why?) and Joseph Addai (meh) for the Dolphins defense (more on that in a bit), I started Marion Barber III alongside McAllister. This dude… When he’s on my bench he scores, like, 5 TDs/game. When I play him he runs 6 times for 3 yards.

My wide receivers were a mess. Marty Booker was hurt, so I traded him for Braylon Edwards who subsequently sucked. I didn’t play Laveranues Coles, since he was “questionable” and this worked out fine since he got hurt (I mean for me, not him).

So, I didn’t play the Dolphins’ defense in the end. In my angst, I read several articles by fantasy football pundits (what kind of a job is that?) talking about how overmatched the Bears defense was for the Lions.

Man, they sucked. For two weeks in a row.

So I lost. And I’m not even sure I “beat myself” (to use football losing team jargon). I’m not even sure that my poor choices could overcome that day Steven Jackson had.

Anyway, fantasy football was fun. I came, I saw, I lost to Jeremy.

Better luck next year.

I have been working on Fancy-Pants quite a bit in the last couple of weeks. This is an AJAX layer over Voyager’s WebVoyage — an attempt to de-suck-ify its interface a bit. Why is it called Fancy-Pants? Well, Voyager still has the same underwear, it’s just got a new set of britches.

There are two main problems that it’s trying to solve:

  1. For items that have more than one MFHD, WebVoyage won’t show any item information in the title list.
  2. We wanted to link to 856 URLs from the title list.

Now, we’re already doing the second one, but it’s not implemented particularly well. While we were solving those problems, we wanted to see what we could do about that god-awful table based display.

I took NCSU’s Endeca layout as the baseline template for what I wanted the results to look like. Right now, Fancy-Pants can only be accessed via this Greasemonkey script [get Greasemonkey here]. Greasemonkey, of course, wouldn’t be a requirement, but we’re using it to inject the initial javascript call since we’re having to work on a live system.

For the title list screen, the javascript is looping through the bib ids on the page (it grabs them from the ‘save record’ checkboxes) and sends them to a Ruby on Rails app that queries Voyager’s Oracle database and builds a new result set. The javascript hides the original page results (display: none) and inserts a div with the new results. If there are multiple 856es or locations, the result has expanding/collapsing divs to show/hide them.

I send the query terms to Yahoo’s spell check API and will return a link to any suggestions it gives. No, this isn’t the ideal, but I’m still in proof-of-concept stage.

Things I still want to do with title list screen are:

  1. Come up with a way to show what the item is (journal, microform, map, etc.) — I’ve started on this, but it’s very rough
  2. Make the ‘sort by’ dropdown a row of links
  3. Turn the ‘Narrow my search’ button/page into a faceted navigation menu with options that make sense for the result set (for instance, limiting language to Dutch, Middle (ca. 1050-1350) isn’t going to come into play that much). Also add some logical facets a la Evergreen
  4. Replace the ‘save record’ feature to work during the entire session and be able to save directly to Zotero, Endnote, Bibtex, CiteULike or Connotea.
  5. COinS and UnAPI
  6. Give it the same style as the rest of our new web design.

I’m currently not doing much with the record view page, but I am adding a direct link to the record. I plan on integrating Umlaut responses here, as well as other context sensitive items – especially those that don’t conform well to OpenURL requests.

If you were able to install the Greasemonkey script and want to try it out, go to GIL’s keyword search and try:

  1. senate hearings — this is a good example of multiple mfhds/856es
  2. thomas friedmann — a good example of “Did you mean”

Also try a journal search for “Nature”. Then try whatever floats your boat and let me know how it worked. If you notice that it’s really slow, this is actually because of Voyager. The “Available online” and relevance icons are all rendered dynamically and they just grind the output to a halt. When we go live with this, we’d disable those features in WebVoyage to speed things up.
Fancy-pants is by no means a final product. I view this as a bridge between what we have and an upcoming Solr based catalog interface. The Solr catalog will still need to interface with Voyager, so Fancy-pants would transition to that. Ultimately, I would like this whole process to eventually lead to the Communicat.

I have a very conflicted relationship with our archives department. While their projects still need to get done, their services get very little use (especially when compared to other pending projects) and every time I get near any of their projects, it starts to become “Ross Singer and the Tar Baby”. Everything, EVERY SINGLE THING, archives/special collections has ever created is arcane, dense, and enormously time consuming. It always seems simple and then it somehow always turns into EAD.

About a year ago, I was tasked with developing something to help enable archives publish their recently converted EAD 2002 finding aids to HTML. I knew that XSLT stylesheets existed to do the heavy lifting in this so I thought it wouldn’t be too hard to get this up and running.

But it never works out that way with archives. The stylesheets that did what they liked worked only in Saxon and Saxon only works with Java (well, now with .Net, too — but that still didn’t help at the time). I wasn’t going to take on a language that I only poke at with a long stick on the most ambitious of days for some archives project (no offense, archives, but come on…). My whining caught Jeremy Frumkin‘s attention who pointed me at a project that Terry Reese had done. It was a PHP/MySQL project that took the EAD, called Saxon from the commandline and printed the resulting HTML. It also indexed all these parts of the finding aid and put them in a MySQL database to enable search. I could never get the search part to work very well with our data, so I gave up that part, focusing instead on the Saxon->XSLT->cache to HTML part.

I set up a module in our intranet that let the archivists upload xml files, preview the result with the stylesheet, rollback to an earlier version, etc. No matter what, though, this was a hack. And, most importantly, I never got search working. Also, the detailed description (the dsc) was incredibly difficult to get to display how archives wanted it.

Another thing that nagged at me was, for all this work I (and they) had invested in this, how was this really any better than the HTML finding aids they were migrating from? They put all this work into encoding the EAD and all that we were really doing was displaying a web page that was less flexible in its output than the previous HTML.
This summer, I met with archives to discuss how we were going to enable searching. Their vision seemed pretty simple to implement.

Why do I keep falling for that? How many hours had I already invested in something I thought would be trivial?

The new system was (of course) built in Rails. The plan was to circumvent XSLT altogether so:

  1. The web designer could have more control over how things worked without punching the XSLT tarbaby.
  2. We could make some of that stuff in the EAD “actionable”, like the subject headings, personal names, corporate names, etc.
  3. We could avoid the XSLT nightmare that is displaying the Box/Folder/etc. list.

The fulltext indexing would be provided by Ferret. I thought I could hammer this out in a couple of weeks. I think dumb things like this a lot.

The infrastructure went up pretty quickly. Uploading, indexing, and displaying the browse lists (up until now, they had to get the web designer to add their newly encoded finding aid to the various pages to link to it) all took a little over a week, maybe. Search took maybe another week. For launch purposes, I wasn’t worried about pagination of search results. There were only around 210 finding aids in the system, so a search for “georgia” (which cast the widest net) didn’t really make a terribly unmanageable results page. That’s the nice thing about working with such a small dataset that’s not heavily used. Inefficiencies don’t matter that much. I’ve since added pagination (took a day or two).

No, like last time, the real burden was displaying the finding aid. My initial plan, parsing the XML and divvying it up into a bunch of Ruby objects, was taking a lot of time. EAD is inordinately difficult to put into logical containers that have easily usable attributes. That’s just not how EAD rolls. I found I was severely delaying launch trying to shoehorn my vision on the finding aid, which archives didn’t really care about, of course. They just wanted searching. So, while I was working out my EAD/Ruby object modeler, I deferred to XSLT to get the system out the door.

Rather than using Saxon this time, I opted for Ruby/XSLT (based on libxml/libxslt) for main part of the document and ruby scripting/templates for the detailed collection list. The former worked pretty well (and fast!) but the latter was turning into a nightmare of endless recursion. When I tried looping through all of the levels (EAD can have 9 levels of recursion, c01-c09, starting at any number — I think — and going to any depth), my vain attempts either showed the horrid performance of REXML (a native Ruby XML parser) or attempts at navigating the recursion that would leave you clutching at the fibers of sanity that remain when you get this far in an EAD document.

Finally I found what I thought was my answer: a nifty little Ruby library called CobraVsMongoose that would transform a REXML document to a Ruby hash (or vice-versa). It was unbelievably fast and made working with this nested structure a WHOLE lot easier. There was some strangeness to overcome (naturally). For instance, if an element’s child nodes include an element that appears more than once, it will nestle the the children in an array. If there are no repeating elements, it will put then child nodes in a Hash, (or the other way around, I can’t remember) so you have to check what the object is and process it differently accordingly. Still, it was fast, easy to manipulate and allowed me to launch the finding aids system.

Everybody was happy.

And then archives actually looked at the detailed description. Anything in italics or bold or in those weird ’emph’ tags wasn’t being displayed. Ok, no problem, I just need to mess with the CobraVsMongoose function… oh wait. Yeah, what I hadn’t really thought about was that Ruby hashes don’t preserve sort order, so there was no way to get the italicized or bolded or whatevered text to display where it was supposed to in the output.

Damn you, tarbaby!

Back to the drawing board. I decided to return to REXML, emboldened now by (what I thought) was a better handle on the dsc structure and some better approaches to the recursion. Every c0n element would get mapped to a Ruby object (Collection, Series, Subseries, Folder, Item, etc.) and nest as children to each other and have partial views that would display them properly.

On my first pass, it was taking 28 seconds for our largest finding aid to display. TWENTY EIGHT SECONDS?! I would like to note, as finding aids go, ours aren’t very large. So, after tweaking a bit more (eliminating checking for children in Item elements, for example), I got it down to 24 seconds. Still not exactly the breathtaking performance I had hoped for.

What was bugging me was how quick CobraVsMongoose was. Somehow, despite also using REXML, it was fast. Really fast. And here my implementation seemed more like TurtleVsSnail. I was all set to turn to Libxml-Ruby (which would require a bunch of refactoring to migrate from REXML) when I found my problem and its solution. This was last night at 11PM.

While poring over the REXML API docs, I noticed that the REXML::Element.each_element method’s argument was called ‘xpath’. Terry had written about how dreadfully slow XPath queries were with REXML and, as a result, I thought I was avoiding them. When I removed the path arg from the each_element call in one of my methods and just iterated through each child element to see if its name matched, it cut the processing time in half! So, while 12 seconds was certainly no thoroughbred, it was definitely the right track. When I eliminated every xpath in the recursion process, I got it down to about 5 seconds. Add a touch of fragment caching and the natural performance boost of a production vs. development site in rails, and I think we’ve got a “good enough for now” solution.

There’s a little more to do with this project. I plan on adding an OpenSearch interface (I have all the plumbing from adding it to the Umlaut), an OAI provider and an SRU interface (when I finally get around to porting that CQL parser). And, yeah, finishing the EAD Object Model.

But right now, archives and I need to spend a little time away from each other.

In the meantime, here’s the development site.