Polishing the Turd

If YPOW, like MPOW, is an Endeavor Voyager site, you’ve got some decisions ahead. Francisco Partners, naturally, would like you to migrate to Aleph, and I have no doubt that Ex Libris is, as I write this, busily working on a means to make that easy for Voyager libraries to do. But ILS migrations are painful, no matter how easy the backend process might be. There’s staff training, user training, managing new workflows, site integration; lots of things to deal with. Also, your functionality may not be a 1:1 relationship to what you currently have. How do you work around services you depended upon?

Since soon our contracts with Endeavor Information Systems will be next to worthless, I propose, Voyager customers, that we take ownership of our systems. For the price of a full Oracle (or SQL Server? — does Voyager support other RDBMSes?) license (many of us already have this), we can get write permissions to our DB and make our own interfaces. We wouldn’t need to worry about staff clients (for now), since we already have cataloging, circulation, acquisitions, etc. modules that work. When we’re ready for different functionality, however, we can create a new middleware (in fact, I’m planning to break ground on this in the next two weeks) to allow for web clients or, even better, piggyback on Evergreen’s staff clients and let somebody else do the hard work. If we had native clients in the new middleware, a library could use any database backend they wanted (just migrate the data from Oracle into something else). The key is write access to the database.

By taking ownership of our ILS, we can push developments we want, such as NCIP, a ‘Next Gen OPAC’, better link resolver integration, better metasearch integration, etc. without the pain of starting all over again (with potentially the same results, who is to say that whatever you choose as an ILS wouldn’t eventally get bought and killed off, as well?). Putting my money (or lack thereof) where my mouth is, I plan on migrating Fancy Pants to use such a backend (read only db access, for now, we still have a support contract, after all). I’m calling this project ‘Bon Voyage’. After reading Birkin’s post on CODE4LIB, I would like to make a similar service for Voyager that would basically take the place of the Z39.50 server and access to the database. Fancy Pants wouldn’t be integrated into Bon Voyage, it would just be another client (since it was always only meant as a stopgap, anyway).

What we’ll have is a framework for getting at the database backend (it’d be safe to say this will be a rails project) with APIs to access bib, item, patron, etc. information. Once the models are created, it will be relatively simple to transition to ‘write’ access when that becomes necessary. Making a replacement for WebVoyage would be fairly trivial once the architecture is in place. Web based staff clients would also be fairly simple. I think EG staff client integration wouldn’t be too hard since it would just be an issue of outputting our data to something the EG clients want (JSON, I believe) and translating the client’s reponse. That would need to be investigated more, however (I’m on paternity leave and not doing things like that right now 🙂

Would anybody find this useful?
It seems the money we spend on an ILS could be better spent elsewhere. I don’t think this would be a product we could distribute outside of the the current Voyager customer base (at least, not until it was completely native… maybe not even then- we’d have to work this out with Francisco Partners, I guess), but I think that that is big enough to be sustainable on its own.

I have been working on Fancy-Pants quite a bit in the last couple of weeks. This is an AJAX layer over Voyager’s WebVoyage — an attempt to de-suck-ify its interface a bit. Why is it called Fancy-Pants? Well, Voyager still has the same underwear, it’s just got a new set of britches.

There are two main problems that it’s trying to solve:

  1. For items that have more than one MFHD, WebVoyage won’t show any item information in the title list.
  2. We wanted to link to 856 URLs from the title list.

Now, we’re already doing the second one, but it’s not implemented particularly well. While we were solving those problems, we wanted to see what we could do about that god-awful table based display.

I took NCSU’s Endeca layout as the baseline template for what I wanted the results to look like. Right now, Fancy-Pants can only be accessed via this Greasemonkey script [get Greasemonkey here]. Greasemonkey, of course, wouldn’t be a requirement, but we’re using it to inject the initial javascript call since we’re having to work on a live system.

For the title list screen, the javascript is looping through the bib ids on the page (it grabs them from the ‘save record’ checkboxes) and sends them to a Ruby on Rails app that queries Voyager’s Oracle database and builds a new result set. The javascript hides the original page results (display: none) and inserts a div with the new results. If there are multiple 856es or locations, the result has expanding/collapsing divs to show/hide them.

I send the query terms to Yahoo’s spell check API and will return a link to any suggestions it gives. No, this isn’t the ideal, but I’m still in proof-of-concept stage.

Things I still want to do with title list screen are:

  1. Come up with a way to show what the item is (journal, microform, map, etc.) — I’ve started on this, but it’s very rough
  2. Make the ‘sort by’ dropdown a row of links
  3. Turn the ‘Narrow my search’ button/page into a faceted navigation menu with options that make sense for the result set (for instance, limiting language to Dutch, Middle (ca. 1050-1350) isn’t going to come into play that much). Also add some logical facets a la Evergreen
  4. Replace the ‘save record’ feature to work during the entire session and be able to save directly to Zotero, Endnote, Bibtex, CiteULike or Connotea.
  5. COinS and UnAPI
  6. Give it the same style as the rest of our new web design.

I’m currently not doing much with the record view page, but I am adding a direct link to the record. I plan on integrating Umlaut responses here, as well as other context sensitive items – especially those that don’t conform well to OpenURL requests.

If you were able to install the Greasemonkey script and want to try it out, go to GIL’s keyword search and try:

  1. senate hearings — this is a good example of multiple mfhds/856es
  2. thomas friedmann — a good example of “Did you mean”

Also try a journal search for “Nature”. Then try whatever floats your boat and let me know how it worked. If you notice that it’s really slow, this is actually because of Voyager. The “Available online” and relevance icons are all rendered dynamically and they just grind the output to a halt. When we go live with this, we’d disable those features in WebVoyage to speed things up.
Fancy-pants is by no means a final product. I view this as a bridge between what we have and an upcoming Solr based catalog interface. The Solr catalog will still need to interface with Voyager, so Fancy-pants would transition to that. Ultimately, I would like this whole process to eventually lead to the Communicat.

I’ve been waiting for a while to have this title. Well, actually, not a long while, and that’s testimony to how quickly I’m able to develop things in Rails.

While I think SFX is fine product and we are completely and utterly dependent upon it for many things, it does still have its shortcomings. It is not a terribly intuitive interface (no link resolver that I’m aware of has one) and there are some items it just doesn’t resolve well, such as conference proceedings. Since conference proceedings and technical reports are huge for us, I decided we needed something that resolved these items better. That’s when the idea of the übeResolver (now mainly known as ‘the umlaut’) was born.

Although I had been working with Ed Summers on the Ruby OpenURL libraries before Code4Lib 2006, I really began working on umlaut earlier this month when I thought I might have something coherent together in time before the ELUNA proposal submission deadline. Although I barely had anything functional on the 8th (the deadline — 2 days after I had really broken ground), I could see that this was actually feasible and doable.

Three weeks later and it’s really starting to take shape (although it’s really, really slow right now). Here are some examples:

The journal ‘Science’

A book: ‘Advances in Communication Control Networks’

Conference Proceeding

Granted, the conference proceeding is less impressive as a result of IEEE being available via SFX (although, in this case, it’s getting the link from our catalog) and the fact that I’m having less luck with SPIE conferences (they’re being found, but I’m having some problems zeroing in on the correct volume — more on that in a bit), but I think that since this is the result of < 14 days of development time, it isn't a bad start. Now on to what it's doing. If the item is a "book", it queries our catalog for ISBN; asks xISBN for other matches, queries our catalog for that; does a title/author search; does a conferenceName/title/year search. If there are matches, it then asks the opac for holdings data. If the item is either not held or not available, it does the same to our consortial catalog. Currently it’s doing both, regardless, because I haven’t worried about performance.

It checks the catalog via SRU and tries to fill out the OpenURL ContextObject with more information (such as publisher and place). This would be useful to then export into a citation manager (which most link resolvers have fairly minimal support for). While it has the MODS records, it also grabs LCSH and Table of Contents (if they exist). When I find an item with more data, I’ll grab it as well (such as abstracts, etc.).

It then queries Amazon Web Services for more information (editorial content, similar items, etc.).

It still needs to check SFX, but, unfortunately, that would slow it down even more.

For journals, it checks SFX first. If there’s no volume, issue, date or article title, it will try to get coverage information. Unfortunately, SFX’s XML interface doesn’t send this, so I have to get this information from elsewhere. When I made our Ejournal Suggest service, I had to create a database of journals and journal titles and I have since been adding functionality to it (since I am running reports from SFX for titles and it includes the subject associations, I load them as well — it includes coverage, too, so including that field was trivial). So when I get the SFX result document back, I parse it for its services (getFullText, getDocumentDelivery, getCitedBy, etc.) and if no article information is sent, I make a web service request to a little PHP/JSON widget I have on the Ejournal Suggest database that gets back coverage, subjects and other similar journals based on the ISSN. The ‘other similar journals’ are 10 (arbitrary number) other journals that appear in the same subject headings, ordered by number of clickthroughs in the last month. This doesn’t appear if there is an article, because I haven’t decided if it’s useful in that case (plus the user has a link to the ‘journal level’ if they wish).

Umlaut then asks the opac for holdings and tries to parse the holdings records to determine if a specific issue is held in print (this works well if you know the volume number — I have thought about how to parse just a year, but haven’t implemented it yet). If there are electronic holdings, it attempts to dedupe.

There is still a lot more work to do with journals, although I hope to be able to implement this soon. The getCitedBy options will vary from grad students/faculty to undergrads. Since we have very limited seats to Web of Science, undergraduates will, instead, get their getCitedBy links to Google Scholar. Graduate students and faculty will get both Web of Science and Google Scholar. Also, if no fulltext results are found, it will then go out to the search engines to try to find something (whether it finds the original item or a postprint in or something). We will also have getAbstracts and getTOCs services enabled so the user can find other databases that might be useful or table of content services, accordingly. Further, I plan on associating the subject guides with SFX Subjects and LCC, so we can make recommendations from a specific subject guide (and actually promote the guide a bit) based, contextually, by what the user is already looking at. By including the SFX Target name in the subject items (which is an existing field that’s currently unused), we could also match on the items themselves.
The real value in umlaut, however, will come in its unAPI interface. Since we’ll have Z39.88 ContextObjects, MODS records, Amazon Web Services results and who knows what else, umlaut could feed an Atom store (such as unalog) with a whole hell of a lot of data. This would totally up the ante of scholarly social bookmarking services (such as Connotea and Cite-U-Like) by behaving more like personal libraries that match on a wide variety of metadata, not just url or title. The associations that users make can also aid umlaut in recommendations of other items.

The idea here is not a replacement of the current link resolver, the intention is to enhance it. SFX makes excellent middleware, but I think it’s interface leaves a bit to be desired. By utilizing its strength, we can layer more useful services on top of it. Also, a user can add other affiliations that they belong to in their profile, so umlaut can check their local public library or, if they are taking classes at another university, they can include those.

At this point I can already hear you saying, “But Ross, not everyone uses SFX”. How true! I propose a microformat for link resolver results that could be parseable by umlaut (and in an ‘eating your own dog food’ fashion, will add this to umlaut’s template, eventually), making any link resolver available to umlaut.

There is another problem that I’ve encountered while working on this project, though, too. Last week and the week before, while I was doing the bulk of the SRU development, I kept on noticing (and reporting) our catalog (and, more often, it’s Z39.50 server) going down. Like many times a day. After concluding that, in fact, I was probably causing the problem, I finally got around to doing something that I’ve been meaning to do for months (and I would recommend to everyone else if they want to actually make useful systems): exporting the bib database into something better. Last week I imported our catalog into Zebra and sometime this week I will have a system that syncs the database every other hour (we already have the plumbing for this for our consortial catalog). I am also experimenting with Cheshire3 (since I think it’s potential is greater — it’s possible we may use both for different purposes). The advantage to this (besides not crashing our catalog every half hour) is that I can index it any way want/need to as well as store the data any way I need to in order to make sure that users get the best experience they can.

Going back to the SPIE conferences, there is no way in Voyager that I can limit my results to less than 360+ results for “SPIE Proceedings” in 2003. At least, not from the citations I get from Compendex (which is where anyone would get the idea to look for SPIE Proceedings in our catalog, anyway). With an exported database, however, I could index the volume and pinpoint the exact record in our catalog. Or, if that doesn’t scale (for instance, if they’re all done a little differently), I can pound the hell out our zebra (or cheshire3 or whatever) server looking for the proper volume without worrying about impacting all of our other services. I can also ‘game the system’ a bit and store bits in places that I can query when I need them. Certainly this makes umlaut (and other services) more difficult to share to other libraries (at least, other libraries that don’t have similar setups to ours), but I think these sorts of solutions are essential to improving access to our collections.

Oh yeah, and lest you think that mirroring your bib database is too much to maintain: Zebra can import marc records (so you can use your opac’s marc export utility) and our entire bib database (705,000 records) takes up less than 2GB of storage. The more indexes added, the larger the database size, of course, but I am indexing a LOT in that.

I have mentioned here several times the “alternative to the catalog” project I am trying to implement at Tech. One of the problems that I’ve had is naming the project something that lets people realize what I’m talking about, without the political hairiness of saying “catalog replacement” (since that’s technically not true, anyway).

In a meeting two weeks ago (about subject guides), I was drawing the concept of this project on the whiteboard of our conference room. It’s been up ever since and in the middle, I had written “ALTOpac” because that was an easy way to loosely describe it in a way that the uninitiated in the room could envision where I was starting from. Sitting in another meeting today, the capitalized letters jumped out at me: ALTO. It means nothing.

And I like that. Of course it still doesn’t explain what it’s about. That’s what subtitles are for.

Now, let me explain what the hell Alto is and what it is supposed to do.

Alto is a “community-based collection builder and search engine”.

Come to think of it, that might not actually clear anything up.

Let’s back up a bit, shall we?

To say searching the catalog is “searching our collection” is quite arbitrary and false. Metasearch doesn’t really solve this problem, since you’d still only point the metasearch engine at certain assets and it’s non-trivial to make relationships between assets. Metasearch is part of the solution, but hardly the panacea.

Again, our “collection” is an ambiguous term and shouldn’t be solely determined by our collection development policies/budget. It is our opinion that if something is important enough to be added to a reserves list (even a web page), it should technically be part of our collection. I would not, however, say it should be cataloged (and that’s why this isn’t a catalog replacement project, see?). If an item is even bookmarked (via a local social bookmarking service, such as unalog or connotea) it should then become part of our collection. A 1927 engineering textbook from Purdue’s catalog? Index it! If a member of our community finds it important enough to want to come back to and share with a group, it’s important enough for us to aggregate into our “collection”. Relevance comes later (keep reading, if you’re interested).

There are also relationships that our community (for the sake of argument, let’s start with “Georgia Tech”) builds that are highly relevant for finding connections between disparate “things”. So, the items put on reserve for a particular course have an umbrella of commonality between them that should be utilized for anyone that runs across any of these items. The relevance ranking should be even greater for a user that happens to be a member of the group in question (for instance, is enrolled in the class).

If Alto has a citation management-esque feature in it, users can very specifically group relevant resources together based on a project. Resources can be anything: books, websites, articles, searches, chat transcripts, trails, you name it.

And all of this should feed the “relevance beast”, as it were.

So that’s some background. Given that we’ll have some formal subject classifications for these objects (from the OPAC or from metasearch or whatever), we should be able to bridge the formal to folksonomy to make sense of how people have classified their saved things.

We can then begin to cluster search results. Format, subject, concept, group, policies… All of these can be browsed after the search begins. The search results will be a combination of metadata objects and library content. If some of the results appear in a given “subject guide”, the guide will a suggested resource (and will, in turn, push some resources into the result set).

The goal is to open the silos we have created around our resources/services. It would break down the ambiguity between “collections”, “services” and “policies” since they’re all interrelated.

How do we plan to do this? Glad you asked (you’re still reading, right?)!

We’ve exported all of the bib records from our catalog. The plan is to use METS as our wrapper around MODS. We’ll then harvest our institutional repository and index our website. That’s a pretty good base to start with. All of this is stored in a dbXML database and indexed with Lucene.

If users want to harvest a collection from citeseer or OAIster, that will be available and will become part of our collection. Annotations, links to reviews, links to content to index will all be made available.

I’m leaving a lot out and glossing some of this over… but it starts to put the idea on “paper” for me to come back later.

I’ve mentioned several times in this space the OPAC redesign project that Art and I are working on. There hasn’t really been anything to show, to date, because it’s taken a very long time to get actually get the data out of Voyager. There are easier and faster ways we could have done this, probably, but we’ve been a little bogged down trying to get this to work in Art’s webdav environment. This has required sucking the data out of Oracle according to LCC and that’s been no easy task. GovDocs are in a different hierarchy, based on SUDOC (CODOC, for Art).

In the meantime, I get emails from Art at 12:30 at night, 7:30 in the morning that say things like:

I am woefully weak on python but I know you have been working with python lately and I wondered if the approach I am using makes sense. I am persisting date modified information with a python shelve. So it looks like:

shelf[url] = last_modified

This seems to work wonderfully, but I needed to add:

import dumbdbm

for the shelf to have somewhere to put the info. What I think is supposed to happen is that the shelf command looks for some sort of database option and cycles through them all looking for storage. The “import dumbdbm” seems to be a way to add an option if no other is found. Have you ever tried anything like this? I wanted to use pickle/cpickle but a million links would probably throttle it.

… I, of course, have no idea what he’s talking about, but it’s flattering nonetheless that he thinks I might.

Anyway, last week I started actually working with PyLucene and our metadata mirror files (Art, meanwhile, is doing similar work with Cocoon/Lucene) and I came across what is possibly the most useful byproduct of this project. While I was preparing the logic to frbrize the mirror data, it struck me that it doesn’t have to be perfect, at first.

By separating the data from the ILS, we can create any kind of interface we want, indeed several, should we choose, without worrying about affecting the backend system at all. We can combine records, add metadata as necessary, remove it if it doesn’t work properly, tweak our search algorithms, and incorporate it into any sort of system we want, because it would have absolutely no effect on the ILS itself. We’ll still have the original “authority” should we mess anything up too badly and we’ll have all kinds of value that couldn’t (and probably shouldn’t) go in a “conventional opac”.

This sort of abstraction from the “inventory control system” is such a basic programming principle that I have to wonder why no vendors implement it (even I, as an untrained hacker understand the importance of this). It also abstracts the user interface from the catalogers a bit — added bonus. Catalogers are great for many things, but designing user interfaces generally isn’t one of them.