Monthly Archives: August 2005

I just caught the end of Dark City. I love this movie, despite its cheesiness, and I think it speak volumes of you, as a nerd, as to where you stand on this flick.

In general, I see the computer geek community as comprised of two camps: the slashdot community, made up of the engineering (and engineering aspirant) type; efficiency, economy, practicality rule above aesthetics. Things that kick ass are valued more than objects of elegance. Principle carries more weight than pragmatism.

Then there is the other group. This is a nerd set that values form, as well as function. Perfection gives way to pragmatism. Strong coding skills aren’t necessary (they help, of course), because “rules” are an “impediment to creativity”. “Innovation” is the watchword above “propriety”. Thankfully, I would place a vast majority of #code4lib in this arena.

“Thankfully”, for two reasons:

  • It shows hope for the library development community
  • I hang out there all day, and I tend to dislike the former group

Another way to look at this schism is “The Matrix” vs. “Dark City”. Both movies are based on the same premise: Cartesian logic. They both center around thinking things that cannot be certain that anything exists besides themselves. The difference is that one is a kick ass blockbuster smash and the other is a low budget cult classic.

You can enjoy both movies (I certainly do), but if you laugh more at “I know kung-fu” than seeing the wires attached to Rupert Sewell as he fights with the bad guy who looks like Reducto from “Harvey Birdman: Attorney at Law“, you squarely fall in the innovator more than engineer camp. What is your Keanu tolerance? It says a lot.

Although I’m not sure why, the world does need both “The Matrix” and “Dark City” fans. They serve different purposes.

But when you’re recruiting a geek, it’s good to know what you’re getting. Try “The Matrix” vs. “Dark City” question and evaluate from there.

I’ve mentioned several times in this space the OPAC redesign project that Art and I are working on. There hasn’t really been anything to show, to date, because it’s taken a very long time to get actually get the data out of Voyager. There are easier and faster ways we could have done this, probably, but we’ve been a little bogged down trying to get this to work in Art’s webdav environment. This has required sucking the data out of Oracle according to LCC and that’s been no easy task. GovDocs are in a different hierarchy, based on SUDOC (CODOC, for Art).

In the meantime, I get emails from Art at 12:30 at night, 7:30 in the morning that say things like:

I am woefully weak on python but I know you have been working with python lately and I wondered if the approach I am using makes sense. I am persisting date modified information with a python shelve. So it looks like:

shelf[url] = last_modified

This seems to work wonderfully, but I needed to add:

import dumbdbm

for the shelf to have somewhere to put the info. What I think is supposed to happen is that the shelf command looks for some sort of database option and cycles through them all looking for storage. The “import dumbdbm” seems to be a way to add an option if no other is found. Have you ever tried anything like this? I wanted to use pickle/cpickle but a million links would probably throttle it.

… I, of course, have no idea what he’s talking about, but it’s flattering nonetheless that he thinks I might.

Anyway, last week I started actually working with PyLucene and our metadata mirror files (Art, meanwhile, is doing similar work with Cocoon/Lucene) and I came across what is possibly the most useful byproduct of this project. While I was preparing the logic to frbrize the mirror data, it struck me that it doesn’t have to be perfect, at first.

By separating the data from the ILS, we can create any kind of interface we want, indeed several, should we choose, without worrying about affecting the backend system at all. We can combine records, add metadata as necessary, remove it if it doesn’t work properly, tweak our search algorithms, and incorporate it into any sort of system we want, because it would have absolutely no effect on the ILS itself. We’ll still have the original “authority” should we mess anything up too badly and we’ll have all kinds of value that couldn’t (and probably shouldn’t) go in a “conventional opac”.

This sort of abstraction from the “inventory control system” is such a basic programming principle that I have to wonder why no vendors implement it (even I, as an untrained hacker understand the importance of this). It also abstracts the user interface from the catalogers a bit — added bonus. Catalogers are great for many things, but designing user interfaces generally isn’t one of them.

Despite our waning patronage (both physically and virtually), librarians never cease their criticism of the barbarism of the unwashed masses for not adopting their love of rich metadata.

“Dumbing down the catalog”
“I don’t think it’s too much to ask a student to learn what the library catalog is”
“Thousands of hits”
“Did A9 even bother to look at SRW/U?”

Let’s take the first (widely used) statement. A system that is able to take a natural language query and present to the user a list that contains many of the things they are looking for early in the result set is not dumb. Hemingway, Ernest is dumb. Not understanding what I, the user, mean when I type “Ernest Hemingway” is dumb. This standard is applied to librarians, why not the catalog? A librarian doesn’t explicitly require the patron to know they’re looking for before they will help them with a reference question, but we expect them to form a perfect boolean query to isolate that rare manuscript (acquired in 1963 and widely unheard of) that would be the “perfect compliment to their term paper”.

Number two: I don’t expect a student to learn how to use a sliderule, either. It’s not necessary for them to know what double clutching is. It wouldn’t be the end of the world if they never have seen a typewriter’s correction ribbon. Technology makes awkward systems obsolete.

In regards to the “thousands of hits” meme (which Alane Wilson argued against quite convincingly), how many hits would a user get if all of our databases were searched simultaneously? What if they are getting a sufficiently smaller set of results, but it’s because they’re looking in the wrong place? I am seldomly unhappy with my Google results as a starting place.

Should A9 have? Does SRW/U really make any sense whatsoever to 95% of the world outside of libraries? Why doesn’t the SRW/U crowd try to work with the OpenSearch community? Why? Because we say ours is better, so the other shouldn’t be trifled with. To be clear, it’s possible to layer OpenSearch on top of SRU; Georgia Tech does it. Is one superior to the other? SRW/U is certainly more sophisticated. Despite what you will read to the contrary, however, OpenSearch is much, much easier to implement. If you know the metadata schema of the SRW/U server, simple SRU clients are possible, but, like Z39.50 before it, there are no constraints on what you might get from an SRW/U server. OpenSearch, while limited and limiting (for certain), has a somewhat different purpose than SRW/U. SRW/U is a protocol for searching for and retrieving metadata. OpenSearch is a spec for searching for and retrieving search results. This may sound redundant, but there is a nuanced difference. No matter the OpenSearch source, the results will always look the same, so it is very simple to integrate into a display (yet not so simple to actually do anything else with the result). While SRW/U is definitely more versatile, transforming your results to OpenSearch has its advantages. But this is a hard sell to the library world, because the “metadata isn’t rich enough”.

It’s time we stopped scorning and ignoring the outside world, because they are doing fine without us. Aaron Krowne notes that a huge amount of scholarly content is freely available, further making our position in society weaker, making it all the more important that we co-opt popular culture, rather than ridicule it. Our standards are great… now let’s see how they can interface with the real world.

Fairly recently on the Web4Lib mailing list, a thread started by Jim Campbell (of UVA) and David Walker (of Cal State San Marcos) (and others) prompted me to ask what the role of OPAC is in the modern library.

Outside of “Inventory Control System”, I don’t feel like I got a very good or meaningful response.

I have been thinking a lot about something that Karen Schneider had written a while ago about the need for search interfaces to be search/browse. By this, I mean you begin your session by typing some words in a box and your interface adapts itself contextually to the results and what you should be looking at, so your “browse” options would be logical based on the context of your results.

If your terms were to bring back government documents, say, you would also have the ability to browse our GovDocs research guide or email our GovDocs librarian. If your search brought back a database (for example, ABI/Inform), then the page should also link to the subject guide that includes ABI/Inform (in this case, the Business guide).

This, of course, requires that the “library website” be in a format that makes it potentially servable in this manner. For our site, I have proposed that our content be broken down into small sections (rather than pages) that can be classified and served as necessary.

If you are an undergrad and one of your results happens to be one of your reserves items (which I’ll get to in a minute), there’s not much need to see the faculty policies for placing something on reserve. There is a use, however, in seeing the circulation policies regarding said reserve as appropriate to an undergrad.

If your search results in a journal that we get through an aggregator that sucks (meaning Lexis-Nexis or Factiva or their ilk), present tutorials on getting to the journal through that aggregator (or just a tutorial, in general).

Searches should be weighted contextually, as well. Objects that appear in your reserves lists or subject disciplines should have more relevance than other things. Circulation/clickthroughs should boost relevance (although I realize that non-circulating items present a problem here).

The important thing I want to see is the relationship between objects and content. My search brings back a journal. Besides the obvious information I want to know about thing (esp. things not included, like, what is it about?), tell me what databases index this thing; what other journals are similar; what is the current ToC (if available via RSS); are there preprints from this journal in our institutional repository/ETDs; etc. If there is any library created content related to a particular object, I want that, too.

I want to break down the silos between our resources and content and different collections.

And, yes, I think articles and other database content should be included in that as well (if you have the credentials to view them – if not, an indication of what you’d see if you were logged in).

If we don’t include the entirety of our collections, I am not entirely sure what the purpose of the catalog is.

I notice a lot of the ‘blogs I read regularly have recently had a similar posting to this one:

Wow, it’s been so long since I last posted. Well, it’s time to catch up.

I suspect for many, summer is the busy time of year. This is when you have to pack in all of the important projects before fall semester begins. It’s also conference season and, hopefully, you might be able to sneak in a vacation (oh well, two out of three for me).

Anyway, here’s to getting this blog beast out of hibernation.