Monthly Archives: June 2009

For the last couple of weeks I’ve returned to working on Alto Jangle connector, at least part-time.  I had shelved development on it for a while; I had a hard time finding anybody interested in using it and had reached a point where the development database I was working against was making it difficult to know what to expect in a real, live Alto system.  After I got wind of a couple of libraries that might be interested in it, I thought I should at least get it in a usable state.

One of the things that was vexing me prior to my hiatus was how to get Sybase to page through results in a semi-performant way.  I had originally blamed it on Grails, then when I played around with refactoring the connector in PHP (using Quercus, which is pretty slick by the way, to provide Sybase access via JDBC — the easiest way to do it) I realized that paging is just outside of Sybase’s capabilities.

And when you’re so used to MySQL, PostgreSQL and SQLite, this sort of makes your jaw drop (although, in its defense, it appears that this isn’t all that easy in Oracle, either — however, it’s at least possible in Oracle).

There seem to be two ways to do something like getting rows 375,000 – 375,099 from all of the rows in a table:

  1. Use cursors
  2. use SET ROWCOUNT 375100 and loop through and throw out the first 375,000 results.

The first option isn’t really viable.  You need write access to the database and it’s unclear how to make this work in most database abstraction libraries.  I don’t actually know that cursors do anything differently than option 2 besides pushing the looping to the database engine itself.  I was actually using cursors in my first experiments in JRuby using java.sql directly, but since I wasn’t aware of this problem at the time, I didn’t check to see how well it performed.

Option 2 is a mess, but this appears to be how GORM/Hibernate deals with paging in Sybase.  Cursors aren’t available in Quercus’ version of PDO, so it was how I had to deal with paging in my PHP prototypes, as well.  When I realized that PHP was not going to be any faster than Grails, I decided to just stick with Grails (“regular C-PHP” is out — compiling in Sybase support is far too heavy a burden).

This paging thing still needed to be addressed.  Offsets of 400,000 and more were taking more than twelve minutes to return.  How much more, I don’t know — I killed the request at the 12 minute mark.  While some of this might be result of a bad or missing index, any way you cut it, it wasn’t going to be acceptable.

I was kicking around the idea of exporting the “models” of the Jangle entities into a local HSQLDB (or whatever) mirror and then working the paging off of that.  I couldn’t help but think that this was sort of a waste, though — exporting from one RDBMS to another solely for the benefit of paging.  You’d have to keep them in sync somehow and still refer to the original Sybase DB for things like relationships and current item or borrower status.  For somebody that’s generally pretty satisfied with hacking together kludgy solutions to problems, this seemed a little too hack-y… even for my standards.

Instead, I settled on a different solution that could potentially bring a bunch of other features along with it.  Searchable is a Grails plugin for Compass, a project to easily integrate Lucene indexes with your Java domain classes (this would be analogous to Rails’ act_as_ferret).  When your Grails application starts up, Searchable will begin to index whatever models you declared as, well,  searchable.  You can even set options to store all of your attributes, even if they’re not actual database fields, alleviating the need to hit the original database at all, which is nice.  Initial indexing doesn’t take long — our “problem” table that took twelve minutes to respond takes less than five minutes to fully index.  It would probably take considerably less than that if the data was consistent (some of the methods to set the attributes can be pretty slow if the data is wonky — it tries multiple paths to find the actual values of the attribute).

What this then affords us is consistent access times, regardless of the size of the offset:  the 4,000th page is as fast as the second:  between 2.5 and 3.5 seconds (our development database server is extremely underpowered and I access it via the VPN — my guess is that a real, live situation would be much faster).

The first page is a bit slower.  I can’t use the Lucene index for the first page of results because there’s no way for Searchable to know if the WORKS_META table has changed since the last request since these changes wouldn’t be happening through Grails.  Since performance for the first hundred rows out of Sybase isn’t bad, the connector just uses it for the first page, then syncs the Lucene index with the database at the end of the request.  Each additional page then pulls from Lucene.  Since these pages wouldn’t exist until after the Lucene index is created and the Lucene index is recreated every time the Grails app is started, I added a controller method that checks the count of the Sybase table and the count of the Lucene index to confirm that they’re in sync (it’s worth noting that if the Lucene index has already been created once, this will be available right away after Grails starts — the reindexing is still happening, but in a temp location that will be moved to the default location once it’s complete overwriting the old index).

The side benefit to using Searchable is that it will make adding search functionality to Alto connector that much easier.  Building SQL statements from the CQL queries in the OpenBiblio connector was a complete pain the butt.  CQL to Lucene syntax should be considerably easier.  It seems like  it would be possible for these Lucene indexes to potentially alleviate the need for the bundles Zebra index that comes with Alto, eventually, but that’s just me talking, not any sort of strategic goal.

Anyway, thanks to Lucene, Sybase is behaving mostly like a modern RDBMS, which is a refreshing change.

In a world where library management systems are sophisticated and modern…

I was doing some Google searches about SKOS, trying to figure out the exact distinction between skos:ConceptScheme and skos:Collection (it’s much more clear to me now) and I came across this article in

Introducing SKOS

The article is fine, but it’s not what compelled me to write a blog post.  I was struck by a comment on that page titled What about Topic Maps?:

This new W3C standard obviously has a huge overlap with the very mature ISO standard Topic Maps.Topic Maps were originally conceived for (almost) exactly the same problem space as SKOS, and they are widely used. (For example, all major library cataloging software either supports Topic Maps or soon will.)

However, Topic Maps proved to be more generally useful, so they are often compared and contrasted with RDF itself. The surprising difficulty of making Topic Maps and RDF work together is exactly the “extra level of indirection” mentioned by the author of this article about SKOS.

It is very strange that neither this article, nor the referenced XTech paper, mentions Topic Maps.

What is the relationship between SKOS and Topic Maps? How does this fit in with the work (as reported In Edd Dumbill’s blog)
on interoperability between Topic Maps and RDF/OWL?

Now, I have no idea if “yitzgale” is some sort of alias of Alexander Johannesen, let’s assume “no” (for one thing, that comment is far too optimistic about library technology).  The sentence [f]or example, all major library cataloging software either supports Topic Maps or soon will is sort of stunning in both the claim it makes and its total lack of accuracy.  I feel pretty confident in my familiarity with library cataloging software and I can say with some degree of certainty that there is no support for topic maps today  (hell, MARC21, MFHD and Unicode support are pushing it – and those are just incremental changes).  This comment was written four years ago.
And yet, there’s part of me that feels robbed.  Where is the topic map support in my library system?  I don’t even really know anything about TM, but I still feel it would be a damn sight better than what we’ve got now.  What reality is this that yitzgale is living in, with its fancy library systems and librarians and vendors willing to embrace a radical change in how things are done?  I want in.
I might even be able to jump off my RDF bandwagon for it.