Better Paging Through Better Searching

For the last couple of weeks I’ve returned to working on Alto Jangle connector, at least part-time.  I had shelved development on it for a while; I had a hard time finding anybody interested in using it and had reached a point where the development database I was working against was making it difficult to know what to expect in a real, live Alto system.  After I got wind of a couple of libraries that might be interested in it, I thought I should at least get it in a usable state.

One of the things that was vexing me prior to my hiatus was how to get Sybase to page through results in a semi-performant way.  I had originally blamed it on Grails, then when I played around with refactoring the connector in PHP (using Quercus, which is pretty slick by the way, to provide Sybase access via JDBC — the easiest way to do it) I realized that paging is just outside of Sybase’s capabilities.

And when you’re so used to MySQL, PostgreSQL and SQLite, this sort of makes your jaw drop (although, in its defense, it appears that this isn’t all that easy in Oracle, either — however, it’s at least possible in Oracle).

There seem to be two ways to do something like getting rows 375,000 – 375,099 from all of the rows in a table:

  1. Use cursors
  2. use SET ROWCOUNT 375100 and loop through and throw out the first 375,000 results.

The first option isn’t really viable.  You need write access to the database and it’s unclear how to make this work in most database abstraction libraries.  I don’t actually know that cursors do anything differently than option 2 besides pushing the looping to the database engine itself.  I was actually using cursors in my first experiments in JRuby using java.sql directly, but since I wasn’t aware of this problem at the time, I didn’t check to see how well it performed.

Option 2 is a mess, but this appears to be how GORM/Hibernate deals with paging in Sybase.  Cursors aren’t available in Quercus’ version of PDO, so it was how I had to deal with paging in my PHP prototypes, as well.  When I realized that PHP was not going to be any faster than Grails, I decided to just stick with Grails (“regular C-PHP” is out — compiling in Sybase support is far too heavy a burden).

This paging thing still needed to be addressed.  Offsets of 400,000 and more were taking more than twelve minutes to return.  How much more, I don’t know — I killed the request at the 12 minute mark.  While some of this might be result of a bad or missing index, any way you cut it, it wasn’t going to be acceptable.

I was kicking around the idea of exporting the “models” of the Jangle entities into a local HSQLDB (or whatever) mirror and then working the paging off of that.  I couldn’t help but think that this was sort of a waste, though — exporting from one RDBMS to another solely for the benefit of paging.  You’d have to keep them in sync somehow and still refer to the original Sybase DB for things like relationships and current item or borrower status.  For somebody that’s generally pretty satisfied with hacking together kludgy solutions to problems, this seemed a little too hack-y… even for my standards.

Instead, I settled on a different solution that could potentially bring a bunch of other features along with it.  Searchable is a Grails plugin for Compass, a project to easily integrate Lucene indexes with your Java domain classes (this would be analogous to Rails’ act_as_ferret).  When your Grails application starts up, Searchable will begin to index whatever models you declared as, well,  searchable.  You can even set options to store all of your attributes, even if they’re not actual database fields, alleviating the need to hit the original database at all, which is nice.  Initial indexing doesn’t take long — our “problem” table that took twelve minutes to respond takes less than five minutes to fully index.  It would probably take considerably less than that if the data was consistent (some of the methods to set the attributes can be pretty slow if the data is wonky — it tries multiple paths to find the actual values of the attribute).

What this then affords us is consistent access times, regardless of the size of the offset:  the 4,000th page is as fast as the second:  between 2.5 and 3.5 seconds (our development database server is extremely underpowered and I access it via the VPN — my guess is that a real, live situation would be much faster).

The first page is a bit slower.  I can’t use the Lucene index for the first page of results because there’s no way for Searchable to know if the WORKS_META table has changed since the last request since these changes wouldn’t be happening through Grails.  Since performance for the first hundred rows out of Sybase isn’t bad, the connector just uses it for the first page, then syncs the Lucene index with the database at the end of the request.  Each additional page then pulls from Lucene.  Since these pages wouldn’t exist until after the Lucene index is created and the Lucene index is recreated every time the Grails app is started, I added a controller method that checks the count of the Sybase table and the count of the Lucene index to confirm that they’re in sync (it’s worth noting that if the Lucene index has already been created once, this will be available right away after Grails starts — the reindexing is still happening, but in a temp location that will be moved to the default location once it’s complete overwriting the old index).

The side benefit to using Searchable is that it will make adding search functionality to Alto connector that much easier.  Building SQL statements from the CQL queries in the OpenBiblio connector was a complete pain the butt.  CQL to Lucene syntax should be considerably easier.  It seems like  it would be possible for these Lucene indexes to potentially alleviate the need for the bundles Zebra index that comes with Alto, eventually, but that’s just me talking, not any sort of strategic goal.

Anyway, thanks to Lucene, Sybase is behaving mostly like a modern RDBMS, which is a refreshing change.

  1. Toke Eskildsen said:

    There is a penalty in Lucene for requesting pages at high offsets as it uses a sliding window sorter to arrive at the page. However, one of the first things to notice about Lucene is that it’s blazingly fast for a lot of things including this scenario, so paging to high numbers just adds a few milliseconds to overall search time.

    Regarding measured search time, then I guess that your 2.5-3.5 seconds from above is due to the searcher not being warmed as 3 seconds is a lot for a Lucene search. Try performing 5-10 searches, then measure response times.

  2. Ross said:

    Thanks for the tips, Toke!

    Obviously, whatever overhead Lucene gives me on paging is going to be orders upon orders of magnitude faster than the equivalent in Sybase — I’ll take it 🙂

    The long loading times have nothing to do with Lucene – bringing back the “objects” stored in Lucene is indeed blazingly fast (tenths of a millisecond for 100 records). Unfortunately, it still needs to hit the Sybase DB to get some associations that would be far to volatile to store in a mirror. Sybase on an underpowered machine + access over a VPN connection + zero query optimization = fairly slow response times.

    I’m not worried about that part — I feel pretty confident that that will speed up in a real environment (plus some attention paid to the actual queries).

Leave a Reply

Your email address will not be published. Required fields are marked *