Archive

Solr

There are any number of reasons that you can attribute to Solr‘s status as the standard bearer of faceted full-text searching:  it’s free, fast, works shockingly well out of the box without any tweaking, has a simple and intuitive HTTP API (making it available in the programming language of your choice) and is, by far, the easiest “enterprise-level” application to get up and running.  None of its “competitors” (Sphinx, Xapian, Endeca, etc.), despite any individual advantages they might have, can claim all of these features, which goes a long way towards explaining Solr’s popularity.

The library world has definitely taken a shine to Solr:  from discovery interfaces like VuFind and Primo, to repositories like Fedora, to full-text aggregators like Summon, you can find Solr under the hood of most of the hot products and services available right now.  The fact that a library can install VuFind and have a slick, jaw-droppingly powerful OPAC-replacement that puts their legacy interface to shame in about an hour is almost completely the by-product of Solr’s amazing simplicity to get up and running.  It’s no wonder why so many libraries are adopting it (compare it to SOPAC, also built in PHP and about as old, but uses Sphinx for the full-text indexing and is hardly ever seen in the wild).

Without a doubt, Solr is pretty much a no-brainer if you are able to run Jetty (or Tomcat or JBoss or Glassfish or whatever):  with enough hardware, Solr can scale up to pretty much whatever your need might be.  The problem (at least the problem in my mind) is that Solr doesn’t scale down terribly well.  If you host your content from a cheap, shared web hosting provider or a VPS, for example, Solr is not available or not practical (it doesn’t live in small memory environments well).  The hosted Solr options are fairly expensive and while there are cheap, shared web hosting providers that do provide Java Application Servers, switching vendors to provide faceted search for your mid-size Drupal or Omeka site might not be entirely practical or desirable.

I find myself proof-of-concept-ing a lot of hacks to projects like VuFind, Blacklight, Kochief and whatnot and run these things off of my shared web server.  It’s older, underpowered and only has 1GB of RAM.  Since I’m not running any of these projects in production (just really making things available for others to see), it was really annoying to have Solr gobbling up 20% of the available RAM for these little pet projects.  What I wanted was something that acted more or less like Solr when you pointed an application that expected Solr to be there, but I wanted it to have a small footprint that could run (almost) anywhere and more or less disappear when it was idle.

So it was for this scenario that I wrote CheapSkate: a Solr emulator written in Ruby.  It uses Ferret, the Ruby port of Lucene, as the full-text indexing engine and Sinatra to supply the HTTP API.  Ferret is fast, scales quite well and responds to the same search syntax as Solr, so I knew it could handle the search aspect pretty easily.  Faceting (as can be expected) proved the harder part.  Originally, I was storing the values of fields in an RDBMS and using that to provide the facets.  Read performance was ok, although anything over 5,000 results would start to bog down – the real problem was the write performance, which was simply woeful.  Part of the issue was that this design was completely schemaless:  you could send anything to CheapSkate and facet on any field, regardless of size.  It also tried to maintain the type of the incoming field value:  dates were stored as dates, numbers stored as integers and so on.  Basically the lack of constraints made it wildly inefficient.

Eventually, I dropped the RDBMS component, and started playing around Ferret’s terms capabilities.  If you set a particular field to be untokenized, your field values appear exactly as you put them in.  This is perfect for faceting (since you don’t want stemming and whatnot on your query filters and your strings aren’t normalized or downcased or anything so they look right in the UI) and is basically the same thing Solr itself does.  Instead of a schema.xml, CheapSkate has a schema.yml, but it works essentially the same way:  you define your fields, what should be tokenized (that is, which fields allow full-text search) or not (i.e. facet fields) and what datatype the field should be.

CheapSkate doesn’t support all of the field types that Solr does, but it supports strings, numbers, dates and booleans.

One neat thing about Ferret is that you can pass a Ruby Proc to the search method as a search option.  This proc then has access to the search results as Ferret is finding them.  CheapSkate uses this find the terms in the untokenized fields for each search hit, throws them in a Hash and generates a hit count for each term.  This is a lot faster than getting all the document ids from the search, looping them and generating your term hash after the search is completed.  That said, this is still definitely the bottleneck for CheapSkate.  If the search result has more than 10-15,000 hits, performance begins to get pretty heavily impacted by grabbing the facets.  I’m not terribly concerned by this, data sets with search results in the 20,000+ range start to creep into the “you would be better off just using Solr” domain.  For my proofs-of-concepts, this has only really raised its head in VuFind when filtering on something like “Book” (with no search terms) for a 50,000 record collection.  What I mean to say is, this happens for fairly non-useful searches.

Overall, I’ve been pretty happy with how CheapSkate is working.  For regular searching it does pretty well (although, like I said, I’m not trying to run a production discovery system that pleases both librarians and users).  There’s a very poorly designed “more like this” handler that really needs an overhaul and there is no “did you mean” (spellcheck).  This hasn’t been a huge priority, because I don’t really like the spellcheck in Solr all that much, anyway.  That said, if somebody really wanted this and had an idea of how it would be implemented in Ferret, I’d be happy to add it.

Ideally, I’d like to see something like CheapSkate in PHP using Zend_Search_Lucene, since that would be accessible to virtually everybody, but that’s a project for somebody else.

In the meantime, if you want to see some examples of CheapSkate in action:

One important caveat to projects like VuFind and Blacklight:  CheapSkate doesn’t work with Solrmarc, which requires Solr to return responses in the javabin format (which may be possible to hack out something that looks enough like javabin to fool Solrmarc, I just haven’t figured it out).   My workaround has been to populate a local Solr index with Solrmarc and then just dump all of the documents out of Solr into CheapSkate.

For a long time, I was massively confused about what the Platform was or did.  Months after I started at Talis I was still fairly unclear of what the Platform actually did.  I’ve now got my head around it, use it, and have a pretty good understanding of why and how it’s useful, but I fully realize that a lot of people (and by people I’m really referring to library people) don’t and don’t really care to learn.

What they want is Solr.  Actually, no, what they want is a magical turnkey system that takes their crappy OPAC (or whatever) data and transmogrifies it into a modern, of-the-web type discovery system.  What is powering that discovery system is mostly irrelevant if it behaves halfway decently and is pretty easy to get up and running for a proof-of-concept.  These two points, of course, are why Solr is so damned popular; to say that it meets those criteria is a massive understatement.  The front-end of that Solr index is another story entirely, but Solr itself is a piece of cake.

Almost from the time I started at Talis I have thought that a Solr-clone API for the Platform would make sense.  Although the Platform doesn’t have all of the functionality of Solr, it has several of the sexy bits (Lucene syntax and faceting, for example) and if it had some way to respond to an out of the box Solr client, it seemed to me that it would make it a lot easier to turn an off-the-shelf Solr powered application (a la VuFind or Blacklight) into a Platform powered, RDF/linked data application with minimal customization.  It’s not Solr and in many ways is quite different than Solr — but if it can exploit its similarities with Solr enough to leverage the pretty awesome client base that Solr has, it’ll make it easier to open the door for things the Platform is good at.  Alternately, if the search capabilities of the Platform become too limited compared to Solr, the data is open — just index it in Solr.  Theoretically, if the API is a Solr-clone, you should be able to point your application at either.

The proof-of-concept project I’m working on right now is basically a reënvisioned Communicat:  a combination discovery interface; personal and group resource collection aggregator; resource-list content management system (for course reserves, say, or subject/course guides, etc.);  and “discovered” resources (articles, books, etc.) cache and recommendation service.  None of these would be terribly sophisticated at a first pass, I’m just trying to get (and show) a clearer understanding of how a Communicat might work.  As such, I’m trying to do as little development from the ground up as I can get away with.

I’ll go into more detail later as it starts to get fleshed out some, but for the discovery and presentation piece, I plan on using Blacklight.  Of the OSS discovery interfaces, it’s the most versatile for the wide variety of resources I would hope to be in a Communicat-like system.  It’s also Ruby, so I feel the most comfortable hacking away at it.  It also meant I needed the aforementioned Solr-like API for the Platform, so I hastily cobbled together something using Pho and Sinatra.  I’m calling it pret-a-porter, and the sources are available on Github.

You can see it in action here.  The first part of the path corresponds with whatever Platform store you want to search.  The only “Response Writers” available are Ruby and JSON (I’ll add an XML response as soon as I can — I just needed Ruby for Blacklight and JSON came basically for free along with it).  It’s incredibly naive and rough at this point, but it’s a start.  Most importantly, I have Blacklight working against it.  Here’s Blacklight running off of a Prism 3 store.  It took a little bit of customization of Blacklight to make this work, but it would still be interchangeable with a Solr index (assuming you were still planning on using the Platform for your data storage).  When I say a “little bit”, I mean very little.  Both pieces (pret-a-porter and the Blacklight implementation) took less than three days total to get running.

If only the rest of the Communicat could come together that quickly!