There are any number of reasons that you can attribute to Solr‘s status as the standard bearer of faceted full-text searching: it’s free, fast, works shockingly well out of the box without any tweaking, has a simple and intuitive HTTP API (making it available in the programming language of your choice) and is, by far, the easiest “enterprise-level” application to get up and running. None of its “competitors” (Sphinx, Xapian, Endeca, etc.), despite any individual advantages they might have, can claim all of these features, which goes a long way towards explaining Solr’s popularity.
The library world has definitely taken a shine to Solr: from discovery interfaces like VuFind and Primo, to repositories like Fedora, to full-text aggregators like Summon, you can find Solr under the hood of most of the hot products and services available right now. The fact that a library can install VuFind and have a slick, jaw-droppingly powerful OPAC-replacement that puts their legacy interface to shame in about an hour is almost completely the by-product of Solr’s amazing simplicity to get up and running. It’s no wonder why so many libraries are adopting it (compare it to SOPAC, also built in PHP and about as old, but uses Sphinx for the full-text indexing and is hardly ever seen in the wild).
Without a doubt, Solr is pretty much a no-brainer if you are able to run Jetty (or Tomcat or JBoss or Glassfish or whatever): with enough hardware, Solr can scale up to pretty much whatever your need might be. The problem (at least the problem in my mind) is that Solr doesn’t scale down terribly well. If you host your content from a cheap, shared web hosting provider or a VPS, for example, Solr is not available or not practical (it doesn’t live in small memory environments well). The hosted Solr options are fairly expensive and while there are cheap, shared web hosting providers that do provide Java Application Servers, switching vendors to provide faceted search for your mid-size Drupal or Omeka site might not be entirely practical or desirable.
I find myself proof-of-concept-ing a lot of hacks to projects like VuFind, Blacklight, Kochief and whatnot and run these things off of my shared web server. It’s older, underpowered and only has 1GB of RAM. Since I’m not running any of these projects in production (just really making things available for others to see), it was really annoying to have Solr gobbling up 20% of the available RAM for these little pet projects. What I wanted was something that acted more or less like Solr when you pointed an application that expected Solr to be there, but I wanted it to have a small footprint that could run (almost) anywhere and more or less disappear when it was idle.
So it was for this scenario that I wrote CheapSkate: a Solr emulator written in Ruby. It uses Ferret, the Ruby port of Lucene, as the full-text indexing engine and Sinatra to supply the HTTP API. Ferret is fast, scales quite well and responds to the same search syntax as Solr, so I knew it could handle the search aspect pretty easily. Faceting (as can be expected) proved the harder part. Originally, I was storing the values of fields in an RDBMS and using that to provide the facets. Read performance was ok, although anything over 5,000 results would start to bog down – the real problem was the write performance, which was simply woeful. Part of the issue was that this design was completely schemaless: you could send anything to CheapSkate and facet on any field, regardless of size. It also tried to maintain the type of the incoming field value: dates were stored as dates, numbers stored as integers and so on. Basically the lack of constraints made it wildly inefficient.
Eventually, I dropped the RDBMS component, and started playing around Ferret’s terms capabilities. If you set a particular field to be untokenized, your field values appear exactly as you put them in. This is perfect for faceting (since you don’t want stemming and whatnot on your query filters and your strings aren’t normalized or downcased or anything so they look right in the UI) and is basically the same thing Solr itself does. Instead of a schema.xml, CheapSkate has a schema.yml, but it works essentially the same way: you define your fields, what should be tokenized (that is, which fields allow full-text search) or not (i.e. facet fields) and what datatype the field should be.
CheapSkate doesn’t support all of the field types that Solr does, but it supports strings, numbers, dates and booleans.
One neat thing about Ferret is that you can pass a Ruby Proc to the search method as a search option. This proc then has access to the search results as Ferret is finding them. CheapSkate uses this find the terms in the untokenized fields for each search hit, throws them in a Hash and generates a hit count for each term. This is a lot faster than getting all the document ids from the search, looping them and generating your term hash after the search is completed. That said, this is still definitely the bottleneck for CheapSkate. If the search result has more than 10-15,000 hits, performance begins to get pretty heavily impacted by grabbing the facets. I’m not terribly concerned by this, data sets with search results in the 20,000+ range start to creep into the “you would be better off just using Solr” domain. For my proofs-of-concepts, this has only really raised its head in VuFind when filtering on something like “Book” (with no search terms) for a 50,000 record collection. What I mean to say is, this happens for fairly non-useful searches.
Overall, I’ve been pretty happy with how CheapSkate is working. For regular searching it does pretty well (although, like I said, I’m not trying to run a production discovery system that pleases both librarians and users). There’s a very poorly designed “more like this” handler that really needs an overhaul and there is no “did you mean” (spellcheck). This hasn’t been a huge priority, because I don’t really like the spellcheck in Solr all that much, anyway. That said, if somebody really wanted this and had an idea of how it would be implemented in Ferret, I’d be happy to add it.
Ideally, I’d like to see something like CheapSkate in PHP using Zend_Search_Lucene, since that would be accessible to virtually everybody, but that’s a project for somebody else.
In the meantime, if you want to see some examples of CheapSkate in action:
- Here’s that VuFind instance with 50,000 MARC records (from the California College of the Arts)
- Kochief with around 10,000 MARC records (from the Library of Congress, via Blacklight)
- Drupal with just over 50 nodes
- WordPress with just under 150 posts & pages (this blog).
One important caveat to projects like VuFind and Blacklight: CheapSkate doesn’t work with Solrmarc, which requires Solr to return responses in the javabin format (which may be possible to hack out something that looks enough like javabin to fool Solrmarc, I just haven’t figured it out). My workaround has been to populate a local Solr index with Solrmarc and then just dump all of the documents out of Solr into CheapSkate.