This is a preprint for a column I wrote in the Journal of Electronic Resources Librarianship called Memo from the Systems Office. The edited version appeared in Volume 20, Issue 3
In Search of a Really “Next Generation” Catalog
Ever since North Carolina State University Libraries launched their Endeca based OPAC replacement in the beginning of 2006, the library world has been completely obsessed with ditching their old, tired catalog interfaces (and with good reason) for the greener pastures of more sophisticated indexing, more accurate relevance ranking, dust jackets and the most coveted feature of all: facets. Despite the fact that Medialab had brought AquaBrowser to the U.S. market nearly a year earlier, NC State set the rules of the game and the target that the rest of the profession was to aim for. And, indeed, it was quite a radical and welcome change; the interface to the catalog had barely changed since it had been migrated from command line terminals to the web. The ink had barely dried on NC State’s press release before a whole host of similar products were announced by the library vendors.
We are now starting to see the fruits of their labor, as well as a handful of open source projects, making their way into production. Alongside AquaBrowser and Endeca, we now have Innovative Interface’s Encore, Ex Libris’ Primo, Prism 3 by my employer, Talis and OCLC’s Worldcat Local. In the near future, VTLS and SirsiDynix should also be rolling out similar products. If the price tag on any of these offerings is too much for your library to afford, Villanova University, the University of Virginia, and Plymouth State University all have created comparable functionality in their free and open source projects: VuFind, Blacklight and Scriblio, respectively.
An environmental scan of all the options certainly finds more similarities between the products than differences: facets, dust jackets, user created tagging, bookmarkable URLs. Much of the functionality is thanks to the Solr project from the Apache Foundation: a full text indexer that produces Google-like search results as well as a simple faceting engine to further limit the scope of the search context. In most respects, Solr’s feature set is nearly indistinguishable from Endeca’s Profind product. Couple that to the fact that Solr is an open source application that is free to integrate into your product, it is easy to see why it is a popular choice. Many of the next generation catalog replacements (although not all) use Solr to do the heavy lifting, making the differences mainly the minutiae of how the data is indexed and presented. There are exceptions, of course, such as AquaBrowser’s visualized display of search results or the way that Worldcat Local includes other OCLC data sources, such as their union catalog, name authorities and article databases in their search results. It is probably not coincidental that these are two of the products that are not based on Solr.
In all fairness, these products do offer a significant advantage to Endeca: since Profind has no out of the box interface, all development must be done locally. On top of the sticker price, a library would also need to find the resources to build something to actually use it, requirements outside the reach of a great majority of institutions. The library vendors, on the other hand, offer relatively turn-key solutions, giving libraries a much simpler path to improved functionality.
As much of an improvement as these OPAC replacements are (and, certainly, they are a vast improvement over the status quo) they are all still based on some fundamentally flawed principles: they are all still relatively closed world silos intended to index MARC records. This is no criticism of the MARC format, catalogers or cataloging practices, but the way that data is represented in catalog records is ill-suited for a next generation OPAC. The records are sparse, in many cases very sparse. Records are seldomly updated. There is also no distinction of discrete concepts within the record: what exists is a blob of metadata about a work with strings identifying the creator or subject. To relate different blobs, the OPAC matches on those strings. This constraint is an unnecessary holdover in a modern library system. The creators, the subjects, the publishers, all of the distinct concepts that appear in the self-contained record should be first-class citizens, their own distinct records with their own distinct behaviors, rather than being merely strings to search on. Another shortcoming is that all records are displayed more or less equally. Despite the differences in a map and journal, and, more importantly, a user’s expectation of what they would use a map or a journal for, they both appear in roughly the same template with many of the same labels and options. What we have now are systems that ape Amazon’s look and feel with a minute fraction of the kind of data that makes Amazon compelling.
These new, shiny silos also do very little to harness the potential of the communities they serve. There is practically nothing the users can do to influence the way the system works and very little they can do in adding useful data to the records, outside of comments, ratings and tagging. These, sadly, have minimal value given the quite small size of the populations of majority of libraries especially compared to the number of resources in the collection. There is almost no effort made to integrate the data into the larger information ecosystem of the web, instead requiring data sources to be pulled in and indexed internally to be acknowledged. As the catalog takes a less central role to the library, this approach becomes less practical.
There are some glimmers of hope. VuFind, for example, has a plugin for crude author pages, based on name authority records, pulling biographical data from Wikipedia along with local holdings for the author’s works. Worldcat Local goes further with this, integrating their Worldcat Identities service with authors, showing all works by a given person noting what is available locally and what might be held at nearby libraries. Prism 3, while not utilizing it at this stage, is built on Talis’ semantic web-based Platform, giving it the potential of tapping into, integrating, and, conversely, feeding open data from all over the net.
Scriblio takes a more direct approach to integrating into the wider web. Built on the blogging platform WordPress, it taps into the existing social framework optimized for Web 2.0 style services (what Casey Bisson, Scriblio’s creator, calls “the Google Economy”). Instead of recreating bookmarking services or methods to define identity or prevent comment spam, Scriblio instead is able defer to services such as del.icio.us, OpenID and Akismet, utilize Technorati and Google Blog Search to discover content linking to the catalog, as well as trackbacks. WordPress’s broad selection of plugins give Scriblio the opportunity to include data from third parties such as last.fm for records about musical works, Flickr for images, and maps from Google or Yahoo.
Despite being a perfect netizen, it is a little unclear how effective Scriblio is at being a library discovery system. Many of the criticisms levied against the other next generation catalog replacements also apply to Scriblio: the data is largely unenhanced and immutable; the community has little effect on the records; and there is no distinction in the display for different kinds of data. Scalability becomes another issue. WordPress was never designed for the amount of data that a large research library would have. After all, it is a personal publishing platform intended for content creation with minimal emphasis on searching. Scriblio’s interface in particular seems to be optimized for the display of monographs. Serials, databases and how they relate to one another, an integral part of the modern research library, may not translate nearly as well.
Another intriguing alternative is BiblioCommons. BiblioCommons, by a company with the same name, adds a rich social network to the library by leveraging user contributed content, borrowing history and creating personal profiles that then create communities of interest around similar tastes. The focus is on the users, user behavior and user experience rather than the bibliographic metadata which would allow communities to shape their libraries in ways that work best for them. Creating services based on user data requires quite large populations to have enough activity to make informed decisions. This model works well for public libraries with large pools of borrowers. Being more general in scope, smaller public library districts can feed into the same user base, provided the groups are relatively culturally and linguistically homogenous. Academic libraries do not have this luxury, however; it remains to be seen how BiblioCommons’ model adapts to meet their needs.
The downside of BiblioCommons is that, even after nearly two years of presentations and announcements, as of this writing there are no production instances and all of the betas are private. There is no way to know how well their approach scales or works or fits in with the rest of the information landscape because it has yet to be subjected to the rigors of real world usage. Regardless of how BiblioCommons fares, their method and user-centric approach is worth a look.
As much as the interfaces are changing in style from their ancestor the card catalog, so too are catalogers realizing that the underlying data must evolve, as well. Development of RDA (Resource Description and Access), FRBR and the Dublin Core Abstract Model (DCAM) are inspired, in part, by the inefficiencies in applying MARC and traditional cataloging techniques to the modern information universe. There is a classic “chicken or the egg” scenario here, though. RDA is unlikely to gain much support until it is adopted by the cataloging modules in the library management systems, but, as this would probably be the most significant change to online library systems since their creation, would require a major overhaul on the part of the vendors, further slowing its uptake.
Rather than the integrated library system, perhaps it should be the discovery systems that drive the adoption and proliferation of RDA and the DCAM into libraries. With their focus on resources rather than records, integrating data from other sources and generating a collaborative web of data, this seems a perfect fit for the kinds of next generation search and discovery systems needed in our increasingly distributed and digitized landscape.
Disintegrate the bibliographic data from the inventory control system, let it incorporate and display however it wants or needs to and leave the circulation, purchasing and serials prediction to the back office application. Like it or not, it is the direction the current crop of catalog replacements are taking us anyway; it is time to shed the trappings of the card catalog and reconfigure our assets to work with the web instead of around it. Until we start to work with the data as it is intended, rather than how it has traditionally been structured, these next generation tools that we find so innovative will merely underwhelm.