This is a preprint for a column I wrote in the Journal of Electronic Resources Librarianship called Memo from the Systems Office. The edited version appeared in Volume 20, Issue 2
The Knowledgebase Kibbutz
As libraries’ collections increasingly go digital, so too does their dependence on knowledgebases to access and maintain these electronic holdings. Somewhat different than other library-based knowledge management systems (catalogs, institutional repositories, etc.), the data found in the knowledgebases of link resolvers or electronic resource management systems is not generally modeled, created or updated by librarians (although, admittedly, a lot of local work is done to modify or fix this data). Much like the subscription based resources they track, it is difficult to know what exactly libraries are allowed to do with their knowledgebase data. What is known is that a handful of companies are doing roughly the same work and their customers are often simultaneously fixing the same errors and inaccuracies in the data these vendors are aggregating. The entire process is proprietary, inefficient and redundant and is controlled by bevy of players that have no incentive to change. A centralized, standardized approach, maintained by librarians, publishers and vendors could not only reduce total cost of ownership, but also improve the quality of the data as well as the services built upon such a repository.
Last spring, the UK Serials Group (UKSG) issued a report titled “Link Resolvers and the Serials Supply Chain”  aimed squarely at this issue (full disclosure: my employer, Talis, is a member of the UKSG Knowledge Bases and Related Technologies working group ). It discussed the reasons the current environment evolved the way it did, the problems and inefficiencies that exist with the status quo, and a possible, centralized alternative to improve the situation. Interestingly, the proposal put forth assumed a single organization would be responsible to shepherd such a service instead of relying on a distributed, social software based, Web 2.0 method to maintain this knowledge (although, obviously, someone or some thing would still have to host such a service). In the course of the report, the authors give a passing, somewhat dismissive, mention of the Jointly Administered Knowledge Environment (JAKE), which, although it ran out of steam years before the term “Web 2.0” was even coined, was based entirely on those principles.
JAKE, although now dead and buried (jake-db.org was decommissioned January 31st, 2007 after years of neglect ), was hardly a “failure” as the UKSG report would indicate. Started in 1999 by the Cushing/Whitney Medical Library at Yale, its functionality went on off to spawn the basis of two different currently available link resolvers (OCLC Openly Informatics’ WorldCat Link Manager , née 1Cate, and Simon Fraser University’s CUFTS ). Despite leading to no actual working copies, a Google search for “jointly administered knowledge environment” yields over 25,000 hits of which a sizable proportion are library web pages pointing at one of the four former JAKE locations. While this certainly appears to be an example of “the tragedy of the commons” (many people benefiting from/few people contributing to), the reality of why JAKE failed to take firmer root is probably more complicated than that. Since it had no direct interaction with commercial link resolvers and, as such, provided a questionable real-life benefit to library users, the motivation for librarians not directly involved in the project to maintain it was fairly minimal. Since JAKE predated the rise of the social web, there was no precedent on how to cultivate a community to keep it running. Since JAKE did not fit into the workflows of the librarians or the publishers, there was little incentive to invest a lot of effort in the submittal or upkeep of the data. JAKE was doomed not because it wasn’t useful but because it could not articulate its usefulness.
The danger in the UKSG approach is the creation of another closed, subscription based service like Crossref or WorldCat. While certainly the credibility and existing relationships with the major stakeholders (publishers, aggregators, libraries, vendors) that an organization like OCLC or Crossref already has would be desirable and give an immediate boost to such a project, past performance does not show much hope for clearing up the issue of what subscribers are actually allowed to do with the data found in a centralized, paid access knowledgebase. It also does nothing to address the openness required to build a community on this service, which in turn raises questions on how the data contained within would be maintained. Along with people contributing to the database and journal information, it would also need application developers, from libraries, library vendors, publishers and database vendors and beyond, to write applications that not only depend on the existence of this service but on the accuracy of its data. If business interests are staked on the success of this venture, and even more importantly, multiple and diverse business interests, the chances of its survival become more likely.
By no means would this project be trivial. From a technical, political or administrative perspectives there are countless numbers of pitfalls. Modeling this data is hard. The only evidence needed to back up this statement is to look at the arcane ways libraries and publishers have tried to model it in the past (MARC21 Format for Holdings Data , ONIX for Serials Online Holdings , etc.). Agreement on what criteria even constitutes being added to such a knowledgebase would result in months of debates. Adoption of this resource would require librarians to feel confident enough in its accuracy; just because the data will be available to correct does not mean that anybody will want to update vast percentages of it. The major link resolver and ERMS vendors have little incentive to participate: regardless of how awkward or inefficient their current processes are, they work and it does not require a massive refactoring in workflows and code to maintain the status quo. Perhaps the biggest question would be, who would host this service? Who has the infrastructure to support a project like this? Who will pay for the servers, bandwidth and upkeep? Would the stakeholders feel comfortable if this was provided by a vendor, regardless of how open the community is? When the community reaches an impasse, who will have the authority to make executive decisions? These are all currently unanswered and the prospect of an organization like NISO (or another similar body) being tasked to solve them means that there are no solutions in sight.
The upside of this service would be so great, though. Outside of the obvious potential it brings to the evolution of the knowledgebase as we currently know it: a standard for publishers and providers to point to for their holdings submissions; a centralized source to add and disseminate targets provided by librarians when the vendor community can not or will not; and a means to eliminate the redundancies of maintaining multiple knowledgebases between vendor offerings, the community contributed knowledgebase also creates opportunities for new services. Besides a crop of new or improved OpenURL resolver, ERMS and Metasearch offerings that would likely spring forth, more tangentially related applications would be possible as well. With a centralized registry of primarily electronic resources, a uniform identifier can be given to items such as databases or e-book packages, since there is, shamefully, nothing that addresses that need currently. Preprint/post-print/working paper coverage can be associated with original resources. Relationships can be defined between publications and web services that fall outside of the traditional library purview: journal tables of contents found at Cite-U-Like , full text coverage at Google Book Search, and more. By keeping the data open, people from outside the library community can utilize, reconstitute and extend our data providing libraries with services that they would not have imagined or had access to create.
Even the pitfalls are surmountable. Organizations such as the Internet Archive’s Open Library and Wikipedia have had to deal with the realities of openly editable and therefore highly dynamic content. Mechanisms, whether automated or manual, could be put into place to monitor or spot check edits for accuracy. Libraries could lock their resolvers to specific edits of a particular resource until they (or a predetermined “trusted” agent: a consortial partner or their vendor, say) approves the most recent edition. Library software vendor buy in can be achieved in two ways: comprehensive publisher and aggregator support or simple economics. As the UKSG report states, the publishing community would relish a single, definitive means for formatting and distributing their holdings. Instead of many ad-hoc arrangements for resolver vendors, subscription agents, and ERMS products, having a consistent and streamlined process is a much easier business case to sell to the publishers. If the publishers commit to solely producing their holdings through a centralized service, the vendors have little choice but to acknowledge and use it. As to the latter approach, it is time consuming and therefore expensive to maintain a large and accurate knowledgebase. Simon Fraser University’s subscription costs for their CUFTS link resolver are almost exclusively used for cost recovery for their knowledgebase maintenance. OhioLINK and the Colorado Alliance of Research Libraries (CARL) must face similar predicaments for their OLinks  and GoldRush  products, respectively. The smaller traditional commercial vendors are probably looking at an even bleaker return on investment. This is, no doubt, why OCLC Openly’s WorldCat Link Manager’s knowledgebase is used behind the scenes by the majority of link resolvers on the market today. However, if the link resolver and ERMS suppliers outside of the big three (Ex Libris, Serials Solutions and OCLC Openly) contributed to a collaborative knowledgebase, they might, together, have a large enough market to influence the data providers to contribute to it (which, in turn, may be an end around to the first solution). The non-participating vendors could use the community as a means to share unsupported targets created by their customers by providing an import/export mechanism to their current, proprietary knowledgebases. There are multiple benefits for having a resource like this; it does not necessarily have to power the link resolver product directly as long as it can be integrated seamlessly. The vendors jumped at supporting Google’s demands for Google Scholar export, so why not this?
Most likely, the best solution would be for somebody to just get a handful of the stakeholders together and create something that works, preferably with an OpenURL link resolver or electronic resource management system modified or built to use it (open source or commercial, it does not matter). The knowledgebase crisis is not going away, and as the digital universe expands, especially to new and different formats, it will only get more difficult to manage. By tapping into the power of the entire community, from the beginning of the publishing chain to the end user, the knowledgebase becomes self-sustaining and finds new and interesting uses along the way. The world does not need another closed-access, subscription based library data silo. We have enough of those already.