This is a preprint for a column I wrote in the Journal of Electronic Resources Librarianship called Memo from the Systems Office. The edited version appeared in Volume 20, Issue 1
Opening Up Access to Open Access
As the corpus of gray literature grows and the price of serials rises, it becomes increasingly important to explore ways to integrate the free and open web seamlessly into our collections. Users, after all, are discovering these materials all the time via sites like Google Scholar and Scirus or by searching arXiv.org or CiteSeer directly. Leveraging such resources not only spackles gaps within our holdings, it also cheaply and efficiently expands the range of content, especially that which might otherwise fall outside of a library’s collection development policy (but not necessarily outside of the scope their constituents’ research). While large academic libraries certainly would benefit from exploiting these archives, the real winners would be smaller libraries with more limited collections. Add access to digitized items in Google Book Search, the Open Content Alliance, Project Gutenberg and similar initiatives from Microsoft and Amazon and the even the smallest community college library begins to resemble their more financially endowed peers. Ideally, as researchers discover, use and cite materials in these green archives alongside their vended counterparts, it will further encourage them to submit their own scholarly work into such a repository further increasing the volume and ratio of items available.
The problem, however, is that documents such as these are largely invisible to the traditional library research workflow. Outside of the possibility of being included in a metasearch application (which is hardly ubiquitous or even necessarily desirable among many libraries), there is little to zero chance of preprints or working papers to be found in the normal “search in licensed database/link to OpenURL link resolver/link to licensed full text database” chain. Our systems don’t care about objects since that’s both vast and messy. They define the existence of articles and their availability to the user based on their broad association. “This object is part of this serial which, based on the date, has been identified as part of this institution’s collection.” Monographic materials follow the same path: they are either in the catalog or available via interlibrary loan. The time is nigh for a service that bridges the gap between the published document and its open access preprint, postprint and working paper kin.
This is hardly a new idea. Last year, a consortium of Japanese libraries teamed with OCLC Openly Informatics to build a means to open their institutional repository content to their link resolvers. [1,2] This project focused on local content, albeit local content aggregated from several large Japanese universities, however, so the metadata is tightly controlled and, inherently, limited. It doesn’t begin to address the ability to find resources outside of the library’s (or consortium’s, in this case) defined “collection”. It is also a fairly specific solution: an OpenURL service embedded in a DSpace implementation using a locally standardized qualified Dublin Core profile to be utilized by a 1Cate based link resolver. This is not intended as a criticism; rather it is meant to point out the differences in objectives. Indeed, AIRway’s goals are worthwhile; any access to our institutional repositories is a step in the right direction.
Similarly, Georgia Tech’s Ümlaut project also sought to expose the free print archives in the link resolving chain (full disclosure: I was directly involved in the design, development and implementation of the Ümlaut). Rather than looking for open access materials directly, however, the Ümlaut utilized the Google and Yahoo web search APIs (Application Programming Interfaces) and, amongst other activities, weeded through the web search results for materials that were in predefined open access archives (mainly via the Directory of Open Access Repositories and Registry of Open Access Repositories [3,4]) and used simple heuristics to determine whether or not the link was in fact to the full text of the citation (rather than to a citation in a bibliography or just a random irrelevant match on the article title and author keywords). This approach isn’t without its problems: false positives are possible depending on the OpenURL context object metadata and discovery is left to the mercy of the web crawlers. A single but very significant example: Arxiv.org only allows its content to be crawled by Google but Google’s search API can be very unstable and frequently fails which essentially renders the documents contained there non-existent for the searcher at that moment. To further add to the problem, Google no longer provides API keys to their web search service making this method unavailable to other libraries that may wish to pursue such a mechanism. As a result, Arxiv.org, easily one of the more important open access archives, would be unavailable to determine if a paper is held there. Still, this is a surprisingly effective means of linking to open content. This functionality is tightly integrated into the Ümlaut, however, and would be difficult to detach for broader use.
A centralized service seems the most logical way to approach this for the maximum value for a widespread audience using a variety of link resolvers (among other services) from various different vendors. Many of the necessary pieces are already in place to make such a web service rather trivial to implement. A vast percentage of the open access repositories are harvested by the University of Michigan’s OAIster service  making this a good single point to search digital resources. The downside is that OAIster can suffer from poor performance and has an awkward and non-standard interface for remote access. Thankfully, the Danish company Index Data has made OAIster available through their OpenContent service  which brings immense improvements to response time and accessibility via Z39.50 and SRU. This service also provides access to the Open Content Alliance’s collection of digitized books as well as Project Gutenberg, conveniently providing a single source for a vast amount of open access content. LibraryThing recently introduced a service to associate scans in Google Book Search with entries in LibraryThing. [7,8] This data could be crossed with LibraryThing’s thingISBN or OCLC’s xISBN ISBN concordance services to increase the possibility of finding a digitized copy (although, realistically, items with ISBNs are not likely to be out of copyright). Similar initiatives can be done for Microsoft’s and Amazon’s nascent digitization programs. The incoming OpenURL would be stored with the objects that are discovered through searching these sources which would both improve the metadata surrounding the digital resources as well as open up other possibilities of services to utilize this vast pool of content.
The Open Access Resolver would work roughly like this: an OpenURL enabled application (be it a database, a link resolver or browser plugin) would send an OpenURL to the OA Resolver along with the desired response format (defaulting to a human readable HTML interface). The resolver would determine the type of resource to search for (book, article, etc.) and query the appropriate services accordingly; the book search services would be ignored if the item is an article, for example. OA Resolver would then return its response based on format requested: XML to be integrated into another service, HTML to be viewed by the user or a simple boolean text string for another application to generate a link to the Open Access Resolver to present its options. This is then a useful service to be consumed by other institutional applications such as the OpenURL link resolver, catalog or other discovery mechanisms. It is also effective as a standalone link resolver for researchers that may not have access to one through their library.
From a purely technical standpoint, this is not a terribly difficult challenge; especially if the expectations of accuracy and success that is currently applied to OpenURL link resolvers is also given to such a project. The issues instead are of agreement on how this service should work, vendor implementation and uptime. What kind of request and response format would work best for the current crop of link resolvers (and other services)? Would vendors utilize this service or implement their own proprietary variant with resources included through deals they can make with content providers?
The value of such an application would be immeasurable to libraries and even more useful to its users. The vastness of content available freely on the web and unavailable to our systems cannot and should not be ignored. How long can we afford to overlook these free versions of resources that we spend so much to collect?
- Sugita, S., Horikoshi, K., Suzuki, M., Shin Kataoka, Hellman, E. S., Suzuki, K., et al. (2007). Linking Service to Open Access Repositories. D-Lib Magazine, 13(3/4). Retrieved September 20, 2007, from http://www.dlib.org/dlib/march07/sugita/03sugita.html