A URI Scheme for SuDocs

Jonathan Rochkind recently started a thread on the Code4lib mailing list asking how to register an info URI for SuDocs.  Ray Denenberg responded with an explanation of the process.  I won’t get into my opinions of info URIs or the merits of either side of the ensuing debate that spun out from this thread, but my takeaway was that Jonathan wasn’t really looking for an info URI, anyway.

What Jonathan wanted was:

  • A generic URI model to define SuDocs
  • For this model to be maintained and hosted by somebody other than him
  • If possible, the URIs be resolvable to something that made sense for the supplied SuDoc

I think these are reasonable desires.

I also thought that there were existing structures out there that could meet his requirements without going through the “start up costs” of registering an info URI.  Also, info URIs are not of the web, so after going through the work of creating a ‘standard’, you cannot actually use it directly to figure out what the SuDoc is referring to.

SuDocs (like all other aspects of Government Documents) are arcane, niche and not understood by anyone other than GovDoc librarians, who seem to be rare.  That being said, there is a pretty convenient web presence implicit in SuDoc — in order for a SuDoc to exist, it needs to appear in the GPO’s catalog.  Since anything that appears in the GPO’s catalog can be seen on the web, we have a basis for a dereferenceable URI structure.

The GPO uses Ex Libris’ Aleph and whatever Aleph’s out of the box web OPAC is for their catalog.  Last week, I was Googling for some information about SRU and Aleph and it led me to this page about constructing CCL queries into Aleph (note, please disregard almost everything written on this page about SRU, CQL, etc., as it’s almost completely b.s.).  Figuring there must some way to search on SuDocs, I tried a couple of combinations of things, until I found this page in the GPO catalog.  Ok, so the index for SuDocs is called “GVD”.

This gives us URLs like:  http://catalog.gpo.gov/F/?func=find-c&ccl_term=GVD%3DE%202.11/3:EL%202

Now, this could work, but it’s incredibly awkward.  It’s also extremely fragile since it’s software (in this case, Aleph) dependent, and if it was to break, requires the GPO to redirect us to the right place.

This is, of course, exactly what PURLs were designed to do.  I had never actually set up a PURL and almost didn’t for this, since the purl.org service said that it wasn’t working properly so it would be disabled for another week.  However, all the links were there, so I forged ahead.  I was in the process of setting up a regular PURL, when I ran across partial redirects.  I figured something like this had to exist for PURLs that were used for RDF vocabularies and the like, but wasn’t aware of how they work.

Anyway, they’re extremly simple.  Basically you set up a base URL (http://purl.org/NET/foo/) and anything requested past that base URL (e.g. http://purl.org/NET/foo/bar) will be redirected to the PURL endpoint verbatim.

So, I set up a partial redirect PURL at the base:  http://purl.org/NET/sudoc/

The expectation that it would be followed by a properly URL escaped SuDoc:  E 2.11/3:EL 2 becomes http://purl.org/NET/sudoc/E%202.11/3:EL%202 which then tacks that SuDoc onto http://catalog.gpo.gov/F/?func=find-c&ccl_term=GVD%3D and redirects you to http://catalog.gpo.gov/F/?func=find-c&ccl_term=GVD%3DE%202.11/3:EL%202.

What you have then is a unique identifier for a SuDoc that resolves to a human readable representation of what the URI stands for.  If the GPO changes OPACs or Ex Libris changes Aleph’s URL scheme or the GPO comes up with a better representation of SuDoc, it doesn’t matter as long as the actual SuDoc class number can be used to redirect the user to the new location.

Obviously, there’s an expectation here that PURLs remain indefinitely and that purl.org is never lost to a third party that repurposes it for other uses.  However, there are major parts of the web that rely on purl.org, so there are a lot of people that would fight to not see this happen.

Basically, I think these are the sorts of simple solutions that I feel we should be using to solve these sorts of problems on the web.  We are no longer the center of the information universe and it’s time that we accepted that and begin to use the tools that the rest of the world is using to solve the same problems that everybody else is dealing with.

How many other ‘identifiers’ could be mint persistent, dereferenceable URIs this way?  I look forward to finding out.

By the way, there are currently three possible points of failure for this URI scheme:  purl.org, GPO and me.  I would prefer not to be a single point of failure, so if you would like to be added as a maintainer to this PURL, please let me know and I would be happy to add you.

5 comments
  1. Nice, thanks Ross.

    That redirect feature of Purl is nice, I didn’t know about that. I still don’t entirely understand how those Purl’s get redirected to a legal Aleph URL. But I’ll read up on the purl documentation. Or, can you provide a capture of exactly how you configured purl to do this?

    This experiment makes me think about some other features of identifiers.

    It is important that I can tell from the URI alone that this IS a sudoc. Which indeed I can, so that’s good. If it starts http://purl.org/NET/sudoc, I know it’s a sudoc. That turns out to be important.

    But that has implications on what people talk about identifiers being ‘opaque’. If these were merely in the universe of http://purl.org/ID, I couldn’t tell from the URI alone that it was a sudoc at all. That would be problematic.

    If I couldn’t tell that it WAS a sudoc at all, it would not be useful to me. If I could only tell it was a sudoc after making an http call, it would be a performance nightmare.

  2. Ah, I see the trick on how the purl redirect works now, cool.

  3. Also, readers note that if you are going to embed such a URI in an OpenURL KEV rft_id, the sudoc itself ends up looking ‘double escaped’.

    Take a sudoc: E 2.11/3:EL 2

    Escape it once to turn it into a proper URI: http://purl.org/NET/sudoc/E%202.11/3:EL%202

    Escape that URI to embed it in a KEV rft_id:

    &rft_id=http%3A%2F%2Fpurl.org%2FNET%2Fsudoc%2FE%25202.11%2F3%3AEL%25202

    A little bit confusing. This will be more confusing if the blog comment messes up those strings. 🙂

  4. I like this approach as a lightweight way of de-referencing identifiers in a relatively sustainable way.

    However, I’ve got some comments 🙂
    1) If you go to http://catalog.gpo.gov/F?func=file&file_name=help-1 and scroll right to the bottom of the page it seems that GPO support persistent URLs for some pages of the catalog – although at the moment it looks like these are just top level pages not deeper links. It suggests that there is at least some awareness that the catalog isn’t ‘link friendly’ and so it might be that staff at the GPO are willing to either maintain the redirection Ross has setup, or even support a local redirection service.

    2) There is an alternative link syntax you could use of the form http://catalog.gpo.gov/F/?func=find-b&find_code=GVD&request=E%202.11/3:EL%202
    Not that it is much different, but I prefer the splitting of the index name and search string into two parameters

    3) Obviously what you are actually doing is providing a link to a search result, rather than to a specific document. I guess from Jonathan’s comments he is happy with this – the PURL resolves to something fairly meaningful – but probably worth noting the distinction. The defaults that GPO have set on the catalog mean that even when the deeplink search only results in a single record they still only show the brief record, and not the full record (this is a default setting, the user can change the behaviour by adjusting the preferences in their session – but not something you can adjust when deeplinking).

    4) The syntax for linking to the full metadata record for a specific document is of the pattern http://catalog.gpo.gov/F?func=direct&doc_number=000407525. The ‘doc_number’ is an Aleph allocated system number with no link to available metadata – so the only way you can check this is to look up the record via metadata first. There should be a 1 to 1 correspondence between SuDOC IDs and their system numbers (i.e. each SuDOC ID should be recorded in the metadata of one and only one record). If this is the case GPO could provide a translation mechanism relatively easily if they have licensed the Aleph X-Server (an API which returns XML) or by any other means they wanted.

    I realise that some of this is pie-in-the-sky if GPO aren’t able to deliver this (or aren’t interested) but it took me a bit of poking around to find this stuff out, so thought it was worth recording somewhere!

  5. For my purposes, I actually don’t need resolvability at all, I just need the URI that I can recognize as a SuDoc and extract the bare SuDoc from.

    But rsinger thought it should have some kind of resolvability too, and it’s kind of neat.

    I have discovered that SuDocs are somewhat less predictable than I thought. Pre ~1976 SuDocs are in fact unlikley to show up in catalog.gpo.gov at all, so you might have a valid SuDoc which still doesn’t resolve.

    Also, while identical SuDocs aren’t _supposed_ to be given to different documents, it seems that they are more often then I had hoped — especially for that pre-~1976 period. Since they don’t have those old SuDocs in a computer catalog at all, it’s hard for them to make sure new ones don’t conflict with pre-existing ones, I guess.

    This is all just from ‘reverse engineering’, and finding friendly experienced gov docs librarians to talk to (who aren’t generally used to thinking of SuDocs in this way). I have not successfully made any contact with GPO personally.

Leave a Reply

Your email address will not be published. Required fields are marked *