Archive

URIs

I’ve been accused of several things in the Linked Data community this week:  a circular reasoner, a defender of the status quo “just because that’s how we’ve always done it”, and (implicitly) an httpRange-14 apologist.  Quite frankly, none of these are true or quite what I mean (and I’m, of course, over dramatizing the accusations), but let’s focus on the last point for now (which may clear up some of the other points, as well).

Ed’s post (as he explains at the end) is a reference to me calling bullshit on his claim that “[he] think[s] httpRange-14 is an elaborate scholarly joke“.  Let me be clear from the outset that I am not particularly dogmatic on this issue.  That is, I don’t think the internet will break if the resource and carrier are conflated, but I also don’t think it’s that hard to keep them separated and that the value in doing so outweighs any perceived costs.

First off, let me explain what httpRange-14 is to the uninitiated (skip on ahead if you feel pretty comfortable with this).  In linked data (or semantic web, you can choose the words that feel best to you), we run into a problem with identifiers and what, exactly, they are identifying.  Let’s say I want to talk about Chattanooga.  Well, “Chattanooga” is not a web resource, but if I want talk about it unambiguously, it needs an identifier, preferably an HTTP URI, so other people can refer to it unambiguously and say things about it and discover it.  Ideally, this web representation would also have human readable (HTML) and machine readable (RDF, XML, etc.) versions.  But the important distinction here is that the city of Chattanooga cannot be retrieved on the web, only these HTML, RDF, XML surrogates.  If the surrogate has the same URI (identifier) as the resource it’s describing it starts to get difficult to figure out what we’re talking about.

So to try to make this a little clearer, let’s say I am making this representation of Chattanooga for people to use:

<http://dilettantes.code4lib.org/resources/Chattanooga_Tennessee.rdf>
    rdf:type <http://www.geonames.org/ontology#P.PPL> ;
    <http://www.geonames.org/ontology#population> "155554"^^xsd:integer.

But I also feel I need to let people know some administrative data about it, so they know when it was last modified and by whom, etc., so:

<http://dilettantes.code4lib.org/resources/Chattanooga_Tennessee.rdf>
    rdf:type <http://www.geonames.org/ontology#P.PPL> ;
    <http://www.geonames.org/ontology#population> "155554"^^xsd:integer ;
    dcterms:creator <http://dilettantes.code4lib.org/about#me> ;
    dcterms:created "2010-07-09"^^xsd:date ;
    dcterms:modified "2010-07-09T11:25:00-6"^^xsd:dateTime .

Now things get confusing.  My new assertions (dcterms:creator/created/modified) are being applied to the same resource as my city, so I am saying that I created a city of 155,554 people today (what have you done today, chump?).

The way we get around this is through a layer of indirection, basically we just use two URIs: you request an RDF document from http://dilettantes.code4lib.org/resources/Chattanooga_Tennessee.rdf and it has something like:

<http://dilettantes.code4lib.org/resources/Chattanooga_Tennessee#place>
  rdf:type <http://www.geonames.org/ontology#P.PPL> ;
  <http://www.geonames.org/ontology#population> "155554"^^xsd:integer.
<http://dilettantes.code4lib.org/resources/Chattanooga_Tennessee.rdf>
    rdf:type <http://xmlns.com/foaf/0.1/Document> ;
    <http://xmlns.com/foaf/0.1/primaryTopic> <http://dilettantes.code4lib.org/resources/Chattanooga_Tennessee#place> ;
    dcterms:creator <http://dilettantes.code4lib.org/about#me> ;
    dcterms:created "2010-07-09"^^xsd:date ;
    dcterms:modified "2010-07-09T11:25:00-6"^^xsd:dateTime .

And this keeps things a little clearer.  I created the document you’re looking at today, not the resource that the document is describing.  So this way when you say that my RDF is terrible (fair accusation) you’re not necessarily saying that about the city of Chattanooga (and vice versa).  You can read more about this at Cool URIs for the Semantic Web (by the way, I tend to favor the “hash URI” approach, for simplicity’s sake).

Now back to Ed’s post.  His argument is that if he uses http://en.wikipedia.org/wiki/William_Shakespeare as his identifier (referent, really) we should be smart enough to know when we say that this URI is a foaf:Person and that it was dcterms:created on “2001-10-14” that we’re referring to two different things.

The first comment is from Ian (full disclosure: my boss, fuller disclosure: this doesn’t mean I agree with him) who simultaneously “completely agrees” with Ed and yet supplies an argument that punches a gigantic hole in the side of Ed’s thesis.

To put it another way, sure, maybe we can tell that dcterms:created is a strange assertion for a foaf:Person and we have other ways to tell that Shakespeare was born in 1564 (via a bio:Birth resource or something), but this breaks down for books and all sorts of other entities.  So you have dcterms:created “2003-09-04” and dcterms:creator <http://en.wikipedia.org/wiki/Douglas_Coupland> on http://en.wikipedia.org/wiki/Girlfriend_in_a_Coma_%28novel%29 and we’ve now sown some confusion.  This ambiguity becomes more problematic down the road when the context changes (that is, assumptions I can make about wikipedia and wikipedia’s model don’t necessarily apply elsewhere).

Right around the time I graduated from high school, the guitarist in my band at the time made me a cassette copy of Jimi Hendrix’s “Jimi Plays Monterey“.  The sound quality was pretty terrible and, as I recall, my tape player ate it once making it even worse.  Still, I loved that album (Jimi, while playing Dylan’s “Like a Rolling Stone” says “I know I missed a verse, it’s alright, baby.”): I love the songs, I love the playing, I love the energy of the performance.  The medium that album came to me on, however, was subpar.  There are general attributes of “cassette tapes” and then there was “this particular recording on this particular cassette”.

At the same time in my life, I had a compact disc of the BulletBoys’ eponymous album.  Fidelity-wise, the sound of this album was orders of magnitude better than my copy of “Jimi Plays Monterey”, but pretty much everything else about it sucked.

The carrier is not the content.  Being able to refer to the quality of my dilapidated cassette without dragging the Jimi Hendrix Experience into it is useful.  I should be able to say that my BulletBoys CD sounded better than my Hendrix tape without that being a staggering example of bad taste.

In libraries, we have a long history of data ambiguity.  We have struggled enough to figure out the semantics in our AACR2/ISBD data that when we have the chance to easily and concretely identify the things we are talking about, we should take it.  I am not proposing abstracting things into oblivion with resources on top of resources – just sensibly being sure you’re talking about what you say you are.

Unfortunately, one of my problems with the new RDA vocabularies is that in several instances it schmushes multiple statements together to avoid the modeling the “hard parts” (this is precisely the same issue I have with Ian’s later comment).  For example, RDA has a bunch of properties that are intended to “hand wave” around the complexities of FRBR, such as http://RDVocab.info/Elements/otherDistinguishingCharacteristicOfTheExpression.  So you’d have something like:

<http://example.org/1>
    <http://RDVocab.info/Elements/title> "Something: a something something" ;
    <http://RDVocab.info/Elements/titleOfTheWork> "Something" .

What you’ve done here with “titleOfTheWork” is say that <http://example.org/1> has a work, is itself not a work and the work’s title is “Something”.   That’s some attribute!  But if we can say all of that, why would we not just model the work?! Even if we don’t know where in the WEMI chain <http://example.org/1> falls, if we did something like this:

<http://example.org/1>
    dcterms:title "Something: a something something" ;
    ex:hasWork <http://example.org/works/1234> .

<http://example.org/works/1234>
    a <http://RDVocab.info/uri/schema/FRBRentitiesRDA/Work>;
    dcterms:title "Something" .

we’ve now done something useful, unambiguous and reusable (and not ignoring FRBR while simultaneously defining it).  The closed nature of IFLA’s development of these vocabularies don’t lead to me have much hope, though.

But, again, back to Ed.  Like I said, I really don’t think the internet will fall apart and satellites will come crashing to the earth if we don’t adhere consistently to httpRange-14.  No, the reason why I call bullshit on Ed’s statement is because he finds the use of owl:sameAs on resources such as http://purl.org/NET/marccodes/muscomp/sn#genre to be inappropriate.  While in his post he claims it’s fine that we conflate the resource of William Shakespeare as a foaf:Person and foaf:Document that was modified on “2010-06-28T17:02:41-04:00”, he on the other hand questions the appropriateness of <http://purl.org/NET/marccodes/muscomp/sn#genre> owl:sameAs <http://dbpedia.org/resource/Sonatas> because by doing so it infers that <http://purl.org/NET/marccodes/muscomp/sn#genre> has a photo collection at <http://www4.wiwiss.fu-berlin.de/flickrwrappr/photos/Sonata> (which, in fact, has little to do with the musical genre and actually has a lot of pictures of Hyundais, among other things).

This is a perfectly fair, valid and important point (and one that absolutely needs to be addressed), but doesn’t this also mean he actually cares that we say what we really mean?

Jonathan Rochkind recently started a thread on the Code4lib mailing list asking how to register an info URI for SuDocs.  Ray Denenberg responded with an explanation of the process.  I won’t get into my opinions of info URIs or the merits of either side of the ensuing debate that spun out from this thread, but my takeaway was that Jonathan wasn’t really looking for an info URI, anyway.

What Jonathan wanted was:

  • A generic URI model to define SuDocs
  • For this model to be maintained and hosted by somebody other than him
  • If possible, the URIs be resolvable to something that made sense for the supplied SuDoc

I think these are reasonable desires.

I also thought that there were existing structures out there that could meet his requirements without going through the “start up costs” of registering an info URI.  Also, info URIs are not of the web, so after going through the work of creating a ‘standard’, you cannot actually use it directly to figure out what the SuDoc is referring to.

SuDocs (like all other aspects of Government Documents) are arcane, niche and not understood by anyone other than GovDoc librarians, who seem to be rare.  That being said, there is a pretty convenient web presence implicit in SuDoc — in order for a SuDoc to exist, it needs to appear in the GPO’s catalog.  Since anything that appears in the GPO’s catalog can be seen on the web, we have a basis for a dereferenceable URI structure.

The GPO uses Ex Libris’ Aleph and whatever Aleph’s out of the box web OPAC is for their catalog.  Last week, I was Googling for some information about SRU and Aleph and it led me to this page about constructing CCL queries into Aleph (note, please disregard almost everything written on this page about SRU, CQL, etc., as it’s almost completely b.s.).  Figuring there must some way to search on SuDocs, I tried a couple of combinations of things, until I found this page in the GPO catalog.  Ok, so the index for SuDocs is called “GVD”.

This gives us URLs like:  http://catalog.gpo.gov/F/?func=find-c&ccl_term=GVD%3DE%202.11/3:EL%202

Now, this could work, but it’s incredibly awkward.  It’s also extremely fragile since it’s software (in this case, Aleph) dependent, and if it was to break, requires the GPO to redirect us to the right place.

This is, of course, exactly what PURLs were designed to do.  I had never actually set up a PURL and almost didn’t for this, since the purl.org service said that it wasn’t working properly so it would be disabled for another week.  However, all the links were there, so I forged ahead.  I was in the process of setting up a regular PURL, when I ran across partial redirects.  I figured something like this had to exist for PURLs that were used for RDF vocabularies and the like, but wasn’t aware of how they work.

Anyway, they’re extremly simple.  Basically you set up a base URL (http://purl.org/NET/foo/) and anything requested past that base URL (e.g. http://purl.org/NET/foo/bar) will be redirected to the PURL endpoint verbatim.

So, I set up a partial redirect PURL at the base:  http://purl.org/NET/sudoc/

The expectation that it would be followed by a properly URL escaped SuDoc:  E 2.11/3:EL 2 becomes http://purl.org/NET/sudoc/E%202.11/3:EL%202 which then tacks that SuDoc onto http://catalog.gpo.gov/F/?func=find-c&ccl_term=GVD%3D and redirects you to http://catalog.gpo.gov/F/?func=find-c&ccl_term=GVD%3DE%202.11/3:EL%202.

What you have then is a unique identifier for a SuDoc that resolves to a human readable representation of what the URI stands for.  If the GPO changes OPACs or Ex Libris changes Aleph’s URL scheme or the GPO comes up with a better representation of SuDoc, it doesn’t matter as long as the actual SuDoc class number can be used to redirect the user to the new location.

Obviously, there’s an expectation here that PURLs remain indefinitely and that purl.org is never lost to a third party that repurposes it for other uses.  However, there are major parts of the web that rely on purl.org, so there are a lot of people that would fight to not see this happen.

Basically, I think these are the sorts of simple solutions that I feel we should be using to solve these sorts of problems on the web.  We are no longer the center of the information universe and it’s time that we accepted that and begin to use the tools that the rest of the world is using to solve the same problems that everybody else is dealing with.

How many other ‘identifiers’ could be mint persistent, dereferenceable URIs this way?  I look forward to finding out.

By the way, there are currently three possible points of failure for this URI scheme:  purl.org, GPO and me.  I would prefer not to be a single point of failure, so if you would like to be added as a maintainer to this PURL, please let me know and I would be happy to add you.