Archive

libraries

I have been slowly taking the MARC codes lists and modeling them as linked data. I released a handful of them several months ago (geographic area codes, countries and languages) and have added more as I get inspired or have some free time. Most recently, I’ve added the Form of Item, Target Audience and Instruments and Voices terms.

The motivation behind modeling these lists is that they are extremely low-hanging fruit: they are controlled vocabularies that (usually) appear in the MARC fixed (or, at any rate, controlled) fields. What this means is that they should be pretty easy to link to from our source data. The identifiers are based on the actual code values in an effort to not actually have to look anything up when converting MARC into RDF.

I’ll go over each code list and explain what their function and how to link to them from MARC:

Geographic Area Codes

The purpose of these is a little vague:  they’re hard to classify as to what exactly they are; there are states (Tennessee), countries (India), continents (Europe), geographic features (Andes, Congo River, Great Rift Valley), areas or regions (Tropics, “Southwest, New” –whatever that means–, “Africa, French-speaking Equatorial“), hemispheres (Southern hemisphere), planets (Uranus) and then there are entries for things like “Outer Space” and “French Community” (which, as I understand it, is sort of the French analog to the British Commonwealth); in short, they are all over the map (literally).

I have modeled these things as wgs84:SpatialThings.  I don’t know if that is 100% appropriate (e.g. “French Community”) and am open to recommendations for other classes.  Given that they are somewhat hierarchical and are used to define the geographic “subject” of a work, it might be more appropriate to model them using SKOS.

The geographic area code is found in the MARC 043$a (which is a repeatable subfield in a non-repeatable field) and should be a 7 character string (although this may vary based on local cataloging practices).  Most codes will be much shorter than this: the specification requires right padding hyphens (“-“) to seven characters (“aa—–“).  To turn this into a MARC Codes URI, you’ll drop the trailing hyphens and append “#location”:

http://purl.org/NET/marccodes/gacs/aa#location

http://purl.org/NET/marccodes/gacs/n-us-md#location

I’m not sure what is actually the “best” property to use to link to these resources, but I have been using <http://purl.org/dc/terms/spatial> (although, admittedly, not consistently).  This would entail that these resources are also a <http://purl.org/dc/terms/Location> which is something I can live with.

Not all of the geographic area codes are linked to anything, but some are linked to the authorities at http://id.loc.gov/authorities/, dbpedia, geonames, etc.

Country Codes

These are a little more consistent than the geographic area codes, but they are definitely not all “countries”.  With a few exceptions (United States Misc. Caribbean Islands) they are actual “political entities”, with countries (Guatemala), and states/provinces/territories (Indiana, Nova Scotia, Gibraltar, Gaza Strip).

Like the geographic area codes, I’ve modeled these as wgs84:SpatialThings.

They can appear in several places in the MARC record:  they will almost always appear in the 008 in positions 15-17 as the “country of publication”.  If one code isn’t enough to convey the full story of the production of a particular resource (!), the code may also appear in the 044$a (repeatable subfield, non-repeatable field).  There are a couple of fields that the country codes could appear in:  the 535$g, 775$f and the 851$g; I have no idea how common it would be to find them there (and they have a different meaning — the 535/851 define the location of the item, for example).

To generate the country code URI, take the value from the MARC 008[15-17] or 044$a, strip any leading or trailing spaces and append “#location”.  The URIs look like:

http://purl.org/NET/marccodes/countries/aw#location

http://purl.org/NET/marccodes/countries/sa#location

To link to these resources, I’ve been using the RDA:placeOfPublication property, although I’m sure there are plenty of others that are appropriate (seems like a logical property for BIBO, for example).

The original code lists are also grouped by region, but there are no actual codes for this.  I created some for the purposes of linked data:

http://purl.org/NET/marccodes/countries/regions/1#location

http://purl.org/NET/marccodes/countries/regions/2#location

etc. (until 12).

Since we only use the country codes in MARC to note the place of publication, these are far less valuable than the geographic area codes (which are much more ambiguous in meaning), since it’s much more interesting when you can say that all of these things:

http://api.talis.com/stores/rsinger-dev4/services/sparql?query=SELECT+%3Fs%0D%0AWHERE+{%0D%0A%3Fs+%3Fp+%3Chttp%3A%2F%2Fpurl.org%2FNET%2Fmarccodes%2Fgacs%2Fe-ie%23location%3E%0D%0A}&output=json

are referring to the same place as all of these things:

http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&should-sponge=&query=select+distinct+%3Fs+where+%0D%0A{%3Fs+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2Fcountry%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FIreland%3E}&debug=on&timeout=&format=text%2Fhtml

which, in turn, are referring to the same place as this:

http://www.freebase.com/view/en/ireland/-/book/book_subject/works

which, in my mind, has tremendous potential.

Language Codes

Unbeknownst to me prior to undertaking this project, the Library of Congress is actually the maintenance agency for ISO 639-2 and the ISO codes are actually a derivative of the MARC codes list.  They aren’t actually a 1:1 mapping (there are 22 codes that are different in the ISO list), but they’re extremely close.  What is particularly nice about this is that most locale/language libraries are aware of these codes so it’s fairly easy to map to other locales (notably ISO 639-1, which is used by xml:lang).

The Library of Congress publishes an XML version of the list which is what I used to model it as linked data.  One of the nice features of this list was that it has attributes on the name that denote whether or not there’s an authority record for it:

<name authorized=”yes”>Abkhaz</name>

which we can then take, tack the substring ” language” onto it and look it up in http://id.loc.gov/authorities:

http://id.loc.gov/authorities/sh85000169#concept

giving us a link between things created in a particular language and things created about that language.

To use the language codes, take the value of positions 35-37 of the 008 or the 041 (the different subfields all define a different place the text might have a different language, so check the spec on this one).  I doubt it hardly ever appears in actual data, but the 242$y might have the language of the translated title.

Take that value (be sure to strip any trailing/leading whitespace — it’s supposed to be 3 characters: no more, no less) and plug it into the following URI template:

http://purl.org/NET/marccodes/languages/{abc}#lang

for example:

http://purl.org/NET/marccodes/languages/tur#lang

http://purl.org/NET/marccodes/languages/myn#lang

etc.

The language resources link to id.loc.gov (as mentioned above) as well as Lingvoj/Lexvo (they link to both, where appropriate, since there are likely still many data sources out there still using the Lingvoj URIs).  There are a handful (for example, Swedish) that link to dbpedia, but since those links are available in Lexvo, it’s not essential they appear here.

Musical Composition Codes

There are two codes lists that are directly related to music-based resources (sound recordings, scores and video): the musical composition codes and the Instruments and Voices codes.  Given that there has been a lot of work put into modeling music data for the linked data cloud, I thought it would be most useful to orient both of these lists to be used with the Music Ontology.

The composition codes basically denote the “genre” of the music contained in the resource.  It’s extremely classical-centric and sometimes lumps a lot of different forms into one genre code (try Divertimentos, serenades, cassations, divertissements, and notturni on for size), but they are definitely a start for finding like resources.

They are modeled as mo:Genre resources and include links to id.loc.gov, dbpedia and wikipedia.  To get the code, either use positions 17-19 of the MARC 008 field or the 047$a (both a repeating field and subfield).  The normalized code should always be two alpha characters long, and downcased.

They go into a URI template like:

http://purl.org/NET/marccodes/muscomp/{ab}#genre

such as:

http://purl.org/NET/marccodes/muscomp/sy#genre

or

http://purl.org/NET/marccodes/muscomp/mz#genre

It would be really useful to find other datasources that use mo:Genre to link these to.

Form of Item Codes

This is a very small list that broadly describes the format of the resource being described.  This is probably most useful to use with dcterms:format, so they’ve all been modeled with the rdf:type dcterms:MediaType.  A full third of the codes describe microforms (granted, out of 9 total), which should give you some some sense of how relevant these are.

Getting the code from the MARC record is dependent on the kind of record you’re looking at.  For books, serials, sound recordings, scores, computer files and mixed materials, take the 23rd position from the 008.  For visual materials and maps use the 29th position.  They should be one, lowercase alpha character.

URIs look like:

http://purl.org/NET/forms/{a}#form

The resources link to http://id.loc.gov/authorities (think Genre/Form terms), http://id.loc.gov/vocabulary/graphicMaterials and (for a couple) dbpedia.

Ideally, these will eventually link to whatever is analogous is RDA (if somebody can point that out to me).

Frequency of Issue Codes

Unlike the previous code list, this one seems much more useful.  It is used to define how often a continuing resource is updated.  Unfortunately, it is extremely print-centric (the only term more frequent than “daily” is “Continuously updated” which is defined as “Updated more frequent than daily.”), but some of the terms would seem to hold value even outside of the library context (Annual, Biweekly, Quarterly, etc.).  It doesn’t take a tremendous leap of the imagination see how these might be useful for events calendars (Monthly, etc.) or for GoodRelations-type datasets (“Semi-annual Blowout Sale!”).

To get the code from the MARC record, check the 008[18] or the 853-855$w.  Presumably, this should only appear for continuing resources (SER).  It’s a one letter code, lower cased.

The URIs look like:

http://purl.org/NET/marccodes/frequency/{x}#term

They are modeled as dcterms:Frequency resources and link to dbpedia where available.

Target Audience Codes

This is another fairly short, extremely generalized list.  It is primarily useful to determining the age-level of children’s resources, most likely (5 of the 8 terms are for juvenile age groups).  They are of rdf:type dcterms:AgentClass.  Resources are linked (where appropriate — and maybe even a few that aren’t) to dbpedia and http://id.loc.gov/authorities/.

For books, music (scores, sound recordings), computer files and visual materials, get the code from the 008[22].  It is one letter, lower cased.  URIs follow the fairly consistent form we’ve seen thus far:

http://purl.org/NET/marccodes/target/{x}#term

http://purl.org/NET/marccodes/target/c#term

http://purl.org/NET/marccodes/target/f#term

Instruments and Voices Codes

The terms describe the instruments or vocal groups that either appear (for sound recordings, for example) or are intended (scores) for a particular resource.  Like many of the other codes lists, these are quite general and maddeningly biased towards classical music (Continuo, Celeste, Viola d’amore, but no banjo or sitar, for instance).  Like the form of musical composition terms, I modeled these to use with the Music Ontology, namely as the object of mo:instrumentmo:Instrument has this note:

Any taxonomy can be used to subsume this concept. The default one is one extracted by Ivan Herman
from the Musicbrainz instrument taxonomy, conforming to SKOS. This concept holds a seeAlso link
towards this taxonomy.

so these terms have been modeled as skos:Concepts.  There are skos:exactMatch relationships to the Musicbrainz taxonomy where appropriate (as well as links to id.loc.gov/authorities and dbpedia).  The original code lists had an implication of hierarchy (“Larger ensemble – Dance orchestra” should be thought of as “Dance orchestra” with broader term “Larger ensemble”), but that’s not actually used in MARC.  I broke these broader terms out on their own for this vocabulary, since it seemed useful in a linked data context and wouldn’t actually hurt anything (the codes are two letters, so the “broader terms” are just using the first letter).

To get the code, use the MARC 048 subfield a or b (for ensemble or solo parts, respectively) and take the first two characters (which must be letters).  This code may be followed by two digit number (left padded with zeroes) signifying how many parts.  Drop this number, if present.

URI template:

http://purl.org/NET/marccodes/musperf/{xx}#term

http://purl.org/NET/marccodes/musperf/cd#term

http://purl.org/NET/marccodes/musperf/ed#term

Other Codes

I am not sure when or if I will model any more codes lists.  Ideally, the Library of Congress should be doing these (they’ve done the relator codes, and preservation events lists).  The only other lists I can see much value in are the Specific Material Form Terms (the MARC 007) and the MARC Organization codes.

I have done a bit of work on the specific material forms list, but it’s fairly complicated.  My current approach is a hybrid of controlled vocabularies and RDF schema (after all, it makes sense for a globe to be rdf:type <http://purl.org/NET/marccodes/smd/terms/Globe> rather than that be some property set on an untyped resource).  For an RDF schema, though, I would prefer a “better” namespace than purl.org/NET/, although perhaps it doesn’t really matter much.

No matter what, it would certainly push the limits of my freebie Heroku account that this is currently running on.

I am definitely open to any ideas or recommendations people might have for these (and requests for other lists to be converted).  I’d also be interested to see if are able to use them with your data.

For a couple of months this year, the library world was aflame with rage at the proposed OCLC licensing policy regarding bibliographic records.  It was a justifiable complaint, although I basically stayed out of it:  it just didn’t affect me very much.  After much gnashing of teeth, petitions, open letters from consortia, etc. OCLC eventually rescinded their proposal.

Righteous indignation: 1, “the man”: 0.

While this could certainly counted as a success (I think, although this means we default to the much more ambiguous 1987 guidelines), there is a bit of a mixed message here about where the library community’s priorities lie.  It’s great that you now have the right to share your data, but, really, how do you expect to do it?

It has been a little over a year since the Jangle 1.0 specification has been released; 15 months or so since all of the major library vendors (with one exception) agreed to the Digital Library Federation’s “Berkeley Accord”; and we’re at the anniversary of the workshop where the vendors actually agreed on how we would implement a “level 1” DLF API.

So far, not a single vendor at the table has honored their commitment, and I have seen no intention to do so with the exception of Koha (although, interestingly, not by the company represented in the Accord).

I am going to focus here on the DLF ILS-DI API, rather than Jangle, because it is something we all agreed to.  For all intents and purposes, Jangle and the ILS-DI are interchangeable:  I think anybody that has invested any energy in either project would be thrilled if either one actually caught on and was implemented in a major ILMS.  Both specifications share the same scope and purpose.  The resources required to support one would be the same as the other, the only difference between the two are the client-side interfaces.  Jangle technically meets all of the recommendations of the ILS-DI, but not to the bindings that we, the vendors, agreed to (although there is an ‘adapter’ to bridge that gap).  Despite having spent the last two years of my life working on Jangle, I would be thrilled to no end if the ILS-DI saw broad uptake.  I couldn’t care less about the serialization; I only care about the access.

There is only one reason that the vendors are not honoring their commitment:  libraries aren’t demanding that they do.

Why is this?  Why the rally to ensure that our bibliographic data is free for us to share when we lack the technology to actually do the sharing?

When you look at the open source OPAC replacements (I’m only going to refer to the OSS ones here, because they are transparent, as opposed to their commercial counterparts):  VuFind, Blacklight, Scriblio, etc. and take stock of hoops that have to be jumped through to populate their indexes and check availability, most libraries would throw their hands in the air and walk away.  There are batch dumps of MARC records.  Rsync jobs to get the data to the OPAC server.  Cron jobs to get the MARC into the discovery system.  Screen scrapers and one off “drivers” to parse holdings and status.  It is a complete mess.

It’s also the case for every Primo, Encore, Worldcat Local, AquaBrowser, etc. that isn’t sold to an internal customer.

If you’ve ever wondered why the third party integration and enrichment services are ultimately somewhat unsatisfying (think BookSite.com or how LibraryThing for Libraries is really only useful when you can actually find something), this is it.  The vendors have made it nearly impossible for a viable ecosystem to exist because there is no good way to access the library’s own data.

And it has got to stop.

For the OCLC withdrawal to mean anything, libraries have either got to put pressure on their vendors to support one of the two open APIs, migrate to a vendor that does support the open APIs, or circumvent the vendors entirely by implementing the specifications themselves (and sharing with their peers).  This cartel of closed access is stifling innovation and, ultimately, hurting the library users.

I’ll hold up my end (and ensure it’s ILS-DI compatible via this) and work towards it being officially supported here, but the 110 or so Alto customers aren’t exactly going to make or break this.

Hold your vendor’s feet to the fire and insist they uphold their commitment.

In a world where library management systems are sophisticated and modern…

I was doing some Google searches about SKOS, trying to figure out the exact distinction between skos:ConceptScheme and skos:Collection (it’s much more clear to me now) and I came across this article in XML.com:

Introducing SKOS

The article is fine, but it’s not what compelled me to write a blog post.  I was struck by a comment on that page titled What about Topic Maps?:

This new W3C standard obviously has a huge overlap with the very mature ISO standard Topic Maps.Topic Maps were originally conceived for (almost) exactly the same problem space as SKOS, and they are widely used. (For example, all major library cataloging software either supports Topic Maps or soon will.)

However, Topic Maps proved to be more generally useful, so they are often compared and contrasted with RDF itself. The surprising difficulty of making Topic Maps and RDF work together is exactly the “extra level of indirection” mentioned by the author of this article about SKOS.

It is very strange that neither this article, nor the referenced XTech paper, mentions Topic Maps.

What is the relationship between SKOS and Topic Maps? How does this fit in with the work (as reported In Edd Dumbill’s blog)
on interoperability between Topic Maps and RDF/OWL?

Now, I have no idea if “yitzgale” is some sort of alias of Alexander Johannesen, let’s assume “no” (for one thing, that comment is far too optimistic about library technology).  The sentence [f]or example, all major library cataloging software either supports Topic Maps or soon will is sort of stunning in both the claim it makes and its total lack of accuracy.  I feel pretty confident in my familiarity with library cataloging software and I can say with some degree of certainty that there is no support for topic maps today  (hell, MARC21, MFHD and Unicode support are pushing it – and those are just incremental changes).  This comment was written four years ago.
And yet, there’s part of me that feels robbed.  Where is the topic map support in my library system?  I don’t even really know anything about TM, but I still feel it would be a damn sight better than what we’ve got now.  What reality is this that yitzgale is living in, with its fancy library systems and librarians and vendors willing to embrace a radical change in how things are done?  I want in.
I might even be able to jump off my RDF bandwagon for it.

I was reading Brian’s appeal for more Emerils in the library world (bam!), noticed Steven Bell’s comment (his blog posting was a response to one by Steven in the first place) and it got me thinking.

First off, I don’t necessarily buy into Brian’s argument.  Maybe it’s due to the fact that he’s younger than me, but my noisy, unwanted opinions aren’t because I didn’t get a pretty enough pony for my sixteenth birthday or because I saw Jason Kidd’s house on Cribs ™ and want to see my slam dunk highlights on SportsCenter on my 40″ flat screens in every bathroom.  It’s because I feel I have something to offer libraries and I genuinely want to help affect change.  Really, I know this is what motivates Brian, too, despite his E! Network thesis, because we worked together and I know his ideas.

Brian doesn’t have to worry about his fifteen minutes coming to a close anytime soon.  Although at first blush it would appear that the niche he has carved out for himself is potentially flash-in-the-pan-y (Facebook, Second Life, library gaming, other Library 2.0 conceits), the motivation for why he does what he does is anything but.  He is really just trying to meet users where they are, on their terms, to help them with their library experience.

Technologies will change and so, too, will Brian, but that’s not the point.  He’ll adapt and adjust his methods to best suit what comes down the pike, as it comes down the pike (proactively, rather than reactively) and continue to be a vanguard in engaging users on their own turf.  More importantly, though, I think he can continue to be a voice in libraries because he works in a library and if you have some creative initiative it’s very easy to stand out and make yourself heard.

Brian and I used joke about the library rock star lifestyle:  articles, accolades, speaking gigs, etc.  A lot of this comes prettily easily, however.  If you can articulate some rational ideas and show a little something to back those ideas up, you can quickly make a name for yourself.  Information science wants visionary people (regardless of whether or not they follow that leader) and librarians want to hear new ideas for how to solve old problems.  Being a rock star is pretty easy, being a revolutionary is considerably harder.

I made the jump from library to vendor because I wanted to see my ideas affect a larger radius than what I could do at a single node.  It has been an interesting adjustment and I’m definitely still trying to find my footing.  It has been much, much more difficult to stand out because I am suddenly surrounded by a bunch of people that much are smarter than me, much better developers than me, and have more experience applying technology on a large scale.  This is not to say that I haven’t worked with brilliant people in libraries (certainly I have, Brian among them), but the ratio has never been quite like this.  Add to the fact that being a noisy, opinionated voice within a vendor has its immediate share of skeptics and cynics (who are the ‘rock stars’ in the vendor community?  Stephen Abram?  Shoot me.), I may find myself falling into Steven Bell’s dustbin.  Then again, I might be able to eventually influence the sorts of changes that inspired me to make the leap in the first place.  I can do without the stardom in that case.

San Francisco is a truly miserable town to try to recuperate from a sore throat.  Not that anyone would consider a place with the nickname ‘Fog City’ to be a good place to convalesce, especially since walking around in the wet and cold is so desirable given the charms of the city.  And so I found myself last week (and my first week at Talis), slogging through precipitation to the Open Content Alliance’s Annual Meeting at the Officer’s Club at the Presidio.

Festivities got a late start and began with everyone in room (all hundred something of us) standing up and introducing ourselves to the room.  While this helped put names to faces (otherwise I never would have met John Mignault) and identify some of the OCA’s initiatives, it had the added effect of pushing us completely off schedule.

Brewster Kahle talked for a while about the need to focus on texts between 1923 and 1963 (or whatever); a vast majority of these works would likely be out of copyright, it just takes some research or contacting the rights holder.  We cannot just digitize works pre-1923; this has diminishing returns of value.  Post-1964 materials will likely never be available, so this middle group needs to be exploited.

Next, we got a report out on some of the activities the OCA has been undertaking for the last year: microfilm digitization from UIUC; the Biodiversity Heritage Library; printing on demand; and briefly on the Open Library.  The microfilm digitization project is interesting.  They can scan something like a roll every hour.  I had a similar idea when I was at Tech, although my plan had been to use the existing, public microform scanners.  Obviously that would have been a lot slower and with possibly mixed results, but a lot cheaper.

The Biodiversity Heritage Library is a consortium of 10 natural history museums and botanical gardens (including the Smithsonian, New York Botanical Garden, Kew Gardens) working to create a subject specific portal to their collections.  If there were a lot of details said about this I must have tuned them out.  Same goes for the Open Library project.  I’m pretty sure both of these updates were rather light on specifics.  Printing on demand was from a vendor:  the simple message was that this is becoming affordable — the cool part being that the economics are exactly the same for printing one book as it is for creating 1000 copies of the book and that is around $0.01 per page.

After a break (where I talked to John Mignault the whole time), we had breakout sessions.  I attended “Sharing and Integration of Bibliographic Records” and intended to sit and listen.  Instead, I wound up talking.  A lot.  One of the main issues of conversation surrounded a proposal by Bowker (the ISBN issuing agency for North America) to supply ISBNs to digitized copies of works that would not originally have had an ISBN (which, in the context of the Open Content Alliance, would be nearly all of them).  Bowker has made a deal that they will offer 3 million ISBNs (250,000 per library) as a gift to the non-profit sources in the OCA.  After a library burns through their 250k, the ISBNs are $0.10 each.

Superficially, you could either love or hate this proposal, but as debate wore on, it became easy to both love and hate this proposal.  I think I argued for both sides during the course of the session.  While the payoffs seem logical (it would be nice to discover these materials by related ISBN, certainly), there are also some pitfalls, as well.  For instance, since the OCA isn’t terribly effective at discouraging multiple institutions digitizing the same edition/run of a particular book.  This means that two different scans of essentially the same book could potentially have two different ISBNs.  This actually complies with the ISBN specification, but it certainly wouldn’t comply with people’s expectations.

Also, it is unclear how these records should be treated in services like OCLC.  If they get an ISBN, does that mean a new record should be added?  If multiple scans are created, what is the appropriate 856 to add to an existing record?  What is the ‘authoritative’ URL?

We also talked a bit the static nature of the Internet Archive’s metadata.  It is assumed that the metadata will not change after a scan is loaded into the archive, but this is hardly the reality.  How then does the IA know of the updates made at the owning institution?  This seems like the perfect application of RDF; the IA would just point at the owning institution’s record, but that would obviously require some infrastructure.  The notion of a pingback, like weblogs, was raised.

After lunch, we got reports from the breakouts.  The ILL/Scan-on-demand group came up with a process to share still in copyright items.  Also they made the recommendation that no ILL charges be made on these requests.  This is really quite an interesting development and I’m awfully impressed they made the progress they did.  It’s obviously because I wasn’t there to argue about every little point.

Brewster Kahle then had those interested in the ISBN deal go back into a room and work things out.  The same issues came up, but this time I feel they were largely ignored.  Kahle wants to implement ISBNs and really wouldn’t take any other answer, so the plan is to figure out how to make this work.  Since he’s willing to pony up a lot of the cash to make it happen, it’s certainly his call.  His argument was, basically, “it will work or it won’t”.  Sure, but how will it effect the use of ISBNs in the meantime?

We broke up to listen to Carl Lagoze speak about life after we’ve digitized everything.  His thesis was, in a nutshell, what are we going to do after we’ve aggregated everything into repositories?  It’s imperative that we come up with value on top this data we’re accumulating, it’s not enough just to collect it.  We need to make associations, content and means to allow our researchers to leverage the objects we have.  He opened the floor to discussion on this topic and comments varied.  I think I was still hung up on the ISBN thing and wasn’t in the mood to argue anymore.

I skipped the reception to meet Ian and Paul and get my Macbook Pro.  Unfortunately, I had to fly out the next morning, but all in all it was a good trip!

…although LinuX_Xploit_Crew, with all due respect, I think it actually is.

Oh well, we’re back with a new theme (which nobody will see except to read comments, since I’m pretty sure all traffic comes from the code4lib planet) and an updated WordPress install. Look out, world!

So, in the downtime here’s a non-comprehensive rundown of what I’ve been working on:

  1. I’ve written an improved (at least, I think it’s improved) alternative to Docutek’s Eres RSS interface. Frankly, Docutek’s sucked. Maybe we have an outdated version of Eres, but the RSS feeds would give errors because you have to click through a copyright scare page before you can view reserves, but you can’t link the RSS links to this form and get the item. I wrote a little Ruby/Camping app that takes urls like: http://eres.library.gatech.edu/course/WS-1001-A/Fall/2007 and turns that into a usable feed. I needed the course id/term/year format to show them in Sakai. My favorite part of this project was finding Rubyscript2exe. This allows me to just bundle one file (the compiled camping app) plus a configuration file. Granted, an asp.net app would be even easier for sites to install, but I didn’t have time to learn asp.net. I have more ideas of what I would like to do with this (such as show current circ status for physical reserves), but in the chaos that is our library reorg, I haven’t gotten around to even showing anybody what I’ve written so far.
  2. I broke ground on a Metalib X-Server Ruby library. It took me a while to wrap my head around how this needed to be modelled, but I think it’s starting to take shape. It doesn’t actually perform queries, yet, but it connects to the server, allows you to set the portal association and find and set the category to search. Quicksets and MySets are all derivations of the same concept, so I don’t think it will be take me long to actually incorporate actual searching. For proof-of-concept, I plan on embedding this library in MemoryHole, our Solr-based discovery app. I’ve actually stopped development of MemoryHole so we can focus on vuFind, since they do functionally the same thing and I’d rather help make vuFind better than replicate everything it does, only in Ruby. The reason I’m doing this proof-of-concept in MemoryHole rather than vuFind is solely due to familiarity and time.

In other news, my last post seems to have caused a bit of a stir. My plan is to write a response, but the short of it is that I feel the arguments for an MLS are extremely classist.

Also my bathroom is finished and it looks great.

ERESidue

Blogged with Flock

Tags:

Can anyone give a rational explanation as to why a job with a description like this:

DESCRIPTION: Provides technology and computer support for the Vanderbilt Library. The major areas of responsibility include developing, maintaining and assisting in the enhancement of interfaces to web-enabled database applications (currently implemented in perl, PHP, and MySQL). The position also helps establish and maintain guidelines (coding standards, version control, etc.) for the development of new applications in support of library patrons, staff, and faculty across the university. This position will also provide first line backup for Unix system administration. Other duties and assignments will be negotiated based on the successful candidate’s expertise, team needs, and library priorities.

would require an MLS? Library experience? Sure, I can see why that would be desirable. While I find it ridiculous when many libraries require an MLS for what is essentially an IT manager, Vandy is upping the ante here and requiring it for a developer/jr. sysadmin.

I guess that’s a way to prop up the profession.

Before I left for Guatemala, Ian Davis at Talis asked if I could give him a dump of our MARC records to load into Talis Platform. I had been talking in the #code4lib channel about how I was pushing the idea of using Talis Source to make simple, ad-hoc union catalogs; we could make one for Georgia Tech & Emory (we have joint degree programs) or Arche or Georgia Tech/Atlanta-Fulton Public Library, etc. My thinking was that by utilizing the Talis Platform, we could forgo much of the headache in actually making a union catalog for somewhat marginal use cases (the public library one notwithstanding).

About a week after I got back from Guatemala, I had an email from Richard Wallis with some urls to play around with to access my Bigfoot store. He showed me search services, facet services and augment services. I was unable to be really dive into it much at the time but since I’m working on a total site search project for the library, I thought this would be a good chance to kick the tires a bit to include catalog results.

After two days of poking around, I have made some opinions of it, have some recommendations for it, and wrote a Ruby library to access it.

1) The Item Service

This is certainly the most straightforward and for many people, the most useful service of the bunch. The easiest way to think of the item service is an HTTP based Lucene service (a la Solr or Lucene-WS) of your bib records. It returns something OpenSearch-y (it claims to be a RSS 1.0 document), but it doesn’t validate. That being said, FeedTools happily consumed it (more on that later) and the semantics should be familiar to anyone that has looked at OpenSearch before. Each item node also contains a Dublin Core representation of the record and a link to a marcxml representation. I’m not sure if there’s a description document for Bigfoot.
Although the query syntax is pure Lucene (title:”The Lexus and the Olive Tree”), the downside is that it’s not documented anywhere what the indexes are and I doubt there would be any way to add new ones (for example, my guess is I wouldn’t be able to get an index for 490/440$v that I use for the Umlaut). I don’t see returning the results as OAI_DC being too much of a problem, since the RSS item includes a title (which would have been tricky between the DC and the marcxml). My Ruby library might not generate valid DC, I haven’t really looked into it.

The docs also mention you can POST items to your Bigfoot store, but they don’t mention what your data needs to look like (MARC?) or what credentials you need to add something (I mean, it must be more than just your store name, right?). My hope is to add this functionality to bigfoot-ruby soon (especially since my data is from a bulk export from last October).

2) The Facet Service

This one is intriguing, definitely, since Faceted searching is all the rage right now. The search syntax is basically the same as the Item Service, except you also send a comma delimited list of the fields you would like to query. What you get back is either an XML or XHTML document of your results.

For each field you request, you get back a set of terms (you can specify how many you want, with a default of 5) that appear most frequently in your field. You also get an approximation for how many results you would get in that facet and a url to search on that facet. It’s quite fast, although, realistically, you can’t do much with the output of facet search alone.

Again, it’s difficult to know what you can facet on (subject, creator and date are all useful — I’m sure there are others) and the facet that (for me, at least) held the most promise — type — is too overly broad to do much with (it uses Leader position 7, but lumps the BKS and SER types all in a label called “text”). I would like to see Talis implement something like my MARC::TypedRecord concept so one could facet on things like government document or conference. You could separate newspapers from journals and globes from maps. Still, the text analysis of the non-fixed fields is powerful and useful and beats the hell out of trying to implement something like that locally.

In bigfoot-ruby, I have provided two ways to do a faceted search: you can just do the search and get back Facet objects containing the terms and search urls or you can facet with items which executes the item searches automatically (in turn getting a definitive number of results for the query, as well). Since I didn’t bother to implement threading, getting facets with items can be pretty slow.

3) The Augment Service

To be honest, I’m having a hard time figuring out useful scenarios for the augment service. The idea is that you give it the URI of an RSS feed, and this service will enhance it with data from your Bigfoot store (at least, that is sort of how I understand it works). Richard’s example for me was to feed it the output of an xISBN query (which isn’t in RSS 1.0, AFAIK, but, for the sake of example…) and the augment service would fill in the data for ISBNs your library holds. The API example page mentions Wikipedia, but I don’t know where other than the Talis Platform that you can get Wikipedia entries formatted properly. I tried sending it the results of an Umlaut2 OpenSearch query, but it didn’t do anything with it. Presumably this RSS 1.0 feed needs the bib data to be sent in a certain way (my guess is in OAI_DC, like the Item Service), but I’m not sure. The only use case I can think of for this service is a much simpler way to check for ISBN concordance (rather than isbn:(123456789X|223456789X|323456789X|etc.))

Overall, I’m really impressed with the Talis API. It is a LOT easier to use than, say, Z39.50 and by using OpenSearch seems more natural to integrate into existing web services than SRU.

Bigfoot-ruby is definitely a work in progress. I think I would like to split the Search class into ItemService and FacetService. I don’t like how results is an Array for items and a Hash for facets. Just seems sloppy. I need to document it, of course and I would like to implement Item POST. This project also made me realize how bloody slow FeedTools is. I am currently using it in both the Umlaut and the Finding Aids to provide OpenSearch, but I think it’s really too sluggish to justify itself.

Thanks, Talis, for getting me started with Bigfoot and giving me the opportunity to play around with it. Also, thanks to Ed Summers for fixing SVN on Code4lib.org. You wouldn’t be able to download it and futz around with it yourself, otherwise.

If YPOW, like MPOW, is an Endeavor Voyager site, you’ve got some decisions ahead. Francisco Partners, naturally, would like you to migrate to Aleph, and I have no doubt that Ex Libris is, as I write this, busily working on a means to make that easy for Voyager libraries to do. But ILS migrations are painful, no matter how easy the backend process might be. There’s staff training, user training, managing new workflows, site integration; lots of things to deal with. Also, your functionality may not be a 1:1 relationship to what you currently have. How do you work around services you depended upon?

Since soon our contracts with Endeavor Information Systems will be next to worthless, I propose, Voyager customers, that we take ownership of our systems. For the price of a full Oracle (or SQL Server? — does Voyager support other RDBMSes?) license (many of us already have this), we can get write permissions to our DB and make our own interfaces. We wouldn’t need to worry about staff clients (for now), since we already have cataloging, circulation, acquisitions, etc. modules that work. When we’re ready for different functionality, however, we can create a new middleware (in fact, I’m planning to break ground on this in the next two weeks) to allow for web clients or, even better, piggyback on Evergreen’s staff clients and let somebody else do the hard work. If we had native clients in the new middleware, a library could use any database backend they wanted (just migrate the data from Oracle into something else). The key is write access to the database.

By taking ownership of our ILS, we can push developments we want, such as NCIP, a ‘Next Gen OPAC’, better link resolver integration, better metasearch integration, etc. without the pain of starting all over again (with potentially the same results, who is to say that whatever you choose as an ILS wouldn’t eventally get bought and killed off, as well?). Putting my money (or lack thereof) where my mouth is, I plan on migrating Fancy Pants to use such a backend (read only db access, for now, we still have a support contract, after all). I’m calling this project ‘Bon Voyage’. After reading Birkin’s post on CODE4LIB, I would like to make a similar service for Voyager that would basically take the place of the Z39.50 server and access to the database. Fancy Pants wouldn’t be integrated into Bon Voyage, it would just be another client (since it was always only meant as a stopgap, anyway).

What we’ll have is a framework for getting at the database backend (it’d be safe to say this will be a rails project) with APIs to access bib, item, patron, etc. information. Once the models are created, it will be relatively simple to transition to ‘write’ access when that becomes necessary. Making a replacement for WebVoyage would be fairly trivial once the architecture is in place. Web based staff clients would also be fairly simple. I think EG staff client integration wouldn’t be too hard since it would just be an issue of outputting our data to something the EG clients want (JSON, I believe) and translating the client’s reponse. That would need to be investigated more, however (I’m on paternity leave and not doing things like that right now 🙂

Would anybody find this useful?
It seems the money we spend on an ILS could be better spent elsewhere. I don’t think this would be a product we could distribute outside of the the current Voyager customer base (at least, not until it was completely native… maybe not even then- we’d have to work this out with Francisco Partners, I guess), but I think that that is big enough to be sustainable on its own.