One of the byproducts of the “Communicat” work I had done at Georgia Tech was a variant of Ed Summersruby-marc that went into more explicit detail regarding the contents inside the MARC record (as opposed to ruby-marc which focuses on its structure).  It had been living for the last couple of years as a branch within ruby-marc, but this was never a particularly ideal approach.  These enhancements were sort of out of scope for ruby-marc as a general MARC parser/writer, so it’s not as if this branch was ever going to see the light of day as trunk.  As a result, it was a massive pain in the butt for me to use locally:  I couldn’t easily add it as a gem (since it would have replaced the real ruby-marc, which I use far too much to live without) which meant that I would have to explicitly include it in whatever projects I wanted to use it in and update any paths included accordingly.

So as I found myself, yet again, copying the TypedRecords directory into another local project (this one to map MARC records to RDF), I decided it was time to make this its own project.

One of the amazingly wonderful aspects of Ruby is the notion of “opening up an object or class”.  For those not familiar with Ruby, the language allows you to take basically any object or class, redefine it and add your own attributes, methods, etc.  So if you feel that there is some particular functionality missing from a given Ruby object, you can just redefine it, adding or overriding the existing methods, without having to reimplement the entire thing.  So, for example:

class String
  def shout
    "#{self.upcase}!!!!"
  end
end

str = "Hello World"
str.shout
=> "HELLO WORLD!!!!"

And just like that, your String objects gained the ability to get a little louder and a little more obnoxious.

So rather than design the typed records concept as a replacement for ruby-marc, it made more sense to treat it more as an extension to ruby-marc.  By monkey patching, the regular marc parser/writer can remain the same, but if you want to look a little more closely at the contents, it will override the behavior of the original classes and objects and add a whole bunch of new functionality.  For MARC records, it’s analogous to how Facets adds all kinds of convenience methods to String, Fixnum, Array, etc.

So, now it has its own github project:  enhanced-marc.

If you want to install it:

  gem sources -a http://gems.github.com
  sudo gem install rsinger-enhanced_marc

There’s some really simple usage instructions on the project page and I’ll try to get the rdocs together as soon as I can.  In a nutshell it works almost just like ruby-marc does:

require 'enhanced_marc'

records = []
reader = MARC::Reader.open('marc.dat')
reader.each do | record
  records << record
end

As it parses each record, it examines the leader to determine what kind of record it is:

  • MARC::BookRecord
  • MARC::SerialRecord
  • MARC::MapRecord
  • MARC::ScoreRecord
  • MARC::SoundRecord
  • MARC::VisualRecord
  • MARC::MixedRecord

and adds a bunch of format specific methods appropriate for, say, a map.

It’s possible to then simply extract either the MARC codes or the (English) human readable string that the MARC code represents:

record.class
=> MARC::SerialRecord
record.frequency
=> "d"
record.frequency(true)
=> "Daily"
record.serial_type(true)
=> "Newspaper"
record.is_conference?
=> false

or, say:

record.class
=> MARC::VisualRecord
record.is_govdoc?
=> true
record.audience_level
=> "j"
record.material_type(true)
=> "Videorecording"
record.technique(true)
=> "Animation"

And so on.

There is still quite a bit I still need to add.  It pretty much ignores mixed records at the moment.  It’s something I’ll need to eventually get to, but these are uncommon enough that it’s currently a lower priority.  I also need to provide some methods that evaluate the 007 field.  I haven’t gotten to this yet, just because it’s just a ton of tedium.  It would be useful, though, so I want to get it in there.

If there is interest, it could perhaps be extended to include authority records or holdings records.  It would also be handy to have convenience methods on the data fields:

record.isbn
=> "0977616630"
record.control_number
=> "793456"

Anyway, hopefully somebody might find this to be useful.

For a couple of months this year, the library world was aflame with rage at the proposed OCLC licensing policy regarding bibliographic records.  It was a justifiable complaint, although I basically stayed out of it:  it just didn’t affect me very much.  After much gnashing of teeth, petitions, open letters from consortia, etc. OCLC eventually rescinded their proposal.

Righteous indignation: 1, “the man”: 0.

While this could certainly counted as a success (I think, although this means we default to the much more ambiguous 1987 guidelines), there is a bit of a mixed message here about where the library community’s priorities lie.  It’s great that you now have the right to share your data, but, really, how do you expect to do it?

It has been a little over a year since the Jangle 1.0 specification has been released; 15 months or so since all of the major library vendors (with one exception) agreed to the Digital Library Federation’s “Berkeley Accord”; and we’re at the anniversary of the workshop where the vendors actually agreed on how we would implement a “level 1” DLF API.

So far, not a single vendor at the table has honored their commitment, and I have seen no intention to do so with the exception of Koha (although, interestingly, not by the company represented in the Accord).

I am going to focus here on the DLF ILS-DI API, rather than Jangle, because it is something we all agreed to.  For all intents and purposes, Jangle and the ILS-DI are interchangeable:  I think anybody that has invested any energy in either project would be thrilled if either one actually caught on and was implemented in a major ILMS.  Both specifications share the same scope and purpose.  The resources required to support one would be the same as the other, the only difference between the two are the client-side interfaces.  Jangle technically meets all of the recommendations of the ILS-DI, but not to the bindings that we, the vendors, agreed to (although there is an ‘adapter’ to bridge that gap).  Despite having spent the last two years of my life working on Jangle, I would be thrilled to no end if the ILS-DI saw broad uptake.  I couldn’t care less about the serialization; I only care about the access.

There is only one reason that the vendors are not honoring their commitment:  libraries aren’t demanding that they do.

Why is this?  Why the rally to ensure that our bibliographic data is free for us to share when we lack the technology to actually do the sharing?

When you look at the open source OPAC replacements (I’m only going to refer to the OSS ones here, because they are transparent, as opposed to their commercial counterparts):  VuFind, Blacklight, Scriblio, etc. and take stock of hoops that have to be jumped through to populate their indexes and check availability, most libraries would throw their hands in the air and walk away.  There are batch dumps of MARC records.  Rsync jobs to get the data to the OPAC server.  Cron jobs to get the MARC into the discovery system.  Screen scrapers and one off “drivers” to parse holdings and status.  It is a complete mess.

It’s also the case for every Primo, Encore, Worldcat Local, AquaBrowser, etc. that isn’t sold to an internal customer.

If you’ve ever wondered why the third party integration and enrichment services are ultimately somewhat unsatisfying (think BookSite.com or how LibraryThing for Libraries is really only useful when you can actually find something), this is it.  The vendors have made it nearly impossible for a viable ecosystem to exist because there is no good way to access the library’s own data.

And it has got to stop.

For the OCLC withdrawal to mean anything, libraries have either got to put pressure on their vendors to support one of the two open APIs, migrate to a vendor that does support the open APIs, or circumvent the vendors entirely by implementing the specifications themselves (and sharing with their peers).  This cartel of closed access is stifling innovation and, ultimately, hurting the library users.

I’ll hold up my end (and ensure it’s ILS-DI compatible via this) and work towards it being officially supported here, but the 110 or so Alto customers aren’t exactly going to make or break this.

Hold your vendor’s feet to the fire and insist they uphold their commitment.

For a long time, I was massively confused about what the Platform was or did.  Months after I started at Talis I was still fairly unclear of what the Platform actually did.  I’ve now got my head around it, use it, and have a pretty good understanding of why and how it’s useful, but I fully realize that a lot of people (and by people I’m really referring to library people) don’t and don’t really care to learn.

What they want is Solr.  Actually, no, what they want is a magical turnkey system that takes their crappy OPAC (or whatever) data and transmogrifies it into a modern, of-the-web type discovery system.  What is powering that discovery system is mostly irrelevant if it behaves halfway decently and is pretty easy to get up and running for a proof-of-concept.  These two points, of course, are why Solr is so damned popular; to say that it meets those criteria is a massive understatement.  The front-end of that Solr index is another story entirely, but Solr itself is a piece of cake.

Almost from the time I started at Talis I have thought that a Solr-clone API for the Platform would make sense.  Although the Platform doesn’t have all of the functionality of Solr, it has several of the sexy bits (Lucene syntax and faceting, for example) and if it had some way to respond to an out of the box Solr client, it seemed to me that it would make it a lot easier to turn an off-the-shelf Solr powered application (a la VuFind or Blacklight) into a Platform powered, RDF/linked data application with minimal customization.  It’s not Solr and in many ways is quite different than Solr — but if it can exploit its similarities with Solr enough to leverage the pretty awesome client base that Solr has, it’ll make it easier to open the door for things the Platform is good at.  Alternately, if the search capabilities of the Platform become too limited compared to Solr, the data is open — just index it in Solr.  Theoretically, if the API is a Solr-clone, you should be able to point your application at either.

The proof-of-concept project I’m working on right now is basically a reënvisioned Communicat:  a combination discovery interface; personal and group resource collection aggregator; resource-list content management system (for course reserves, say, or subject/course guides, etc.);  and “discovered” resources (articles, books, etc.) cache and recommendation service.  None of these would be terribly sophisticated at a first pass, I’m just trying to get (and show) a clearer understanding of how a Communicat might work.  As such, I’m trying to do as little development from the ground up as I can get away with.

I’ll go into more detail later as it starts to get fleshed out some, but for the discovery and presentation piece, I plan on using Blacklight.  Of the OSS discovery interfaces, it’s the most versatile for the wide variety of resources I would hope to be in a Communicat-like system.  It’s also Ruby, so I feel the most comfortable hacking away at it.  It also meant I needed the aforementioned Solr-like API for the Platform, so I hastily cobbled together something using Pho and Sinatra.  I’m calling it pret-a-porter, and the sources are available on Github.

You can see it in action here.  The first part of the path corresponds with whatever Platform store you want to search.  The only “Response Writers” available are Ruby and JSON (I’ll add an XML response as soon as I can — I just needed Ruby for Blacklight and JSON came basically for free along with it).  It’s incredibly naive and rough at this point, but it’s a start.  Most importantly, I have Blacklight working against it.  Here’s Blacklight running off of a Prism 3 store.  It took a little bit of customization of Blacklight to make this work, but it would still be interchangeable with a Solr index (assuming you were still planning on using the Platform for your data storage).  When I say a “little bit”, I mean very little.  Both pieces (pret-a-porter and the Blacklight implementation) took less than three days total to get running.

If only the rest of the Communicat could come together that quickly!

There were three main reasons that I took the old lcsh.info data that I had lying around and made http://lcsubjects.org:

  1. There were projects (including internal Talis ones) that really wanted to use that data and impatience was growing as to when the Library of Congress would launch id.loc.gov.
  2. Leigh Dodds had just released Pho and needed testers.  I had also, to date, done virtually nothing interesting with the Platform and wanted a somewhat turnkey operation to get started with it.
  3. While it’s great that the Library of Congress has made this data available, what is really interesting is seeing how this stuff relates to other data sets.  It’s unlikely that LoC will be too open to experimentation in this regard, these are, after all, authorities, so LCSubjects.org seemed a good place to provide both this experimentation and community-driven editing (which will, hopefully, be coming soon — Per an idea proposed by Chris Clarke, I would like to store user-added changes into their own named graphs, but that support needs to be added to the Platform) – which will, hopefully, make it more dynamic and interesting, while still deferring “authority” to the Library of Congress.

In the pursuit of number three, I had a handful of what I hoped were fairly “low hanging fruit” projects to help kickstart this process and actually make LCSubjects linked data instead of just linkable data (since that was fairly redundant to id.loc.gov/authorities/, anyway).  I have rolled out the first of these, which was an attempt to provide some sense of geocoding to the geographic headings.

There are just over 58,000 geographic subject headings in the current dump that LoC makes available.  11,362 of these have a ⁰ symbol in them (always in a non-machine readable editorial note).  I decided to take this subset and see how many I could identify as a single geographic “point” (i.e. a single, valid latitudinal and longitudinal coordinate pair), converted those from degree, minute, second format to decimal format and then saw how many of those had a direct match to points in Geonames.

Given that these are entered as prose notes, the matching was fairly spotty.  I was able to identify 9,127 distinct “points”.  837 concepts had either too many coordinates (concepts like this one or this one, for example) or only 1.  It’s messy stuff.  This also means there are about another 1,000 that missed my regex completely (/[0-9]*⁰[^NSEW]*[NSEW]\b/), but I haven’t had time to investigate what these might look like.  Given that these are just text notes, though, I was pretty surprised at the number of actual positive matches I got.  These are now available in the triples using the Basic Geo (WGS84 lat/long) vocabulary.

Making the links to Geonames wasn’t nearly as successful.  Only about 197 points matched.  Some of those that did could be considered questionable (click on the geonames link to see what I mean).  Others are pretty perfect.

All in all, a pretty successful experiment.  I’d like to take another pass at it and see how many prefLabels or altLabels match to the Geonames names and add those, as well.  Also, just after I added the triples, there was an announcement for LinkedGeoData.org, which will probably provide much better wgs84:location coverage (I can do searches like http://linkedgeodata.org/triplify/near/%latitude%,%longitude%/1 which would find points of interest within 1 meter of my coordinate pair).  So stay tuned for those links.

Lastly, one of the cooler by-products of adding these coordinates is functionality like this which roughly gives you all of the LCSH with coordinates found roughly inside the geographic boundaries of Tennessee (TN is a parallelogram, so this box style query isn’t perfect).

For Ian Davis‘ birthday, Danny Ayers sent out an email asking people to make some previously unavailable datasets accessible as linked data as Ian’s present.  It was a pretty neat idea.  One that I wish I had thought of.

Given that Ian is my boss (prior to about a month ago, Ian was just nebulously “above me” somewhere in the Talis hierarchy, but I now report to him directly) one could cynically make the claim that by providing Ian a ‘linked data gift’ that I would just be currying favor by being a kiss-ass.  You could make that claim, sure, but evidently you are not aware of how I hurt the company.

Anyway, as my contribution, I decided to take the data dumps from LibraryThing that Tim Spalding pretty graciously makes available [whoa, in the time that I first started this post until now, the data has gone AWOL, I suppose I did this just in time].  The data isn’t always very current and not all of the files are terribly useful (the tags one, for example, doesn’t offer much since the tags aren’t associated with anything — it’s just words and their counts), but it’s data and between ThingISBN and the WikipediaCitations I thought it would be worth it.

I wanted to take a very pragmatic approach to this: no triple store, no search, no rdf libraries, minimal interface.  Mostly this was inspired by Ed Summers‘ work with the Library of Congress Authorities, but, also, if Tim (or, whoever at LibraryThing) saw that making LibraryThing linked data was as easy as a few template tweaks (as opposed to a major change in their development stack) this exercise was much more likely to actually make its way into LibraryThing.

What I ended up with (the first pass released before the end of Ian’s birthday, I might add) was LODThing: a very simple application written in Ruby’s Sinatra framework, DataMapper and SQLite.  The entire application is less than 230 lines of Ruby (including the web app and data loader) plus 2 HAML templates and 2 builder templates for the HTML/RDFa and RDF/XML, respectively.  The SQL database has three tables, including the join table.  This is really simple stuff.  The only real reason it took a couple days to create was trying to get the data loaded into SQLite from these huge XML files.  Nokogiri is fast (well, Ruby fast), but a 200 MB XML file is pretty big.  It was nice to get acquainted with Nokogiri’s pull parser, though.

There are a few things to take away from this exercise.

  1. When data is freely available, it’s really quite simple to reconstitute it into linked data without any need to depart from your traditional technology stack.  There is nothing even remotely semantic-webby about LODThing except its output.
  2. We now have an interesting set of URIs and relationships to start to express and model FRBR relationships.
  3. The Wikipedia citations data is extremely useful and could certainly be fleshed out more.  One could imagine querying DBpedia or Freebase on these concepts and identifying if the Wikipedia article is actually referring to the work itself and use that.  Right now LODThing makes no claims about the relationships except that it’s a reference from Wikipedia.

LODThing isn’t really intended for human consumption, so there’s no real “default way in”.  The easiest way to use it is to make a URI from an ISBN:

If you know the LibraryThing ‘work ID’, you can get in that way, too:

Also, you can all of these resources as RDF/XML by replacing the .html with .rdf.

So, Tim, you wrote on the LT API page that you would love to see what people are doing with your data, here you go.  It would be even more awesome if it made it’s way back into LT — after all, it would alleviate some of the need for you to have a special API for this stuff.

Also, special thanks to Toby Inkster for providing a ton of help in getting this to resemble something that a linked data aware agent would actually want and finally turning the httpRange-14 light bulb on over my head.  He also immediately linked to it from his Amazon API LODifiier, which is sort of cool, too.

I’ll be happy to throw the sources into a github repository if anybody’s interested in them.

For the last couple of weeks I’ve returned to working on Alto Jangle connector, at least part-time.  I had shelved development on it for a while; I had a hard time finding anybody interested in using it and had reached a point where the development database I was working against was making it difficult to know what to expect in a real, live Alto system.  After I got wind of a couple of libraries that might be interested in it, I thought I should at least get it in a usable state.

One of the things that was vexing me prior to my hiatus was how to get Sybase to page through results in a semi-performant way.  I had originally blamed it on Grails, then when I played around with refactoring the connector in PHP (using Quercus, which is pretty slick by the way, to provide Sybase access via JDBC — the easiest way to do it) I realized that paging is just outside of Sybase’s capabilities.

And when you’re so used to MySQL, PostgreSQL and SQLite, this sort of makes your jaw drop (although, in its defense, it appears that this isn’t all that easy in Oracle, either — however, it’s at least possible in Oracle).

There seem to be two ways to do something like getting rows 375,000 – 375,099 from all of the rows in a table:

  1. Use cursors
  2. use SET ROWCOUNT 375100 and loop through and throw out the first 375,000 results.

The first option isn’t really viable.  You need write access to the database and it’s unclear how to make this work in most database abstraction libraries.  I don’t actually know that cursors do anything differently than option 2 besides pushing the looping to the database engine itself.  I was actually using cursors in my first experiments in JRuby using java.sql directly, but since I wasn’t aware of this problem at the time, I didn’t check to see how well it performed.

Option 2 is a mess, but this appears to be how GORM/Hibernate deals with paging in Sybase.  Cursors aren’t available in Quercus’ version of PDO, so it was how I had to deal with paging in my PHP prototypes, as well.  When I realized that PHP was not going to be any faster than Grails, I decided to just stick with Grails (“regular C-PHP” is out — compiling in Sybase support is far too heavy a burden).

This paging thing still needed to be addressed.  Offsets of 400,000 and more were taking more than twelve minutes to return.  How much more, I don’t know — I killed the request at the 12 minute mark.  While some of this might be result of a bad or missing index, any way you cut it, it wasn’t going to be acceptable.

I was kicking around the idea of exporting the “models” of the Jangle entities into a local HSQLDB (or whatever) mirror and then working the paging off of that.  I couldn’t help but think that this was sort of a waste, though — exporting from one RDBMS to another solely for the benefit of paging.  You’d have to keep them in sync somehow and still refer to the original Sybase DB for things like relationships and current item or borrower status.  For somebody that’s generally pretty satisfied with hacking together kludgy solutions to problems, this seemed a little too hack-y… even for my standards.

Instead, I settled on a different solution that could potentially bring a bunch of other features along with it.  Searchable is a Grails plugin for Compass, a project to easily integrate Lucene indexes with your Java domain classes (this would be analogous to Rails’ act_as_ferret).  When your Grails application starts up, Searchable will begin to index whatever models you declared as, well,  searchable.  You can even set options to store all of your attributes, even if they’re not actual database fields, alleviating the need to hit the original database at all, which is nice.  Initial indexing doesn’t take long — our “problem” table that took twelve minutes to respond takes less than five minutes to fully index.  It would probably take considerably less than that if the data was consistent (some of the methods to set the attributes can be pretty slow if the data is wonky — it tries multiple paths to find the actual values of the attribute).

What this then affords us is consistent access times, regardless of the size of the offset:  the 4,000th page is as fast as the second:  between 2.5 and 3.5 seconds (our development database server is extremely underpowered and I access it via the VPN — my guess is that a real, live situation would be much faster).

The first page is a bit slower.  I can’t use the Lucene index for the first page of results because there’s no way for Searchable to know if the WORKS_META table has changed since the last request since these changes wouldn’t be happening through Grails.  Since performance for the first hundred rows out of Sybase isn’t bad, the connector just uses it for the first page, then syncs the Lucene index with the database at the end of the request.  Each additional page then pulls from Lucene.  Since these pages wouldn’t exist until after the Lucene index is created and the Lucene index is recreated every time the Grails app is started, I added a controller method that checks the count of the Sybase table and the count of the Lucene index to confirm that they’re in sync (it’s worth noting that if the Lucene index has already been created once, this will be available right away after Grails starts — the reindexing is still happening, but in a temp location that will be moved to the default location once it’s complete overwriting the old index).

The side benefit to using Searchable is that it will make adding search functionality to Alto connector that much easier.  Building SQL statements from the CQL queries in the OpenBiblio connector was a complete pain the butt.  CQL to Lucene syntax should be considerably easier.  It seems like  it would be possible for these Lucene indexes to potentially alleviate the need for the bundles Zebra index that comes with Alto, eventually, but that’s just me talking, not any sort of strategic goal.

Anyway, thanks to Lucene, Sybase is behaving mostly like a modern RDBMS, which is a refreshing change.

In a world where library management systems are sophisticated and modern…

I was doing some Google searches about SKOS, trying to figure out the exact distinction between skos:ConceptScheme and skos:Collection (it’s much more clear to me now) and I came across this article in XML.com:

Introducing SKOS

The article is fine, but it’s not what compelled me to write a blog post.  I was struck by a comment on that page titled What about Topic Maps?:

This new W3C standard obviously has a huge overlap with the very mature ISO standard Topic Maps.Topic Maps were originally conceived for (almost) exactly the same problem space as SKOS, and they are widely used. (For example, all major library cataloging software either supports Topic Maps or soon will.)

However, Topic Maps proved to be more generally useful, so they are often compared and contrasted with RDF itself. The surprising difficulty of making Topic Maps and RDF work together is exactly the “extra level of indirection” mentioned by the author of this article about SKOS.

It is very strange that neither this article, nor the referenced XTech paper, mentions Topic Maps.

What is the relationship between SKOS and Topic Maps? How does this fit in with the work (as reported In Edd Dumbill’s blog)
on interoperability between Topic Maps and RDF/OWL?

Now, I have no idea if “yitzgale” is some sort of alias of Alexander Johannesen, let’s assume “no” (for one thing, that comment is far too optimistic about library technology).  The sentence [f]or example, all major library cataloging software either supports Topic Maps or soon will is sort of stunning in both the claim it makes and its total lack of accuracy.  I feel pretty confident in my familiarity with library cataloging software and I can say with some degree of certainty that there is no support for topic maps today  (hell, MARC21, MFHD and Unicode support are pushing it – and those are just incremental changes).  This comment was written four years ago.
And yet, there’s part of me that feels robbed.  Where is the topic map support in my library system?  I don’t even really know anything about TM, but I still feel it would be a damn sight better than what we’ve got now.  What reality is this that yitzgale is living in, with its fancy library systems and librarians and vendors willing to embrace a radical change in how things are done?  I want in.
I might even be able to jump off my RDF bandwagon for it.

I cannot perceive a day that I might charge for a webinar about Jangle.  I expect that that day will never come.

Still, it pains me to see a NISO Webinar on Interoperability:

http://www.niso.org/news/events/2009/interop09

It pains me for a couple of reasons — a hundred bucks for a webinar?  Come on, NISO, get over yourself.

Secondly, I have tremendous respect for and was happy to participate in the DLF ILS-DI Berkeley Accord, but it’s, at best, a half measure, is no longer being actively developed and has, for all intents and purposes, lost its sponsorship.

Jangle isn’t perfect and I realize there’s not a NISO standard to be found (well, you can send Z39.2…), but if you’re going to talk about interoperability, there’s not a more pragmatic and simple approach on the table, currently.