For your consideration: yet another MARC-in-JSON proposal pt. 1

Note: this post is broken into two parts. If you are just interested in the MARC-in-JSON technical specification proposal, look here. If you’re interested in the justification and history of it, read on.

The easy and obvious reaction to the suggestion of providing new serializations to MARC is to reach for Roy Tennant’s suggestion and ask for its head. Let’s face it, though, MARC is going nowhere soon and we have yet to produce a truly lossless alternative format. With the rise of JSON as the generally preferred means of communication between machines on the web, it was only a matter of time before the proposals for a standardized way to serialize MARC as JSON began to materialize.

So far we have two well-publicized suggestions: one by Bill Dueber, at the University of Michigan; and one by Andrew Houghton, who works at OCLC Research. They are quite different and each have their advantages and disadvantages. The tricky part of representing MARC in JSON is preserving field and subfield order: simply representing an object (hash, associative array, dictionary, map, hashtable, etc.) will not do the trick (despite being the most programmatically friendly) since it would be impossible to ensure a round trip from MARC to JSON back into an identical MARC record. Instead, we need to make some compromises (which both proposals do), which is not entirely unexpected with a 40 year old standard.

Let me explain a bit about each proposal, starting with Bill’s.

Bill takes the approach that JSON would more or less simply be used for transmission: it is optimized to serialize and parse very quickly. It is, however, not a a terribly friendly format to work with natively. A MARC-HASH record is basically a hash with arrays of arrays representing the fields and subfields, with the hash providing administrative data (version, type) and the leader. The fields key contains the array of arrays. These arrays either have two values (for control fields) or four (for data fields) with the last value being an array of subfields (which are also two value arrays). The first value in these arrays is always the MARC tag. For data fields, the second and third values are the indicators. For control fields the second value is the field value.

Here’s an example, provided by Bill:

{
    "type" : "marc-hash",
    "version" : [1, 0],

    "leader" : "leader string"
    "fields" : [
       ["001", "001 value"]
       ["002", "002 value"]
       ["010", " ", " ",
        [
          ["a", "68009499"]
        ]
      ],
      ["035", " ", " ",
        [
          ["a", "(RLIN)MIUG0000733-B"]
        ],
      ],
      ["035", " ", " ",
        [
          ["a", "(CaOTULAS)159818014"]
        ],
      ],
      ["245", "1", "0",
        [
          ["a", "Capitalism, primitive and modern;"],
          ["b", "some aspects of Tolai economic growth" ],
          ["c", "[by] T. Scarlett Epstein."]
        ]
      ]
    ]
  }

By putting all of the field data in nested arrays, a client would have to loop through every array to find specific fields. This isn’t difficult, of course, it’s just not terribly friendly.

Andrew’s approach is modeled after MARCXML. A record is represented by an object with three attributes: leader (string), controlfield (array), and datafield (array). The controlfields are represented as objects with two keys: tag (string) and data (string). Datafields have three keys: tag (string), ind (string for both indicators) and subfield (array). The subfield array contains objects with two keys: code (string) and data (string).

By replicating MARCXML, Andrew’s proposal should be pretty recognizable to anyone that’s familiar with MARC. Bill’s concept is going to be incredibly fast to serialize and parse. Unfortunately, I don’t think either of them is particularly good JSON (Andrew’s model is, of course, further hampered by the fact that MARCXML isn’t particularly good XML). As JSON becomes a first class citizen on the web, with libraries (JSONPath) and datastores (MongoDB, CouchDB, Persevere) specialized for it, it seems as though a MARC-in-JSON format should not only be good MARC, but good JSON, too.

This led to me to develop yet another way to serialize MARC as JSON. While JSONPath or MongoDB, by themselves, aren’t realistic functional requirements (since neither is “standard” in any way), they do represent how people that are seriously using JSON expect it work. If JSONPath is superceded by some other XPath-analogous-y means of traversing and searching a JSON data structure, it will likely contain many similarities to JSONPath, which is similar (although not identical) to XPath. These technologies indicate that people are looking at JSON as a first class data structure of its own and not just for simply passing data around that’s simple to serialize and parse into other forms.

And, so, with that backstory in place, proceed to my proposal to represent MARC in JSON.

Leave a Reply

Your email address will not be published. Required fields are marked *