Archive

Monthly Archives: September 2010

Note: to see the backstory and justification of this proposal, please see the preceding post.

MARC-in-JSON is a proposed JSON schema for representing MARC records as JSON. It is the outgrowth of working with MARC data in MongoDB and is intended to be both a faithful representation of MARC as well as a logical and useful model to work natively in JSON-centric environments. Ideally, this serialization could eventually replace binary MARC as the default format. The round trip of a MARC-in-JSON record from MARC to JSON back to MARC is lossless and preserves field/subfield order.

An example MARC bibliographic record, represented as text:

LEADER 01471cjm a2200349 a 4500
001 5674874
005 20030305110405.0
007 sdubsmennmplu
008 930331s1963 nyuppn eng d
035 $9 (DLC) 93707283
906 $a 7 $b cbc $c copycat $d 4 $e ncip $f 19 $g y-soundrec
010 $a 93707283
028 02 $a CS 8786 $b Columbia
035 $a (OCoLC)13083787
040 $a OClU $c DLC $d DLC
041 0 $d eng $g eng
042 $a lccopycat
050 00 $a Columbia CS 8786
100 1 $a Dylan, Bob, $d 1941-
245 14 $a The freewheelin' Bob Dylan $h [sound recording].
260 $a [New York, N.Y.] : $b Columbia, $c [1963]
300 $a 1 sound disc : $b analog, 33 1/3 rpm, stereo. ; $c 12 in.
500 $a Songs.
511 0 $a The composer accompanying himself on the guitar ; in part with instrumental ensemble.
500 $a Program notes by Nat Hentoff on container.
505 0 $a Blowin' in the wind -- Girl from the north country -- Masters of war -- Down the highway -- Bob Dylan's blues -- A hard rain's a-gonna fall -- Don't think twice, it's all right -- Bob Dylan's dream -- Oxford town -- Talking World War III blues -- Corrina, Corrina -- Honey, just allow me one more chance -- I shall be free.
650 0 $a Popular music $y 1961-1970.
650 0 $a Blues (Music) $y 1961-1970.
856 41 $3 Preservation copy (limited access) $u http://hdl.loc.gov/loc.mbrsrs/lp0001.dyln
952 $a New
953 $a TA28
991 $b c-RecSound $h Columbia CS 8786 $w MUSIC

The same bibliographic record serialized as MARC-in-JSON would appear as follows (pretty-printed with whitespace and line breaks for readability):

{
    "leader":"01471cjm a2200349 a 4500",
    "fields":
    [
        {
            "001":"5674874"
        },
        {
            "005":"20030305110405.0"
        },
        {
            "007":"sdubsmennmplu"
        },
        {
            "008":"930331s1963    nyuppn              eng d"
        },
        {
            "035":
            {
                "subfields":
                [
                    {
                        "9":"(DLC)   93707283"
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "906":
            {
                "subfields":
                [
                    {
                        "a":"7"
                    },
                    {
                        "b":"cbc"
                    },
                    {
                        "c":"copycat"
                    },
                    {
                        "d":"4"
                    },
                    {
                        "e":"ncip"
                    },
                    {
                        "f":"19"
                    },
                    {
                        "g":"y-soundrec"
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "010":
            {
                "subfields":
                [
                    {
                        "a":"   93707283 "
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "028":
            {
                "subfields":
                [
                    {
                        "a":"CS 8786"
                    },
                    {
                        "b":"Columbia"
                    }
                ],
                "ind1":"0",
                "ind2":"2"
            }
        },
        {
            "035":
            {
                "subfields":
                [
                    {
                        "a":"(OCoLC)13083787"
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "040":
            {
                "subfields":
                [
                    {
                        "a":"OClU"
                    },
                    {
                        "c":"DLC"
                    },
                    {
                        "d":"DLC"
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "041":
            {
                "subfields":
                [
                    {
                        "d":"eng"
                    },
                    {
                        "g":"eng"
                    }
                ],
                "ind1":"0",
                "ind2":" "
            }
        },
        {
            "042":
            {
                "subfields":
                [
                    {
                        "a":"lccopycat"
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "050":
            {
                "subfields":
                [
                    {
                        "a":"Columbia CS 8786"
                    }
                ],
                "ind1":"0",
                "ind2":"0"
            }
        },
        {
            "100":
            {
                "subfields":
                [
                    {
                        "a":"Dylan,
                         Bob,
                        "
                    },
                    {
                        "d":"1941-"
                    }
                ],
                "ind1":"1",
                "ind2":" "
            }
        },
        {
            "245":
            {
                "subfields":
                [
                    {
                        "a":"The freewheelin' Bob Dylan"
                    },
                    {
                        "h":"
                        [
                            sound recording
                        ]
                        ."
                    }
                ],
                "ind1":"1",
                "ind2":"4"
            }
        },
        {
            "260":
            {
                "subfields":
                [
                    {
                        "a":"
                        [
                            New York,
                             N.Y.
                        ]
                         :"
                    },
                    {
                        "b":"Columbia,
                        "
                    },
                    {
                        "c":"
                        [
                            1963
                        ]
                        "
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "300":
            {
                "subfields":
                [
                    {
                        "a":"1 sound disc :"
                    },
                    {
                        "b":"analog,
                         33 1/3 rpm,
                         stereo. ;"
                    },
                    {
                        "c":"12 in."
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "500":
            {
                "subfields":
                [
                    {
                        "a":"Songs."
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "511":
            {
                "subfields":
                [
                    {
                        "a":"The composer accompanying himself on the guitar ; in part with instrumental ensemble."
                    }
                ],
                "ind1":"0",
                "ind2":" "
            }
        },
        {
            "500":
            {
                "subfields":
                [
                    {
                        "a":"Program notes by Nat Hentoff on container."
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "505":
            {
                "subfields":
                [
                    {
                        "a":"Blowin' in the wind -- Girl from the north country -- Masters of war -- Down the highway -- Bob Dylan's blues -- A hard rain's a-gonna fall -- Don't think twice,
                         it's all right -- Bob Dylan's dream -- Oxford town -- Talking World War III blues -- Corrina,
                         Corrina -- Honey,
                         just allow me one more chance -- I shall be free."
                    }
                ],
                "ind1":"0",
                "ind2":" "
            }
        },
        {
            "650":
            {
                "subfields":
                [
                    {
                        "a":"Popular music"
                    },
                    {
                        "y":"1961-1970."
                    }
                ],
                "ind1":" ",
                "ind2":"0"
            }
        },
        {
            "650":
            {
                "subfields":
                [
                    {
                        "a":"Blues (Music)"
                    },
                    {
                        "y":"1961-1970."
                    }
                ],
                "ind1":" ",
                "ind2":"0"
            }
        },
        {
            "856":
            {
                "subfields":
                [
                    {
                        "3":"Preservation copy (limited access)"
                    },
                    {
                        "u":"http://hdl.loc.gov/loc.mbrsrs/lp0001.dyln"
                    }
                ],
                "ind1":"4",
                "ind2":"1"
            }
        },
        {
            "952":
            {
                "subfields":
                [
                    {
                        "a":"New"
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "953":
            {
                "subfields":
                [
                    {
                        "a":"TA28"
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "991":
            {
                "subfields":
                [
                    {
                        "b":"c-RecSound"
                    },
                    {
                        "h":"Columbia CS 8786"
                    },
                    {
                        "w":"MUSIC"
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        }
    ]
}

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in IETF RFC 2119.

MARC-in-JSON records MUST conform to the following JSON schema:

{
    "description":"A MARC Record",
    "type": "object",
    "properties": {
        "leader": {
            "type": "string",
            "minLength": 24,
            "maxLength": 24
        },
        "fields": {
            "type": "array",
            "items": {
                "type":[
                    {
                        "type": "object",
                        "description":"A MARC Control Field",
                        "additionalProperties":{
                            "type":"string"
                        }
                    },
                    {
                        "type": "object",
                        "additionalProperties":{
                            "type":"object",
                            "description":"A MARC Variable Field",
                            "properties":{
                                "ind1":{
                                    "type":"string",
                                    "minLength":1,
                                    "maxLength":1
                                },
                                "ind2":{
                                    "type":"string",
                                    "minLength":1,
                                    "maxLength":1
                                },
                                "subfields":{
                                    "type":"array",
                                    "items":{
                                        "type":"object",
                                        "description":"A MARC Subfield",
                                        "additionalProperties":{
                                            "type":"string"
                                        }
                                    }
                                }
                            }
                        }
                    }
                    ]
                },
            "additionalProperties": false
        }
    },
    "additionalProperties": false
}

Download this schema.

MARC-in-JSON consists of four (4) object types:

Record objects
The base representation of the MARC record. It MUST be a JSON object with two properties:

  • leader, which MUST be a string, exactly 24 characters in length.
  • fields, an array which MUST only contain control field and variable field objects.

Record objects MAY be contained in a JSON array.

Control field objects
MARC control fields MUST be represented as a JSON object with a single key/value pair. The key MUST be a string conforming to a valid MARC field tag value (generally three alphanumeric characters). The value of the object MUST be a string.
Variable field objects
Variable fields MUST be represented as JSON objects with a single key/value pair. The key MUST be a string conforming to a valid MARC field tag value (generally three alphanumeric characters). The value of the object MUST be a JSON object with three properties:

  • ind1: a one (1) character string representing the 1st MARC field indicator
  • ind2: a one (1) character string representing the 2nd MARC field indicator
  • subfields: an array containing at least one subfield object
Subfield objects
MARC subfields MUST be represented as JSON objects with a single key/value pair. The key MUST be a string conforming to a valid MARC subfield code value (generally a single alphanumeric character). The value MUST be a string representing the value of the subfield. A subfield object MUST only appear in a variable field object subfields array.

The content of a MARC-in-JSON object MUST be UTF-8 encoded or UTF-8 escaped according to the JSON standard (RFC 4627).  MARC-8, UTF-16 or UTF-32 SHALL NOT be permitted under MARC-in-JSON.

There are currently two implementations conforming to this specification for serialization:

Note: this post is broken into two parts. If you are just interested in the MARC-in-JSON technical specification proposal, look here. If you’re interested in the justification and history of it, read on.

The easy and obvious reaction to the suggestion of providing new serializations to MARC is to reach for Roy Tennant’s suggestion and ask for its head. Let’s face it, though, MARC is going nowhere soon and we have yet to produce a truly lossless alternative format. With the rise of JSON as the generally preferred means of communication between machines on the web, it was only a matter of time before the proposals for a standardized way to serialize MARC as JSON began to materialize.

So far we have two well-publicized suggestions: one by Bill Dueber, at the University of Michigan; and one by Andrew Houghton, who works at OCLC Research. They are quite different and each have their advantages and disadvantages. The tricky part of representing MARC in JSON is preserving field and subfield order: simply representing an object (hash, associative array, dictionary, map, hashtable, etc.) will not do the trick (despite being the most programmatically friendly) since it would be impossible to ensure a round trip from MARC to JSON back into an identical MARC record. Instead, we need to make some compromises (which both proposals do), which is not entirely unexpected with a 40 year old standard.

Let me explain a bit about each proposal, starting with Bill’s.

Bill takes the approach that JSON would more or less simply be used for transmission: it is optimized to serialize and parse very quickly. It is, however, not a a terribly friendly format to work with natively. A MARC-HASH record is basically a hash with arrays of arrays representing the fields and subfields, with the hash providing administrative data (version, type) and the leader. The fields key contains the array of arrays. These arrays either have two values (for control fields) or four (for data fields) with the last value being an array of subfields (which are also two value arrays). The first value in these arrays is always the MARC tag. For data fields, the second and third values are the indicators. For control fields the second value is the field value.

Here’s an example, provided by Bill:

{
    "type" : "marc-hash",
    "version" : [1, 0],

    "leader" : "leader string"
    "fields" : [
       ["001", "001 value"]
       ["002", "002 value"]
       ["010", " ", " ",
        [
          ["a", "68009499"]
        ]
      ],
      ["035", " ", " ",
        [
          ["a", "(RLIN)MIUG0000733-B"]
        ],
      ],
      ["035", " ", " ",
        [
          ["a", "(CaOTULAS)159818014"]
        ],
      ],
      ["245", "1", "0",
        [
          ["a", "Capitalism, primitive and modern;"],
          ["b", "some aspects of Tolai economic growth" ],
          ["c", "[by] T. Scarlett Epstein."]
        ]
      ]
    ]
  }

By putting all of the field data in nested arrays, a client would have to loop through every array to find specific fields. This isn’t difficult, of course, it’s just not terribly friendly.

Andrew’s approach is modeled after MARCXML. A record is represented by an object with three attributes: leader (string), controlfield (array), and datafield (array). The controlfields are represented as objects with two keys: tag (string) and data (string). Datafields have three keys: tag (string), ind (string for both indicators) and subfield (array). The subfield array contains objects with two keys: code (string) and data (string).

By replicating MARCXML, Andrew’s proposal should be pretty recognizable to anyone that’s familiar with MARC. Bill’s concept is going to be incredibly fast to serialize and parse. Unfortunately, I don’t think either of them is particularly good JSON (Andrew’s model is, of course, further hampered by the fact that MARCXML isn’t particularly good XML). As JSON becomes a first class citizen on the web, with libraries (JSONPath) and datastores (MongoDB, CouchDB, Persevere) specialized for it, it seems as though a MARC-in-JSON format should not only be good MARC, but good JSON, too.

This led to me to develop yet another way to serialize MARC as JSON. While JSONPath or MongoDB, by themselves, aren’t realistic functional requirements (since neither is “standard” in any way), they do represent how people that are seriously using JSON expect it work. If JSONPath is superceded by some other XPath-analogous-y means of traversing and searching a JSON data structure, it will likely contain many similarities to JSONPath, which is similar (although not identical) to XPath. These technologies indicate that people are looking at JSON as a first class data structure of its own and not just for simply passing data around that’s simple to serialize and parse into other forms.

And, so, with that backstory in place, proceed to my proposal to represent MARC in JSON.