Archive

MARC

I have been slowly taking the MARC codes lists and modeling them as linked data. I released a handful of them several months ago (geographic area codes, countries and languages) and have added more as I get inspired or have some free time. Most recently, I’ve added the Form of Item, Target Audience and Instruments and Voices terms.

The motivation behind modeling these lists is that they are extremely low-hanging fruit: they are controlled vocabularies that (usually) appear in the MARC fixed (or, at any rate, controlled) fields. What this means is that they should be pretty easy to link to from our source data. The identifiers are based on the actual code values in an effort to not actually have to look anything up when converting MARC into RDF.

I’ll go over each code list and explain what their function and how to link to them from MARC:

Geographic Area Codes

The purpose of these is a little vague:  they’re hard to classify as to what exactly they are; there are states (Tennessee), countries (India), continents (Europe), geographic features (Andes, Congo River, Great Rift Valley), areas or regions (Tropics, “Southwest, New” –whatever that means–, “Africa, French-speaking Equatorial“), hemispheres (Southern hemisphere), planets (Uranus) and then there are entries for things like “Outer Space” and “French Community” (which, as I understand it, is sort of the French analog to the British Commonwealth); in short, they are all over the map (literally).

I have modeled these things as wgs84:SpatialThings.  I don’t know if that is 100% appropriate (e.g. “French Community”) and am open to recommendations for other classes.  Given that they are somewhat hierarchical and are used to define the geographic “subject” of a work, it might be more appropriate to model them using SKOS.

The geographic area code is found in the MARC 043$a (which is a repeatable subfield in a non-repeatable field) and should be a 7 character string (although this may vary based on local cataloging practices).  Most codes will be much shorter than this: the specification requires right padding hyphens (“-“) to seven characters (“aa—–“).  To turn this into a MARC Codes URI, you’ll drop the trailing hyphens and append “#location”:

http://purl.org/NET/marccodes/gacs/aa#location

http://purl.org/NET/marccodes/gacs/n-us-md#location

I’m not sure what is actually the “best” property to use to link to these resources, but I have been using <http://purl.org/dc/terms/spatial> (although, admittedly, not consistently).  This would entail that these resources are also a <http://purl.org/dc/terms/Location> which is something I can live with.

Not all of the geographic area codes are linked to anything, but some are linked to the authorities at http://id.loc.gov/authorities/, dbpedia, geonames, etc.

Country Codes

These are a little more consistent than the geographic area codes, but they are definitely not all “countries”.  With a few exceptions (United States Misc. Caribbean Islands) they are actual “political entities”, with countries (Guatemala), and states/provinces/territories (Indiana, Nova Scotia, Gibraltar, Gaza Strip).

Like the geographic area codes, I’ve modeled these as wgs84:SpatialThings.

They can appear in several places in the MARC record:  they will almost always appear in the 008 in positions 15-17 as the “country of publication”.  If one code isn’t enough to convey the full story of the production of a particular resource (!), the code may also appear in the 044$a (repeatable subfield, non-repeatable field).  There are a couple of fields that the country codes could appear in:  the 535$g, 775$f and the 851$g; I have no idea how common it would be to find them there (and they have a different meaning — the 535/851 define the location of the item, for example).

To generate the country code URI, take the value from the MARC 008[15-17] or 044$a, strip any leading or trailing spaces and append “#location”.  The URIs look like:

http://purl.org/NET/marccodes/countries/aw#location

http://purl.org/NET/marccodes/countries/sa#location

To link to these resources, I’ve been using the RDA:placeOfPublication property, although I’m sure there are plenty of others that are appropriate (seems like a logical property for BIBO, for example).

The original code lists are also grouped by region, but there are no actual codes for this.  I created some for the purposes of linked data:

http://purl.org/NET/marccodes/countries/regions/1#location

http://purl.org/NET/marccodes/countries/regions/2#location

etc. (until 12).

Since we only use the country codes in MARC to note the place of publication, these are far less valuable than the geographic area codes (which are much more ambiguous in meaning), since it’s much more interesting when you can say that all of these things:

http://api.talis.com/stores/rsinger-dev4/services/sparql?query=SELECT+%3Fs%0D%0AWHERE+{%0D%0A%3Fs+%3Fp+%3Chttp%3A%2F%2Fpurl.org%2FNET%2Fmarccodes%2Fgacs%2Fe-ie%23location%3E%0D%0A}&output=json

are referring to the same place as all of these things:

http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&should-sponge=&query=select+distinct+%3Fs+where+%0D%0A{%3Fs+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2Fcountry%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FIreland%3E}&debug=on&timeout=&format=text%2Fhtml

which, in turn, are referring to the same place as this:

http://www.freebase.com/view/en/ireland/-/book/book_subject/works

which, in my mind, has tremendous potential.

Language Codes

Unbeknownst to me prior to undertaking this project, the Library of Congress is actually the maintenance agency for ISO 639-2 and the ISO codes are actually a derivative of the MARC codes list.  They aren’t actually a 1:1 mapping (there are 22 codes that are different in the ISO list), but they’re extremely close.  What is particularly nice about this is that most locale/language libraries are aware of these codes so it’s fairly easy to map to other locales (notably ISO 639-1, which is used by xml:lang).

The Library of Congress publishes an XML version of the list which is what I used to model it as linked data.  One of the nice features of this list was that it has attributes on the name that denote whether or not there’s an authority record for it:

<name authorized=”yes”>Abkhaz</name>

which we can then take, tack the substring ” language” onto it and look it up in http://id.loc.gov/authorities:

http://id.loc.gov/authorities/sh85000169#concept

giving us a link between things created in a particular language and things created about that language.

To use the language codes, take the value of positions 35-37 of the 008 or the 041 (the different subfields all define a different place the text might have a different language, so check the spec on this one).  I doubt it hardly ever appears in actual data, but the 242$y might have the language of the translated title.

Take that value (be sure to strip any trailing/leading whitespace — it’s supposed to be 3 characters: no more, no less) and plug it into the following URI template:

http://purl.org/NET/marccodes/languages/{abc}#lang

for example:

http://purl.org/NET/marccodes/languages/tur#lang

http://purl.org/NET/marccodes/languages/myn#lang

etc.

The language resources link to id.loc.gov (as mentioned above) as well as Lingvoj/Lexvo (they link to both, where appropriate, since there are likely still many data sources out there still using the Lingvoj URIs).  There are a handful (for example, Swedish) that link to dbpedia, but since those links are available in Lexvo, it’s not essential they appear here.

Musical Composition Codes

There are two codes lists that are directly related to music-based resources (sound recordings, scores and video): the musical composition codes and the Instruments and Voices codes.  Given that there has been a lot of work put into modeling music data for the linked data cloud, I thought it would be most useful to orient both of these lists to be used with the Music Ontology.

The composition codes basically denote the “genre” of the music contained in the resource.  It’s extremely classical-centric and sometimes lumps a lot of different forms into one genre code (try Divertimentos, serenades, cassations, divertissements, and notturni on for size), but they are definitely a start for finding like resources.

They are modeled as mo:Genre resources and include links to id.loc.gov, dbpedia and wikipedia.  To get the code, either use positions 17-19 of the MARC 008 field or the 047$a (both a repeating field and subfield).  The normalized code should always be two alpha characters long, and downcased.

They go into a URI template like:

http://purl.org/NET/marccodes/muscomp/{ab}#genre

such as:

http://purl.org/NET/marccodes/muscomp/sy#genre

or

http://purl.org/NET/marccodes/muscomp/mz#genre

It would be really useful to find other datasources that use mo:Genre to link these to.

Form of Item Codes

This is a very small list that broadly describes the format of the resource being described.  This is probably most useful to use with dcterms:format, so they’ve all been modeled with the rdf:type dcterms:MediaType.  A full third of the codes describe microforms (granted, out of 9 total), which should give you some some sense of how relevant these are.

Getting the code from the MARC record is dependent on the kind of record you’re looking at.  For books, serials, sound recordings, scores, computer files and mixed materials, take the 23rd position from the 008.  For visual materials and maps use the 29th position.  They should be one, lowercase alpha character.

URIs look like:

http://purl.org/NET/forms/{a}#form

The resources link to http://id.loc.gov/authorities (think Genre/Form terms), http://id.loc.gov/vocabulary/graphicMaterials and (for a couple) dbpedia.

Ideally, these will eventually link to whatever is analogous is RDA (if somebody can point that out to me).

Frequency of Issue Codes

Unlike the previous code list, this one seems much more useful.  It is used to define how often a continuing resource is updated.  Unfortunately, it is extremely print-centric (the only term more frequent than “daily” is “Continuously updated” which is defined as “Updated more frequent than daily.”), but some of the terms would seem to hold value even outside of the library context (Annual, Biweekly, Quarterly, etc.).  It doesn’t take a tremendous leap of the imagination see how these might be useful for events calendars (Monthly, etc.) or for GoodRelations-type datasets (“Semi-annual Blowout Sale!”).

To get the code from the MARC record, check the 008[18] or the 853-855$w.  Presumably, this should only appear for continuing resources (SER).  It’s a one letter code, lower cased.

The URIs look like:

http://purl.org/NET/marccodes/frequency/{x}#term

They are modeled as dcterms:Frequency resources and link to dbpedia where available.

Target Audience Codes

This is another fairly short, extremely generalized list.  It is primarily useful to determining the age-level of children’s resources, most likely (5 of the 8 terms are for juvenile age groups).  They are of rdf:type dcterms:AgentClass.  Resources are linked (where appropriate — and maybe even a few that aren’t) to dbpedia and http://id.loc.gov/authorities/.

For books, music (scores, sound recordings), computer files and visual materials, get the code from the 008[22].  It is one letter, lower cased.  URIs follow the fairly consistent form we’ve seen thus far:

http://purl.org/NET/marccodes/target/{x}#term

http://purl.org/NET/marccodes/target/c#term

http://purl.org/NET/marccodes/target/f#term

Instruments and Voices Codes

The terms describe the instruments or vocal groups that either appear (for sound recordings, for example) or are intended (scores) for a particular resource.  Like many of the other codes lists, these are quite general and maddeningly biased towards classical music (Continuo, Celeste, Viola d’amore, but no banjo or sitar, for instance).  Like the form of musical composition terms, I modeled these to use with the Music Ontology, namely as the object of mo:instrumentmo:Instrument has this note:

Any taxonomy can be used to subsume this concept. The default one is one extracted by Ivan Herman
from the Musicbrainz instrument taxonomy, conforming to SKOS. This concept holds a seeAlso link
towards this taxonomy.

so these terms have been modeled as skos:Concepts.  There are skos:exactMatch relationships to the Musicbrainz taxonomy where appropriate (as well as links to id.loc.gov/authorities and dbpedia).  The original code lists had an implication of hierarchy (“Larger ensemble – Dance orchestra” should be thought of as “Dance orchestra” with broader term “Larger ensemble”), but that’s not actually used in MARC.  I broke these broader terms out on their own for this vocabulary, since it seemed useful in a linked data context and wouldn’t actually hurt anything (the codes are two letters, so the “broader terms” are just using the first letter).

To get the code, use the MARC 048 subfield a or b (for ensemble or solo parts, respectively) and take the first two characters (which must be letters).  This code may be followed by two digit number (left padded with zeroes) signifying how many parts.  Drop this number, if present.

URI template:

http://purl.org/NET/marccodes/musperf/{xx}#term

http://purl.org/NET/marccodes/musperf/cd#term

http://purl.org/NET/marccodes/musperf/ed#term

Other Codes

I am not sure when or if I will model any more codes lists.  Ideally, the Library of Congress should be doing these (they’ve done the relator codes, and preservation events lists).  The only other lists I can see much value in are the Specific Material Form Terms (the MARC 007) and the MARC Organization codes.

I have done a bit of work on the specific material forms list, but it’s fairly complicated.  My current approach is a hybrid of controlled vocabularies and RDF schema (after all, it makes sense for a globe to be rdf:type <http://purl.org/NET/marccodes/smd/terms/Globe> rather than that be some property set on an untyped resource).  For an RDF schema, though, I would prefer a “better” namespace than purl.org/NET/, although perhaps it doesn’t really matter much.

No matter what, it would certainly push the limits of my freebie Heroku account that this is currently running on.

I am definitely open to any ideas or recommendations people might have for these (and requests for other lists to be converted).  I’d also be interested to see if are able to use them with your data.

Note: to see the backstory and justification of this proposal, please see the preceding post.

MARC-in-JSON is a proposed JSON schema for representing MARC records as JSON. It is the outgrowth of working with MARC data in MongoDB and is intended to be both a faithful representation of MARC as well as a logical and useful model to work natively in JSON-centric environments. Ideally, this serialization could eventually replace binary MARC as the default format. The round trip of a MARC-in-JSON record from MARC to JSON back to MARC is lossless and preserves field/subfield order.

An example MARC bibliographic record, represented as text:

LEADER 01471cjm a2200349 a 4500
001 5674874
005 20030305110405.0
007 sdubsmennmplu
008 930331s1963 nyuppn eng d
035 $9 (DLC) 93707283
906 $a 7 $b cbc $c copycat $d 4 $e ncip $f 19 $g y-soundrec
010 $a 93707283
028 02 $a CS 8786 $b Columbia
035 $a (OCoLC)13083787
040 $a OClU $c DLC $d DLC
041 0 $d eng $g eng
042 $a lccopycat
050 00 $a Columbia CS 8786
100 1 $a Dylan, Bob, $d 1941-
245 14 $a The freewheelin' Bob Dylan $h [sound recording].
260 $a [New York, N.Y.] : $b Columbia, $c [1963]
300 $a 1 sound disc : $b analog, 33 1/3 rpm, stereo. ; $c 12 in.
500 $a Songs.
511 0 $a The composer accompanying himself on the guitar ; in part with instrumental ensemble.
500 $a Program notes by Nat Hentoff on container.
505 0 $a Blowin' in the wind -- Girl from the north country -- Masters of war -- Down the highway -- Bob Dylan's blues -- A hard rain's a-gonna fall -- Don't think twice, it's all right -- Bob Dylan's dream -- Oxford town -- Talking World War III blues -- Corrina, Corrina -- Honey, just allow me one more chance -- I shall be free.
650 0 $a Popular music $y 1961-1970.
650 0 $a Blues (Music) $y 1961-1970.
856 41 $3 Preservation copy (limited access) $u http://hdl.loc.gov/loc.mbrsrs/lp0001.dyln
952 $a New
953 $a TA28
991 $b c-RecSound $h Columbia CS 8786 $w MUSIC

The same bibliographic record serialized as MARC-in-JSON would appear as follows (pretty-printed with whitespace and line breaks for readability):

{
    "leader":"01471cjm a2200349 a 4500",
    "fields":
    [
        {
            "001":"5674874"
        },
        {
            "005":"20030305110405.0"
        },
        {
            "007":"sdubsmennmplu"
        },
        {
            "008":"930331s1963    nyuppn              eng d"
        },
        {
            "035":
            {
                "subfields":
                [
                    {
                        "9":"(DLC)   93707283"
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "906":
            {
                "subfields":
                [
                    {
                        "a":"7"
                    },
                    {
                        "b":"cbc"
                    },
                    {
                        "c":"copycat"
                    },
                    {
                        "d":"4"
                    },
                    {
                        "e":"ncip"
                    },
                    {
                        "f":"19"
                    },
                    {
                        "g":"y-soundrec"
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "010":
            {
                "subfields":
                [
                    {
                        "a":"   93707283 "
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "028":
            {
                "subfields":
                [
                    {
                        "a":"CS 8786"
                    },
                    {
                        "b":"Columbia"
                    }
                ],
                "ind1":"0",
                "ind2":"2"
            }
        },
        {
            "035":
            {
                "subfields":
                [
                    {
                        "a":"(OCoLC)13083787"
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "040":
            {
                "subfields":
                [
                    {
                        "a":"OClU"
                    },
                    {
                        "c":"DLC"
                    },
                    {
                        "d":"DLC"
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "041":
            {
                "subfields":
                [
                    {
                        "d":"eng"
                    },
                    {
                        "g":"eng"
                    }
                ],
                "ind1":"0",
                "ind2":" "
            }
        },
        {
            "042":
            {
                "subfields":
                [
                    {
                        "a":"lccopycat"
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "050":
            {
                "subfields":
                [
                    {
                        "a":"Columbia CS 8786"
                    }
                ],
                "ind1":"0",
                "ind2":"0"
            }
        },
        {
            "100":
            {
                "subfields":
                [
                    {
                        "a":"Dylan,
                         Bob,
                        "
                    },
                    {
                        "d":"1941-"
                    }
                ],
                "ind1":"1",
                "ind2":" "
            }
        },
        {
            "245":
            {
                "subfields":
                [
                    {
                        "a":"The freewheelin' Bob Dylan"
                    },
                    {
                        "h":"
                        [
                            sound recording
                        ]
                        ."
                    }
                ],
                "ind1":"1",
                "ind2":"4"
            }
        },
        {
            "260":
            {
                "subfields":
                [
                    {
                        "a":"
                        [
                            New York,
                             N.Y.
                        ]
                         :"
                    },
                    {
                        "b":"Columbia,
                        "
                    },
                    {
                        "c":"
                        [
                            1963
                        ]
                        "
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "300":
            {
                "subfields":
                [
                    {
                        "a":"1 sound disc :"
                    },
                    {
                        "b":"analog,
                         33 1/3 rpm,
                         stereo. ;"
                    },
                    {
                        "c":"12 in."
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "500":
            {
                "subfields":
                [
                    {
                        "a":"Songs."
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "511":
            {
                "subfields":
                [
                    {
                        "a":"The composer accompanying himself on the guitar ; in part with instrumental ensemble."
                    }
                ],
                "ind1":"0",
                "ind2":" "
            }
        },
        {
            "500":
            {
                "subfields":
                [
                    {
                        "a":"Program notes by Nat Hentoff on container."
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "505":
            {
                "subfields":
                [
                    {
                        "a":"Blowin' in the wind -- Girl from the north country -- Masters of war -- Down the highway -- Bob Dylan's blues -- A hard rain's a-gonna fall -- Don't think twice,
                         it's all right -- Bob Dylan's dream -- Oxford town -- Talking World War III blues -- Corrina,
                         Corrina -- Honey,
                         just allow me one more chance -- I shall be free."
                    }
                ],
                "ind1":"0",
                "ind2":" "
            }
        },
        {
            "650":
            {
                "subfields":
                [
                    {
                        "a":"Popular music"
                    },
                    {
                        "y":"1961-1970."
                    }
                ],
                "ind1":" ",
                "ind2":"0"
            }
        },
        {
            "650":
            {
                "subfields":
                [
                    {
                        "a":"Blues (Music)"
                    },
                    {
                        "y":"1961-1970."
                    }
                ],
                "ind1":" ",
                "ind2":"0"
            }
        },
        {
            "856":
            {
                "subfields":
                [
                    {
                        "3":"Preservation copy (limited access)"
                    },
                    {
                        "u":"http://hdl.loc.gov/loc.mbrsrs/lp0001.dyln"
                    }
                ],
                "ind1":"4",
                "ind2":"1"
            }
        },
        {
            "952":
            {
                "subfields":
                [
                    {
                        "a":"New"
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "953":
            {
                "subfields":
                [
                    {
                        "a":"TA28"
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        },
        {
            "991":
            {
                "subfields":
                [
                    {
                        "b":"c-RecSound"
                    },
                    {
                        "h":"Columbia CS 8786"
                    },
                    {
                        "w":"MUSIC"
                    }
                ],
                "ind1":" ",
                "ind2":" "
            }
        }
    ]
}

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in IETF RFC 2119.

MARC-in-JSON records MUST conform to the following JSON schema:

{
    "description":"A MARC Record",
    "type": "object",
    "properties": {
        "leader": {
            "type": "string",
            "minLength": 24,
            "maxLength": 24
        },
        "fields": {
            "type": "array",
            "items": {
                "type":[
                    {
                        "type": "object",
                        "description":"A MARC Control Field",
                        "additionalProperties":{
                            "type":"string"
                        }
                    },
                    {
                        "type": "object",
                        "additionalProperties":{
                            "type":"object",
                            "description":"A MARC Variable Field",
                            "properties":{
                                "ind1":{
                                    "type":"string",
                                    "minLength":1,
                                    "maxLength":1
                                },
                                "ind2":{
                                    "type":"string",
                                    "minLength":1,
                                    "maxLength":1
                                },
                                "subfields":{
                                    "type":"array",
                                    "items":{
                                        "type":"object",
                                        "description":"A MARC Subfield",
                                        "additionalProperties":{
                                            "type":"string"
                                        }
                                    }
                                }
                            }
                        }
                    }
                    ]
                },
            "additionalProperties": false
        }
    },
    "additionalProperties": false
}

Download this schema.

MARC-in-JSON consists of four (4) object types:

Record objects
The base representation of the MARC record. It MUST be a JSON object with two properties:

  • leader, which MUST be a string, exactly 24 characters in length.
  • fields, an array which MUST only contain control field and variable field objects.

Record objects MAY be contained in a JSON array.

Control field objects
MARC control fields MUST be represented as a JSON object with a single key/value pair. The key MUST be a string conforming to a valid MARC field tag value (generally three alphanumeric characters). The value of the object MUST be a string.
Variable field objects
Variable fields MUST be represented as JSON objects with a single key/value pair. The key MUST be a string conforming to a valid MARC field tag value (generally three alphanumeric characters). The value of the object MUST be a JSON object with three properties:

  • ind1: a one (1) character string representing the 1st MARC field indicator
  • ind2: a one (1) character string representing the 2nd MARC field indicator
  • subfields: an array containing at least one subfield object
Subfield objects
MARC subfields MUST be represented as JSON objects with a single key/value pair. The key MUST be a string conforming to a valid MARC subfield code value (generally a single alphanumeric character). The value MUST be a string representing the value of the subfield. A subfield object MUST only appear in a variable field object subfields array.

The content of a MARC-in-JSON object MUST be UTF-8 encoded or UTF-8 escaped according to the JSON standard (RFC 4627).  MARC-8, UTF-16 or UTF-32 SHALL NOT be permitted under MARC-in-JSON.

There are currently two implementations conforming to this specification for serialization:

Note: this post is broken into two parts. If you are just interested in the MARC-in-JSON technical specification proposal, look here. If you’re interested in the justification and history of it, read on.

The easy and obvious reaction to the suggestion of providing new serializations to MARC is to reach for Roy Tennant’s suggestion and ask for its head. Let’s face it, though, MARC is going nowhere soon and we have yet to produce a truly lossless alternative format. With the rise of JSON as the generally preferred means of communication between machines on the web, it was only a matter of time before the proposals for a standardized way to serialize MARC as JSON began to materialize.

So far we have two well-publicized suggestions: one by Bill Dueber, at the University of Michigan; and one by Andrew Houghton, who works at OCLC Research. They are quite different and each have their advantages and disadvantages. The tricky part of representing MARC in JSON is preserving field and subfield order: simply representing an object (hash, associative array, dictionary, map, hashtable, etc.) will not do the trick (despite being the most programmatically friendly) since it would be impossible to ensure a round trip from MARC to JSON back into an identical MARC record. Instead, we need to make some compromises (which both proposals do), which is not entirely unexpected with a 40 year old standard.

Let me explain a bit about each proposal, starting with Bill’s.

Bill takes the approach that JSON would more or less simply be used for transmission: it is optimized to serialize and parse very quickly. It is, however, not a a terribly friendly format to work with natively. A MARC-HASH record is basically a hash with arrays of arrays representing the fields and subfields, with the hash providing administrative data (version, type) and the leader. The fields key contains the array of arrays. These arrays either have two values (for control fields) or four (for data fields) with the last value being an array of subfields (which are also two value arrays). The first value in these arrays is always the MARC tag. For data fields, the second and third values are the indicators. For control fields the second value is the field value.

Here’s an example, provided by Bill:

{
    "type" : "marc-hash",
    "version" : [1, 0],

    "leader" : "leader string"
    "fields" : [
       ["001", "001 value"]
       ["002", "002 value"]
       ["010", " ", " ",
        [
          ["a", "68009499"]
        ]
      ],
      ["035", " ", " ",
        [
          ["a", "(RLIN)MIUG0000733-B"]
        ],
      ],
      ["035", " ", " ",
        [
          ["a", "(CaOTULAS)159818014"]
        ],
      ],
      ["245", "1", "0",
        [
          ["a", "Capitalism, primitive and modern;"],
          ["b", "some aspects of Tolai economic growth" ],
          ["c", "[by] T. Scarlett Epstein."]
        ]
      ]
    ]
  }

By putting all of the field data in nested arrays, a client would have to loop through every array to find specific fields. This isn’t difficult, of course, it’s just not terribly friendly.

Andrew’s approach is modeled after MARCXML. A record is represented by an object with three attributes: leader (string), controlfield (array), and datafield (array). The controlfields are represented as objects with two keys: tag (string) and data (string). Datafields have three keys: tag (string), ind (string for both indicators) and subfield (array). The subfield array contains objects with two keys: code (string) and data (string).

By replicating MARCXML, Andrew’s proposal should be pretty recognizable to anyone that’s familiar with MARC. Bill’s concept is going to be incredibly fast to serialize and parse. Unfortunately, I don’t think either of them is particularly good JSON (Andrew’s model is, of course, further hampered by the fact that MARCXML isn’t particularly good XML). As JSON becomes a first class citizen on the web, with libraries (JSONPath) and datastores (MongoDB, CouchDB, Persevere) specialized for it, it seems as though a MARC-in-JSON format should not only be good MARC, but good JSON, too.

This led to me to develop yet another way to serialize MARC as JSON. While JSONPath or MongoDB, by themselves, aren’t realistic functional requirements (since neither is “standard” in any way), they do represent how people that are seriously using JSON expect it work. If JSONPath is superceded by some other XPath-analogous-y means of traversing and searching a JSON data structure, it will likely contain many similarities to JSONPath, which is similar (although not identical) to XPath. These technologies indicate that people are looking at JSON as a first class data structure of its own and not just for simply passing data around that’s simple to serialize and parse into other forms.

And, so, with that backstory in place, proceed to my proposal to represent MARC in JSON.

One of the byproducts of the “Communicat” work I had done at Georgia Tech was a variant of Ed Summersruby-marc that went into more explicit detail regarding the contents inside the MARC record (as opposed to ruby-marc which focuses on its structure).  It had been living for the last couple of years as a branch within ruby-marc, but this was never a particularly ideal approach.  These enhancements were sort of out of scope for ruby-marc as a general MARC parser/writer, so it’s not as if this branch was ever going to see the light of day as trunk.  As a result, it was a massive pain in the butt for me to use locally:  I couldn’t easily add it as a gem (since it would have replaced the real ruby-marc, which I use far too much to live without) which meant that I would have to explicitly include it in whatever projects I wanted to use it in and update any paths included accordingly.

So as I found myself, yet again, copying the TypedRecords directory into another local project (this one to map MARC records to RDF), I decided it was time to make this its own project.

One of the amazingly wonderful aspects of Ruby is the notion of “opening up an object or class”.  For those not familiar with Ruby, the language allows you to take basically any object or class, redefine it and add your own attributes, methods, etc.  So if you feel that there is some particular functionality missing from a given Ruby object, you can just redefine it, adding or overriding the existing methods, without having to reimplement the entire thing.  So, for example:

class String
  def shout
    "#{self.upcase}!!!!"
  end
end

str = "Hello World"
str.shout
=> "HELLO WORLD!!!!"

And just like that, your String objects gained the ability to get a little louder and a little more obnoxious.

So rather than design the typed records concept as a replacement for ruby-marc, it made more sense to treat it more as an extension to ruby-marc.  By monkey patching, the regular marc parser/writer can remain the same, but if you want to look a little more closely at the contents, it will override the behavior of the original classes and objects and add a whole bunch of new functionality.  For MARC records, it’s analogous to how Facets adds all kinds of convenience methods to String, Fixnum, Array, etc.

So, now it has its own github project:  enhanced-marc.

If you want to install it:

  gem sources -a http://gems.github.com
  sudo gem install rsinger-enhanced_marc

There’s some really simple usage instructions on the project page and I’ll try to get the rdocs together as soon as I can.  In a nutshell it works almost just like ruby-marc does:

require 'enhanced_marc'

records = []
reader = MARC::Reader.open('marc.dat')
reader.each do | record
  records << record
end

As it parses each record, it examines the leader to determine what kind of record it is:

  • MARC::BookRecord
  • MARC::SerialRecord
  • MARC::MapRecord
  • MARC::ScoreRecord
  • MARC::SoundRecord
  • MARC::VisualRecord
  • MARC::MixedRecord

and adds a bunch of format specific methods appropriate for, say, a map.

It’s possible to then simply extract either the MARC codes or the (English) human readable string that the MARC code represents:

record.class
=> MARC::SerialRecord
record.frequency
=> "d"
record.frequency(true)
=> "Daily"
record.serial_type(true)
=> "Newspaper"
record.is_conference?
=> false

or, say:

record.class
=> MARC::VisualRecord
record.is_govdoc?
=> true
record.audience_level
=> "j"
record.material_type(true)
=> "Videorecording"
record.technique(true)
=> "Animation"

And so on.

There is still quite a bit I still need to add.  It pretty much ignores mixed records at the moment.  It’s something I’ll need to eventually get to, but these are uncommon enough that it’s currently a lower priority.  I also need to provide some methods that evaluate the 007 field.  I haven’t gotten to this yet, just because it’s just a ton of tedium.  It would be useful, though, so I want to get it in there.

If there is interest, it could perhaps be extended to include authority records or holdings records.  It would also be handy to have convenience methods on the data fields:

record.isbn
=> "0977616630"
record.control_number
=> "793456"

Anyway, hopefully somebody might find this to be useful.