Metaproxy routes and filters, as I understand them

Like most Indexdata products, Metaproxy is an incredibly useful, although damn near impenetrable, application.  I have been using it to manage our Z39.50 access in the project I’ve been working for the last year or so (Talis Aspire Digitised Content, if you’re interested).  Its purpose is two-fold: most importantly, it makes the Z39.50 services available via SRU (so we don’t need Z39.50 support in our apps themselves), and it also allows us to federate a bunch of library catalogs in one target, so we can get a cross-section of major libraries in one query.

For the last nine months or so, I’ve been using Metaproxy quite successfully, albeit in a very low-volume and tightly scoped context, without really understanding in the slightest idea how it works (as one tends to do with Indexdata-produced software) or what the configuration settings really did.  Despite the fact that we were pointing at a little less than twenty five Z39.50 targets, it just worked (after some initial trial and error) even though these targets made up a diverse cross-section of ILMSes (Voyager, Symphony, Aleph, Prism).  Granted, we’re only searching on ISBN and ISSN right now, but none of the currently existing catalogs required any special configuration.

There was a notable vendor that wasn’t represented, however.

Recently, TADC has gone from ‘closed pilot’ to ‘available for any institution to request a demo’.  A university recently requested a demo and when I added their Z39.50 target (which you will not be surprised to learn was the vendor that we hadn’t dealt with) to Metaproxy, I noticed I kept getting ‘invalid combination of attributes for index’ errors when I would try to do ISBN and ISSN queries via SRU (although, interestingly, not via Z39.50).

If you’re not familiar with queries in Z39.50, they have an incredibly opaque construction where every query element takes (up to) 6 ‘attributes’:

  1. use
    which field you want to search: title, author, etc.
  2. relation
    =, >, <, exact, etc.
  3. position
    first in field, first in subfield, anywhere in field, etc.
  4. structure
    word, phrase, keyword, date, etc.
  5. truncate
    left, right, no truncation, etc.
  6. completeness
    incomplete subfield, complete subfield, complete field

So a query for ISBN=1234567890 in Prefix Query Format (PQF – what Z39.50 uses) would look like:

@attr 1=7 @attr 2=3 @attr 3=2 @attr 4=1 @attr 5=100 @attr 6=1 1234567890

To translate this, you refer to the Bib-1 attribute set, but to break it down, it’s saying: search the ISBN field for strings that start with ‘1234567890’ followed by a space (and possibly more characters) or the end of the field. Servers will often have default behavior on fields so you don’t, in practice, always have to send all 6 attributes (often you only need to send the use attribute), but accepted attributes combinations are completely at the discretion of the server.  The servers we were pointing at up until now were happy with @attr 1=7 @attr 2=3 1234567890

Since this is horribly arcane search syntax, CQL was developed to replace it.  To do the above query in CQL, all you need is:

bath.isbn = 1234567890

Where bath defines the context set (the vocabulary of fields), and isbn is the field to search (yes, I realize that the Bath context set is deprecated, but the Bibliographic context set requires index modifiers, which nobody supports, as far as I can tell).

However, to make this work in Metaproxy, you need a mapping to translate the incoming CQL to PQF to send to the Z39.50 server.  And this is where our demo instance was breaking down. When I changed the mapping to work with the newly added catalog, some (but not all!) of the existing catalogs would stop returning results for ISBN/ISSN queries. I needed a different configuration for them, which meant that I actually had to figure out how Metaproxy works.

Metaproxy’s documentation explains that it is basically made up of three components:

  • Packages

    A package is request or response, encoded in some protocol, issued by a client, making its way through Metaproxy, send to or received from a server, or sent back to the client.

    The core of a package is the protocol unit – for example, a Z39.50 Init Request or Search Response, or an SRU searchRetrieve URL or Explain Response. In addition to this core, a package also carries some extra information added and used by Metaproxy itself.

    Um, ok.  To be honest, I still don’t really understand what packages are.  They don’t seem to exist in the example configurations or at least in not ones I care about.

  • Routes

    Packages make their way through routes, which can be thought of as programs that operate on the package data-type. Each incoming package initially makes its way through a default route, but may be switched to a different route based on various considerations.

    Well, this seems to make sense, at least. A requests can be routed through certain paths. Check.

  • Filters

    Filters provide the individual instructions within a route, and effect the necessary transformations on packages. A particular configuration of Metaproxy is essentially a set of filters, described by configuration details and arranged in order in one or more routes. There are many kinds of filter – about a dozen at the time of writing with more appearing all the time – each performing a specific function and configured by different information.

    The word “filter” is sometimes used rather loosely, in two different ways: it may be used to mean a particular type of filter, as when we speak of “the auth_simple filter” or “the multi filter”; or it may be used to be a specific instance of a filter within a Metaproxy configuration.

    Ugh, well that’s clear as mud, isn’t it? But, ok, so these are the things that do what you want to happen in the route. The documentation for these is pretty spartan, as well (considering they’re supposed to do the bulk of the work), but maybe through some trial and error we can figure it out.

All of this is declared in an XML configuration file, here’s an example that comes supplied with Metaproxy’s sources.  In this file you have a metaproxy root element and under that you have a start tag where you declare the default route that every request goes through to begin with.

Generally, this is also where you’d declare your global-level filters.  Filters can be defined in two ways: with an id attribute that can called one or more times from within routes, or you can put a filter element directly in the route (probably without an id attribute) which can be thought of as being scoped locally to that route.

For the global filters, you put a filters element under your metaproxy root element under which there are one or more filter elements. These filter elements should have id and type attributes (the types are available here).  You would also define the behavior of your filter here.  Here’s an example of how we define our two different CQL to PQF mappings:

<filters>
  <filter id="default-cql-rpn" type="cql_rpn">
    <conversion file="../etc/cql2pqf.txt" />
  </filter>
  <filter id="other-cql-rpn" type="cql_rpn">
    <conversion file="../etc/other-cql2pqf.txt" />
  </filter>
</filters>

To call these filters from your route, you use an empty element with the refid attribute:

<filter refid="default-cql-rpn" />

Also under the metaproxy element is a routes tag. The routes element can have one or more route blocks. One of these needs to have an id that matches the route attribute in your start tag (i.e. your default route). All of the examples (I think) use the id ‘start’.

In your default route, you can declare your common filters, such as the HTTP listener for the SRU service. You can also log the incoming requests. Here’s an example:

<routes>
  <route id="start">
    <filter type="log">
      <message>Something</message>
    </filter>
    <filter refid="id-declared-for-filter-in-your-filters-block" />
    ...
  </route>
</routes>

The first filter element in that route is locally scoped, it can’t be called again. The second one calls a filter that was defined in your filters section. That same filter could, in theory, also be called by a different route, keeping the configuration file (relatively) DRY.

The filters happen sequentially.

It is in one of the locally scoped filters within a route where you’d define the settings of the databases you are proxying. The filter type for these is ‘virt_db’.

<route id="start">
  ...
  <filter type="virt_db">
    <virtual>
      <database>lc</database>
      <target>z3950.loc.gov:7090/voyager</target>
    </virtual>
  </filter>
  ...
</route>

It is at this point that you can branch into different routes so different databases can have different configurations. It would look something like:

<routes>
  <route id="start">
    ...
    <filter type="virt_db">
      <virtual>
        <database>lc</database>
        <target>z3950.loc.gov:7090/voyager</target>
      </virtual>
      <virtual route="otherdbroute">
        <database>otherdb</database>
        <target>example.org:210/otherdb</target>
      </virtual>
    </filter>
    ...
  </route>
  <route id="otherdbroute">
    <filter type="log">
      <message>Other DB route</message>
    </filter>
    <filter refid="other-cql-rpn" />
    ...
  </route>

It’s important to note here that the branched route will return to route where it was initiated after it completes, so if there are more filters declared after this route is called, they will be applied to this route, as well.

It’s also important to note (at least this is how things appear to work for me) that if a filter of the same type is called more than once for a particular route, only the first one seems to get applied. In our example above, you could apply the cql_rpn filters like this:

<routes>
  <route id="start">
    ...
    <filter type="virt_db">
      <virtual>
        <database>lc</database>
        <target>z3950.loc.gov:7090/voyager</target>
      </virtual>
      <virtual route="otherdbroute">
        <database>otherdb</database>
        <target>example.org:210/otherdb</target>
      </virtual>      
    </filter>
    <filter refid="default-cql-rpn" />
    ...
  </route>
  <route id="otherdbroute">
    <filter type="log">
      <message>Other DB route</message>
    </filter>
    <filter refid="other-cql-rpn" />
    ...
  </route>

In this case, requests for “lc” would have “default-cql-rpn” applied to them. Requests for “otherdb” would have “other-cql-rpn” applied to them, but they seem to ignore the “default-cql-rpn” filter that comes later (I, for one, found this extremely counter-intuitive). So rather than having your edge cases overwrite your default configuration, you set your edge cases first and then set a default for anything that hasn’t already had a particular filter applied to it.

Also somewhat counter-intuitively, if you’re searching multiple databases with a single query and the databases require different configurations, you configure it like this:

<routes>
  <route id="start">
    ...
    <filter type="virt_db">
      <virtual>
        <database>lc</database>
        <target>z3950.loc.gov:7090/voyager</target>
      </virtual>
      <virtual route="otherdbroute">
        <database>otherdb</database>
        <target>example.org:210/otherdb</target>
      </virtual>
      <virtual>
        <database>all</database>
        <target>z3950.loc.gov:7090/voyager</target>
        <target>example.org:210/otherdb</target>
      </target>  
    </filter>
    <filter type="multi">
      <target route="otherdbroute">example.org:210/otherdb</target>
    </filter>
    <filter refid="default-cql-rpn" />

    ...
  </route>
  <route id="otherdbroute">
    <filter type="log">
      <message>Other DB route</message>
    </filter>
    <filter refid="other-cql-rpn" />
    ...
  </route>

Since the route can’t be applied to the whole aggregate of databases (the other databases would fail), you declare the route for a particular target in a ‘multi’ filter. Again, I think this route would have to appear before the default-cql-rpn filter is called to work.

I hope this helps somebody. If nothing, else, it will probably help myself when I need to eventually remember how all of this works.

Leave a Reply

Your email address will not be published. Required fields are marked *