Monday, 8 April 2013

Musings on data and identifiers, prompted by a visit to the Ashmolean Museum, Oxford

Flint handaxe

So, it being the Easter school holidays, we all went for a family outing to the Ashmolean Museum in Oxford. And within about two minutes (because I am a geek) I started spotting identifiers and thinking about how the physical objects in the museum are analogous to datasets.

Take for example the flint handaxe pictured above. It's obviously a thing in its own right, well defined and with clear boundaries. But in a cabinet full of other artifacts (even some other hand axes) how can you uniquely identify it? Well, you can stick a label next to it (the number 1) and then connect that local identifier to some metadata on display in the case:

Metadata for the flint handaxe (1.)
That works, but it means that the positions of the artifacts are fixed in the case, so reorganising things risks disconnecting the object from its metadata. The number 1 is only a local identifier too - there were plenty of other cases in the gallery which all had something in there with the number 1 attached to it - so as a unique identifier it's not much good. And in this case, there were actually 2 handaxes identified with the number 1.

If you look closely at the surface of the handaxe, you'll see a number written on it in black ink 1955.439a This number (which I'm guessing is an accession number with the year the artifact was first put into the museum as the first part) is also repeated in small print at the end of the metadata blurb.

So, the moral from this example is that local identifiers are useful, but objects really do need unique identifiers which are present in both the dataset/artifact itself, and its corresponding metadata.

Here we have a large, well defined dataset - sorry - artifact (and a pretty impressive one too!) There's isn't another statue of Sobek this size (or at all as far as I could see) in the Ashmolean museum. So it could be identified as "the restored statue of Sobek in the Ashmolean museum", and you'd probably get away with that as most people would know that's the one you meant.

Sobek's identifier
But it too still has an identifier, and it's right there on his shoulder, not hidden underneath where people can't see it.

Sobek's metadata
And it's also connected with his metadata.

A collection from an A-Group burial
In this case we have a dataset that's a collection of other self-contained datasets. Each dataset/pot has its own individual value, but has greater value as part of the larger collection. These particular datasets were all found in the same location at the same time, so have a very definite connection - they were all grave good excavated from on grave in Farras, Sudan.
Close up of some of the grave goods
Just because a dataset is part of a larger data collection, it doesn't mean the dataset has to be exactly the same as its fellows - in fact a wide variety of stuff makes the collection more valuable. Note though that the storage for the whole collection (i.e. the cabinet) has to take into account the different sizes and different display needs for each of the individual datasets/artifacts.

And of course, each of the artifacts has its own id (sort of - the group of 7 semi-precious stones only has one id between them) as well as a local identifier to link it to its metadata.

Collection metadata and individual item metadata
The collection itself has its own metadata too, which puts the individual items' metadata into context.

Non textual metadata

 And it also has metadata that is better expressed in the form of graphics rather than text - the diagram of the goods where they were found in the grave and an actual photo. These figures too have their own metadata in their captions - so we've got metadata about metadata happening here, and all of it is important to keep and display.

Faience Shabtis
Here we have a data collection that is joined by theme rather than by geographic location. These statues are all shabtis, but came from different places and were ingested into the museum at different times.

Faience shabti metadata (15.)
They all have unique ids though, and in the case of this data collection, only the collection metadata is displayed. I'd imagine though that if you went looking in the museum records, you'd find information on each of the individual shabti, filed under their id.

With digital data we've got it easier in one way, in that the same dataset/shabti can be in multiple collections at the same time and displayed in lots of different ways in different places. The downside is that it can be hard to know exactly what dataset is being displayed where and is part of what collection. That's why the permanent, unique ids are so vital to keep track of things.

Granularity issue! Mosaic tiles
And here we have a classic granularity issue - a pile of mosaic tiles. In theory, you could write a unique id on each on of these tesserae (might be a bit fiddly), but then you'd have to put each of those ids into the metadata. Which, given that the value of these tiles aren't in themselves as individual objects, but in the whole collection, I can understand why the museum curators decided to label them as one thing.

Metadata for the mosaic tiles (49.)
Because the dataset is in lots of pieces (files), none of which is uniquely identified, there is always the risk that a piece may become detached from its collection and lost/misidentified. Moving this particular dataset around the place could be quite problematic - but on the other hand, there's so many pieces that losing one or two in  transit might not be too much of a problem. On issues of granularity, data repository managers, like museum curators, need to decide themselves how they're going to deal with their datasets/artifacts.

Silver ring, temporarily removed
And finally, what do you do if you've published a dataset, but have to take it down for whatever reason? Simple - leave the metadata about the dataset intact, and stick a note on it saying what was removed, who removed it and when. There was another one of these notices that I spotted (but didn't photograph) which gave the reason for the removal (restoration) and also a photo of the artifact, all on the little "Temporarily removed" card.

I think we worry about data a lot, because it's so hard to draw distinct lines around what is and what isn't a dataset. But honestly, there's such a wide variety of stuff in museums that all have identifiers and methods of curation that I really do think we need to worry less about how to turn a dataset into a standardised book, and think of them more as artifacts/things that come in all sorts of shapes and sizes.

Oh, and if you're in Oxford, do go check out the Ashmolean museum. It's great, and has lots more stuff than just the pieces I took photos of!


  1. Sarah -
    excellent analogy and nicely-drawn connections to datasets. You did not really get into preservation; artifacts tend to be long-lived whereas digital datasets can easily get destroyed. However, your points on PIDs and granularity are well taken: one can hardly put an id on each atom (or ion) n a molecule in a substance. However, it is good to have a PID at the smallest atomic (indivisible) level - this then allows different collections (groupings) by different facets (or attributes) of objects (datasets or artifacts) with some common properties (attributes, facets). I won't even get started on metadata......

  2. Yes, preservation is something I've not yet really got my head around, especially when it comes to data, though I think that there's as much effort needed to curate and preserve digital things as there is physical things.

    And then we get into questions of how valuable is a dataset that's been repaired/recreated? Poor old Sobek up there had lost most of his snout and had it replaced. In that case, I think he's far more dramatic having been repaired, but I'm sure there are some archaeologists/historians out there who would argue that he shouldn't have been.

    Interestingly enough, the Ashmolean have an entire section on restoring and preserving artefacts where the discuss this problem - there's a painting (Allegory of Faith where a communion wafer near the chalice was painted over (which they discovered while x-raying it) The question then became: do they restore the painting to its original state with the wafer, or restore it to its slightly younger state without the wafer?

    I think there's a lot of stuff we can learn from museums and libraries when it comes to dealing with digital data (and vice versa!)

    1. Thanks Sarah. I agree that we can pick up some tips from the museum community about identifiers. I've long been of the opinion that museums may offer a better analogy for data curation than archives or libraries do.

      The UK Museum Collections Management Standard SPECTRUM 4.0 provides a potentially valuable model that we might consider as we develop research data management infrastructure in UK HEIs. SPECTRUM maps to BSI PAS 197: Code of Practice for Cultural Collections Management. The standard is available for download at

      For accreditation, 'museums to have in place eight procedures known as the SPECTRUM Primary Procedures. If set up correctly, these eight procedures, backed up by a written Procedural Manual, constitute a basic collections management system adequate to provide accountability and to ensure that a museum knows at any time exactly which items it is legally responsible for and where each item is located.’

      SPECTRUM 4.0 describes workflows for each of the eight procedures and also provides some good ideas for metadata that may need to be captured at each stage as well.

    2. Interesting stuff - thanks for pointing it out!

  3. One of the things that interested me is that you describe almost everything as "metadata", although in practice its not. Its just data. This is particularly obvious here because most of the "metadata" is not data about data, its data about a lump of stone.

    The granularity issue is a good one also. In biology, we see that records get merged and demerged over time as our ideas change. At first sight, I wouldn't have thought this would be so relevant for physical objects, but even here, I guess it's true.

    1. Ah, I was drawing the analogy that anything, even a lump of rock, can be considered a dataset (for a famous example, take the Rosetta stone - a lump of rock embodying information), so therefore data about said lump of rock can be considered metadata.

      I could say that a dataset is like an artefact - your definition of what is or isn't one depends on your frame of reference. I'd be tempted to say that any object that has been created by someone for a purpose is analagous to a dataset, but then I'm sure a geologist would come along and argue that a rock that's been untouched by human hands still is of interest and can be considered a dataset too.

      All analogies break down at some point!