Citing Bytes - Adventures in Data Citation: April 2013

Flint handaxe

So, it being the Easter school holidays, we all went for a family outing to the Ashmolean Museum in Oxford. And within about two minutes (because I am a geek) I started spotting identifiers and thinking about how the physical objects in the museum are analogous to datasets.

Take for example the flint handaxe pictured above. It's obviously a thing in its own right, well defined and with clear boundaries. But in a cabinet full of other artifacts (even some other hand axes) how can you uniquely identify it? Well, you can stick a label next to it (the number 1) and then connect that local identifier to some metadata on display in the case:

Metadata for the flint handaxe (1.)

That works, but it means that the positions of the artifacts are fixed in the case, so reorganising things risks disconnecting the object from its metadata. The number 1 is only a local identifier too - there were plenty of other cases in the gallery which all had something in there with the number 1 attached to it - so as a unique identifier it's not much good. And in this case, there were actually 2 handaxes identified with the number 1.

If you look closely at the surface of the handaxe, you'll see a number written on it in black ink 1955.439a This number (which I'm guessing is an accession number with the year the artifact was first put into the museum as the first part) is also repeated in small print at the end of the metadata blurb.

So, the moral from this example is that local identifiers are useful, but objects really do need unique identifiers which are present in both the dataset/artifact itself, and its corresponding metadata.

Sobek

Here we have a large, well defined dataset - sorry - artifact (and a pretty impressive one too!) There's isn't another statue of Sobek this size (or at all as far as I could see) in the Ashmolean museum. So it could be identified as "the restored statue of Sobek in the Ashmolean museum", and you'd probably get away with that as most people would know that's the one you meant.

Sobek's identifier

But it too still has an identifier, and it's right there on his shoulder, not hidden underneath where people can't see it.

Sobek's metadata

And it's also connected with his metadata.

A collection from an A-Group burial

In this case we have a dataset that's a collection of other self-contained datasets. Each dataset/pot has its own individual value, but has greater value as part of the larger collection. These particular datasets were all found in the same location at the same time, so have a very definite connection - they were all grave good excavated from on grave in Farras, Sudan.

Close up of some of the grave goods

Just because a dataset is part of a larger data collection, it doesn't mean the dataset has to be exactly the same as its fellows - in fact a wide variety of stuff makes the collection more valuable. Note though that the storage for the whole collection (i.e. the cabinet) has to take into account the different sizes and different display needs for each of the individual datasets/artifacts.

And of course, each of the artifacts has its own id (sort of - the group of 7 semi-precious stones only has one id between them) as well as a local identifier to link it to its metadata.

Collection metadata and individual item metadata

The collection itself has its own metadata too, which puts the individual items' metadata into context.

Non textual metadata

And it also has metadata that is better expressed in the form of graphics rather than text - the diagram of the goods where they were found in the grave and an actual photo. These figures too have their own metadata in their captions - so we've got metadata about metadata happening here, and all of it is important to keep and display.

Faience Shabtis

Here we have a data collection that is joined by theme rather than by geographic location. These statues are all shabtis, but came from different places and were ingested into the museum at different times.

Faience shabti metadata (15.)

They all have unique ids though, and in the case of this data collection, only the collection metadata is displayed. I'd imagine though that if you went looking in the museum records, you'd find information on each of the individual shabti, filed under their id.

With digital data we've got it easier in one way, in that the same dataset/shabti can be in multiple collections at the same time and displayed in lots of different ways in different places. The downside is that it can be hard to know exactly what dataset is being displayed where and is part of what collection. That's why the permanent, unique ids are so vital to keep track of things.

Granularity issue! Mosaic tiles

And here we have a classic granularity issue - a pile of mosaic tiles. In theory, you could write a unique id on each on of these tesserae (might be a bit fiddly), but then you'd have to put each of those ids into the metadata. Which, given that the value of these tiles aren't in themselves as individual objects, but in the whole collection, I can understand why the museum curators decided to label them as one thing.

Metadata for the mosaic tiles (49.)

Because the dataset is in lots of pieces (files), none of which is uniquely identified, there is always the risk that a piece may become detached from its collection and lost/misidentified. Moving this particular dataset around the place could be quite problematic - but on the other hand, there's so many pieces that losing one or two in transit might not be too much of a problem. On issues of granularity, data repository managers, like museum curators, need to decide themselves how they're going to deal with their datasets/artifacts.

Silver ring, temporarily removed

And finally, what do you do if you've published a dataset, but have to take it down for whatever reason? Simple - leave the metadata about the dataset intact, and stick a note on it saying what was removed, who removed it and when. There was another one of these notices that I spotted (but didn't photograph) which gave the reason for the removal (restoration) and also a photo of the artifact, all on the little "Temporarily removed" card.

I think we worry about data a lot, because it's so hard to draw distinct lines around what is and what isn't a dataset. But honestly, there's such a wide variety of stuff in museums that all have identifiers and methods of curation that I really do think we need to worry less about how to turn a dataset into a standardised book, and think of them more as artifacts/things that come in all sorts of shapes and sizes.

Oh, and if you're in Oxford, do go check out the Ashmolean museum. It's great, and has lots more stuff than just the pieces I took photos of!

Citing Bytes - Adventures in Data Citation

Monday, 8 April 2013

Musings on data and identifiers, prompted by a visit to the Ashmolean Museum, Oxford