Tuesday, 22 November 2011

What is a dataset, when you get down to it?

Picture by me and Powerpoint. 
It's possible to spend a lot of time arguing about what a dataset actually is (and believe me, plenty of people have, myself included!)

I don't have a definitive answer, but for myself, I tend to default to the idea of what's scientifically meaningful as a dataset. For example, a single peak flood measurement at a certain place for a given year could count as a dataset, but a single rain gauge measurement from a site where the gauge has been in place for years wouldn't. And of course, it all depends on the scientific domain as well.

Sometimes projects can act as a convenient guide - if a project was run for x years and provided a wodge of data, then that data can be packaged up as a project dataset. Sometimes a dataset can be all the data resulting from a given instrument for a given period of operation. The important thing is that common sense needs to be applied to how "thinly sliced" a dataset should be. I really don't want to see the concept of minimum publishable unit applied to data, thankyouverymuch!

An analogy that I tend to use a lot is a book. Like a book, a citeable (DOI-able) dataset should be easily identifiable, stable, complete and (hopefully) have enough information in it so that you can understand what it's all about, without having to refer to (too many) other sources of information. Yes, the dataset can be structured in such a way that you can refer to parts of it easily (chapter and verse analogy), but it doesn't mean that every single segment of the dataset should have its own DOI (or that each verse in a book should be published independently in its own cover).

My completely off-the-cuff and not entirely serious example of how you'd go about referencing segments of a particular dataset is:
  • Honeydew, B, Beaker, Gonzo, T.G.,  Years 2001, 2005 and 2009 from “Statistics of egg-laying in Pitch Perfect Poultry, 2000-2010” doi:10.12345/abcdefg.
Of course, datasets are more than books, and there's lots of different ways of slicing and dicing them to produce scientifically meaningful datasets. At the moment, because we're in the early stages of assigning DOIs to our hosted datasets, we're pretty much making a decision on a case by case basis, in the hopes that some general guidelines will surface along the way. (Thankfully, they do seem to be.)

One idea that quickly got assigned to the "not now - tricky" pile is the notion that users might want at some stage to effectively create a new derived dataset which is made up of smaller bits of other people's datasets, and would then want to cite this derived dataset as a whole. This "user-defined" citation would save space in the valuable real estate of a paper's references, and would provide a link to a list of the other sub-citations, in a format that was both human and machine readable. Provided that each of the sub-citations allowed you to easily and accurately get to the relevant sections of the other datasets, then the derived dataset would count as a citeable object.

This is achievable now, the technologies are ready and mature, but this rapidly starts getting tricky when you start thinking of the roles involved and how to assign credit - the author of this new derived dataset is not so much an author, more a compiler or editor, for example.

Having hierarchies of dataset citations aren't so problematic. For example, we've already made the decision that for large datasets where the dataset is continually modified by appending new files to it (for example, the rain gauge measurements mentioned above where files are created on a daily basis), then we can assign a DOI to a given period's worth of data at a time. For the rain gauge measurements, it's convenient and sensible to assign a DOI to each year's worth of data after the year's complete, and then, when the rain gauge is moved, or otherwise taken out of service, to give the entire time series one DOI.

Citation is actually a really good prod for us, to encourage us to really crystallize our thinking about what a dataset is, and how to deal with it. It's all too easy to have fuzzy datasets being random piles of files, or entries in a database table, without having defined any rules on where their edges are. I don't have the answers, but I do feel like we're getting close to at least some of them!
A lot of the thoughts in this post came about after conversations with the many people involved in various citation workshops/projects etc, including, but not limited to, my co-workers in the NERC SIS data citation and publication project and the CODATA Task Group on Data Citation. Thanks are due to them all! (I'm sure I'll be repeating that lots in future posts too!)

1 comment:

  1. This comment has been removed by a blog administrator.