Monday 12 August 2013

How to review a dataset: a couple of case studies

"Same graph as last year, but now I have an additional dot"
http://vadlo.com/cartoons.php?id=149

As part of the PREPARDE project, I've been doing some thinking recently about how exactly one would go about peer-reviewing data. So far, the project (and friends and other interested parties) have come up with some general principles, which are still being discussed and will be published soon. 

Being more of a pragmatic and experimental bent myself, I thought I'd try to actually review some publicly accessible datasets out there and see what I could learn from the process. Standard disclaimers: with a sample size of 2, and an admittedly biased way of choosing what datasets to review, this is not going to be statistically valid!

I'm also bypassing a bit of the review process that would probably be done by the journal's editorial assistant, asking important questions like: 
  • Does the dataset have a permanent identifier? 
  • Does it have a landing page (or README file or similar) with additional information/metadata, which allows you to determine that this is indeed the dataset you're looking for?
  • Is it in an accredited/trusted repository?*
  • Is the dataset accessible? If not, are the terms and conditions for access clearly defined?
If the answer to any of those questions is no, then the editorial assistant should just bounce the dataset back to the author without even sending it to scientific review, as the poor scientific reviewer will have no chance of either accessing the data, or understanding it.

In my opinion, the main purpose of peer-review of data is to check for obvious mistakes and determine if the dataset (or article) is of value to the scientific community. I also err on the side of pragmatism - for most things, quality is assessed over the long term by how much the thing is used. Data's no different. So, for the most part, the purpose of the scientific peer review is to determine if there's enough information with the data to allow it to be reused.

Dataset 1: Institute of Meteorology and Geophysics (2013): Air temperature and precipitation time series from weather station Obergurgl, 1953-1959. University of Innsbruck, doi:10.1594/PANGAEA.806618,


I found this dataset by going to Pangaea.de and typing "precipitation" into their search box, and then looking at the search results until I found a title that I liked the sound of and thought I'd have the domain expertise to review. (Told you the process was biased!)

Then I started poking around an asking myself a few questions:
  • Are the access terms an conditions appropriate? 
    • Open access and downloadable with a click of a button, so yes. It also clearly stated that the license for the data is CC-BY 3.0
  • Is the format of the data acceptable? 
    • You can download the dataset as tab-delimited text in a wide variety of standards that you can choose from a drop down menu. You can also view the first 2,000 rows in a nicely formatted html table on the webpage.
  • Does the format conform to community standards?
    • I'm used to stuff in netCDF, but I suspect tab delimited text is more generic.
  • Can I open the files and view the data? (If not, reject straight away)
    • I can view the first 2,000 lines on the webpage. Downloading the file was no problem, but the .tab extension confused my computer. I tried opening it in notepad first (which looked terrible) but then quickly figured out that I could open the file in Excel and it would format it nicely for me.
  • Is the metadata appropriate? Does it accurately describe the data?
    • Yes. I can't spot any glaring errors, and short of going to the measurement site itself and measuring, I have to trust that the latitude and longitude are correct, but that's to be expected.
  • Are there unexplained/non-standard acronyms in the dataset title/metadata?
    • No. I like the way parameter DATE/TIME is linked out to a description of the format that it follows.
  • Is the data calibrated? If so, is the calibration supplied?
    • No mention of callibration, but these are old measurements from the 1950s, so I'm not surprised.
  • Is information/metadata given about how/why the dataset was collected? (This may be found in publications associated with the dataset)
  • Are the variable names clear and unambiguous, and defined (with their units)?
    • Yes, in a Parameter(s) table on the landing page. I'm not sure why they decided to call temperature "TTT", but it's easy enough to figure out, given the units are given next to the variable name. 
    • It also took me a minute to figure out what the 7-21h and 21-7h meant in the table next to the Precipitation, sum - but looking at the date/time of the measurements made me realise that it meant the precipitation was summed over the time between 7am and 9pm for one measurement and 9pm and 7am (the following morning) for the other - an artefact of when the measurements were actually taken.
    • The metadata gives the height above ground of the sensor, but doesn't give the height above mean sea level for the measurements station - you have to go to the dataset collection page to find that out. It does say that location is in the Central Alps though.
  • Is there enough information provided so that data can be reused by another researcher?
    • Yes, I think so
  • Is the data of value to the scientific community? 
    • Yes, it's measurement data that can't be repeated.
  • Does the data have obvious mistakes? 
    • Not that I can see. The precision of the precipitation measurement is 0.1mm, which is small, but plausible. 
  • Does the data stay within expected ranges?
    • Yes. I can't spot any negative rainrates, or temperatures in the minus values in the middle of summer.
  • If the dataset contains multiple data variables, is it clear how they relate to each other?
    • Yes - the temperature and precipitation measurements are related according to the time of the measurement. 
Verdict: Accept. I'm pretty sure I'd be able to use this data, if I ever needed precipitation measurements from the 1950s in the Austrian Alps.

I found this dataset in a versy similar way as before, i.e. by going to figshare.com and typing "precipitation" into their search box, ticking the box in the advance search to restrict to datasets, and then picking the first appropriate sounding title.

At first glance, I haven't a clue what this dataset is about. The data itself is easily viewed on the webpage as a table with some location codes (explained a bit in the description - I think they're in the USA?) and some figures for annual rainfall and coefficients of variation.

Going through my questions:
  • Are the access terms an conditions appropriate? 
    •  Don't know. It's obviously open, but I don't know what license it's under (if any)
  • Is the format of the data acceptable? 
    •  I can easily download it as an Excel spreadsheet (make comments as you'd like regarding Excel and proprietary formats and backwards compatibility...)
  • Does the format conform to community standards?
    •  No, but I can open them easily, so it's not too bad
  • Can I open the files and view the data? (If not, reject straight away)
    •  Yes
  • Is the metadata appropriate? Does it accurately describe the data?
    •  No
  • Are there unexplained/non-standard acronyms in the dataset title/metadata?
    •  Yes
  • Is the data calibrated? If so, is the calibration supplied?
    •  No idea
  • Is information/metadata given about how/why the dataset was collected? (This may be found in publications associated with the dataset)
  • Are the variable names clear and unambiguous, and defined (with their units)?
    •  No
  • Is there enough information provided so that data can be reused by another researcher?
    •  No
  • Is the data of value to the scientific community? 
    •  I have no idea
  • Does the data have obvious mistakes? 
    •  No idea
  • Does the data stay within expected ranges?
    •  Well, there's no minus rainfall - other than that, who knows?
  • If the dataset contains multiple data variables, is it clear how they relate to each other?
    •  Not clear
Verdict: Reject. On the figshare site, there simply isn't enough metadata to review the dataset, or even figure out what the data is. Yes, "Annual rainfall (mm)" is clear enough, but that makes me ask: for what year? Or is is averaged? Or what?

But! Looking at the paper which is linked to the dataset reveals an awful lot more information. This dataset is the figures behind table 1 of the paper, shared in a way that makes them easier to use in other work (which I approve of). The paper also has a paragraph about the precipitation data in the table, describing what it is and how it was created. 

It turns out the main purpose of this dataset was to study the plant resource use by populations of desert tortoises (Gopherus agassizii) across a precipitation gradient in the Sonoran Desert of Arizona, USA. And, from the look of the paper (very much outside my field!) it did the job it was supposed to, and might be of use for other people studying animals in that region. My main concern is if that dataset ever becomes disconnected from that paper, then the dataset as it is now would be pretty much worthless.

Here's a picture of a desert tortoise:
File:DesertTortoise.JPG
Desert Tortoise (Gopherus agassizii) in Rainbow Basin near Barstow, California. Photograph taken by Mark A. Wilson (Department of Geology, The College of Wooster). Public Domain

Conclusions

So, what have I learned from this little experiment?
  1. There's an awful lot of metadata and information in a journal article that relates to a dataset (which is good) and linking the two is vital if you're not going to duplicate information from the paper in the same location as the dataset. BUT! if the link between the dataset and the paper is broken, you've lost all the information about the dataset, rendering it useless.
  2. Having standard (and possibly mandatory) metadata fields which have to be filled out before the dataset is stored in the repository means that you've got a far better chance of being able to understand the dataset without having to look elsewhere for information (that might be spread across multiple publications). The down side of this is that it increases the effort needed to deposit the data in the repository, duplicates metadata and may increase the chances of error (when metadata with the dataset is different from that in the publication).
  3. I picked a pair of fairly easy datasets to review, and it took me about 3 hours (admittedly, there was a large proportion of that which was devoted to writing this post). 
  4. Having a list of questions to answer does help very much with the data review process. The questions above are ones I've come up with myself, based on my knowledge of datasets and also of observational measurements. They'll not be applicable for every scientific domain, so I think they're only really guidelines. But I'd be surprised if there weren't some common questions there.
  5. Data review probably isn't as tricky as people are worried. Besides, there's always the option of rejecting stuff out of hand, if, for example, you can't open the downloaded data file. It's the dataset authors' responsibility (with some help from the data repository) to make the dataset usable and understandable if they want it to be published.
  6. Searching for standard terms like "precipitation" in data repositories can return some really strange results.
  7. Desert tortoises are cute!
I'd very much like to thank the authors who's datasets I've reviewed (assuming they ever see this). They put their data out there, open to everyone, and I'm profoundly grateful! Even in the case where I'd reject the dataset as not being suitable to publish in a data journal, I still think the authors did the right thing in making it available, seeing as it's an essential part of another published article.
______
* Believe me, we've had a lot of discussions about what exactly it means to be an accredited/trusted repository. I'll be blogging about it later.