Monday, 14 November 2011

Why data citation is important - a personal tale.

Way back in the day, when I was a wet-behind-the-ears graduate student, my first proper science job was in pre-processing a large scientific dataset. My job was to convert signal levels received from a satellite (Italsat) radio beacon (at 20, 40 and 50 GHz) into attenuation levels. In other words, convert this:

 to this:

with the eventual aim of producing something like this:

a process which involved 4 major steps, 4 different computer programmes, and 16 intermediate files for each day of measurements. Each month of preproccessed data represented somewhere between a couple of days and a week's worth of effort. It was a job where attention to detail was important, and you really had to know what you were looking at from a scientific perspective. 

I started work on this project in 1999. In 2006 (five years after the dataset was finished) we finally got a publication out of it:

Ventouras, S., S. A. Callaghan, and C. L. Wrench (2006), Long-term statistics of tropospheric attenuation from the Ka/U band ITALSAT satellite experiment in the United KingdomRadio Sci.41, RS2007, doi:10.1029/2005RS003252.

It's been cited twice, both times by me. 

We shared our data with another group. They got a publication out of it in 2003, three years before we did. We weren't part of the author list, though I believe we got an acknowledgement. 

A quick Google Scholar for "Italsat Sparsholt" gives 48 papers which mention Italsat (the satellite) and Sparsholt (the receive station where the data came from), 37 of which weren't written by members of the project team. 

But of course, it's citations, not acknowledgements, that are important when it comes to things like how to measure how influential your work is.

And yes, I supposed we could have published quicker. But our job was to collect and quality-control and generally make our datasets as good as they possibly could be. And they are good, and they are important, but unfortunately not in a way that's easily measured. 

So, that's why I'm pushing so hard for datasets to be accepted as first class scholarly outputs. I've spent years of my life, making a dataset the best it can be, only to be pipped to the post when it comes to publishing, and having no way of knowing if that work has actually be worthwhile or not. (And no, I'm not bitter, honest!)

Data citation is something I believe in, because I've been there. I've also submitted data to a data centre (and got infuriated with the format requirements and metadata requests). But now, many years down the line, I'm on the data management side of the fence, and I can see how important it is to encourage scientists who produce data to put their data in archives/data centres where it can be properly looked after. Giving them credit through data citation has got to be part of it, at least until the point where science as a whole comes up with a better method for tracking scientific impact and importance!

(All pictures from: S. Ventouras , C.L. Wrench , S.A Callaghan, ”Measurement and analysis of satellite beacon transmissions at frequencies up to 50 GHz. Part 1: Attenuation Statistics and Frequency Scaling of Attenuation Values” Project report for the Radio Communications Agency, September 2000. Updated March 2003.)


As an aside - if anyone needs convincing of the importance of digital archiving and curation - the only way I could get the images above into this blog post was by taking a digital photo of the hard copy of the report. The original files were in a format (.ps) that Windows doesn't seem to like anymore...

1 comment: