Citing Bytes - Adventures in Data Citation: 2013

Tuesday, 26 November 2013

Citing dynamic data

Beautiful animation from http://uxblog.idvsolutions.com/2013/07/a-breathing-earth.html - go check out the larger versions!

Citing dynamic data is a topic that just keeps coming around, every time data citation is mentioned, usually as a way of pointing out that data citation is not like text citation, because people can and will want to get their hands on the most recent data in a dataset, and simply don't want to wait for a frozen version. There's also confusion about what makes something citeable or not (see "DOI != citeable" by Carl Boettiger), tied into the whole DOI for citation thing and the requirements for a dataset to have a DOI assigned.

As I've said many times before, citing data is all about using it to support the scholarly record. We have other methods of linking data to papers, or data to other data - that's what the Internet is all about after all. I maintain that citation is all about getting back to exactly the thing the author of the article was talking about when they put the citation in the article.

If you’re citing something so you can simply point to it ("the most recent version of the dataset can be found at blah"), and aren’t really that worried about whether it’s changed since the pointer was made, then you can do that easily with a citation with a http link in it. That way you go automatically to the most recent version of the dataset.

If however, you need to be sure that the user gets back to exactly the same data each time, because that's the data you used in your analysis, then that data becomes part of the scientific record and needs to be frozen. How you get back to that exact version is up to the dataset archive – it can be done via frozen snapshots, or by backing out changes on a database – whatever works.

(For a more in-depth discussion of frozen data versus active data, see the previous post here.)

Even if you’re using a DOI to get to a frozen version of the dataset, there should still be a link on the DOI landing page which points to the most recent version of the dataset. So if a scientist wants to get to the most recent version of the dataset, but only has a DOI to a frozen version, then they can still get to the most recent version in a couple of hops.

It is (theoretically) possible to record all changes to a dynamic dataset and guarantee (audited by someone) that, if needed, the data repository could back out all those changes to recreate the original dataset as it was on a certain date. However, the BODC did a few tests a while back, and discovered that backing out the changes made to their database would take weeks, depending on how long ago the requested version was. (This is a technical issue though, so I’m sure people are already working on solving it.)

You could instigate a system where citation is simply a unique reference based on a database identifier and the timestamp of extraction – as is already used in some cases. The main issue with this (in my opinion) is convincing users and journal editors that this is an appropriate way to cite the data. It’s been done in some fields (e.g. accession numbers) but hasn’t really gained world-wide traction. I know from our own experience at BADC that telling people to cite our data using our own (permanent) URLs didn’t get anywhere because people don’t trust urls. (To be fair, we were telling them this at a time when data citation was even less used than it is now, so that might change here and now.)

Frozen data is definitely the easiest and safest type to cite. But, we regularly manage datasets that are continually being updated, and for a long term time series, we can't afford to wait the twenty odd years for the series to be finished and frozen before we start using and citing it.

So we've got a few work-arounds.

For the long running dataset, we break the dataset up into appropriate chunks, and assign DOIs to those chunks. These chunks are generally defined on a time basis (yearly, monthly), and this works particularly well for datasets where new data is continually being appended, but the old data isn't being changed. (Using a dead-tree analogy, the chunks are volumes of the same work which is released in a series and at different times - think of the novels in the series A Song of Ice and Fire for example - now that's a long running dataset which is still being updated*)

A related method is the ONS (Office for National Statistics) model, where the database is cited with a DOI and an access date, on the understanding that the database is only changed by appending new data to it – hence any data from before the access date will not have changed between now and when the citation was made. As soon as old data is updated, the database is frozen and archived, and a new DOI is assigned to the new version.

For datasets where the data is continually being updated, and old measurements are being changed as well as new measurements appended, we take snapshots of the dataset at a given point in time, and those snapshots are frozen, and have the DOIs assigned to them. This is effectively what we do when we have a changing dataset, but the dataset is subject to version control. It also parallels the system used for software releases.

It's worth noting that we're not the only group thinking about these issues, there's a lot of clever people out there trying to come up with solutions. The key thing there is bringing them all together so that the different solutions can work together rather than against each other - one of the key tenets of the RDA.

DOIs aren’t suitable for everything, and citing dynamic data is a problem that we have to get our heads around. It may well turn out that citing frozen datasets is a special case, in which case we’ll need to come up with another solution. But we need to get people used to citing data first!

So, in summary – if all you want from a citation is a link to the current version of the data: use a url. If you want to get back to the exact version of the data used in the paper so that you can check and verify their results: that’s when you need a DOI.

_________________________________________

* Pushing the analogy a bit further - I'd bet there's hordes of "Game of Thrones" fans out there who'd dearly love to get their hands on the active version of the next book in "A Song of Ice and Fire", but I'm pretty sure George R.R. Martin would prefer they didn't!

Frozen Datasets are Useful, So are Active ones

"Frozen Raspberry are Tasty" by epSos.de

I think there's a crucial distinction we need to draw between data that is "active" or "working" and data that is "finished" or "frozen"*, i.e. suitable for publication/consumption by others.

There's a lot of parallels that can be drawn between writing a novel (or a text book, or an article, or a blog post) and creating a dataset. When I sit down to write a blog post, sometimes I start at the beginning and write until I reach the end. In which case, if I was doing it interactively, then it might be useful for a reader to watch me type, and get access to the post as I'm adding to it. I'm not that disciplined a writer however - I reread and rewrite things. I go back, I shuffle text around, and to be honest, it'd get very confusing for someone watching the whole process. (Not to mention the fact that I don't really want people to watch while I'm writing - it'd feel a bit uncomfortable and odd.)

In fact, this post has just been created as a separate entity in its own right - it was originally part of the next post on citing dynamic data - so if the reader wanted to cite the above paragraph and was only accessing the working draft of the dynamic data post, well, when they came back to the dynamic data post, that paragraph wouldn't be there anymore.

It's only when the blog post is what I consider to be finished, and is spell-checked and proofread, that I hit the publish button.

Now, sometimes I write collaboratively. I recently put in a grant proposal which involved coordinating people from all around the world, and I wrote the proposal text openly on a Google document with the help of a lot of other people. That text was constantly in flux, with additions and changes being made all the time. But it was only finally nailed down and finished just before I hit the submit button and sent it in to the funders. Now that that's done, the text is frozen, and is the official version of record, as (if it gets funded) it will become part of the official project documentation.

The process of creating a dataset can be a lot like that. Researchers understandably want to check their data before making it available to other people, in case of others finding errors. They work collaboratively in group workspaces, where a dataset may be changed lots very quickly, without proper version control, and that's ok. There has to be a process that says "this dataset is now suitable for use by other people and is a version of record" - i.e. hitting the submit, or the publish button.

But at the same time, creating datasets can be more like writing a multi-volume epic than a blog post. They take time, and need to be released in stages (or versions, or volumes, if you'd prefer). But each of those volumes/versions is a "finished" thing in its own right.

I'm a firm believer that if you cite something, you're using it to support your argument. In that case, any reader who reads your argument needs to be able to get to the thing you've used to support it. If that thing doesn't exist anymore, or has changed since you cited it, then your argument immediately falls flat. And that is why it's dangerous to cite active datasets. If you're using data to support your argument, that data needs to be part of the record, and it needs to be frozen. Yes, it can be superseded, or flat out wrong, but the data still has to be there.

You don't have this issue when citing articles - an article is always frozen before it is published. The closest analogy in the text world for active data is things like wiki pages, but they're generally not accepted in scholarly publishing to be suitable citation sources, because they change.

But if you're not looking to use data to support your argument, you're just doing the equivalent of saying "the dataset can be found at blah", well, that's when a link to a working dataset might be more appropriate.

My main point here is that you need to know whether the dataset is active or frozen before you link/cite it, as that can determine how you do the linking/citing. The user of the link/citation needs to know whether the dataset is active or not as well.

In the text world, a reader can tell from the citation (usually the publisher info) whether the cited text is active or frozen. For example, a paper from the Journal of Really Important Stuff (probably linked with a DOI), will be frozen, whereas a Wikipedia page (linked with a URL) won't be. For datasets, the publishers are likely to be the same (the host repository) whether the data is frozen or not - hence ideally we need a method of determining the "frozen-ness" of the data from the citation string text.

In the NERC data centres, it's easy. If the text after the "Please cite this dataset as:" bit on the dataset catalogue page has a DOI in it, then the dataset is frozen, and won't be changed. If it's got a URL, the dataset is still active. Users can still cite it, but the caveat there is that it will change over time.

We'll always have active datasets and we'll want to link to them (and potentially even freeze bits of them to cite). We (and others) are still trying to figure out the best ways to do this, and we haven't figured it out completely yet, but we're getting there! Stay tuned for the next blog post, all about citing dynamic (i.e. active) data.

In the meantime, when you're thinking of citing data, just take a moment to think about whether it's active or not, and how that will affect your citing method. Active versus frozen is an important distinction!

____________________________
* I love analogies and terminology. Even in this situation, calling something frozen implies that you can de-frost it and refreeze it (but once that's done, is it still the same thing?) More to ponder...

Thursday, 14 November 2013

Presentations, presentations, presentations...

Scruffy Duck helps me prepare my slides before LCPD13, Malta

Long time, no post and all that - but I'm still here!

The past few months have been a bit busy, what with the RDA Second Plenary, the DataCite Summer Meeting, and the CODATA and Force 11 Task Groups on Data Citation meetings in Washington DC, followed by Linking and Contextualising Publications and Datasets, in Malta, and a quick side trip to CERN for the ODIN codesprint and first year conference. (My slides from the presentations at the DataCite, LCPD and ODIN meetings are all up on their respective sites.)

On top of that I also managed to decide it'd be a good idea to apply for a COST Action on data publication. Thankfully 48 other people from 25 different countries decided that it'd be a good idea too, and the proposal got submitted last Friday (and now we wait...) Oh, and I put a few papers in for the International Digital Curation Conference being held in San Francisco in February next year.

Anyway, they're all my excuse for not having blogged for a while, despite the list I've been building up of things to blog about. This post is really by way of an update, and also to break the dry spell. Normal service (or whatever passes for it 'round these parts) will be resumed shortly.

And just to make it interesting, a couple of my presentations this year were videoed. So, you can hear me present about the CODATA TG on data citation's report "Out of Cite, Out of Mind" here. And the lecture I gave on data management for the OpenAIRE workshop May 28, Ghent Belgium can be found here.

Friday, 6 September 2013

My Story Collider story - now available for all your listening needs

Way back last year, I was lucky/brave/foolhardy enough to take part in a Story Collider event where I stood on stage in front of a microphone and told a story about my life in science*.

And here is that very recording! With many thanks to the fine folk at the Story Collider for agreeing to let me post it on my blog.

_________________
*This was right in the middle of my three month long missing voice period, so I sound a bit croaky.

Monday, 12 August 2013

How to review a dataset: a couple of case studies

"Same graph as last year, but now I have an additional dot"

http://vadlo.com/cartoons.php?id=149

As part of the PREPARDE project, I've been doing some thinking recently about how exactly one would go about peer-reviewing data. So far, the project (and friends and other interested parties) have come up with some general principles, which are still being discussed and will be published soon.

Being more of a pragmatic and experimental bent myself, I thought I'd try to actually review some publicly accessible datasets out there and see what I could learn from the process. Standard disclaimers: with a sample size of 2, and an admittedly biased way of choosing what datasets to review, this is not going to be statistically valid!

I'm also bypassing a bit of the review process that would probably be done by the journal's editorial assistant, asking important questions like:

Does the dataset have a permanent identifier?
Does it have a landing page (or README file or similar) with additional information/metadata, which allows you to determine that this is indeed the dataset you're looking for?
Is it in an accredited/trusted repository?*
Is the dataset accessible? If not, are the terms and conditions for access clearly defined?

If the answer to any of those questions is no, then the editorial assistant should just bounce the dataset back to the author without even sending it to scientific review, as the poor scientific reviewer will have no chance of either accessing the data, or understanding it.

In my opinion, the main purpose of peer-review of data is to check for obvious mistakes and determine if the dataset (or article) is of value to the scientific community. I also err on the side of pragmatism - for most things, quality is assessed over the long term by how much the thing is used. Data's no different. So, for the most part, the purpose of the scientific peer review is to determine if there's enough information with the data to allow it to be reused.

Dataset 1: Institute of Meteorology and Geophysics (2013): Air temperature and precipitation time series from weather station Obergurgl, 1953-1959. University of Innsbruck, doi:10.1594/PANGAEA.806618,

I found this dataset by going to Pangaea.de and typing "precipitation" into their search box, and then looking at the search results until I found a title that I liked the sound of and thought I'd have the domain expertise to review. (Told you the process was biased!)

Then I started poking around an asking myself a few questions:

Are the access terms an conditions appropriate?

Open access and downloadable with a click of a button, so yes. It also clearly stated that the license for the data is CC-BY 3.0

Is the format of the data acceptable?

You can download the dataset as tab-delimited text in a wide variety of standards that you can choose from a drop down menu. You can also view the first 2,000 rows in a nicely formatted html table on the webpage.

Does the format conform to community standards?

I'm used to stuff in netCDF, but I suspect tab delimited text is more generic.

Can I open the files and view the data? (If not, reject straight away)

I can view the first 2,000 lines on the webpage. Downloading the file was no problem, but the .tab extension confused my computer. I tried opening it in notepad first (which looked terrible) but then quickly figured out that I could open the file in Excel and it would format it nicely for me.

Is the metadata appropriate? Does it accurately describe the data?

Yes. I can't spot any glaring errors, and short of going to the measurement site itself and measuring, I have to trust that the latitude and longitude are correct, but that's to be expected.

Are there unexplained/non-standard acronyms in the dataset title/metadata?

No. I like the way parameter DATE/TIME is linked out to a description of the format that it follows.

Is the data calibrated? If so, is the calibration supplied?

No mention of callibration, but these are old measurements from the 1950s, so I'm not surprised.

Is information/metadata given about how/why the dataset was collected? (This may be found in publications associated with the dataset)

Not on this page. But clicking on the "In:" part of the citation reveals that this is a part of a larger dataset: Institute of Meteorology and Geophysics (2013): Climate Data Obergurgl, 1953-2011. University of Innsbruck, doi:10.1594/PANGAEA.806635, which has more information as well as links to papers which relate to the dataset.

Are the variable names clear and unambiguous, and defined (with their units)?

Yes, in a Parameter(s) table on the landing page. I'm not sure why they decided to call temperature "TTT", but it's easy enough to figure out, given the units are given next to the variable name.
It also took me a minute to figure out what the 7-21h and 21-7h meant in the table next to the Precipitation, sum - but looking at the date/time of the measurements made me realise that it meant the precipitation was summed over the time between 7am and 9pm for one measurement and 9pm and 7am (the following morning) for the other - an artefact of when the measurements were actually taken.
The metadata gives the height above ground of the sensor, but doesn't give the height above mean sea level for the measurements station - you have to go to the dataset collection page to find that out. It does say that location is in the Central Alps though.

Is there enough information provided so that data can be reused by another researcher?

Yes, I think so

Is the data of value to the scientific community?

Yes, it's measurement data that can't be repeated.

Does the data have obvious mistakes?

Not that I can see. The precision of the precipitation measurement is 0.1mm, which is small, but plausible.

Does the data stay within expected ranges?

Yes. I can't spot any negative rainrates, or temperatures in the minus values in the middle of summer.

If the dataset contains multiple data variables, is it clear how they relate to each other?

Yes - the temperature and precipitation measurements are related according to the time of the measurement.

Verdict: Accept. I'm pretty sure I'd be able to use this data, if I ever needed precipitation measurements from the 1950s in the Austrian Alps.

Dataset 2: Precipitation metrics by site. Table_1.xls. Ian W. Murray, Blair O. Wolf. PLOS ONE. 10.1371/journal.pone.0066505.t001. Retrieved 09:15, Aug 12, 2013 (GMT).

I found this dataset in a versy similar way as before, i.e. by going to figshare.com and typing "precipitation" into their search box, ticking the box in the advance search to restrict to datasets, and then picking the first appropriate sounding title.

At first glance, I haven't a clue what this dataset is about. The data itself is easily viewed on the webpage as a table with some location codes (explained a bit in the description - I think they're in the USA?) and some figures for annual rainfall and coefficients of variation.

Going through my questions:

Are the access terms an conditions appropriate?

Don't know. It's obviously open, but I don't know what license it's under (if any)

Is the format of the data acceptable?

I can easily download it as an Excel spreadsheet (make comments as you'd like regarding Excel and proprietary formats and backwards compatibility...)

Does the format conform to community standards?

No, but I can open them easily, so it's not too bad

Can I open the files and view the data? (If not, reject straight away)

Is the metadata appropriate? Does it accurately describe the data?

Are there unexplained/non-standard acronyms in the dataset title/metadata?

Is the data calibrated? If so, is the calibration supplied?

No idea

Is information/metadata given about how/why the dataset was collected? (This may be found in publications associated with the dataset)

There is one publication linked to the dataset, Murray IW, Wolf BO (2013) Desert Tortoise (Gopherus agassizii) Dietary Specialization Decreases across a Precipitation Gradient. PLoS ONE 8(6): e66505. doi:10.1371/journal.pone.0066505 - and that's where I managed to find a load more information about the dataset. More later

Are the variable names clear and unambiguous, and defined (with their units)?

Is there enough information provided so that data can be reused by another researcher?

Is the data of value to the scientific community?

I have no idea

Does the data have obvious mistakes?

No idea

Does the data stay within expected ranges?

Well, there's no minus rainfall - other than that, who knows?

If the dataset contains multiple data variables, is it clear how they relate to each other?

Not clear

Verdict: Reject. On the figshare site, there simply isn't enough metadata to review the dataset, or even figure out what the data is. Yes, "Annual rainfall (mm)" is clear enough, but that makes me ask: for what year? Or is is averaged? Or what?

But! Looking at the paper which is linked to the dataset reveals an awful lot more information. This dataset is the figures behind table 1 of the paper, shared in a way that makes them easier to use in other work (which I approve of). The paper also has a paragraph about the precipitation data in the table, describing what it is and how it was created.

It turns out the main purpose of this dataset was to study the plant resource use by populations of desert tortoises (Gopherus agassizii) across a precipitation gradient in the Sonoran Desert of Arizona, USA. And, from the look of the paper (very much outside my field!) it did the job it was supposed to, and might be of use for other people studying animals in that region. My main concern is if that dataset ever becomes disconnected from that paper, then the dataset as it is now would be pretty much worthless.

Here's a picture of a desert tortoise:

Desert Tortoise (Gopherus agassizii) in Rainbow Basin near Barstow, California. Photograph taken by Mark A. Wilson (Department of Geology, The College of Wooster). Public Domain

Conclusions

So, what have I learned from this little experiment?

There's an awful lot of metadata and information in a journal article that relates to a dataset (which is good) and linking the two is vital if you're not going to duplicate information from the paper in the same location as the dataset. BUT! if the link between the dataset and the paper is broken, you've lost all the information about the dataset, rendering it useless.
Having standard (and possibly mandatory) metadata fields which have to be filled out before the dataset is stored in the repository means that you've got a far better chance of being able to understand the dataset without having to look elsewhere for information (that might be spread across multiple publications). The down side of this is that it increases the effort needed to deposit the data in the repository, duplicates metadata and may increase the chances of error (when metadata with the dataset is different from that in the publication).
I picked a pair of fairly easy datasets to review, and it took me about 3 hours (admittedly, there was a large proportion of that which was devoted to writing this post).
Having a list of questions to answer does help very much with the data review process. The questions above are ones I've come up with myself, based on my knowledge of datasets and also of observational measurements. They'll not be applicable for every scientific domain, so I think they're only really guidelines. But I'd be surprised if there weren't some common questions there.
Data review probably isn't as tricky as people are worried. Besides, there's always the option of rejecting stuff out of hand, if, for example, you can't open the downloaded data file. It's the dataset authors' responsibility (with some help from the data repository) to make the dataset usable and understandable if they want it to be published.
Searching for standard terms like "precipitation" in data repositories can return some really strange results.
Desert tortoises are cute!

I'd very much like to thank the authors who's datasets I've reviewed (assuming they ever see this). They put their data out there, open to everyone, and I'm profoundly grateful! Even in the case where I'd reject the dataset as not being suitable to publish in a data journal, I still think the authors did the right thing in making it available, seeing as it's an essential part of another published article.

______

* Believe me, we've had a lot of discussions about what exactly it means to be an accredited/trusted repository. I'll be blogging about it later.

Monday, 17 June 2013

NFDP13: Plenary Panel Two: Where do we want to go?

Wikipedian Protestor, xkcd

This is my final post about the Now and Future of Data Publishing symposium and is a write-up of my speaking notes from the last plenary panel of the day.

As before, I didn't have any slides, but used the above xkcd picture as a backdrop, because I thought it summed things up nicely!

My topic was: "Data and society - how can we ensure future political discussions are evidence led?"

"I work for the British Atmospheric Data Centre, and we happen to be one of the data nodes hosting the data from the 5th Climate Model Intercomparison Project (CMIP5). What this means is that we're hosting and managing about 2 Petabytes worth of climate model output, which will feed into the next Intergovernmental Panel on Climate Change's Assessment Report and will be used national and local governments to set policy given future projections of climate change.

But if we attempted to show politicians the raw data from these model runs, they'd probably need to go and have a quiet lie down in a darkened room somewhere. The raw data is just too much and too complicated, for anyone other than the experts. That's why we need to provide tools and services. But we also need to keep the raw data so the outputs of those tools and services can be verified.

Communication is difficult. It's hard enough to cross scientific domains, let alone the scientist/non-scientist divide. A repositories, we collect metadata bout the datasets in our archives, but this metadata is often far to specific and specialised for a member of the general public or a science journalist to understand. Data papers allow users to read the story of the dataset and find out details of how and why it was made, while putting it into context. And data papers are a lot easier for humans to read that an xml catalogue page.

Data publication can help us with transparency and trust. Politicians can't be scientific experts - they need to be political experts. So they need to rely on advisors who are scientists or science journalists for that advice - and preferably more than one advisor.

Making researchers' data open means that it can be checked by others. Publishing data (in to formal data journal sense) means that it'll be peer-reviewed, which (in theory at least) will cut down on fraud. It's harder to fake a dataset than a graph - I know this personally, because I spent my PhD trying to simulate radar measurements of rain fields, with limited success!)

With data publishing, researchers can publish negative results. The dataset is what it is and can be published even if it doesn't support a hypothesis - helpful when it comes to avoiding going down the wrong research track.

As for what we, personally can do? I'd say: lead by example. If you're a researcher, be open with your data (if you can. Not all data should be open, for very good reason, for example if it's health data and personal details are involved). If you're and editor, reviewer, or funder, simply ask the question: "where's the data?"

And everyone: pick on your MP. Query the statistics reported by them (and the press), ask for evidence. Remember, the Freedom of Information Act is your friend.

And never forget, 87.3% of statistics are made up on the spot!"

_________________________________________________________

Addendum:
After a question from the audience, I did need to make clear that when you're pestering people about where their data is, be nice about it! Don't buttonhole people at parties or back them into corners. Instead of yelling "where's your data?!?" ask questions like: "Your work sounds really interesting. I'd love to know more, do you make your data available anywhere?" "Did you hear about this new data publication thing? Yeah, it means you can publish your data in a trusted repository and get a paper out of it to show the promotion committee." Things like that.

If you're talking the talk, don't forget to walk the walk.

Thursday, 13 June 2013

NFDP13: Panel 3.1 Citation, Impacts, Metrics

This was my title and abstract for the panel session on Citation, Impact, Metrics at the Now and Future of Data Publishing event in Oxford on 22nd May 2013.

Data Citation Principles
I’ll talk about data citation principles and the work done by the CODATA task group on Data Citation. I’ll also touch on the implications of data publication for data repositories and for the researchers who create the data.

And here is a write-up of my presentation notes:

(No, I didn't have any slides - I just used the above PhD comic as a background)

"Hands up if you think data is important. (Pretty much all the audience's hands went up) That's good!

Hands up if you've ever written a journal paper... (Some hands went up) ... and feel you've got credit. (some hands went down again)

Hands up if you've ever created a dataset... (less hands up).... and got credit. (No hands up!)

So, if data's so important, why aren't the creators getting the credit for it?

We're proposing data citation and publication as a method to give researchers credit for their efforts in creating data. The problem is that citation is designed to link one paper to another - that's it. And those papers are printed on and frozen in dead tree format. We've loaded citation with other purposes, for example attribution, discovery, credit. But citation isn't really a good fit for data, because data is so changeable and/or takes such a long time and so many people to create it.

But to make data publication and citation work, data needs to be frozen to become the version of record that will allow the science to become reproducible. Yes, this might be considered a special case of dealing with data, but it's an important one. The version of record can always link to the most up-to-date version of the dataset after all.

Research is getting to be all about impact - how a researcher's work affects the rest of the world. To quantify impact we need metrics. Citation counts for papers are well known and well established metrics, which is why we're piggybacking on them for data. Institutions, funders and repositories all need metrics to support their impact claims too. For example a repository manager can use citation to track how researchers are using the data downloaded from the repository.

The CODATA task group on data citation is an international group. We've written a report: "Citation of data: the current state of practice, policy and technology". It's currently with the external reviewers and we're hoping to release it this summer. It's a big document ~190 pages. In it there are ten data citation principles:

Status of Data: Data citations should be accorded the same importance in the scholarly record as the citation of other objects.
Attribution: A citation to data should facilitate giving scholarly credit and legal attribution to all parties responsible for those data.
Persistence: Citations should refer to objects that persist.
Access: Citations should facilitate access to data by humans and by machines.
Discovery: Citations should support the discovery of data.
Provenance: Citations should facilitate the establishment of provenance of data.
Granularity: Citations should support the finest-grained description necessary to identify the data.
Verifiability: Citations should contain information sufficient to identify the data unambiguously.
Metadata Standards: A citation should employ existing metadata standards.
Flexibility: Citation methods should be sufficiently flexible to accommodate the variant practices among communities.

None of these are particularly controversial, though as we try citing more and more datasets, the devil will be in the detail.

Citation does have the benefit that researchers already are used to doing it as part of their standard practice. The technology also exists, so what we need to do is encourage the culture change so data citation is the norm. I think we're getting there."

The Now and Future of Data Publishing - Oxford, 22 May 2013

Book printing in the 15th century - Wikimedia Commons

St Anne's College, Oxford, was host to a large group of researchers, librarians, data managers and academic publishers for the Now and Future of Data Publishing symposium, funded by the Jisc Managing Research Data programme in partnership with BioSharing, DataONE, Dryad, the International Association of Scientific, Technical and Medical Publishers, and Wiley-Blackwell.

There was a lot of tweeting done over the course of the day (#nfdp13) so I won't repeat it here. (I've made a storify of all my tweets and retweets - unfortunately storify couldn't seem to find the #nfdp13 tweets for other people, so I couldn't add them in.) I was also on two of the panels, so may have missed a few bits of information there - it's hard to tweet when you're sitting on a stage in front of an audience!

A few things struck me about the event:

It was really good to see so many enthusiastic people there!
The meme on the difference between *p*ublication (i.e. on a blog post) and *P*ublication (i.e. in a peer-reviewed journal) is spreading.
I've got a dodgy feeling about the use of "data descriptors" instead of "data papers" in Nature's Scientific Data - it feels like publishing the data in that journal doesn't give it the full recognition it deserves. Also, as a scientist, I want to publish papers, not data descriptors. I can't report a data descriptor to REF, but I can a data paper.
It wasn't just me showing cartoons and pictures of cats
I could really do with finding the time to sit down and think properly about Parsons' and Fox's excellent article Is Data Publication the Right Metaphor? (and maybe even the time to write a proper response).
Only archiving the data that's directly connected with a journal article risks authors only keeping the cherry-picked data they used to justify their conclusions. Also it doesn't cover the vast range of scientific data that is important and irrepreducible, but isn't the direct subject of a paper. Nor does it offer any solution for the problem of negative data. Still, archiving the data used in the paper is a good thing to do - it just shouldn't be the only archiving done.
Our current methods of scientific publication have worked for 300 years - that's pretty good going, even if we do need to update them now!

I'll write up my notes for what I said in each of my panels - stay tuned!

Monday, 8 April 2013

Musings on data and identifiers, prompted by a visit to the Ashmolean Museum, Oxford

Flint handaxe

So, it being the Easter school holidays, we all went for a family outing to the Ashmolean Museum in Oxford. And within about two minutes (because I am a geek) I started spotting identifiers and thinking about how the physical objects in the museum are analogous to datasets.

Take for example the flint handaxe pictured above. It's obviously a thing in its own right, well defined and with clear boundaries. But in a cabinet full of other artifacts (even some other hand axes) how can you uniquely identify it? Well, you can stick a label next to it (the number 1) and then connect that local identifier to some metadata on display in the case:

Metadata for the flint handaxe (1.)

That works, but it means that the positions of the artifacts are fixed in the case, so reorganising things risks disconnecting the object from its metadata. The number 1 is only a local identifier too - there were plenty of other cases in the gallery which all had something in there with the number 1 attached to it - so as a unique identifier it's not much good. And in this case, there were actually 2 handaxes identified with the number 1.

If you look closely at the surface of the handaxe, you'll see a number written on it in black ink 1955.439a This number (which I'm guessing is an accession number with the year the artifact was first put into the museum as the first part) is also repeated in small print at the end of the metadata blurb.

So, the moral from this example is that local identifiers are useful, but objects really do need unique identifiers which are present in both the dataset/artifact itself, and its corresponding metadata.

Sobek

Here we have a large, well defined dataset - sorry - artifact (and a pretty impressive one too!) There's isn't another statue of Sobek this size (or at all as far as I could see) in the Ashmolean museum. So it could be identified as "the restored statue of Sobek in the Ashmolean museum", and you'd probably get away with that as most people would know that's the one you meant.

Sobek's identifier

But it too still has an identifier, and it's right there on his shoulder, not hidden underneath where people can't see it.

Sobek's metadata

And it's also connected with his metadata.

A collection from an A-Group burial

In this case we have a dataset that's a collection of other self-contained datasets. Each dataset/pot has its own individual value, but has greater value as part of the larger collection. These particular datasets were all found in the same location at the same time, so have a very definite connection - they were all grave good excavated from on grave in Farras, Sudan.

Close up of some of the grave goods

Just because a dataset is part of a larger data collection, it doesn't mean the dataset has to be exactly the same as its fellows - in fact a wide variety of stuff makes the collection more valuable. Note though that the storage for the whole collection (i.e. the cabinet) has to take into account the different sizes and different display needs for each of the individual datasets/artifacts.

And of course, each of the artifacts has its own id (sort of - the group of 7 semi-precious stones only has one id between them) as well as a local identifier to link it to its metadata.

Collection metadata and individual item metadata

The collection itself has its own metadata too, which puts the individual items' metadata into context.

Non textual metadata

And it also has metadata that is better expressed in the form of graphics rather than text - the diagram of the goods where they were found in the grave and an actual photo. These figures too have their own metadata in their captions - so we've got metadata about metadata happening here, and all of it is important to keep and display.

Faience Shabtis

Here we have a data collection that is joined by theme rather than by geographic location. These statues are all shabtis, but came from different places and were ingested into the museum at different times.

Faience shabti metadata (15.)

They all have unique ids though, and in the case of this data collection, only the collection metadata is displayed. I'd imagine though that if you went looking in the museum records, you'd find information on each of the individual shabti, filed under their id.

With digital data we've got it easier in one way, in that the same dataset/shabti can be in multiple collections at the same time and displayed in lots of different ways in different places. The downside is that it can be hard to know exactly what dataset is being displayed where and is part of what collection. That's why the permanent, unique ids are so vital to keep track of things.

Granularity issue! Mosaic tiles

And here we have a classic granularity issue - a pile of mosaic tiles. In theory, you could write a unique id on each on of these tesserae (might be a bit fiddly), but then you'd have to put each of those ids into the metadata. Which, given that the value of these tiles aren't in themselves as individual objects, but in the whole collection, I can understand why the museum curators decided to label them as one thing.

Metadata for the mosaic tiles (49.)

Because the dataset is in lots of pieces (files), none of which is uniquely identified, there is always the risk that a piece may become detached from its collection and lost/misidentified. Moving this particular dataset around the place could be quite problematic - but on the other hand, there's so many pieces that losing one or two in transit might not be too much of a problem. On issues of granularity, data repository managers, like museum curators, need to decide themselves how they're going to deal with their datasets/artifacts.

Silver ring, temporarily removed

And finally, what do you do if you've published a dataset, but have to take it down for whatever reason? Simple - leave the metadata about the dataset intact, and stick a note on it saying what was removed, who removed it and when. There was another one of these notices that I spotted (but didn't photograph) which gave the reason for the removal (restoration) and also a photo of the artifact, all on the little "Temporarily removed" card.

I think we worry about data a lot, because it's so hard to draw distinct lines around what is and what isn't a dataset. But honestly, there's such a wide variety of stuff in museums that all have identifiers and methods of curation that I really do think we need to worry less about how to turn a dataset into a standardised book, and think of them more as artifacts/things that come in all sorts of shapes and sizes.

Oh, and if you're in Oxford, do go check out the Ashmolean museum. It's great, and has lots more stuff than just the pieces I took photos of!

Thursday, 28 March 2013

Data paper published!

http://knowyourmeme.com/memes/i-hate-sandcastles-success-kid

I am very pleased to announce that the data paper:

S. A. Callaghan, J. Waight, J.L.Agnew, C. J. Walden, C.L.Wrench , S. Ventouras “The GBS dataset: measurements of satellite site diversity at 20.7 GHz in the UK”, Geoscience Data Journal, 17 March 2013, DOI: 10.1002/gdj3.2

has been published!

This paper gives the details about the second large dataset that I created, and gives permanent links to the dataset itself, which is stored in the BADC archives.

It's been a long road to get here (the dataset itself was finished in 2005) but I figure 8 years between completion of the dataset and publication is ok - especially when you consider we had to launch a new journal to publish it!

Read all about it here

Monday, 25 February 2013

How is a scarf like a dataset?

My teal blue feather and fan scarf

No, it's not a riddle!

It struck me recently that there's lots of parallels one can draw between the act of creating and describing a dataset and the act of hand knitting something. (Bear with me on this - it'll make sense, honest!)

The picture above is my scarf. I'm very fond of it. I knitted it myself, and it's warm and comfortable and goes well with a lot of my clothes.

When you're hand knitting a scarf, you take a ball of yarn, and you cast on stitches to make a row, then you keep adding rows until you run out of yarn, the scarf gets to the right length, or you get fed up with knitting.

The yarn in a ball doesn't contain any information or structure, but by the act of putting stitches into it, you're encoding something. In the case of my scarf above, it's a repeating pattern called feather and fan stitch , but it can just as easily be another pattern, or no pattern at all. If you wanted to get really fancy, you can encode all sorts of information into a knitted item - the most famous example of this is Madame Defarge in Dicken's "A Tale of Two Cities", knitting the names of the upper classes doomed to die at the guillotine into a scarf.

(Pushing the analogy a bit far, each stitch could represent a bit in a dataset, with a knit stitch signifying a zero and a purl stitch a one, but in this case that's not so helpful, as I've got yarn overs and knit-two-togethers as well as knit and purl stitches in there.)

My scarf was created by a process of appending- each new row got added to the previous, like a dataset where each new measurement gets appended on to the previous one to make a time series. The scarf has a fixed number of stitches in each row, the same as a dataset where a fixed number of measurements are taken each day. This doesn't have to be the case, I've seen plenty of patterns for scarves out there with variable row lengths. It all depends on the look you want it to have, or what the knitting is supposed to be - you have variable row lengths to shape the sleeves of a jumper, for instance.

Sometimes my data got corrupted. I dropped a stitch, or miscounted the number of knit-two-togethers that I needed to do, and came out with the wrong number of stitches at the end of the row. Usually when this happens you have to pull out the stitches until you get back to the place where you can fix the mistake, and then re-knit the rows you've pulled apart. It can get a bit annoying, especially when you're ripping out perfectly good rows to fix a mistake you hadn't spotted before, which is several rows (and possibly hours of knitting time) below.

I know for a fact that my scarf is not perfect. I've made mistakes there, and I'd feel really uncomfortable having someone scrutinise it and point out all my errors. Thankfully, no one's planning on peer-reviewing my scarf - though they would if I entered it into one of the knitting competitions you sometimes get at village fetes.

Like a dataset, I could have kept adding stitches and rows to my scarf ad infinitum, but there came a point when I actually wanted to wear it, so that meant I had to finish it off (i.e. cast off the stitches and sew in the ends). I could have used it while it was still being knitted (er... maybe as a pot holder, or a lap warmer?) but the knitting needles would have got in the way. It wouldn't have been ideal. Even if I had decided that I didn't want it to be a scarf after all, and was happy with it as a washcloth (a very sparkly one), I still would have had to have cast off and finished it properly, otherwise the first time I used it, it would have pulled apart into a big tangle of yarn. The same is true for datasets - if you're going to use them, you need them to be properly finished off - i.e. a firm definition of what pieces of data you are using, and what pieces you're not.

So, I finished my scarf/dataset, and I can now use it for the purpose for which it was intended - to keep my neck warm in a stylish yet comfortable way. Now what?

Well, I have a lot of scarves. So I need some way of identifying it, storing it, and maybe even reproducing it (when it wears out, or someone wants to make themselves one just like it). In other words, I need metadata about my scarf.

Descriptive metadata is easy. At a very basic level it's things like colour: "teal blue" and what it is: "scarf". But even with something this simple, you still need to have common language to make sure that the descriptors are understood. "Teal blue" makes perfect sense to me, but might not mean anything to someone else, who might think it looks a bit green.

Thankfully, there are other ways of describing the scarf. I can say that it's 200cm long, and 20cm wide, and that it was made from King Cole Haze Glitter DK (the type of yarn), colourway 124 - Ocean, with dyelot 67233. And all those last pieces of metadata, though too specific for general use, do describe the scarf accurately, though not completely, and makes a start at providing the information needed to recreate it.

For recreating the scarf, I need all the metadata about what yarn was used, but I also need the size of the needles I knitted it on (4mm). I need the pattern that I used (18 stitch feather and fan, with a 2 stitch garter stitch border at the edges). I need the number of stitches I cast on (54) and my tension (how tightly I knit in this pattern - 28 rows and 27 stitches for a 10cm by 10cm square). You don't need any of this information to wear the scarf, but it is important to keep it if you want to recreate it!

(As an aside, I didn't keep all the metadata about how I made the scarf and what yarn I used for it written down somewhere, which meant that when I came to write this post, I needed to work it out all over again. In other words, metadata should be collected from the start and stored somewhere safe, regardless of what it's describing!)

I then also need to make sure my scarf is stored correctly when I'm not using it, so it doesn't get lost, or (heaven forbid!) corrupted (i.e eaten by moths or shredded by mice). I also need to be able to tell people where it's stored, so that when I ask my other half to fetch it for me, I can also tell him that it's hanging on the door of my wardrobe.

I want to be able to cite my scarf when I'm talking about it. Mostly, I just do it by saying "my teal blue feather and fan scarf", to distinguish it from the other scarves I have hanging around the place. I could get fancy and assign it a KOI (a Knitted Object Identifier) but most of my handknits are sufficiently distinct that a casual glance can tell which is which from a short description!

And finally, because I've put a lot of time and effort into making my scarf, I'd like to get credit for doing that. Which, for me anyway, is covered when someone says to me "that's a nice scarf" and I can respond with "thanks! I made it myself" and a proud smile.

There's more to this analogy than the special case of one dataset/scarf being created by one single creator, but I'm sure I've bludgeoned you with enough knitting terminology already today. I'm sure I can stretch the analogy further, but that's something for another post!

I'll leave you with a challenge, to think about something you've made yourself, with your own two hands. It can be anything; a nice meal, a garden, a piece of clothing, a lego model, a painting, a piece of furniture. Something that you made yourself and are proud of. Got something?

The way you feel about that thing is the way that dataset creators feel about their data, especially if that dataset has been created through great effort and took a lot of time. Everyone wants acknowledgement and credit for the work that they do. Data creators are no different!

Tuesday, 29 January 2013

Data journals - as soon-to-be-obsolete stepping stone to something better?

Stepping Stones by mark 75

During the PREPARDE project workshop at the International Digital Curation Conference, one of the presenters raised the thought that data journals may just be a temporary phenomena pending better data organisation and credit. (I'm paraphrasing from memory here, so forgive me if I get it wrong!) Their thinking was that we want to make data a first class scientific object and will do so through data citation, and then we will also want to enhance existing scientific publications with links back to the data they use and associated interactive gubbins. So therefore, data journals, which publish the datasets and a brief paper describing them won't be needed, because you'll either cite the data directly, or have links in an analysis-and-conclusions article to the data.

I'm not arguing with the need for proper data citations, or the benefits they'll give. I also agree that analysis-and-conclusions articles will and should have better links to the data that underlies them. I do think though that there's an awfully big jump between a dataset, in a repository, ready to be cited, and a full analysis-and-conclusions article.

(A brief digression - I know we're piggy-backing on article publication to provide data creators with the credit they deserve for creating the datasets, and this is nowhere near the ideal way of doing it! But that's a subject for another post, so, for today, let's go with the whole data publication thing as a given.)

Let's start with direct citations of datasets. Ok, so you've created your dataset and you've put it in a repository somewhere, and cite it using a permanent id (DOI/ARK/whatever). Using that citation, another researcher can go and find your dataset where it's stored, and will have at least the minimum level of metadata given in the citation (Authors, Title, Publisher, etc.) What the user of the dataset doesn't get is any indication of how useful the data is likely to be (apart from what they can guess through their knowledge of the authors' and repository's reputation), and they may not get any information at all about whether or not the dataset meets any community standards, is in appropriate formats, or has extra supporting metadata or documentation.

This isn't a particularly likely situation for most discipline-based repositories, who have a certain amount of domain knowledge to ensure that community standards are met. But for institutional or general repositories, who may have to cover subject areas from art history to zoology, they simply won't be able to provide this depth of knowledge. So a data citation can easily provide the who, where and maybe the what of a dataset (who created it? where is it stored? what is it - or at least what is it called?) , but doesn't automatically provide any information on the how or the why the dataset was created - which is important for when it comes to judging the quality and reuse potential of the dataset.

Looking from the other end, analysis-and-conclusions papers tend to be pretty long things, and they often have to describe a lot in terms of the methods used for the analysis. Having to explain the data collection and processing method before you even get to the analysis methods is a pain (even if you'd only have to do it once and would then cite that first paper), but is still an essential part of the paper if the conclusions are to hold up.

Yes, it will be great to click on a graph and be taken to the raw data that created that plot, but you'd still need to provide metadata for that subset of the dataset (and most repositories only store and cite the full dataset, not subsets). Clicking through to a subset of the data doesn't give the whole picture of the dataset either, what if that particular data subset was cherry-picked to best support the conclusions drawn in the paper? There's technical issues there, which I'm sure will be solved, but they aren't yet.

It's also about the target audience as well. If I'm looking for datasets that might be useful to me, I don't want to be trawling through pages of analytical methods to find them. Ditto if I'm interested in new statistical techniques, all the stuff about how the data was collected is noise to me. Splitting the publication between data article (which gives all the information about calibrations and instrument set-up and the like) and analysis-and-conclusions and citing the former from the latter seems sensible to me. Not to mention that it might work out quicker to publish two smaller papers than one large one (and would certainly be easier to write and review!)

So I really do think there's a long-term place for data journals, between data citation and analysis-and-conclusions articles. Data articles allow for the publication of more information about a dataset (and in a more human-readable way) than can be captured in a simple metadata scheme and a repository catalogue. Data articles also provide a mechanism for the dataset's community to judge the scientific quality and potential reuse of the dataset through peer-review (open or closed, pre- or post-publication).

I think a data article is also a sign that the data producer is proud of their data and is willing to publicise it and share it with the community. I know that if I had a rubbish dataset that I didn't want other people using, but had been told by someone important that it had to be in a repository, then I'd be sure to put it somewhere with the minimum amount of metadata. Yes, it could still be cited, but it wouldn't necessarily be easy to use!

There's only one way to find out if data journals are just a temporary stepping stone between data and analysis-and-conclusions articles until data citation becomes common practice and enhanced publications really get off the ground. And that's to keep working to raise the profile of data citation and data publication (whether in a data article, or as a first class part of an analysis-and-conclusions article) so it becomes the norm that data is made available as part of any scientific publication.

In the meantime, let's keep talking about these issues, and raising these points. The more we talk about them and the more we try to make data citation and enhanced publications happen, the more we're raising consciousness about the importance of data in science. That's all to the good!

About the Author

I'm Sarah Callaghan and I am the Research Practice Manager for the University of Oxford.

Previously, I was Editor-in-Chief for Patternsa data science journal from Cell Press.

Before then I worked for the Centre for Environmental Data Analysis as a data scientist and programme manager attempting to make sense of this data citation and publication thing.

Before that I worked for the Radio Communications Research Unit (now the Chilbolton Group at STFC - Rutherford Appleton Laboratory) where I studied radio propagation at frequencies above 10 GHz (and in the process created a number of large datasets).

Needless to say, all opinions are my own, and nothing to do with my employer.

My official biography can be found here.