Data Citation Principles
I’ll talk about data citation principles and the work done by the CODATA task group on Data Citation. I’ll also touch on the implications of data publication for data repositories and for the researchers who create the data.
And here is a write-up of my presentation notes:
(No, I didn't have any slides - I just used the above PhD comic as a background)
"Hands up if you think data is important. (Pretty much all the audience's hands went up) That's good!
Hands up if you've ever written a journal paper... (Some hands went up) ... and feel you've got credit. (some hands went down again)
Hands up if you've ever created a dataset... (less hands up).... and got credit. (No hands up!)
So, if data's so important, why aren't the creators getting the credit for it?
We're proposing data citation and publication as a method to give researchers credit for their efforts in creating data. The problem is that citation is designed to link one paper to another - that's it. And those papers are printed on and frozen in dead tree format. We've loaded citation with other purposes, for example attribution, discovery, credit. But citation isn't really a good fit for data, because data is so changeable and/or takes such a long time and so many people to create it.
But to make data publication and citation work, data needs to be frozen to become the version of record that will allow the science to become reproducible. Yes, this might be considered a special case of dealing with data, but it's an important one. The version of record can always link to the most up-to-date version of the dataset after all.
Research is getting to be all about impact - how a researcher's work affects the rest of the world. To quantify impact we need metrics. Citation counts for papers are well known and well established metrics, which is why we're piggybacking on them for data. Institutions, funders and repositories all need metrics to support their impact claims too. For example a repository manager can use citation to track how researchers are using the data downloaded from the repository.
The CODATA task group on data citation is an international group. We've written a report: "Citation of data: the current state of practice, policy and technology". It's currently with the external reviewers and we're hoping to release it this summer. It's a big document ~190 pages. In it there are ten data citation principles:
- Status of Data: Data citations should be accorded the same importance in the scholarly record as the citation of other objects.
- Attribution: A citation to data should facilitate giving scholarly credit and legal attribution to all parties responsible for those data.
- Persistence: Citations should refer to objects that persist.
- Access: Citations should facilitate access to data by humans and by machines.
- Discovery: Citations should support the discovery of data.
- Provenance: Citations should facilitate the establishment of provenance of data.
- Granularity: Citations should support the finest-grained description necessary to identify the data.
- Verifiability: Citations should contain information sufficient to identify the data unambiguously.
- Metadata Standards: A citation should employ existing metadata standards.
- Flexibility: Citation methods should be sufficiently flexible to accommodate the variant practices among communities.
None of these are particularly controversial, though as we try citing more and more datasets, the devil will be in the detail.
Citation does have the benefit that researchers already are used to doing it as part of their standard practice. The technology also exists, so what we need to do is encourage the culture change so data citation is the norm. I think we're getting there."