Citing Bytes - Adventures in Data Citation

Tuesday, 7 February 2012

Lunchtime lecture to the British Geological Survey

I was invited to give a talk to the British Geological Survey on the 25th January, on the topic of data citation and publishing, and why it's important. I've been doing this talk in a variety of guises in different places for a while now, but I thought it'd be good to put it up here too. Consider it an on-line lecture, if you will.

(Click on any of the slide images to see the larger versions)

The key point here is that science should be reproducible, different people running the same experiment at different times should get the same result. Unfortunately, until someone invents a working time machine, you can't just pop back to last week to collect some observational data, so that's why we have to archive it properly.

Often, the only part of the scientific process that gets published is the conclusions from a dataset. And, if the data's rubbish, so will be the conclusions. But we won't know that until we can look at the data.

This is a bit of blurb about the data citation project, and the NERC data centres, and why we care about data in the first place.

There's a nice picture drawn by Robert Hooke in the above slide - showing us that in the past it might have been tedious and time consuming to collect data, but it was at least (relatively) easy to publish. Not so much anymore.

And we're only going to be getting more data... Lots of people call it "the data deluge". If we're going to be flooded with data, it's time to start building some arks!

Data sharing is often put forward as a way of dealing with the data deluge. It has its good points...

...but in this day and age of economic belt-tightening, hoarding data might be the only thing that gets you a grant.

Data producers put a lot of effort into creating their datasets, and at the moment, there's no formal way of recognising this, which will help the data producers when it comes to facing a promotion board.

There are lots of drivers to making data freely available, and to cite and publish it. From a purely pragmatic view, and wearing my data centre hat, we want a carrot to encourage people to store their data with us in appropriate formats and with complete metadata.

The project aims can basically be summed up as us wanting a mechanism to give credit to the scientists who give us data, because we know how tricky a job it is. But it has to be done if the scientific record is to stand.

The figure in this slide is key here, especially when it comes to drawing the distinction between "published" with a small "p" and "Published" with a big "P". We want to get data out into the open, and at the same time have it "Published", providing guarantees as to its persistence and general quality. What we definitely don't want is to have the data locked away on a floppy disk in a filing cabinet in an office somewhere.

Data centres are fitting into the middle ground between open and closed, and "published" and "Published", and we're hoping to help move things in the right directions.

Repeating the point, because it's important. (With added lolcats for emphasis!)

I'm far from an expert on cloud computing, but there are many questions to be answered before shoving datasets into the cloud or on a webpage. These things, like discoverability, permanence, trust, etc, are all things that data centres can help with.

This is an example of thousand year old data that's preserved very well indeed. Unfortunately we've lost the supporting information and the context that went with it, meaning we've got several different translations with different meanings.

It's not enough to simply store the bits and bytes, we need the context and metadata too.

It's easy enough to stick your dataset on a webpage, but it takes effort to ensure it's all properly documented, and that other people can use it without your input. There's also risks - someone might find errors, or use your work to win funding.

Data centres know that the work involved in preparing a dataset for use by others is needed, and that's why we want to help the data producers and ensure they get credit for it.

Of course, in some cases where sharing data is mandatory, but the data producer doesn't really want to do it, it's a simple matter of not doing the prep work, and then the data's unusable to anyone but the creators.

(The example files in the pictures come from one of my own datasets, before they were put into the BADC with all their metadata and in netCDF. I know what they are, but no one else would...)

So, we're going to cite data using DOIs, and these are the reasons why. Main ones being, they're commonly used for papers, and scientists are familiar with them.

Now we're getting into the detail. These are our rules about what sort of data we can /will cite. Note that these are self-imposed rules, and we're being pretty strict about them. That's because we want a DOI-ed dataset to be something worth having.

Data centres served data as our day job - we take it in from scientists and we make it available to other interested parties.

The data citation project is working on a method of citing data using DOIs - which will give the dataset our "data centre stamp of approval", meaning we think it's of good technical quality and we commit to keeping it indefinitely.

The scientific quality of a dataset has to be evaluated by peer review by scientists in the same domain. That's going to be a tricky job, and we're partnering up with academic publishers to work further on this.

Data Publication, with associated scientific peer review would be good for science as a whole, and also good for the data producers. It would allow us to test the conclusions published in the literature, and provide a more complete scientific record.

Of course, publishing data can't really be done in the traditional academic journal way. We need to take advantage of all these new technologies.

We're not the first to think of this - data journals already exist, and more are on the horizon. There does seem to be a groundswell of opinion that data is becoming more and more important, and citation and publication of data are key.

This pretty much sums up the situation with the project at the moment. At the end of this phase, all the NERC data centres will have at least one dataset in their archive with associated DOI, and we'll have guideline documents published for the data centre and data producers about the requirements for a dataset to be assigned a DOI.

Users are coming to us and asking for DOIs, and we're hoping to get more scientists interested in them. We're also encouraging the journals who express an interest in data publication, and are encouraging them to mandate dataset citation in their papers too.

I really do feel like we're gathering momentum on this!

Thursday, 2 February 2012

JISC Grant Funding 01/12: Digital Infrastructure Programme

JISC have announced their latest Managing Research Data call. Of particular interest (to me, anyway) is:

Managing Research Data: Innovative Data Publication

Projects to design and implement innovative technical models and organisational partnerships to encourage and enable publication of research data.

Total funding of up to £320,000 for 2-4 projects of between £80,000 and £150,000 per project.
Jun 2012 – Jul 2013.

Closing date is 12:00 noon UK time on 16 March 2012. More details here.

Friday, 16 December 2011

IDCC 2011 - notes from day 1 plenary talks

The SS Great Britain, location of the opening reception

There were some absolutely amazing speakers at IDCC11, and I'd heartily encourage you to go and watch the videos that were made of the event. Below are the take-home messages I scribbled down in my notebook.

[Anything in square brackets and italics are my own comments/thoughts]

Opening Keynote by Ewan McIntosh (NoTosh)
Ewan started of by challenging us to be problem finders, rather than problem solvers, as that's where the innovations are really made, by finding a problem and then solving it. There's a lot of stuff out there that just doesn't work, because it's not got a problem to solve.

Scientists have to be careful - taking too much time to make sure that the data's correct can mean that we sit on it until it becomes useless. Communication of the data is as important as the data itself.

Even open data isn't really open, because people can't use it. Note that "open" does not mean "free".

Ewan went into a school where the kids were having problems listening and talking. And he got them to put on their very own TEDx event. A load of 7-8 year olds watched a lot of TED talks, and then they presented their own. [The photos from this event were amazing!]

He said that we've got to look at the impact of our data in the real world, and if we're not enthusiastic about what we're doing, no one else will be. Media literacy is also in the eye of the beholder.

He left us with some challenges for how we deal with data and science:
1. Tell a story
2. Create curiosity
3. Create wonder
4. Find a user pain (and solve it)
5. Create a reason to trade data

[I'm very pleased that the last two points are being addressed by the whole data citation thing I've been working on!]

David Lynn (Wellcome Trust)
The Wellcome trust has a data management and sharing policy that was published in January 2007. In it, researchers are required to maximise access to data and produce a data management plan, while the Trust commits to meet the costs of data sharing.

David's key challenges for data sharing were:

Infrastructure
Cultural (including incentives and recognition)
Technical
Professional (including training and career development of data specialists [hear hear!])
Ethical

Jeff Haywood (University of Edinburgh)

The University's mission: the creation, dissemination and curation of knowledge.

For example the Tobar an Dualchais site, which hosts an archive of video, audio, text and images of Scottish songs and stories from the 1930s on.

But to do data management, there needs to be incentives, something of value for researchers at every level.

Herding cats is easy - put fish at the end of the room where you want them to go!

Internal pressure from researchers came first. They wanted storage, which is a different problem from research data management.

Edinburgh's policy is that responsibility for research data management lies primarily with the PIs. New research proposals have to be accompanied by data management plans. The university will archive stuff that is important, and that funders/other repositories won't/can't.

One of their solutions is drop-box-like storage, which is also easily accessible from off-site and for collaborators.

Andrew Charlsworth (University of Bristol)

Focusing on the legal aspects of data.

People are interested in the workflows/processes/methodologies in science as well as the data.

There are legal implications of releasing data, including data protection, confidentiality, IPR etc...

Leaving safe storage to researchers over long periods of time is problematic because people leave, technology changes, security for personal data, FOI requests, deleting data/ownership...

Most legal and ethical problems arise because of:

lack of control (ownership)
lack of metadata
poor understanding of legal./ethical issues
not adjusting policies to new circumstances
lack of sanction (where do consequences of data loss/breach/misuse fall?)

We can't just open data, we have to put it into context.

We want to avoid undue legalisation, so use risk assessments rather than blanket rules.

Institutions and researchers should be prepared for FOI requests.

"Avoiding catching today's hot potatoes with the oven gloves of yesterday."

Mark Hahnel (FigShare)

"Scientists are egomaniacs...but it's not their fault."

We could leverage altmetrics on top of normal metrics to get extra information.

The new FigShare website will be released in January. Datasets on it are released under CC0, while everything else is CC-BY. Stuff put on the FigShare site can be cited using a DOI.

Filesets are anything that has more than one file in it.

Victoria Stodden (Columbia University)

Talking about reproducible research

"Without code, you don't have data." Open code is part of open data. Reproducability scopes what to share and how.

[I got a bit confused during her talk, until I realised that code doesn't just mean computer code, but all the workflows associated with producing a scientific result]

Scientific culture should be made so that scientific knowledge doesn't dissipate. Reproducability requires tools, infrastructure and incentives [and in the case of observational data, a time machine]

Many deep intellectual contributions are only captured in code - hence it's difficult to access these implementations without the code.

Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124. doi:10.1371/journal.pmed.0020124

Heather Piwowar (DataOne)
Science is based on "standing on the shoulders of giants" - but "building broad shoulders is hard work" and it doesn't help you become top dog.

Researchers overwhelmingly agree that sharing data is the right thing to do and that they'll get more citations.

We need to facilitate the deep recognition of the labour of dataset creation, and encourage researchers to have CV sections for data and code.

There is a pace for quick and dirty solutions.

We have a big problem in that citation info is often behind paywalls - we need open bibliography. More, we need open access to full text as citation doesn't tell us if the dataset was critiqued or not. We also need access to other metrics, like repository download stats.

Call to action!

Raise our expectation about what we can mash up, and our roles
Raise our voices
Get excited and make things! [I like this one!]

A future where what kind of impact something makes is as important as how much impact it makes.

[Heather very kindly has made all of her presentation notes available on her blog.]

Thursday, 15 December 2011

Link roundup

Blog posts:

The Skinny on Data Publication - "It turns out data publication is similar to data management: no one is against the concept per se, but they are against all of the work, angst, and effort involved in making it a reality."

Save Scholarly Ideas, Not the Publishing Industry (a rant) - "The scholarly publishing industry used to offer a service. It used to be about making sure that knowledge was shared as broadly as possible to those who would find it valuable using the available means of distribution: packaged paper objects shipped through mail to libraries and individuals. It made a profit off of serving an audience. These days, the scholarly publishing industry operates as a gatekeeper, driven more by profits than by the desire to share information as widely as possible. It stopped innovating and started resting on its laurels."

My Data Management Plan -a satire - "When required to make the data available by my program manager, my collaborators, and ultimately by law, I will grudgingly do so by placing the raw data on an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26-edddf420622d. I will under no circumstances make any attempt to provide analysis source code, documentation for formats, or any metadata with the raw data. When requested (and ONLY when requested), I will provide an Excel spreadsheet linking the names to data sets with published results. This spreadsheet will likely be wrong -- but since no one will be able to analyze the data, that won't matter."

altmetrics: a manifesto - "No one can read everything. We rely on filters to make sense of the scholarly literature, but the narrow, traditional filters are being swamped. However, the growth of new, online scholarly tools allows us to make new filters; these altmetrics reflect the broad, rapid impact of scholarship in this burgeoning ecosystem. We call for more tools and research based on altmetrics."

Papers

Systematic documentation and analysis of human genetic variation in hemoglobinopathies using the microattribution approach, Giardine et. al. Nature Genetics 43, 295–301 (2011) doi:10.1038/ng.785

On the utility of identification schemes for digital earth science data: an assessment and recommendations Duerr et al. Earth Science Informatics, Springer-Verlag, July 2011, 10.1007/s12145-011-0083-6

Data Reviews, peer-reviewed research data. Marjan Grootveld and Jeff van Egmond (editors). DANS studies in Digital Archiving 5. Data Archiving and Networked Services (DANS) - 2011. ISBN 978-94-90531-07-2.

Services

Cite my Data - "The ANDS Cite My Data service will allow research organisations to assign Digital Object Identifiers (DOIs) to research datasets or collections."

total-impact.org - "Create a collection of research objects you want to track. We'll provide you a report of the total impact of this collection."

figshare.com - "Scientific publishing as it stands is an inefficient way to do science on a global scale. A lot of time and money is being wasted by groups around the world duplicating research that has already been carried out. FigShare allows you to share all of your data, negative results and unpublished figures. In doing this, other researchers will not duplicate the work, but instead may publish with your previously wasted figures, or offer collaboration opportunities and feedback on preprint figures."

Tuesday, 13 December 2011

Report from IDCC 2011 - Data for Impact workshop

with thanks to www.phdcomics.com

I spent most of last week in Bristol at the 7th International Digital Curation Conference, and had a grand old time talking about data and citations. The first thing I went to was a workshop entitled "Data for Impact: Can research assessment create effective incentives for best practice in data sharing?"

The short answer to this is, yes, but...

There's no denying that the Research Excellence Framework ("REF", for short) impacts on how research is disseminated in this country. An example was given: engineers typically publish their work in conference proceedings that are very well refereed and very competitive, with high impact in the field, internationally. But because these conference proceedings weren't counted in the RAE, the message came back to the engineering departments that they had to publish in high impact journals. So the engineers duly did, with the net result that this (badly) impacted their international standing.

There's the double whammy too, that the REF is essentially a data collection exercise, and the universities put a lot of time and effort into it - but there's no data strategy associated with the REF, and data isn't a part of it!

The REF is very concerned with publications (the number that got mentioned was that publications form 65% of the return), so we had a lot of discussion on how we could piggy-back on publications, and essentially produce "data publications" to get data counted in the REF. (Which is what I'm trying to do at the moment...)

Leaving aside the question of why we're piggy-backing on a centuries-old mechanism for publicizing scientific work (i.e. journals) when we could be taking advantage of this cool new technology to create other solutions; there are other issues associated with this. Sure, we can assign DOIs to all the data we can think of (in suitable, stable repositories, of course), but that doesn't mean they'll be properly cited in the literature. People aren't used to citing data, they haven't understood the benefits of it, and, perhaps most importantly, the metrics aren't there to track data citation!

We talked a fair bit about metrics, specifically, altmetrics as a way of quantifying the impact of a particular piece of work (whether data or not). These haven't really gained any ground when it comes to the REF, mainly as I suspect they lack the critical mass of users using them, though it is early days. There's some really interesting stuff, and I for one will be heading over to total-impact.org and figshare.com in the not too distant future to play with what they've been doing over there.

If we could convince the REF to count data, either as a separate research output, or even as a publication type, then that would be excellent. Sure, there were concerns that if data was a publication type, then it would be ignored in favour of high-impact journal publications (why count your dataset when you've got multiple Nature papers and four slots to report publications in?) but it could make life better for those researchers who never get a Nature paper, because they're so busy looking after their data.

I suspect though that it's too late to get data into the next REF in 2014, but maybe the one after that? Time to start lobbying the high-up people who make those sorts of decisions!

Tuesday, 22 November 2011

What is a dataset, when you get down to it?

Picture by me and Powerpoint.

It's possible to spend a lot of time arguing about what a dataset actually is (and believe me, plenty of people have, myself included!)

I don't have a definitive answer, but for myself, I tend to default to the idea of what's scientifically meaningful as a dataset. For example, a single peak flood measurement at a certain place for a given year could count as a dataset, but a single rain gauge measurement from a site where the gauge has been in place for years wouldn't. And of course, it all depends on the scientific domain as well.

Sometimes projects can act as a convenient guide - if a project was run for x years and provided a wodge of data, then that data can be packaged up as a project dataset. Sometimes a dataset can be all the data resulting from a given instrument for a given period of operation. The important thing is that common sense needs to be applied to how "thinly sliced" a dataset should be. I really don't want to see the concept of minimum publishable unit applied to data, thankyouverymuch!

An analogy that I tend to use a lot is a book. Like a book, a citeable (DOI-able) dataset should be easily identifiable, stable, complete and (hopefully) have enough information in it so that you can understand what it's all about, without having to refer to (too many) other sources of information. Yes, the dataset can be structured in such a way that you can refer to parts of it easily (chapter and verse analogy), but it doesn't mean that every single segment of the dataset should have its own DOI (or that each verse in a book should be published independently in its own cover).

My completely off-the-cuff and not entirely serious example of how you'd go about referencing segments of a particular dataset is:

Honeydew, B, Beaker, Gonzo, T.G., Years 2001, 2005 and 2009 from “Statistics of egg-laying in Pitch Perfect Poultry, 2000-2010” doi:10.12345/abcdefg.

Of course, datasets are more than books, and there's lots of different ways of slicing and dicing them to produce scientifically meaningful datasets. At the moment, because we're in the early stages of assigning DOIs to our hosted datasets, we're pretty much making a decision on a case by case basis, in the hopes that some general guidelines will surface along the way. (Thankfully, they do seem to be.)

One idea that quickly got assigned to the "not now - tricky" pile is the notion that users might want at some stage to effectively create a new derived dataset which is made up of smaller bits of other people's datasets, and would then want to cite this derived dataset as a whole. This "user-defined" citation would save space in the valuable real estate of a paper's references, and would provide a link to a list of the other sub-citations, in a format that was both human and machine readable. Provided that each of the sub-citations allowed you to easily and accurately get to the relevant sections of the other datasets, then the derived dataset would count as a citeable object.

This is achievable now, the technologies are ready and mature, but this rapidly starts getting tricky when you start thinking of the roles involved and how to assign credit - the author of this new derived dataset is not so much an author, more a compiler or editor, for example.

Having hierarchies of dataset citations aren't so problematic. For example, we've already made the decision that for large datasets where the dataset is continually modified by appending new files to it (for example, the rain gauge measurements mentioned above where files are created on a daily basis), then we can assign a DOI to a given period's worth of data at a time. For the rain gauge measurements, it's convenient and sensible to assign a DOI to each year's worth of data after the year's complete, and then, when the rain gauge is moved, or otherwise taken out of service, to give the entire time series one DOI.

Citation is actually a really good prod for us, to encourage us to really crystallize our thinking about what a dataset is, and how to deal with it. It's all too easy to have fuzzy datasets being random piles of files, or entries in a database table, without having defined any rules on where their edges are. I don't have the answers, but I do feel like we're getting close to at least some of them!
____________________________________
A lot of the thoughts in this post came about after conversations with the many people involved in various citation workshops/projects etc, including, but not limited to, my co-workers in the NERC SIS data citation and publication project and the CODATA Task Group on Data Citation. Thanks are due to them all! (I'm sure I'll be repeating that lots in future posts too!)

Wednesday, 16 November 2011

And once again xkcd hit the nail on the head:

About the Author

I'm Sarah Callaghan and I am the Research Practice Manager for the University of Oxford.

Previously, I was Editor-in-Chief for Patternsa data science journal from Cell Press.

Before then I worked for the Centre for Environmental Data Analysis as a data scientist and programme manager attempting to make sense of this data citation and publication thing.

Before that I worked for the Radio Communications Research Unit (now the Chilbolton Group at STFC - Rutherford Appleton Laboratory) where I studied radio propagation at frequencies above 10 GHz (and in the process created a number of large datasets).

Needless to say, all opinions are my own, and nothing to do with my employer.

My official biography can be found here.