Citing Bytes - Adventures in Data Citation: December 2011

Friday 16 December 2011

IDCC 2011 - notes from day 1 plenary talks

The SS Great Britain, location of the opening reception

There were some absolutely amazing speakers at IDCC11, and I'd heartily encourage you to go and watch the videos that were made of the event. Below are the take-home messages I scribbled down in my notebook.

[Anything in square brackets and italics are my own comments/thoughts]

Opening Keynote by Ewan McIntosh (NoTosh)
Ewan started of by challenging us to be problem finders, rather than problem solvers, as that's where the innovations are really made, by finding a problem and then solving it. There's a lot of stuff out there that just doesn't work, because it's not got a problem to solve.

Scientists have to be careful - taking too much time to make sure that the data's correct can mean that we sit on it until it becomes useless. Communication of the data is as important as the data itself.

Even open data isn't really open, because people can't use it. Note that "open" does not mean "free".

Ewan went into a school where the kids were having problems listening and talking. And he got them to put on their very own TEDx event. A load of 7-8 year olds watched a lot of TED talks, and then they presented their own. [The photos from this event were amazing!]

He said that we've got to look at the impact of our data in the real world, and if we're not enthusiastic about what we're doing, no one else will be. Media literacy is also in the eye of the beholder.

He left us with some challenges for how we deal with data and science:
1. Tell a story
2. Create curiosity
3. Create wonder
4. Find a user pain (and solve it)
5. Create a reason to trade data

[I'm very pleased that the last two points are being addressed by the whole data citation thing I've been working on!]

David Lynn (Wellcome Trust)
The Wellcome trust has a data management and sharing policy that was published in January 2007. In it, researchers are required to maximise access to data and produce a data management plan, while the Trust commits to meet the costs of data sharing.

David's key challenges for data sharing were:

Infrastructure
Cultural (including incentives and recognition)
Technical
Professional (including training and career development of data specialists [hear hear!])
Ethical

Jeff Haywood (University of Edinburgh)

The University's mission: the creation, dissemination and curation of knowledge.

For example the Tobar an Dualchais site, which hosts an archive of video, audio, text and images of Scottish songs and stories from the 1930s on.

But to do data management, there needs to be incentives, something of value for researchers at every level.

Herding cats is easy - put fish at the end of the room where you want them to go!

Internal pressure from researchers came first. They wanted storage, which is a different problem from research data management.

Edinburgh's policy is that responsibility for research data management lies primarily with the PIs. New research proposals have to be accompanied by data management plans. The university will archive stuff that is important, and that funders/other repositories won't/can't.

One of their solutions is drop-box-like storage, which is also easily accessible from off-site and for collaborators.

Andrew Charlsworth (University of Bristol)

Focusing on the legal aspects of data.

People are interested in the workflows/processes/methodologies in science as well as the data.

There are legal implications of releasing data, including data protection, confidentiality, IPR etc...

Leaving safe storage to researchers over long periods of time is problematic because people leave, technology changes, security for personal data, FOI requests, deleting data/ownership...

Most legal and ethical problems arise because of:

lack of control (ownership)
lack of metadata
poor understanding of legal./ethical issues
not adjusting policies to new circumstances
lack of sanction (where do consequences of data loss/breach/misuse fall?)

We can't just open data, we have to put it into context.

We want to avoid undue legalisation, so use risk assessments rather than blanket rules.

Institutions and researchers should be prepared for FOI requests.

"Avoiding catching today's hot potatoes with the oven gloves of yesterday."

Mark Hahnel (FigShare)

"Scientists are egomaniacs...but it's not their fault."

We could leverage altmetrics on top of normal metrics to get extra information.

The new FigShare website will be released in January. Datasets on it are released under CC0, while everything else is CC-BY. Stuff put on the FigShare site can be cited using a DOI.

Filesets are anything that has more than one file in it.

Victoria Stodden (Columbia University)

Talking about reproducible research

"Without code, you don't have data." Open code is part of open data. Reproducability scopes what to share and how.

[I got a bit confused during her talk, until I realised that code doesn't just mean computer code, but all the workflows associated with producing a scientific result]

Scientific culture should be made so that scientific knowledge doesn't dissipate. Reproducability requires tools, infrastructure and incentives [and in the case of observational data, a time machine]

Many deep intellectual contributions are only captured in code - hence it's difficult to access these implementations without the code.

Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124. doi:10.1371/journal.pmed.0020124

Heather Piwowar (DataOne)
Science is based on "standing on the shoulders of giants" - but "building broad shoulders is hard work" and it doesn't help you become top dog.

Researchers overwhelmingly agree that sharing data is the right thing to do and that they'll get more citations.

We need to facilitate the deep recognition of the labour of dataset creation, and encourage researchers to have CV sections for data and code.

There is a pace for quick and dirty solutions.

We have a big problem in that citation info is often behind paywalls - we need open bibliography. More, we need open access to full text as citation doesn't tell us if the dataset was critiqued or not. We also need access to other metrics, like repository download stats.

Call to action!

Raise our expectation about what we can mash up, and our roles
Raise our voices
Get excited and make things! [I like this one!]

A future where what kind of impact something makes is as important as how much impact it makes.

[Heather very kindly has made all of her presentation notes available on her blog.]

Thursday 15 December 2011

Link roundup

Blog posts:

The Skinny on Data Publication - "It turns out data publication is similar to data management: no one is against the concept per se, but they are against all of the work, angst, and effort involved in making it a reality."

Save Scholarly Ideas, Not the Publishing Industry (a rant) - "The scholarly publishing industry used to offer a service. It used to be about making sure that knowledge was shared as broadly as possible to those who would find it valuable using the available means of distribution: packaged paper objects shipped through mail to libraries and individuals. It made a profit off of serving an audience. These days, the scholarly publishing industry operates as a gatekeeper, driven more by profits than by the desire to share information as widely as possible. It stopped innovating and started resting on its laurels."

My Data Management Plan -a satire - "When required to make the data available by my program manager, my collaborators, and ultimately by law, I will grudgingly do so by placing the raw data on an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26-edddf420622d. I will under no circumstances make any attempt to provide analysis source code, documentation for formats, or any metadata with the raw data. When requested (and ONLY when requested), I will provide an Excel spreadsheet linking the names to data sets with published results. This spreadsheet will likely be wrong -- but since no one will be able to analyze the data, that won't matter."

altmetrics: a manifesto - "No one can read everything. We rely on filters to make sense of the scholarly literature, but the narrow, traditional filters are being swamped. However, the growth of new, online scholarly tools allows us to make new filters; these altmetrics reflect the broad, rapid impact of scholarship in this burgeoning ecosystem. We call for more tools and research based on altmetrics."

Papers

Systematic documentation and analysis of human genetic variation in hemoglobinopathies using the microattribution approach, Giardine et. al. Nature Genetics 43, 295–301 (2011) doi:10.1038/ng.785

On the utility of identification schemes for digital earth science data: an assessment and recommendations Duerr et al. Earth Science Informatics, Springer-Verlag, July 2011, 10.1007/s12145-011-0083-6

Data Reviews, peer-reviewed research data. Marjan Grootveld and Jeff van Egmond (editors). DANS studies in Digital Archiving 5. Data Archiving and Networked Services (DANS) - 2011. ISBN 978-94-90531-07-2.

Services

Cite my Data - "The ANDS Cite My Data service will allow research organisations to assign Digital Object Identifiers (DOIs) to research datasets or collections."

total-impact.org - "Create a collection of research objects you want to track. We'll provide you a report of the total impact of this collection."

figshare.com - "Scientific publishing as it stands is an inefficient way to do science on a global scale. A lot of time and money is being wasted by groups around the world duplicating research that has already been carried out. FigShare allows you to share all of your data, negative results and unpublished figures. In doing this, other researchers will not duplicate the work, but instead may publish with your previously wasted figures, or offer collaboration opportunities and feedback on preprint figures."

Tuesday 13 December 2011

Report from IDCC 2011 - Data for Impact workshop

with thanks to www.phdcomics.com

I spent most of last week in Bristol at the 7th International Digital Curation Conference, and had a grand old time talking about data and citations. The first thing I went to was a workshop entitled "Data for Impact: Can research assessment create effective incentives for best practice in data sharing?"

The short answer to this is, yes, but...

There's no denying that the Research Excellence Framework ("REF", for short) impacts on how research is disseminated in this country. An example was given: engineers typically publish their work in conference proceedings that are very well refereed and very competitive, with high impact in the field, internationally. But because these conference proceedings weren't counted in the RAE, the message came back to the engineering departments that they had to publish in high impact journals. So the engineers duly did, with the net result that this (badly) impacted their international standing.

There's the double whammy too, that the REF is essentially a data collection exercise, and the universities put a lot of time and effort into it - but there's no data strategy associated with the REF, and data isn't a part of it!

The REF is very concerned with publications (the number that got mentioned was that publications form 65% of the return), so we had a lot of discussion on how we could piggy-back on publications, and essentially produce "data publications" to get data counted in the REF. (Which is what I'm trying to do at the moment...)

Leaving aside the question of why we're piggy-backing on a centuries-old mechanism for publicizing scientific work (i.e. journals) when we could be taking advantage of this cool new technology to create other solutions; there are other issues associated with this. Sure, we can assign DOIs to all the data we can think of (in suitable, stable repositories, of course), but that doesn't mean they'll be properly cited in the literature. People aren't used to citing data, they haven't understood the benefits of it, and, perhaps most importantly, the metrics aren't there to track data citation!

We talked a fair bit about metrics, specifically, altmetrics as a way of quantifying the impact of a particular piece of work (whether data or not). These haven't really gained any ground when it comes to the REF, mainly as I suspect they lack the critical mass of users using them, though it is early days. There's some really interesting stuff, and I for one will be heading over to total-impact.org and figshare.com in the not too distant future to play with what they've been doing over there.

If we could convince the REF to count data, either as a separate research output, or even as a publication type, then that would be excellent. Sure, there were concerns that if data was a publication type, then it would be ignored in favour of high-impact journal publications (why count your dataset when you've got multiple Nature papers and four slots to report publications in?) but it could make life better for those researchers who never get a Nature paper, because they're so busy looking after their data.

I suspect though that it's too late to get data into the next REF in 2014, but maybe the one after that? Time to start lobbying the high-up people who make those sorts of decisions!