Friday, 16 December 2011

IDCC 2011 - notes from day 1 plenary talks

The SS Great Britain, location of the opening reception

There were some absolutely amazing speakers at IDCC11, and I'd heartily encourage you to go and watch the videos that were made of the event. Below are the take-home messages I scribbled down in my notebook.

[Anything in square brackets and italics are my own comments/thoughts]

Opening Keynote by Ewan McIntosh (NoTosh)
Ewan started of by challenging us to be problem finders, rather than problem solvers, as that's where the innovations are really made, by finding a problem and then solving it. There's a lot of stuff out there that just doesn't work, because it's not got a problem to solve.

Scientists have to be careful - taking too much time to make sure that the data's correct can mean that we sit on it until it becomes useless. Communication of the data is as important as the data itself.

Even open data isn't really open, because people can't use it. Note that "open" does not mean "free".

Ewan went into a school where the kids were having problems listening and talking. And he got them to put on their very own TEDx event. A load of 7-8 year olds watched a lot of TED talks, and then they presented their own. [The photos from this event were amazing!]

He said that we've got to look at the impact of our data in the real world, and if we're not enthusiastic about what we're doing, no one else will be. Media literacy is also in the eye of the beholder.

He left us with some challenges for how we deal with data and science:
1. Tell a story
2. Create curiosity
3. Create wonder
4. Find a user pain (and solve it)
5. Create a reason to trade data

[I'm very pleased that the last two points are being addressed by the whole data citation thing I've been working on!]

David Lynn (Wellcome Trust)
The Wellcome trust has a data management and sharing policy that was published in January 2007. In it, researchers are required to maximise access to data and produce a data management plan, while the Trust commits to meet the costs of data sharing.

David's key challenges for data sharing were:

  • Infrastructure
  • Cultural (including incentives and recognition)
  • Technical
  • Professional (including training and career development of data specialists [hear hear!])
  • Ethical
Jeff Haywood (University of Edinburgh)
The University's mission: the creation, dissemination and curation of knowledge. 

For example the Tobar an Dualchais site, which hosts an archive of video, audio, text and images of Scottish songs and stories from the 1930s on.

But to do data management, there needs to be incentives, something of value for researchers at every level.

Herding cats is easy - put fish at the end of the room where you want them to go!

Internal pressure from researchers came first. They wanted storage, which is a different problem from research data management.

Edinburgh's policy is that responsibility for research data management lies primarily with the PIs. New research proposals have to be accompanied by data management plans. The university will archive stuff that is important, and that funders/other repositories won't/can't.

One of their solutions is drop-box-like storage, which is also easily accessible from off-site and for collaborators.

Andrew Charlsworth (University of Bristol)
Focusing on the legal aspects of data.

People are interested in the workflows/processes/methodologies in science as well as the data.

There are legal implications of releasing data, including data protection, confidentiality, IPR etc...

Leaving safe storage to researchers over long periods of time is problematic because people leave, technology changes, security for personal data, FOI requests, deleting data/ownership...

Most legal and ethical problems arise because of:
  • lack of control (ownership)
  • lack of metadata
  • poor understanding of legal./ethical issues
  • not adjusting policies to new circumstances
  • lack of sanction (where do consequences of data loss/breach/misuse fall?)
We can't just open data, we have to put it into context.

We want to avoid undue legalisation, so use risk assessments rather than blanket rules.

Institutions and researchers should be prepared for FOI requests.

"Avoiding catching today's hot potatoes with the oven gloves of yesterday."

Mark Hahnel (FigShare)
"Scientists are egomaniacs...but it's not their fault."

We could leverage altmetrics on top of normal metrics to get extra information.

The new FigShare website will be released in January.  Datasets on it are released under CC0, while everything else is CC-BY. Stuff put on the FigShare site can be cited using a DOI.

Filesets are anything that has more than one file in it.

Victoria Stodden (Columbia University)
Talking about reproducible research

"Without code, you don't have data." Open code is part of open data. Reproducability scopes what to share and how.

[I got a bit confused during her talk, until I realised that code doesn't just mean computer code, but all the workflows associated with producing a scientific result]

Scientific culture should be made so that scientific knowledge doesn't dissipate. Reproducability requires tools, infrastructure and incentives [and in the case of observational data, a time machine]

Many deep intellectual contributions are only captured in code - hence it's difficult to access these implementations without the code.

Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124. doi:10.1371/journal.pmed.0020124

Heather Piwowar (DataOne)
Science is based on "standing on the shoulders of giants" - but "building broad shoulders is hard work" and it doesn't help you become top dog.

Researchers overwhelmingly agree that sharing data is the right thing to do and that they'll get more citations.

We need to facilitate the deep recognition of the  labour of dataset creation, and encourage researchers to have CV sections for data and code.

There is a pace for quick and dirty solutions.

We have a big problem in that citation info is often behind paywalls - we need open bibliography. More, we need open access to full text as citation doesn't tell us if the dataset was critiqued or not. We also need access to other metrics, like repository download stats.

Call to action!

  • Raise our expectation about what we can mash up, and our roles
  • Raise our voices
  • Get excited and make things! [I like this one!]
A future where what kind of impact something makes is as important as how much impact it makes.

[Heather very kindly has made all of her presentation notes available on her blog.]

Thursday, 15 December 2011

Link roundup

Blog posts:

The Skinny on Data Publication - "It turns out data publication is similar to data management: no one is against the concept per se, but they are against all of the work, angst, and effort involved in making it a reality."

Save Scholarly Ideas, Not the Publishing Industry (a rant) - "The scholarly publishing industry used to offer a service. It used to be about making sure that knowledge was shared as broadly as possible to those who would find it valuable using the available means of distribution: packaged paper objects shipped through mail to libraries and individuals. It made a profit off of serving an audience. These days, the scholarly publishing industry operates as a gatekeeper, driven more by profits than by the desire to share information as widely as possible. It stopped innovating and started resting on its laurels."

My Data Management Plan -a satire - "When required to make the data available by my program manager, my collaborators, and ultimately by law, I will grudgingly do so by placing the raw data on an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26-edddf420622d. I will under no circumstances make any attempt to provide analysis source code, documentation for formats, or any metadata with the raw data. When requested (and ONLY when requested), I will provide an Excel spreadsheet linking the names to data sets with published results. This spreadsheet will likely be wrong -- but since no one will be able to analyze the data, that won't matter."

altmetrics: a manifesto - "No one can read everything. We rely on filters to make sense of the scholarly literature, but the narrow, traditional filters are being swamped. However, the growth of new, online scholarly tools allows us to make new filters; these altmetrics reflect the broad, rapid impact of scholarship in this burgeoning ecosystem. We call for more tools and research based on altmetrics."


Systematic documentation and analysis of human genetic variation in hemoglobinopathies using the microattribution approach, Giardine et. al. Nature Genetics 43, 295–301 (2011) doi:10.1038/ng.785

On the utility of identification schemes for digital earth science data: an assessment and recommendations Duerr et al. Earth Science Informatics, Springer-Verlag, July 2011, 10.1007/s12145-011-0083-6

Data Reviews, peer-reviewed research data. Marjan Grootveld and Jeff van Egmond (editors). DANS studies in Digital Archiving 5. Data Archiving and Networked Services (DANS) - 2011. ISBN 978-94-90531-07-2.


Cite my Data - "The ANDS Cite My Data service will allow research organisations to assign Digital Object Identifiers (DOIs) to research datasets or collections." - "Create a collection of research objects you want to track. We'll provide you a report of the total impact of this collection." - "Scientific publishing as it stands is an inefficient way to do science on a global scale. A lot of time and money is being wasted by groups around the world duplicating research that has already been carried out. FigShare allows you to share all of your data, negative results and unpublished figures. In doing this, other researchers will not duplicate the work, but instead may publish with your previously wasted figures, or offer collaboration opportunities and feedback on preprint figures."

Tuesday, 13 December 2011

Report from IDCC 2011 - Data for Impact workshop

with thanks to

I spent most of last week in Bristol at the 7th International Digital Curation Conference, and had a grand old time talking about data and citations. The first thing I went to was a workshop entitled "Data for Impact: Can research assessment create effective incentives for best practice in data sharing?"

The short answer to this is, yes, but...

There's no denying that the Research Excellence Framework ("REF", for short) impacts on how research is disseminated in this country. An example was given: engineers typically publish their work in conference proceedings that are very well refereed and very competitive, with high impact in the field, internationally. But because these conference proceedings weren't counted in the RAE, the message came back to the engineering departments that they had to publish in high impact journals. So the engineers duly did, with the net result that this (badly) impacted their international standing.

There's the double whammy too, that the REF is essentially a data collection exercise, and the universities put a lot of time and effort into it - but there's no data strategy associated with the REF, and data isn't a part of it!

The REF is very concerned with publications (the number that got mentioned was that publications form 65% of the return), so we had a lot of discussion on how we could piggy-back on publications, and essentially produce "data publications" to get data counted in the REF. (Which is what I'm trying to do at the moment...)

Leaving aside the question of why we're piggy-backing on a centuries-old mechanism for publicizing scientific work  (i.e. journals) when we could be taking advantage of this cool new technology to create other solutions; there are other issues associated with this. Sure, we can assign DOIs to all the data we can think of (in suitable, stable repositories, of course), but that doesn't mean they'll be properly cited in the literature. People aren't used to citing data, they haven't understood the benefits of it, and, perhaps most importantly, the metrics aren't there to track data citation!

We talked a fair bit about metrics, specifically, altmetrics as a way of quantifying the impact of a particular piece of work (whether data or not). These haven't really gained any ground when it comes to the REF, mainly as I suspect they lack the critical mass of users using them, though it is early days. There's some really interesting stuff, and I for one will be heading over to and in the not too distant future to play with what they've been doing over there.

If we could convince the REF to count data, either as a separate research output, or even as a publication type, then that would be excellent. Sure, there were concerns that if data was a publication type, then it would be ignored in favour of high-impact journal publications (why count your dataset when you've got multiple Nature papers and four slots to report publications in?) but it could make life better for those researchers who never get a Nature paper, because they're so busy looking after their data.

I suspect though that it's too late to get data into the next REF in 2014, but maybe the one after that?  Time to start lobbying the high-up people who make those sorts of decisions!

Tuesday, 22 November 2011

What is a dataset, when you get down to it?

Picture by me and Powerpoint. 
It's possible to spend a lot of time arguing about what a dataset actually is (and believe me, plenty of people have, myself included!)

I don't have a definitive answer, but for myself, I tend to default to the idea of what's scientifically meaningful as a dataset. For example, a single peak flood measurement at a certain place for a given year could count as a dataset, but a single rain gauge measurement from a site where the gauge has been in place for years wouldn't. And of course, it all depends on the scientific domain as well.

Sometimes projects can act as a convenient guide - if a project was run for x years and provided a wodge of data, then that data can be packaged up as a project dataset. Sometimes a dataset can be all the data resulting from a given instrument for a given period of operation. The important thing is that common sense needs to be applied to how "thinly sliced" a dataset should be. I really don't want to see the concept of minimum publishable unit applied to data, thankyouverymuch!

An analogy that I tend to use a lot is a book. Like a book, a citeable (DOI-able) dataset should be easily identifiable, stable, complete and (hopefully) have enough information in it so that you can understand what it's all about, without having to refer to (too many) other sources of information. Yes, the dataset can be structured in such a way that you can refer to parts of it easily (chapter and verse analogy), but it doesn't mean that every single segment of the dataset should have its own DOI (or that each verse in a book should be published independently in its own cover).

My completely off-the-cuff and not entirely serious example of how you'd go about referencing segments of a particular dataset is:
  • Honeydew, B, Beaker, Gonzo, T.G.,  Years 2001, 2005 and 2009 from “Statistics of egg-laying in Pitch Perfect Poultry, 2000-2010” doi:10.12345/abcdefg.
Of course, datasets are more than books, and there's lots of different ways of slicing and dicing them to produce scientifically meaningful datasets. At the moment, because we're in the early stages of assigning DOIs to our hosted datasets, we're pretty much making a decision on a case by case basis, in the hopes that some general guidelines will surface along the way. (Thankfully, they do seem to be.)

One idea that quickly got assigned to the "not now - tricky" pile is the notion that users might want at some stage to effectively create a new derived dataset which is made up of smaller bits of other people's datasets, and would then want to cite this derived dataset as a whole. This "user-defined" citation would save space in the valuable real estate of a paper's references, and would provide a link to a list of the other sub-citations, in a format that was both human and machine readable. Provided that each of the sub-citations allowed you to easily and accurately get to the relevant sections of the other datasets, then the derived dataset would count as a citeable object.

This is achievable now, the technologies are ready and mature, but this rapidly starts getting tricky when you start thinking of the roles involved and how to assign credit - the author of this new derived dataset is not so much an author, more a compiler or editor, for example.

Having hierarchies of dataset citations aren't so problematic. For example, we've already made the decision that for large datasets where the dataset is continually modified by appending new files to it (for example, the rain gauge measurements mentioned above where files are created on a daily basis), then we can assign a DOI to a given period's worth of data at a time. For the rain gauge measurements, it's convenient and sensible to assign a DOI to each year's worth of data after the year's complete, and then, when the rain gauge is moved, or otherwise taken out of service, to give the entire time series one DOI.

Citation is actually a really good prod for us, to encourage us to really crystallize our thinking about what a dataset is, and how to deal with it. It's all too easy to have fuzzy datasets being random piles of files, or entries in a database table, without having defined any rules on where their edges are. I don't have the answers, but I do feel like we're getting close to at least some of them!
A lot of the thoughts in this post came about after conversations with the many people involved in various citation workshops/projects etc, including, but not limited to, my co-workers in the NERC SIS data citation and publication project and the CODATA Task Group on Data Citation. Thanks are due to them all! (I'm sure I'll be repeating that lots in future posts too!)

Monday, 14 November 2011

Why data citation is important - a personal tale.

Way back in the day, when I was a wet-behind-the-ears graduate student, my first proper science job was in pre-processing a large scientific dataset. My job was to convert signal levels received from a satellite (Italsat) radio beacon (at 20, 40 and 50 GHz) into attenuation levels. In other words, convert this:

 to this:

with the eventual aim of producing something like this:

a process which involved 4 major steps, 4 different computer programmes, and 16 intermediate files for each day of measurements. Each month of preproccessed data represented somewhere between a couple of days and a week's worth of effort. It was a job where attention to detail was important, and you really had to know what you were looking at from a scientific perspective. 

I started work on this project in 1999. In 2006 (five years after the dataset was finished) we finally got a publication out of it:

Ventouras, S., S. A. Callaghan, and C. L. Wrench (2006), Long-term statistics of tropospheric attenuation from the Ka/U band ITALSAT satellite experiment in the United KingdomRadio Sci.41, RS2007, doi:10.1029/2005RS003252.

It's been cited twice, both times by me. 

We shared our data with another group. They got a publication out of it in 2003, three years before we did. We weren't part of the author list, though I believe we got an acknowledgement. 

A quick Google Scholar for "Italsat Sparsholt" gives 48 papers which mention Italsat (the satellite) and Sparsholt (the receive station where the data came from), 37 of which weren't written by members of the project team. 

But of course, it's citations, not acknowledgements, that are important when it comes to things like how to measure how influential your work is.

And yes, I supposed we could have published quicker. But our job was to collect and quality-control and generally make our datasets as good as they possibly could be. And they are good, and they are important, but unfortunately not in a way that's easily measured. 

So, that's why I'm pushing so hard for datasets to be accepted as first class scholarly outputs. I've spent years of my life, making a dataset the best it can be, only to be pipped to the post when it comes to publishing, and having no way of knowing if that work has actually be worthwhile or not. (And no, I'm not bitter, honest!)

Data citation is something I believe in, because I've been there. I've also submitted data to a data centre (and got infuriated with the format requirements and metadata requests). But now, many years down the line, I'm on the data management side of the fence, and I can see how important it is to encourage scientists who produce data to put their data in archives/data centres where it can be properly looked after. Giving them credit through data citation has got to be part of it, at least until the point where science as a whole comes up with a better method for tracking scientific impact and importance!

(All pictures from: S. Ventouras , C.L. Wrench , S.A Callaghan, ”Measurement and analysis of satellite beacon transmissions at frequencies up to 50 GHz. Part 1: Attenuation Statistics and Frequency Scaling of Attenuation Values” Project report for the Radio Communications Agency, September 2000. Updated March 2003.)


As an aside - if anyone needs convincing of the importance of digital archiving and curation - the only way I could get the images above into this blog post was by taking a digital photo of the hard copy of the report. The original files were in a format (.ps) that Windows doesn't seem to like anymore...