Monday, 1 December 2014

2nd Data Management Workshop, University of Cologne, 28-29 November 2014

Cologne cathedral at night (and in the rain)

I was very honoured to be invited as a guest speaker at the 2nd Data Management Workshop, held at the University of Cologne on the 28-29 November 2014. 

It was a very interesting workshop, with many excellent national and international speakers. What was particularly good was its focus on interactions between the attendees - the coffee and lunch breaks were particularly long, which gave everyone the chance to really look at the many posters that had been submitted to the workshop, and talk to the people who were presenting them. The workshop proceedings will also be published as a special issue on data management in ISPRS International Journal of Geo-Information - I'm expecting further details of that to be on the workshop website in due course.

I took about 8 pages of hand-scribbled notes from the talks, so I won't be inflicting them all on you. Instead I'll just pull out the highlights that jumped out at me. The talks themselves were videoed, and will be made available on-line too.

The workshop opened with a pair of presentations from Stefan Winkler-Nees and Brit Redohl, both from the German Research Foundation (DFG), discussing the funding mechanisms in Germany for funding data management activities.They seemed very keen to receive more applications for data management funding!

Kevin Ashley (Digital Curation Centre) was next, giving an overview of the landscape of data management - highlighting the DCC guidance documents and Jisc's Research Data Spring, as well as the need for good research data management to root out cases of fraud, and aid data reuse. A key quote I jotted down was "Often your data tells stories that your publications do not."

Arnulf Christl (Metaspatial) gave an amusing and informative talk about open source software and what we can learn from it when it comes to open data. He made the very valid point that scientific data should be clearly licensed, as this allows attribution and credit to be given to the creators. He also showed the following video, which everyone enjoyed!

Tomi Kauppinen (Aalto University School of Science) spoke about linked data and our need for online tools to visualise and assess data, as well as the fact that linked data makes data, and data about data, machine processible.

Jane Greenberg (Dryad) gave an overview of the data publishing system in operation at Dryad, their guidance on data citation, and the costs involved in creating the Dryad metadata records. (This discussion of data publication was a theme that kept coming back throughout the workshop.)

Cyril Pommier (French National Institute for Agricultural Research, INRA) gave a talk about the data management difficulties in coupling phenotype with plant genome studies, for studies into crop security, adaption to climate change, etc. (Being a physicist, a lot of the science went straight over my head, but what I found fascinating was the fact that the data management problems being described were the same ones that we get in atmospheric science, so we may have more in common from a data management point of view, than not. Which made me think - how many of the solutions are applicable cross domains? We need to find out!)

The second day of the workshop kicked off with a pair of archaeological talks. Firstly was Gerd-Christian Weniger (Neanderthal Museum) talking about making 3D scans of items from the Pleistocene period, including Neanderthal fossils. They use Confluence, which is a business wiki, as their repository software, as it allows easy up- and download of data. These scans, and the high resolution surface scans of rock art and stone tools, allow research to be done without having to travel to where the original tool or fossil is actually held - opening up the artifacts for study by schools and teacher training.

Katie Green (Archaeology Data Service) gave a talk about how the ADS does what it does, touching on their workflows for ingest and data publication (with the journal Internet Archaeology, who are also publishing data papers). She talked about the Jisc project, investigating the value of ADS to the community (a related project looked at the BADC last year) - a synthesis report can be found here.

Marjan Grootveld (Data Archiving and Networked Services) talked about how DANS operates, specifically about their front office - back office model for dealing with researchers, where the front office provide guidance and information, while the back office deal with the technical aspects of storage and preservation. DANS provide training for front office staff, who can be embedded in university libraries and other locations. Another quote that resonated with me was: "Data management planning is more important than the plan".

Wolfram Horstmann (State and University Library of Gottingen) discussed data services and policies from universities, funding bodies and journals. He also differentiated between a "post hoc data library" which is strong in service reputation, but weak in subject specific expertise, with an "ad hoc data library", which has good subject specific knowledge, but often no recurrent funding. Of course, hybrids of these two exist.

And Hans Pfeiffenberger (Alfred Wegener Institute and Earth Systems Science Data) finished off the workshop with a discussion about data publication, giving examples of lessons that were learned from data papers published in ESSD. He also showed us that all these data publication issues are not new - Kepler's laws were based on Tycho Brahe's data and observations, which Kepler only got access to after Brahe's death. ESSD requires authors to describe the provenance of the data, the methods used to create/collect it, the limitations of the data, and provide estimates of the error. Reviewers must look at the data, and assess the consistency of the data and the article.

I'd like to thank the organisers again for inviting me to the workshop - and I hope to visit Cologne again sometime!

Monday, 17 November 2014

The WISE Awards 2014

HRH Princess Anne presenting the RCUK-sponsored Champion award. 

The WISE awards are known as “the Oscars of the Scientific World” . They recognise and celebrate individual women and girls who are ideal role models to inspire the next generation of girls to go into STEM careers, as well as the teachers, careers advisers and women in leadership who support and grow the talent pipeline.

I was honoured to have been asked to be a judge for this year's awards, especially as it meant I could attend the gala dinner where the awards were presented to the winners by HRH Princess Anne. It was  a lovely evening, and I got the chance to meet and talk to some amazing and inspiring women in STEM. 

Further details of the awards, details of all the nominees, and the list of all the winners can be found on the WISE website. I'd also really like to encourage everyone who reads my blog to seriously think about their friends and colleagues and if they think someone would be a good fit for an award, then please nominate them! 

WISE also held a daytime event  at the Southbank Centre that day (Thursday 13th November), entitled “Time for Action: The STEM workforce we want to build for the next 30 years”, where they formally announced the release of their new report “ ‘Not for people like me?’ Under-represented groups in science, technology and engineering”   

That session was particularly interesting, as it made us think about how we describe ourselves (say, if we were at a speed-dating event). I describe myself by what I do, along with half of the people in the room at the time. The other half described themselves by what personal attributes they have. Job adverts recruiting for STEM posts need to reflect this.

Also in job adverts - the language used in them can be very off-putting for women. Things that are especially off-putting are if the company appears to be "arrogant", if the advert is unclear about what the job actually is, and if there is no salary quoted.

Another point was made that it's not enough to talk about the outputs of a piece of work ("we built a bridge"), but we should also talk about the outcomes ("and this joined a community together"). This resonated with me, because the reason I do science is because I want to change the world for the better, even if only in a little way. (Another motivator is "being an expert", which I admit works for me too!)

All of the points made (and there's many more in the report - well worth a read!) are backed up by full references. The author, Prof Averil Macdonald, really did a good job on making it accessible and readable, while at the same time backing up every assertion she makes.

WISE are pushing to get "1 million more" women into STEM, on the groups that that number would take the total women in STEM proportion up to 30%, which is generally accepted as critical mass. It's not going to easy, but tackling the way STEM is presented will be a good start. As Imran Khan (Chief Executive - British Science Association) said during the panel discussion: "It's not about changing the girls, it's about changing the science". And changing the way that science is taught in schools - we should be teaching the methods of science, not just teaching the facts.

Another session that was really great was the workshop presented by the Institution of Civil Engineers (ICE). It started with the usual statistics and plots, but then went and got a group of five young apprentices to tell their story about how they came to be an apprentice. These girls were amazing! All of them had a non-standard route into their jobs, common themes included failed exams, or not getting the right grades, or having chosen the wrong subjects at school (which makes me even more sure that the way the UK's school and exam system insists on specialising in a limited amount of subjects at age fifteen really does cause problems!) It's these type of stories that we need to be publicising. After all, if we want to do it all, we just have to do things a little differently.

So, to sum up!
  1. Read (and share) the report!
  2. And when the call for nominees for next year's awards are out - think about who you could nominate!

Wednesday, 12 November 2014

My donation to the Museum of Curiosity

Goin' nuts with the label maker
Goin' nuts with the label maker by Bryan Kennedy, on Flickr

No, unfortunately I haven't been asked to be on the popular Radio 4 radio show "The Museum of Curiosity". But just in case I ever am, I've already decided what I'd like to donate.

But first, a bit of background on the show. The Museum of Curiosity is a panel comedy show, but with a twist. Instead of funny people being funny, they get in funny people, and experts on all sorts of things (and sometimes those people are one and the same), and they have a bit of a chat with the show presenter about their life and work. And each of the guests then gets to donate something to the Museum.

In the show's own words:
"The idea of the show is to bring together the most interesting people we can find and ask them to submit one item each to fill the Museum's empty plinths"

The seventh season is being broadcast at the moment (you can catch up on the listen again part of the BBC Radio4 website), and over the seven seasons there have been such weird and wacky things donated as the alphabet, a pubic louse, silence, Father Christmas, nothing and Epping Forest.

So, dear producers of the Museum of Curiosity. If I were ever to be invited to make a donation to the Museum, here's what it'd be:

(drum roll please!)

A Telepathic Label Maker!

(Note, this is a label maker that makes telepathic labels, not a label maker that is telepathic.)

And as for my reason for donating it - well, the Museum has an awful lot of stuff in it already, with more being added to it all the time. And some of the things (like silence, or nothing) are the sort of things that are really hard to identify if you don't know what it is you're looking at. The label maker would allow the curator of the Museum to label everything*, and provide the casual visitor with all the information they'd need to understand the exhibits, and provide credit to the person who donated it in the first place.

As for the telepathic labels, well, that's me thinking ahead. Assuming that the Museum is around for a long, long time (like I'm sure the curator hopes it is), language is going to change, so a label written in present day English (or heaven forbid, jargon!) won't be very useful. A telepathic label will be able to change to address the person (or alien!) who is viewing it in their own preferred language. Plus, it'd be a big saving on translation services, and would draw a lot more visitors in. 

I await your call, Mr Curator!

* I'm deliberately not using the word metadata here (in case it scares off the media types), though that's essentially what I'm talking about.

Thursday, 2 October 2014

The Joint Declaration of Data Citation Principles

If there's anyone who reads my blog, and doesn't know about the Joint Declaration of Data Citation principles by now, I'd be very surprised! But just in case...

The Joint Declaration of Data Citation principles was a real community effort, bringing together a large number of individuals and groups (including the CODATA-ICSTI Task Group on Data Citation Standards and Practices, DataCite, and the Research Data Alliance's Publishing Data Interest and Working groups, amoung many others) in order to refine, standardise and harmonise the data citation  principles that have been previously published.

Simon Hodson over at the CODATA blog, gives a great overview of the background to the principles and the process that created the harmonised versions.

I'd really encourage everyone to go and endorse the principles - you can do this on a personal level, or even on an institutional one.

Of course, we're not resting on our laurels now that the principles are out there and being endorsed - a follow on group has been created to implement the principles. If you're interested in joining this implementation effort, then please do get in touch with the group leaders!

Friday, 13 June 2014

Link roundup

In no particular order, some interesting stuff that has been cluttering up my browser tabs...

Sound, reproducible scholarship rests upon a foundation of robust, accessible data.  For this to be so in practice as well as theory, data must be accorded due importance in the practice of scholarship and in the enduring scholarly record.  In other words, data should be considered legitimate, citable products of research.  Data citation, like the citation of other evidence and sources, is good research practice and is part of the scholarly ecosystem supporting data reuse.

In support of this assertion, and to encourage good practice, we offer a set of guiding principles for data within scholarly literature, another dataset, or any other research object."

I strongly recommend that everyone with and interest in data citation endorses these principles, either on an individual basis, or on behalf of their organisation!

For Comment: The Role of Publishers in Access to Data

"Call to Action
We envision a future information ecosystem where research data is considered an integral part of scholarly communications. We propose a new metaphor to characterize our vision: a social contract. This contract is an agreement amongst all stakeholders based on shared, governing principles: data should be preserved, discoverable, measured, and integrated into evaluation processes; and data sharing is a fundamental practice. Adherence to this social contract will entail dramatic changes to existing workflows; technologies; and social norms for all the members of the research ecosystem."

"Scientists can be reluctant to share data because of the need to publish journal articles and receive recognition. But what if the data sets were actually a better way of getting credit for your work? Chris Belter measured the impact of a few openly accessible data sets and compared to journal articles in his field. His results provide hard evidence that the production, archival, and sharing of data may actually be a more effective way to contribute to the advancement of scientific knowledge."

DOIs and the danger of data “quality”

"...NERC state “by assigning a DOI the [Environmental Data Centre] are giving it a ‘data center stamp of approval’”. Effectively they see a DOI name (or by implication any other form of Persistent Uniform Resource Locator (PURL)) as a quality check-mark in addition to its role as a reference to an object. Except the DOI system isn’t designed to suggest the “quality goes in before the name goes on”. Just to remind myself, I quickly looked at the International DOI Foundation handbook and it doesn’t mention data quality. Identification, yes. Resolution, yes. Management, yes. Quality, no."

(With a response from myself)

Citation Rates Highlight Uphill Battle for Women in Research Careers

"One of the most important and institutionalized forms of science communication is the peer-reviewed journal article. These articles are essential to disseminating information among researchers in specific fields of study, and the extent to which those journal articles are cited by researchers in later articles is of enormous professional importance to researchers – particularly researchers who work in academic settings. But it appears that many researchers face an uphill battle when it comes to getting citations and related professional benefits. Specifically, researchers who are women."

What’s the Point of Academic Publishing?

"In December 2013, Nobel Prize-winning physicist Peter Higgs made a startling announcement. “Today I wouldn't get an academic job,” he told The Guardian. “It's as simple as that. I don't think I would be regarded as productive enough.”

Higgs noted that quantity, not quality, is the metric by which success in the sciences in measured. Unlike in 1964, when he was hired, scientists are now pressured to churn out as many papers as possible in order to retain their jobs. Had he not been nominated for the Nobel, Higgs says, he would have been fired. His scientific discovery was made possible by his era’s relatively lax publishing norms, which left him time to think, dream, and discover."

Scientists losing data at a rapid rate

"In their parents' attic, in boxes in the garage, or stored on now-defunct floppy disks — these are just some of the inaccessible places in which scientists have admitted to keeping their old research data. Such practices mean that data are being lost to science at a rapid rate, a study has now found.

The authors of the study, which is published today in Current Biology, looked for the data behind 516 ecology papers published between 1991 and 2011. The researchers selected studies that involved measuring characteristics associated with the size and form of plants and animals, something that has been done in the same way for decades. By contacting the authors of the papers, they found that, whereas data for almost all studies published just two years ago were still accessible, the chance of them being so fell by 17% per year. Availability dropped to as little as 20% for research from the early 1990s."

Guidelines / Recommendations for Citing Data

An excellent set of resources from the Virtual Solar Observatory.

The Robot Army of Good Enough

"Pretty much any organization of any size has certain themes, beliefs and outlooks baked into them. Some of them might be obvious from the outside. Others are so inherent that the members might not even notice they’re completely steeped in it.

At the Internet Archive, there’s a philosophy set about access and acceptance of materials and presentation of said materials that’s pretty inherent throughout the engineering and the website. Paraphrased, in my own words, it’s this:

  • Always provide the original.
  • Never ask why a user wants something.
  • Now is better than tomorrow.
  • We can hold it.
  • Many inexpensively is better than one or none luxuriously.
  • Never send a person where a machine can go.
  • Enjoy yourself."

"Being the largest land predator, the fearsome and enigmatic Polar Bear is seen by many as a powerful symbol to highlight of the threats to the environment through global warming. With a new publication on the Polar Bear genome out last week in Cell, they surprisingly are also an impressive example of how far data publication and citation has come in the last few years, and help debunk many of the negative arguments about the early release of datasets in this manner."

How Bitcoin’s Technology Could Revolutionize Intellectual Property Rights

"The bitcoin block chain is well known for its use as a ledger for digital currency transactions, but it has the potential for other, more radical uses too – uses that are only now beginning to be explored.

The online service Proof of Existence is an example of how the power of this new technology can have applications far beyond the world of finance, in this case, giving a glimpse of how bitcoin could one day have a substantial impact in the fields of intellectual property and law.

Although in its initial stages, Proof of Existence can be used to demonstrate document ownership without revealing the information it contains, and to provide proof that a document was authored at a particular time."


"is a free open peer review platform developed by a growing community of volunteer research scholars who envision a new era of openness and transparency in scholarly evaluation and communication. Join us and let’s liberate research together!"

Frontiers for Young Minds

"Frontiers in Neuroscience for Young Minds is a scientific journal that includes young people (from 8 to 15) in the review of articles. This has the double benefit of bringing kids into the world of scientific research – many of them for the first time – and offering active scientists a platform for reaching out to the broadest of all publics.

All articles in Frontiers for Young Minds will be reviewed and approved for publication by young people themselves. Established neuroscientists will mentor these young Review Editors and help them review the manuscript and focus their queries to authors. To avoid overburdening the young Review Editors, revised manuscripts will in turn be reviewed by one of the stellar Associate Editors of Frontiers for Young Minds."

Friday, 2 May 2014

Who owns the data?


So, who does own a dataset, anyway?

Is it the researcher who sets up the instrument and makes the measurement?
Is it the company that built the instrument?
Is it the organisation that operates the instrument (from whom the researcher has bought instrument time)?
Is it the researcher's institution, who employs the researcher to make measurements?
Is it the institution's data repository, who publish the data, or restrict access to it?
Is it the funder whose grant pays the institution for the researcher to make the measurement?
Is it the government, who provides the funder with the budget to hand out grants?
Is it the tax payer, whose taxes fund the government?*

Like so many things in life, the answers to these questions are "well, it depends..."

Ownership is a social construct. I own a car because I have a document in my filing cabinet giving details of the car make and model, saying that I do. This document is also registered in a national database (the DVLA) saying that the car specified is mine. The car itself sits outside my house, and I have the key, which means I can use it, and other people can't without my express permission. If the car gets stolen, it's uniquely registered, so there's a good chance that (barring an experience with a quick respray and fake plates) it'll still be identifiable as mine.

I also have many books. These are mine, because I bought them. But they're not uniquely identified - most don't even have my name written on them, and I don't have a register of them, not even one independently verified by an external body. If a desperate book thief were to come and nick one of my books, well, I'd be very unlikely to get exactly that same volume back again. Yet I still own them, and feel possessive about them.

[Edited to add: my better half points out that if someone steals a book from me, they take away my ability to read that book. If someone steals a digital object, like a dataset, they're stealing a copy, and unless they destroy the original, then it's still available for use by the original owner.]

And that feeling of possession is key to how people react to data. The person who feels the most strongly about the data is the researcher who created it (part of the IKEA effect, that leads to people valuing things that they assemble, customize or build themselves more highly than premade, finished goods**) But an owner of something can have no feelings for it at all, as witnessed by all those paintings locked in a vault somewhere until their value improves. 

That's why I think ownership is not a helpful thing to think about when it comes to data. Ownership focuses on possession - who has the data now. With it being so easy to make copies of datasets, many people can be "owners" - i.e. have the dataset in their possession. Ownership for data then becomes about who holds the "one, true dataset", and can then assert rights based on this***. 

As for the responsibilities of owners, well, I may be having a failure of imagination here, but I can't really think of any. I am perfectly within my rights to burn my book without asking anyone's permission (though causing a nuisance to to the neighbours with the smoke wouldn't be good). And if someone nicks my car and goes joyriding, I'm not responsible for the damage they do. If I own a dataset, I can delete it, change it, whatever. Other people might want to use it, but tough. I own it. I get to decide what to do with it.

It's better, then, to think more about the other roles involved in data, the roles that have responsibilities as well as rights. Roles like the data creator (the researcher who made the measurement), who is responsible for the contents of the dataset and the supporting information around it, and deserves credit for their work. Roles like data publisher, (the data repository and/or library), who is responsible for releasing the data to defined subsets of the population. Roles like data licenser, the party responsible for determining what other parts of the population are allowed access to the dataset, and under what conditions. Roles like data archiver, who decides whether a dataset should still be kept or should be deleted as it's no longer useful. 

These roles don't have to be carried out by individuals, institutions are capable of doing them as well. For example, the Unseen University could act as the licenser, corporate author and publisher of data that it holds. Corporate authorship is particularly useful for datasets with large numbers of creators, as it enables credit while keeping the number of names in the citation string to meaningful levels (see as an example the list of volunteers for Galaxy Zoo at - note that the url for the list names them all as authors!)

So, when discussing data, especially with the people who have put weeks, months and years of their life into the datasets they've created, it's a good idea to think about more than ownership of the data. Think and talk about those other roles and responsibilities. That way it becomes less about asserting rights and possessiveness, and more about the data itself.

And, in the future, as data becomes more open, and the mechanisms exist for giving the data creators (and their employers, funders and support staff) the credit they deserve, then hopefully the issue of ownership won't be so much of a problem.

* This happens to be my personal opinion. The results of publicly funded research should be made available for the benefit of all. In other words, open, unless there's a damn good reason not to.
** The proper link to the paper publishing this study is , but it's paywalled.
*** I'm sure I'm missing out all sort of technical, legal stuff here...

Friday, 24 January 2014

Cite what you use

Poster for "Screen as Landscape" Exhibition at the Stanley Picker Gallery, Kingston University, December 2011, and "Screen as Landscape", Dan Hays, PhD thesis, 2012, Kingston University. (From

What do you cite, the dataset or the data article? Or should it be both?

There's a lot of confusion about this, mainly stemming from the whole notion that the data article is a direct citation substitute (or proxy) for the dataset it describes (which, to be fair, it can be). Citing both the dataset and the data article gives rise to accusations of "salami-slicing" and double accounting, whereas citing only the dataset could be seen as taking citations away from the article (or vice versa).

The way I see it is that the dataset, and its corresponding data article are two separate, though related, things. It's time for another analogy!

Consider the Fine Arts. If you were wanting to do a PhD in the Fine Arts, you would need to produce a Work of Art (or possibly several, depending on your chosen form of Art) and you would also need to write a thesis about that Work, providing information about how you created the Work, why you did it the way you did, the context and reasoning behind it, and all that sort of important background information.

Now, if I was wanting to write a critique of your Work of Art, I could do so without ever reading your thesis. In that case it'd be entirely appropriate to cite the Work, but I'd have no need to cite the thesis.

If, on the other hand, I was wanting to write an article about the history and practice of a technique you used to create your Work of Art, and I read and used information from your thesis to support my argument, then I'd definitely need to cite your thesis. (I could chose to cite the Work of Art as well, in passing, but might not need to. After all, anyone wanting to find out about the Work can read the thesis I've cited and get to it that way. And I'm not actually discussing the Work itself.)

With me so far?

Ok, so the Work of Art is the dataset, and the thesis is the data article. It starts getting a bit murky in the data world, because often there isn't enough contextualising information in the dataset itself to allow it to be used/critiqued/whatever easily, and that information is captured and published in the data article (which is one of the main reasons for having data articles - to make that sort of important information and metadata available!).

Historically, in many disciplines (in the dark days before data citation), important datasets were cited by proxy - i.e. the authors of the dataset published a paper about it, and then others cited that paper as a stand-in for the dataset. The citation counts for that paper then became the citation counts for the dataset, which had the virtue of being simple enough and a valid work-around to the problem of the lack of a common practice of data citation.

But now we have the situation where a dataset can be cited independently from its data article. And we have the following situations:
  1. Both dataset and article are cited. Data creator is very happy (two citations!). Data publisher is happy (citation!). Data article publisher is happy (citation!). Reader of the citing article may not be happy (potential accusations of double counting of citations and salami-slicing...) Publisher of citing article might not be happy (not enough space in reference lists, potentially two citations that look like they're for the same thing).
  2. Only the dataset is cited.  Data creator is happy (citation!). Data publisher is happy (citation!). Data article publisher is not happy (though might be mollified by the fact that there are links from  the dataset back to the data article). Reader of the citing article may not be happy (may want more info about the dataset that is only provided in the data article). Publisher of citing article is probably not bothered one way or another (depending on journal policies for citing data).
  3. Only the data article is cited. Data creator is happy (citation!). Data publisher is not so happy (but probably resigned, no citation, but link from data article to dataset, so not as bad as old days with no link to the data at all). Data article publisher is happy (citation!). Reader of the citing article may not be happy (may want a direct link to the data). Publisher of citing article is content (situation normal).
It's a balancing act!

Honestly? I do think cultural norms will evolve within the different research domains over time. We should be prepared to give them a gentle nudge if they look like they're going completely haywire, but for the most part I'd say let them grow.

And for me, when asked "But what should I cite?!?", my default answer will be "Cite what you use".

  • If you use a data article to understand and make use of a dataset, cite them both.
  • If you use a dataset, but don't use any of the extra information given in the data article, cite the dataset.
  • If you use a data article, but don't do anything with the dataset, cite the article.

Cite what you use!