Citing Bytes - Adventures in Data Citation: 2012

Tuesday 18 December 2012

SpotOn London

This post is a bit "better late than never", even though it's been just over a month since SpotOn London happened. And I'm not too sure what I can actually say about the conference, other than it was amazing, and there was cake!

Actually, no, I can say more than that. SpotOn was an unusual conference for me in that I'm used to the traditional academic conferences where you have people presenting their latest research, followed by a few questions from the floor, and then on to the next thing. SpotOn was a series of sessions which were almost completely panel discussions, where questions from the floor made up the vast majority of the conversation. Add to that format science communicators, tools developers, researchers and policy makers, and you've a potent mix of people to really get the conversations going.

The whole thing was kicked off by Ben Goldacre's gloriously chaotic talk about, um data and randomised trials and stuff, featuring such nuggets of information as that it's possible to buy Uranium off Amazon (but they won't ship it to the UK) and some gleeful choices of words that made me glad I wasn't drinking tea at the time I heard them.

The second keynote was given by Kamila Markram talking about the publishing process, and drivers for open science, in the context of frontiersin.org - a combination publishing and social networking platform for scientists.

Both keynotes (in fact, all the sessions) were videoed, so I recommend going and having a listen.

I got tapped to sit on the Data Reuse panel, along with Mark Hahnel of Fighare and Ross Mounce, even though my voice was still a bit ropey. Gratifyingly, the session was standing room only, and we covered topics including open data, reuse, credit for making data open, data publication and citation, peer-review of data and impact of data. (If you don't have time to watch the video the storify of the session does a good job of capturing all the main points, and a few asides too!)

The other things that have stuck in my mind (a month or so later) include:

The Assessing Social Media Impact session, where we could tell we were making an impact because of the rapid number of spammers targeting the #solo12impact hashtag. (Storify here.)
Preaching to the choir in the Incentivising Open Access and Open Science: Carrot and Stick session, where there was plenty of talk about making other people do things to make science open, but precious little about how to do it yourself. I subscribe to the view that it's either important, so we should do it, or it's not, so we shouldn't and should stop talking about it! And with the whole carrot and stick thing - yes, researchers are not donkeys, but they are human, and we are herd animals! Lead by example! (Storify is here.)
The ScienceGrrl crowd - flying the flag for female scientists!
The fact that of all the badges being given out, the first one to completely disappear was the one saying "Data is the new black"

All in all, a really good conference for meeting new people and getting fired up about all sorts of really cool stuff. I'll be back next year!

Tuesday 13 November 2012

When science and stories collide

I'm back in the office today after a wonderfully intense couple of days at SpotOn London 2012 - which I'll be blogging more about in another post.

But first, I want to talk about the Story Collider - the fringe event which kicked off the whole conference for me, which was held on the Saturday night in the upstairs room of a pub in Camden (not the usual location for scientific shennanigans, to be fair!)

I'm still not entirely sure how I wound up there, hiding at the back of the room, frantically reading and re-reading my notes. Well, yes, I do know how I wound up there. When the email came around to the registered conference attendees asking for storytellers, I took a look at it and thought "that could be interesting - I wonder if my story is appropriate?" And it went from there. The organsisers liked the sound of my story outline, and that was it. I was on the list to tell my tale.

(I was, of course, blithely ignoring the fact that I was due to vanish into the wilds of West Wales the weekend before the show. Oh, and the fact that my voice was somewhat on the croaky side, and not showing any signs of coming back...)

Anyway, the Story Collider is part stand-up comedy, part confessional, and aims to bring together people to listen to and to tell stories about science in their lives. Its format is simple, a half dozen storytellers, talking for about ten minutes each, standing alone on a stage in front of a microphone.

I think I can safely say it was one of the scariest experiences I've had in a long time. I'm no stranger to the stage, but there's a big difference between presenting research (where you can hide behind powerpoint slides and acronyms), or singing songs (where the words are already written and you know them by heart), to standing in front of strangers, telling them about something that actually, really happened to you, and how it made you feel. (The feelings part was the hardest!)

Be that as it may - I did it. I was shaking like a leaf when I got off that stage, but I did it!

The audience was lovely - only a science crowd would have given me a cheer when I told them how I was finally going to get my dataset published. And I got a lot of laughs, and a lot of really nice comments afterwards too - the ones that stuck in my mind were the ones that said how nice it was to hear a story about the actual trials and tribulations of doing science.

The whole event was recorded, so I'm hoping there'll be podcasts of the show coming out in the not-too-distant future. I'd really like to listen to the other stories that were told that night again, as being second last in the running order meant that I was too distracted by being nervous to give them my full attention!

Many thanks to all the Story Collider organisers for giving me the chance to tell my story, and my fellow story-tellers and the audience for being so supportive, and for laughing and cheering! If you get the chance to go to a Story Collider event, or even talk at one, go for it!

One theme that kept coming back in the discussions at SpotOn London was how much we scientists need to get better at telling stories and talking to people. The Story Collider provides an excellent way of doing just that.

Citing Sensitive Data - workshop report

"Burned" DVD, microwaved to ensure total elimination of private data , by NightRStar

On the 29th October, I went to the British Library for a workshop on the topic of managing and citing sensitive data, one of a series of workshops all about data citation.

I won't go into the detail of what was said during the presentations as all the slides are available on-line here, and there's a good blog post summarising the workshop here.

I will take the opportunity to re-iterate what I said in my previous post about how citation doesn't equal open. Though I will expand on it further and say that there needs to be extremely good reasons for keeping data closed when public money has funded its collection (reasons along the lines of patient confidentiality, saving endangered species, etc, not "but I need extra time to write a paper!")

After all the presentations, we were split up into groups, and made to do some work, it being a workshop and all. First of all, we had to come up with some example scenarios for how to cite data given certain access conditions or embargos, and then we had to swap these with another group and try to solve them. This turned out to be a lot of fun, though I did somehow manage to wind up in the group that was threatening to fire people left, right and centre if they didn't behave!

The Yellow group were looking at access conditions for a study where different participants had given different levels of consent. The solutions they came up with were: 1) have an umbrella DOI for the whole dataset with multiple DOIs for the subsets with different access conditions. 2) Have a hierarchical DOI, or 3) have an umbrella DOI linking to subsets. The trade-off here was clarity versus nuance, and it was generally agreed that communities in different disciplines would have to decide the best approach. We also can't draw an inference on a subset of the data without taking the whole dataset into account.

The Red group were looking at embargoed data. First up was "researchers want to gain more research credit". Suggestions included: early deposit, while the embargo still is in play; access by request during embargo; DOI minted on deposit; open landing page in the repository (so people know the data exists, even if they can't access it yet) with end of embargo date on it; and the metadata should be specified on deposit too.

Next the Red group looked at the situation of longitudinal cohort studies which may change and have multi-layered embargoes. Access to variables could be dependent on layers of the dataset, with access to layers potentially increasing in time. The suggestion was to have multiple DOIs for multiple layers, with links between the landing pages to show how the layers fit together.

The Green group also looked at embargoes - specifically the situation where there was retrospective withdrawal of permission for a dataset and the data was embargoed while an investigation took place. (The assumption was that the DOI had already been minted for the dataset.) Suggested action was: retain the same landing page, but add text to it detailing the embargo and the expected date when the investigations would end (compliant with the institution's investigations policy). A user option to register to get notified when the dataset becomes un-embargoed would be a nice thing to have. When the investigation is complete, update the metadata depending on the results. And, at the beginning of the data collection, make sure that the permissions and data policy are set out clearly!

The Blue group were looking at access criteria, in two cases. Firstly was "White rhino numbers and GPS tracking information". The suggestions were: assigning a DOI to the analysed rather than raw data, and apply access conditions to the raw data so as to verify user credentials. The format of the public dataset could be varied, e.g. releasing it as snapshots instead of time series, or delaying the release of the dataset until the death of the tagged rhinos. Some of the rich descriptive data might also be kept back from the DataCite metadata store in order to protect the subjects.

The second scenario the Blue group looked at was animal experiments - medical testing on guinea pigs with photos and survival times. This one was noted as being difficult - though there was agreement that releasing data should be guided by funders and ethics committees. The metadata should not name individuals, and the possibility of embargoing data, or publishing subsets (without photos?) should be investigated.

In the general discussion afterwards it was (quite rightly!) pointed out that it's ok to cite and make available different levels of data (raw/processed) as raw data might well be completely incomprehensible to non-experts. We also had a lot of discussion about those two favourite topics in data citation - granularity and versioning. Happily enough, they'll be the subject of the next workshop, booked for Mon 3rd Dec.

Friday 19 October 2012

Why Citation does not equal Open

Open Means Never Having to Say You're Sorry

By cogdogblog. http://www.flickr.com/photos/cogdog/7155294657/in/pool-67039204@N00/

Recently I've had a few emails from people expressing concern about data licensing, especially when it comes to assigning DOIs to datasets so they can be formally cited. The assumption seems to be that if a dataset has a DOI assigned to it, the data must therefore be open. This isn't the case.

Citation and open data seem to have got tangled together. Yes, citation is a mechanism for encouraging researchers to make their data open, but it doesn't follow that everything you cite has to be open.

Let's take an example of a journal paper. You can cite a journal paper whether it's open or not, and the citation simply gives information about the paper and where you can find it. The DOI for the paper will take you to a landing page, and the landing page then tells you what restrictions are on the paper (if any). It's commonplace to cite a paper that you have to pay to access - I know I've done it many a time.

Similarly, say you want to cite the Book of Kells (Trinity College Dublin MS 58). That's easy - in fact I've just done it. But for the casual reader to access it, you'd need to travel to Dublin, go to Trinity College Library, pay €9 and look at whatever page happens to be open on display at that particular time. (I'm sure there are more stringent restrictions on researchers who actually want to be able to flick through the pages!)

So, there's plenty of precedent for researchers citing things that aren't open, or are restricted in some way. Data will be no different.

DataCite themselves have accounted for some situations where access to data might be restricted (because of confidentiality issues, embargo periods, etc.) in the publication year in the mandatory properties and also in the date element in the optional properties of the DOI metadata schema.

Publication Year “If an embargo period has been in effect, use the date when the embargo period ends. “

The landing page for a DOI-ed dataset needs to be completely open with the relevant information about why there is restricted access and/or what to do to get full access.

There's a planned JISC-British Library DataCite Workshops, focusing on managing and citing sensitive data, taking place on Monday 29th October in the British Library Conference Centre, which will look in greater detail at exactly these sorts of issues. Registration is still open!

For me, I want to spread the word that you can cite data without having to make it open. Open data is, of course, something to be encouraged wherever possible. But scientists are nervous enough about open data and the possibility of getting scooped, or having legal or IPR issues causing problems. Going for the softly, softly approach of citing data whether it's open or not will allow researchers to get used to the idea of data citation. Once they get credit for their work in creating the datasets, that's when we can show them how much more credit they can get for making them open.

And in a lot of cases, data needs to be restricted for very good reasons (for example protecting patient confidentiality). Penalising the researchers who created those datasets by not allowing them citations because their data can't be make open just seems unfair.

Wednesday 27 June 2012

DataCite Summer Meeting June 2012

Sand castles by experts in Copenhagen

This is the last one of these posts - as it's the end of my notes from the Talinn/Copenhagen trip. Unfortunately it wasn't the last of the meetings I had to go to; the final one was a CODATA working group on data citation report drafting meeting, which doesn't have any presentation notes, but meant I missed the second half of the DataCite meeting.

Anyway, notes from the DataCite Summer Meeting presentations I did get to see are below:

Inside the lecture theatre at the Black Diamond

The NordBib conference was all about Structural Frameworks for Open Digital Research
- Strategy, Policy & Infrastructure. I kind of fell into attending it by accident, as I was in Copenhagen for the OpenAIREplus workshop before it, and the DataCite meeting after it, so it seemed sensible to go to this one too.

It was an interesting conference, on one hand very high level, EU strategising, while on the other, the audience seemed to mainly consist of librarians and people interested in data without that much by way of actual concrete experience in data management. So I wound up having lots of conversations with lots of people, all interested in finding out what we in the UK and the NERC data centres have been up to.

All the presentations from the conference are available here (which kind of make my notes redundant, but nevermind!)

OpenAIREplus workshop - notes from the breakout session

One of Copenhagen's bridges being opened, so we can sail through!

1. Funders and data policy
* Lots of interest in the data value checklist - compare UK and Australian data value checklist
* It's cheaper to keep data rather than recreate it
* Can you require open availability of data brought into a project? Case by case negotiation
* Multiple funders - which data policy will be applied?
* CODATA preparing a toolkit for funders about open data policy
* Role of institutional repositories? Data centres are good places to handle data pools
* Need clear metadata!
* How to handle data management plans once the project is over? Fund data management post-project. Should remain institutional responsibility.
* Identifiers - need researcher identifiers, funder acknowledgements, DOIs - all to pull together project information and data
* Are there international approached in data management plans?

2. Institutional policy
* Most institutions don't have a policy yet because they're not easy to create
* What other steps need to be done before policy?
* Hierarchy - who to get involved - academic champions
* Broad overview - what are the needs of researchers - don't want extra admin
* Don't contradict other policies or legislation
* Smaller institutions don't have monet or effort to get into big data infrastructures
* policy can guide researchers on what to do with their data
* What should be deposited, what should be kept
* How can we help insitutions develop data management plans?
* Guidelines on developing data management policies
* What kind of questions do we need to know before drafting policy?

3. Researchers and publishers
* Current examples are life and environmental sciences
* we need other examples in other fiels
* Researchers need acknowledgement for their work on data - not having it stops them shring
* Quality issues are important - need principles for peer review of research data
* Users of data are candidates to review it
* there are varying degrees of openness in peer-review - which will be appropriate for data?
* What stopes researchers sharing data? Quality, promotion, confidence in the value of the data
* We can give researchers more confidence in their data by promoting community standards
* Change beahviour so that data management is done every day, instead of just at the beginning and end of the project.
* Publishers can influence researchers when it comes to data management.
* Metrics are needed, data citation, but also alt.metrics
* Need for good examples of data management to educate researchers
* Need a list of trusted databases/repositories
* URLs aren't trusted, because they break!

4. Technical
* Finland is constructing a national data catalogue, containing a mix of metadata records and data
* OpenAIRE data model and services are using trust levels for entities and (automatic and man-made) relations
* Need to guarantee long term data availability for enhanced publications to be trustworthy, or at least know what bits will last for how long
* Level of trust needed to develop services to show levels of preservation
* Services should still exist for low trust objects - e.g.g use a robot to check if the object is still there, and if not, drop the connection.

5. OpenAIREplus
* Are there licensing restrictions for metadata?
* Case studies of scientific communities should be published as soon as possible
* Credit for researchers is important
* Libraries have a role too - even if there is a fear of data management
* Universities are very disparate - makes it hard politically to agree on data policy.

OpenAIREplus workshop notes - 11th June 2012, Copenhagen

The Black Diamond from the water - not a bad little conference venue!

“Linking Open Access publications to data – policy development and implementation"

The next stop in my marathon run of conferences/workshops/meetings was Copenhagen, and the Black Diamond, the home of the Royal Library of Denmark for the OpenAIREplus workshop and the Nordbib conference (more on that in a later post).

This post contains my notes from the presentations given as part of the OpenAIREplus workshop, and boy, they crammed a lot in there! Some really fascinating talks, and my only complaint would be that we didn't have enough time in the breakout sessions to really get into the meat of things. But given that we'd started at 8.30am, I'm not sure where we could have found any more time!

(Insert usual disclaimer about notes here...)

Some notes from "Editing in the Digital World: 11th EASE General Assembly and Conference"

A friendly face at Tallinn Technical University

I was invited to Tallinn to talk at the European Association of Science Editors (EASE) General Assembly and Conference, "Editing in the Digital World", to present in their session about "Publishing Data". I was very pleasantly surprised by how much interest the delegates had in this particular subject - in fact we had to move rooms for the session, as the first room was too small!

Unfortunately, I couldn't make notes of the session I was presenting in, and was only around for a couple of keynotes (as I had to head off to Copenhagen for yet more meetings - see future blog posts coming soon!). I did manage to take notes for the keynotes I was around for, and here they are, for your reading pleasure.

(As always, these are notes, so all errors grammatical, factual, spelling and otherwise are mine! No warranty is given, etc. etc. etc.)

The slides from my presentation can be found here.

RDMF8 - Feedback from the breakout sessions

Breakout by dspindle, on Flickr

Group 1: What are the main barriers to researcher/publisher collaboration and how might they be transcended?

Who owns the information?

Researchers have a proprietary interest. Journals and institutions also talk about the data being "theirs". Issues of trust.

Need to make clear what value-adds publishers make.

Publishers are making user-driven solutions.

Integrated systems are important

saves duplication of time/effort
Feed through of metadata from data collection systems to publication systems
The DCC has a key role in research support and infrastructure systems, including sharing metadata.

Researcher apathy

Publishers make it easier to put data in repositories
Vicious circle of not getting credit for data means less likely to share.
Lots of collaboration from everyone needed

Group 2: Can peer review of data be performed as scrupulously as peer review of publications? Is there a universal framework that can be applied?

Peer-review of data is not the same as peer-review of publications

Data integrity checks
Scientific review
User review

A taxonomy of data review processes is needed.
Publishers need explicit guidelines on expectations of reviewers regarding data.
Trust in datasets/repositories

Encouraging the wider use of DOIs is essential as it allows researchers to find datasets and evaluate them, starting an evolutionary process for trust.
There are a number of emerging standards for trusted repositories, but they're not commonly known.

Compound or manifest method for publishing the data, article, methods etc.
The role of publishers

varies widely across communities.
Publishers are probably not the best people to archive data.
Learned society publishers have a key role to educate researchers about data.

Institutions

have a key role as part of mobilising the scientific community
The expectations of institutions regarding data have to be spelled out.

Group 3: What future is there for national and institutional data repositories to provide a platform for
data publication?

The future's great!
At the moment, institutional data policies are patchy.

A good incentive for the building of a good institutional repository is it will provide a good record of all institutional research outputs.

Data is a first class scientific output
Institutional repositories should be based on a national template of good practise

Some journals are taking this role at the moment, not sure if someone else should.

Reuse of datasets is a key driver.
Is there mileage in offering a cash prize for best demonstration of best data reuse?

Summary and next steps

Everyone at the meeting was given the job of cascading information about data publication to their colleagues/funders/institution. The DCC promised to engage with funders and others to the extent it can within the UK.

Getting sharing research data right brings in real economic benefits, and that's something we don't have to persuade government about. We need to find out areas to carry out actions where everyone gains. We might find ourselves in the situation where the effort-benefit doesn't apply to the same people, so we need to be prepared.

RDMF8 - Notes from presentations, Fri 30th March.

Lecture Notes by Editor_Tupp, on Flickr

David Tempest (Elsevier) - "Journals and data publishing: enhancing, linking and mining"

David's role is to work on strategy and policy for all the ways people access Elsevier 's data, including open access, mechanisms for access, access for data.
Elsevier 's approach: Interconnections between data and publications are important. Scientists who create data need to have their effort recognised and valued. When journals add value and/or incur significant cost, then their contributions also need to be recognised and valued.
There are many potential new roles - want to embrace an active test and learn approach. Will be sensitive to different practises in different domains. Want to work in collaboration with others. Key is sustainability - want to ensure that information is available for the long term.
Publishing research consortium survey: Data sets/models/algorithms shown as being important yet difficult to access.
Paradox in data availability and access: asking researchers gives positive reasons for sharing data (increased collaboration, reduced duplication of effort, improved efficiencies), but also negative reasons (being scooped, no career rewards for sharing, effort needed to make data suitable for sharing). Embargo periods allow researchers to maximise the use of their data.
Researchers should be certifying data, not the publisher.
Articles on Science Direct can link to many external sources - continuing to work on external and reciprocal linking (at the moment there are 40 different linking agreements). Example: linking with Pangaea.
Article of the future: tabbed so the user can move from one bit of article to another very quickly, incorporating all the different data elements into the text but also in the tabs. Elsevier are rolling it out across all their journals (alongside the traditional view)
Supplementary data: options for expandable boxes containing supplementary information in the text.
Content mining: researchers want to do this so Elsevier are doing a lot to enhance and enable content mining wherever they can. An analogy was shown with physical mining workflows (and some nice pictures too).

Brian McMahon (International Union of Crystallography) - "Research data archiving and publication in a well-defined physical science discipline"

Challenge for publishers engaging with data is the diversity of data.
IUCr unusual amoung international unions in that they publish their own journals. Two journals publish crystal structure reports. These are the most structured and disciplined publications, and had to integrate handling these within more general publishing workflows.
Brian gave a very nice description of a crystallographic (x-ray diffraction) experiment, handily explaining what's involved for all the non-crystallographers in the audience.
Data can mean any or all of: raw measurements from an experiment, processed numerical observations, derived structural information, variable parameters in the experimental set-up or numerical modelling and interpretation, bibliographic and linking information. Make no distinction between data and metadata - metadata are data that are of secondard interest to the current focus of attention.
Crystallographic Information Framework (CIF): human readable and easily machine parseable. Simple tag and value structure. Are xml instances of CIF data. CIF can be used as a vehicle for aticle submission. Within the CIF file is the abstract, other text as a data field. Can reformat the CIF file to produce a more standard paper format.
CIF standard is documented and open.
"Standards are great: everyone should have one!" Important to get started - whatever you can standardise you can leverage.
There is a web service called checkCIF which is the same programs as used to check data on submission of a paper. Authors are encouraged to use this before submission. The author uploads a CIF file, programs generate a report flagging outlying values. If anomaly is detected, then paper will not be passed on through publishing process unless the anomaly is addressed by the author. Reviewer sees outlier flag and response and makes a judgement about it.
Why publish data? Reproducibility, verification, safeguard against error/fraud, expansion of research, example materials for teaching/learning, long-term preservation, systematic collection for comparative studies. Each community has to assess the cost-benefit of each of these reasons for themselves.
IUCr policies: Derived data made freely available. Working on a formal policy for primary data.

Rebecca Lawrence (F1000) - "Data publishing: peer review, shared standards and collaboration"

F1000 core service is post-publication peer-review in biology and medicine. 1500 new evaluations per month, >120k total so far. New: F1000 posters, F1000 Research (F1000R launching later this year)
F100R addressing alternatives to current scholarly publishing approaches: speed (immediate publication), peer review (open, post-publication peer review), dissemination of findings (wide variety of formats, e.g. submitting as a poster), sharing of primary data (sharing, publication and refereeing of datasets). Gold open access, CC-BY
F1000R - post publication formal open refereeing process. Submission to publication lag is days versus months for traditional journals.
Majority of big journals don't see data papers as prior publications.
Key areas requiring stakeholder collaboration for data publication: workflows, cross-linking, data centre accreditation, data peer review.
Datasets: issues of common/mineable formats (DCXL), deposit in relevent subject repositories where possible, otherwise in a stable general data host (Dryad, FigShare, institutional data repository), what counts as an "approved repository", what level of permanency guarantees?
Protocol info: enough for reuse, ultimate aim is computer mineable, MIBBI standards too extreme, F1000R looking at ISA framework with Oxford and Harvard groups.
Authors want it to be quick and simple: minimal effort, maximal reuse of metadata capture, smooth workflow between article, institutional repositories, data centres
Incentive to share data: show view/download statistics (higher than reseeachers think!), impact measures to show value to funders, encourage data citation in main article references (need to agree a standard data citation approach)
Refereeing data: time required to view many data files, how does reviewer know it's ok without repeating experiment/analysing data yourself. Showed ESSD example guidelines, Pensoft, BMC research notes.
Community discussion for peer-review: is the method appropriate? Is there enough information for replication? Appropriate controls? Usual format/structure? Data limitations described? Does data "look" ok?
F1000R sanity check will pick up: format and suitable basic structure, standard basic protocol structure adhered to, data stored in appropriate stable location.
F1000R focus on whether work is scientifically sound, not novelty/interest. Encourage author-referee discussion.
Challenges: referee incentives, author revision incentives, clarity on referee status, knowledge of referee status away from site (CrossMark), mangement of versions (what and how to cite)

Ubiquity press have guidelines for how they choose and suggest data repositories.

Todd Vision (Dryad/National Evolutionary Synthesis Centre) - "Coupling data and manuscript submission: some lessons from Dryad"

Roles for all stakeholders in linking/archiving data/publications
Basic idea of packing information into one thing (the paper) not threatened by enhanced publications, nano publications, data papers.
Researcher - requesting data after publication doesn't work very well.
There is a logical point in time to archive data associated with publications, during the publication process. That's when researchers are motivated to clean up and make data available.
Joint Data Archiving Policy - start of Dryad. Grass-roots effort, rolled out slowly, in the knowledge that there wasn't the infrastructure to handle the long tail data. Response to this policy has been very positive. Embargo on data for a year after publication key to community acceptance.
Dryad requirements (handed down from on high): Less than 15 minutes to complete the deposit through repository interface (once files etc. had been completed). Long term preservation important.
Paper provides large amounts of rich metadata associated with dataset. Orphan data, as long as one has the paper associated with it, can still be valuable. Long-tail data very information rich.
Journals refer authors to Dryad or other suitable repositories.
Curation is the most expensive part of the process. Data DOI (assigned by DRYAD) is put into the article, in whatever version of the article.
Dryad also has authors submitting data outside the integrated systems with specific journals.
Data made available through CC0. About 1/3 of the files get embargoed. Some journals disallow the embargo.
Dryad have handshaking set up with specialised repositories, working with TreeBASE, trying to make progress with Genbank. Will require a lot of community effort on standards.
Adding new journals, ~1/month. Getting closer to financial sustainability all the time.
Legacy data being added. Data being added as a result of it being challenged in the press.
Incentives - in some cases data has a different author list from article author list - providing credit for dataset authors.
Sustainability - deposit cost covered up front. Governed by a membership non-profit organization. Gold Open Access funding model, with different options: Journal subscriptions, pre-purchase of deposits, retrospective purchase of deposits, pay-per-deposit (paid by authors). Deposit fees ~£30/$50!
"Perfect is the enemy of the good" for long tailed data. Repository governance should be community initiative. Lot of room for education about how to prepare data for re-use, how to make data citations actually count. Do we have enough researcher incentives, or are publisher mandates and citation enough?
Limit of 10 GB for dataset. Curation costs for lots of files/complicated metadata drive the costs of deposit.
Reuse of Dryad data: median 12 downloads in a year after deposit. Leader has ~2,000 downloads. Room for improvement in tracking how people use the downloaded data.

Simon Coles (University of Southampton) - "Making the link from laboratory to article"

Talk focussed on the researcher perspective: doing research and writing papers!
We don't think a lot about the researcher's notebooks/how they actually work, of record what they're doing.
Faraday's notebooks are a great example. He recorded over 30,000 experiments and devised a metadata scheme for indexing and tagging experiments.
The notion that we're drowning in a sea of data is true and important to researchers.
Researchers manage and discuss data in relative isolation.
At some level, academics really do want to share, but they want recognition. There's also "how do I get my PhD student to share with me?"
Data puts a large burden on the journals, and it's not clear what the benefits are for the journals.
Example shown of Dial-a-molecule, an EPSRC grand challenge, where information about molecules are provided very efficiently and quickly, all predicated on informatics.
We need to understand all the experiments ever done and the negative results are as important as the positive ones.
Mining data is a big scientific driver.
Chemistry data is: scribblings in a book, the process of mixing stuff, analysis and characterisation of compound using instruments and computers, images, molecules, spectra, all the raw data coming out of instruments. And data ranges from highly structured to difficult to describe.
In chemistry publications, the abstract has complicated and difficult information to catalogue and understand. The experimental section has reams of coded text providing the recipies for what was done. Supplementary information also has pages of text.
There is a problem with information loss, for example, when an author chooses one point from a complete spectra to report on in the text.
With structured data the problem is largely solved. The problem is with unstructured data.
My Lab Notebook provides an on-line research diary to capture what you're doing when you're doing it. This allows a stripped down paper to be written, containing links to the notebook.

Christopher Gutteridge (University of Southampton) - "Publishing open data"

Christopher's remit is to find open data at the University of Southampton and publish it in a joined up way. Or, in other words "Allow the bearer to publish any non-confidential data in our realm without let or hindrance".
His job title is "architect", but he thinks that "gardener" might be more appropriate.
Working on the principle that "we'd be smarter if we knew what we knew".
He started working with buildings on the grounds that they (usually) stay in the same place, and aren't known for suing.
There is a danger with this sort of work, in that the temptation is trong to start by opening up and designing a new system/data model, instead of seeing what's already out there first.
Best practise is to simply list what's available (buildings... people...) and what key is used for them.
He showed an example page about a building, with information about the building, a map of where it is, and a picture of what it looks like, all of which make it a lot easier for visitors and students to find the building. A PhD student put a load of information about the buildings into some code to generate a map of the university buildings. This took a lot of effort to build, but is easy to maintain.
Linked data on the web should have some text with it, explaining what it is, for the use of a random person who has just been dumped on the page courtesy of Google.
If we want to link between research facilities and papers, then each facility needs a unique id. There is value for research data in linking it with the infrastructure that produced it.
Most of the links of value to an organisation are internal.
Homework for the forum attendees: think about the vocabulary we need to specify data management requirements.
Further details at http://blogs.ecs.soton.ac.uk/data/

RDMF8 - Discussions Thurs 29th March

The discussion was lively on the Thursday evening (I think we ran out of steam on the Friday, but it was still an excellent event). Below are the points that were raised:

Journals have a significant role in driving the connections between data and publications. The example given was Nature demanding accession numbers in the 1970s was a key driver for setting up data repositories.
We've only just started with interactive data in papers, and we really do need to think about what readers need and want. Publishers need to become more aware of how researchers work, and get involved further upstream of paper production.
What is the journals' role in the preservation data? Not sure if there is a need for publishers to get into the data repository business. There is a need to move away from supplementary information, and think about how to preserve it. We all have a responsibility to maintain data.
Big question: how do we define a trusted repository? Trusted repositories should be "community endorsed". Publishers are driven by the norms in each scientific community. What are sustainable models for repositories?
An easy way to get more out of supplementary information would be to support it in more and different formats.
What constitutes the version of record for datasets?
The peer-review process is unfunded - how would it change with the integration of data? Nature did a survey where they found that a high percentage of respondents wanted peer-review of data, but didn't want to be the ones to actually do the review.
What role should repositories play in the peer-review of data?
Data papers might help the peer-review process, as it'd break up the procedure of review. For example, in the publication of protocols, the Royal Society of Chemistry checks the data to ensure it is internally consistent, a process separate from peer-review. Could this be part of a new role for technical editors?
There is a CrossRef initiative (CrossMark) in the works which will allow users to see what version a section of a paper is by hovering over it - allowing users to be aware of post publication changes.
The UK Data Archive have a system of high impact and low impact changes for when/if changes in a dataset trigger a new DOI.
Where should data citations be put? In the text? Footnotes? There is concern about things being in the reference list which aren't peer-reviewed, and dual citations. Some publications limit their reference lists.
UKDA are approaching publishers to suggest methods of citations for the social sciences.

Notes from RDMF8: Engaging with the Publishers - Integrating data and publications talk, Thurs 29th March, Southampton

Men with printing press, circa 1930s by Seattle Municipal Archives, on Flickr

Ruth Wilson (Nature Publishing Group) set the scene for us with an excellent key-note talk, which led into some very spirited discussion both after the talk and down the bar before dinner. I scribbled down 3 1/2 pages of notes, so I'm not going to transcribe them all (that would be silly) but instead will aim to get the key points as I understood them. If it's a case of tl;dr, then skip down the end to the talk's conclusions, and you'll get the gist.

NPG's main driving factors for their interest in data publication are: ensuring the transparency of the scientific process, and to speed up the scientific process.
Data neeeds to be: available, findable, interpretable, re-useable and citeable.

The Data Publication Pyramid ( http://www.alliancepermanentaccess.org/wp-content/uploads/downloads/2011/11/ODE-ReportOnIntegrationOfDataAndPublications-1_1.pdf)

Increasing amounts of information are integral to the article (and even more are supplementary). How can we link to data with no serving repository?
Interactive data is becoming important - things like 3 D structure, regraph info, add/remove traces, download data from behind graphs/figures, geospatial data on maps. These are all being pulled together in things like Elsevier's article of the future.
Supplementary data has become "a limitless bag of stuff!", often with the data locked in pdf. Supplementary information is adversely affecting the review process, in that it puts extra pressure on authors, reviewers and readers. There has been a 65% increase in supplementary information between 2008 and 2011. Sometimes it's only tenuously linked to the article, or it can be integral to the article, but put in supplementary information due to journal stringent space restrictions.
Nature Neuroscience will be trialling a new type of paper from April 2012, where the authors will submit one seamless article, putting all of the essential information into it. Editors will then work with the referees and the authors to determine what elements should stay in the paper, and what should be considered supplementary. The plan is that it will make people think what's integral to the paper and ensure all the information submitted is peer-reviewed.
Nature are also investigating an extended on-line version of articles (in html and pdf) where there can be up to 14 extra figures or tables included.
Nature Chemistry was shown as an example: they publish a lot of compounds, where the synthetic procedure for the compounds is in the supplementary information, and gets pulled through to the on-line article in an interactive way.
Linking and integration between journals and data repositories is important. NPG are looking for bidirectional linking between article and data, and are seeking more serious, interactive integration.
NPG has a condition that "authors are required to make materials, data and associated protocols promptly available to others without undue qualifications". It also "strongly recommends" data deposit in subject repositories.
Regarding data publications, the call for data to be a first class scientific object was acknowledged, along with the interest publishers now have in data (as shown by the increasing number of fledgeling data publications)
Data papers were described as being a detailed descriptor of the dataset, with no conclusions, instead focussing on increasing interoperability and reuse. The data should be held in a trusted repository (definition of trusted to be defined!), with linking and integration between the paper and data. Credit would be given through citation for data producers, and would also provide attribution and credit for data managers, who might not qualify for authorship of a traditional paper.

The conclusions:

Linking publications and data strengthens the scientific record and improves transparency
Funders policies are a key driver for integrating data and publications
Journals can and do influence data deposition
Not a situation of one size fits all!
Partnerships are important (institutions, repositories, publishers, researchers, funders), but the roles are not well established, and business models need to be determined.

Monday 19 March 2012

Discussion notes from: Equality for Women in Science - Sometime, now, NEVER?

Women in Science by Kraemer Family Library, on Flickr (http://www.flickr.com/photos/27640054@N08/5526926256/)

After the presentations in the morning (and some very gooey brownies at lunch), we reconvened for a discussion session, followed by a plenary session reporting on the results of the discussion. Below are the random notes I made during the discussions and the plenary.

On the differences in perceptions between men and women: in chemistry, for women a PhD was viewed as an ordeal, while for men it was a rite of passage.
We should value talent - Beckham doesn't leave football when he's injured, why should a woman leave science when she has a baby?
We need to collect information about where people who leave science actually go to. This point was then expanded to a general one about needing to get some social scientists to look into this whole situation more.
The point was made that attitudes can be changed by legislation (for example the social unacceptability of smoking as a result of the smoking ban) , though the counter example was that people tend to pay selective attention to legislation (for example, the mobile phone ban while driving).
Legal quota systems often aren't fair - for example, when there's a legal requirement to have a 50-50 sex ratio for exam invigilation, and only 20% of the lecturers are one sex, the burden of invigilation falls unfairly hard on them.
Money does talk - if university departments required an award (such as Athena Swan Silver) before they could get a grant, things would change pretty quickly!
Headhunters are often key to pull women up through into boards, and should therefore be worked with. There was a proposal that universities should require headhunters to return equal numbers of male and female candidates for a post. Networks and professional bodies are also useful for women to be part of, as headhunters often go to them to find candidates.
The example of the Chemistry department at the University of York was referenced repeatedly (as you'd expect for them being a trailblazer). They had to move from large quantities of effort to a smaller, higher quality of effort, and the head of department was key to making this work as a guardian of culture.
Regarding self confidence issues, it was acknowledged that actually both sexes need help with self confidence. Women admit their lack of self confidence, while men bluster, but both situations are problematic.
Everyone acknowledged that it was harder for men to say things like "can we start the meeting later? I need to take my kids to school." This shouldn't be the case! One participant said that in her department, the standing rule was no meetings outside of school hours, and that this worked well for everyone.
We need to harness enlightened self-interest to change things.
We also shouldn't shrink the problem - we can't extrapolate out from the local situation to encompass all of science. Yet it's at the local level where the most drive to fix things can be found, and local implementation is important to cover the multiplicity of issues, because there's no one problem across all the sciences.
One delegate raised the question: is the lack of women in academia actually a problem? And if it is, is it not a problem that will be self-correcting once the universities realise? (To me, this sounded like "the market will sort it out" sort of thinking I've come across in my radio days. Unfortunately, when it comes to things like that, the market isn't that helpful.)
Bottom up and top down initiatives need to meet, probably at the level of the Principal Investigator, as they're the ones who train the next generation of scientists and pass on the culture.
Yes, we need more facts and information, but we can't afford to sit on our hands waiting for them - we need to take action as well.

My discussion group came up with the following points (which I managed to scribble down):

Profile problem: people need to be made aware of the leaky pipeline problem.
Acceptance: many people don't even believe the leaky pipe is a problem. Leaders and guardians of culture need to be targeted and trained about unconscious bias.
Positive campaign: pooling best practise and publicising the benefits to all
Role models: (surprisingly to me) the evidence for their usefulness is very weak, and there are both positive and negative results in having role models.

We made the following proposal - that there should be social science research commissioned (by the research councils?) to answer our questions, like how to quantify the value to the economy of fixing the leaky pipe, and then to make publicly available these results and the facts collected as part of the research.

So, in other words, we did what scientists everywhere do - decided that we needed further research! Still, it was a very interesting discussion, and I'm hoping that people went away from the conference that bit more determined to change their departmental culture for the better!

Friday 16 March 2012

And now for something completely different...Equality in Science

In the midst of this week's haze of grant-proposal writing, I took a day out to attend the "Equality for Women in Science - Sometime, now, NEVER?" conference happening at the International Space Innovation Centre, at Harwell Oxford, as it was conveniently located just across the road from my office. I also went, wearing my hat as chair of the RAL Women in STEM committee. (That's me in the above photo, by the way - taken for a brochure about women scientists in STFC.)

Slightly depressingly, as you'd expect for a conference on equality, it was a female dominated event, with about 12 men in the 100 strong audience. (Coincidentally, that's about the same proportion as women in STEM in STFC). So I had a certain feeling of there being a bit of preaching to the choir going on, but still.

We started with a keynote from John Perkins, Chief Scientific Advisor for BIS. He was really pushing the point that the leaky pipeline damages the UK economy, to the tune of millions, and that we needed to fix it.

We then had Jocelyn Bell-Burnell reporting on a Royal Society of Edinburgh study which is due to be launched on the 4th April, and so we got a sneak preview, which was confidential. She did set the scene quite nicely with some quotes from the Good Wife's Guide, 1955.

Paul Walton (University of York) presented some really interesting stats showing that the ability of women to progress through the system hasn't changed, and this extends across all disciplines. Scarily, if this trend continues, it'll take until 2109 to reach parity in civil engineering, and maybe never in clinical dentistry. The Chemistry department at York are the only department to receive an Athena SWAN gold award, and Paul told us the 12 year story of how they got there. It took a lot of leadership, he said, to change the culture. And they focussed on fairness, which is something everyone can get behind. (He also had a really neat optical illusion trick to make you see a colour photo when it was really black and white - illustrating the prejudices that we all have.)

Ottoline Leyser talked about the pressure cooker of academia - the publish or perish mantra that scares women away, and scares men into staying (because they don't want to seem a wimp), which is bad for everyone. She too was pushing the whole "parenting is a parent's issue" and "it's a culture in science issue, not a women in science issue", which I really agree with!

The last presentation was Denis Bartholomew, who was proposing the use of quotas to get more women on boards and in higher positions of authority. This didn't go down particularly well, for me I felt that he needed more evidence to show to support his thinking, especially when presenting to scientists! Still, he had a good analogy, that smoking only really became socially unacceptable after legislation came into force for the smoking ban.

After lunch we had some really good discussion sessions. Which I will report another time, because it's Friday afternoon, and time for me to go home!

New article in IJDC

Sarah Callaghan, Steve Donegan, Sam Pepler, Mark Thorley, Nathan Cunningham, Peter Kirsch, Linda Ault, Patrick Bell, Rod Bowie, Adam Leadbetter, Roy Lowry, Gwen Moncoiffé, Kate Harrison, Ben Smith-Haddon, Anita Weatherby, Dan Wright
Making Data a First Class Scientific Output: Data Citation and Publication by NERC’s Environmental Data Centres International Journal of Digital Curation, Vol 7, No 1 (2012)

Abstract

The NERC Science Information Strategy Data Citation and Publication project aims to develop and formalise a method for formally citing and publishing the datasets stored in its environmental data centres. It is believed that this will act as an incentive for scientists, who often invest a great deal of effort in creating datasets, to submit their data to a suitable data repository where it can properly be archived and curated. Data citation and publication will also provide a mechanism for data producers to receive credit for their work, thereby encouraging them to share their data more freely.

Tuesday 7 February 2012

Lunchtime lecture to the British Geological Survey

I was invited to give a talk to the British Geological Survey on the 25th January, on the topic of data citation and publishing, and why it's important. I've been doing this talk in a variety of guises in different places for a while now, but I thought it'd be good to put it up here too. Consider it an on-line lecture, if you will.

(Click on any of the slide images to see the larger versions)

The key point here is that science should be reproducible, different people running the same experiment at different times should get the same result. Unfortunately, until someone invents a working time machine, you can't just pop back to last week to collect some observational data, so that's why we have to archive it properly.

Often, the only part of the scientific process that gets published is the conclusions from a dataset. And, if the data's rubbish, so will be the conclusions. But we won't know that until we can look at the data.

This is a bit of blurb about the data citation project, and the NERC data centres, and why we care about data in the first place.

There's a nice picture drawn by Robert Hooke in the above slide - showing us that in the past it might have been tedious and time consuming to collect data, but it was at least (relatively) easy to publish. Not so much anymore.

And we're only going to be getting more data... Lots of people call it "the data deluge". If we're going to be flooded with data, it's time to start building some arks!

Data sharing is often put forward as a way of dealing with the data deluge. It has its good points...

...but in this day and age of economic belt-tightening, hoarding data might be the only thing that gets you a grant.

Data producers put a lot of effort into creating their datasets, and at the moment, there's no formal way of recognising this, which will help the data producers when it comes to facing a promotion board.

There are lots of drivers to making data freely available, and to cite and publish it. From a purely pragmatic view, and wearing my data centre hat, we want a carrot to encourage people to store their data with us in appropriate formats and with complete metadata.

The project aims can basically be summed up as us wanting a mechanism to give credit to the scientists who give us data, because we know how tricky a job it is. But it has to be done if the scientific record is to stand.

The figure in this slide is key here, especially when it comes to drawing the distinction between "published" with a small "p" and "Published" with a big "P". We want to get data out into the open, and at the same time have it "Published", providing guarantees as to its persistence and general quality. What we definitely don't want is to have the data locked away on a floppy disk in a filing cabinet in an office somewhere.

Data centres are fitting into the middle ground between open and closed, and "published" and "Published", and we're hoping to help move things in the right directions.

Repeating the point, because it's important. (With added lolcats for emphasis!)

I'm far from an expert on cloud computing, but there are many questions to be answered before shoving datasets into the cloud or on a webpage. These things, like discoverability, permanence, trust, etc, are all things that data centres can help with.

This is an example of thousand year old data that's preserved very well indeed. Unfortunately we've lost the supporting information and the context that went with it, meaning we've got several different translations with different meanings.

It's not enough to simply store the bits and bytes, we need the context and metadata too.

It's easy enough to stick your dataset on a webpage, but it takes effort to ensure it's all properly documented, and that other people can use it without your input. There's also risks - someone might find errors, or use your work to win funding.

Data centres know that the work involved in preparing a dataset for use by others is needed, and that's why we want to help the data producers and ensure they get credit for it.

Of course, in some cases where sharing data is mandatory, but the data producer doesn't really want to do it, it's a simple matter of not doing the prep work, and then the data's unusable to anyone but the creators.

(The example files in the pictures come from one of my own datasets, before they were put into the BADC with all their metadata and in netCDF. I know what they are, but no one else would...)

So, we're going to cite data using DOIs, and these are the reasons why. Main ones being, they're commonly used for papers, and scientists are familiar with them.

Now we're getting into the detail. These are our rules about what sort of data we can /will cite. Note that these are self-imposed rules, and we're being pretty strict about them. That's because we want a DOI-ed dataset to be something worth having.

Data centres served data as our day job - we take it in from scientists and we make it available to other interested parties.

The data citation project is working on a method of citing data using DOIs - which will give the dataset our "data centre stamp of approval", meaning we think it's of good technical quality and we commit to keeping it indefinitely.

The scientific quality of a dataset has to be evaluated by peer review by scientists in the same domain. That's going to be a tricky job, and we're partnering up with academic publishers to work further on this.

Data Publication, with associated scientific peer review would be good for science as a whole, and also good for the data producers. It would allow us to test the conclusions published in the literature, and provide a more complete scientific record.

Of course, publishing data can't really be done in the traditional academic journal way. We need to take advantage of all these new technologies.

We're not the first to think of this - data journals already exist, and more are on the horizon. There does seem to be a groundswell of opinion that data is becoming more and more important, and citation and publication of data are key.

This pretty much sums up the situation with the project at the moment. At the end of this phase, all the NERC data centres will have at least one dataset in their archive with associated DOI, and we'll have guideline documents published for the data centre and data producers about the requirements for a dataset to be assigned a DOI.

Users are coming to us and asking for DOIs, and we're hoping to get more scientists interested in them. We're also encouraging the journals who express an interest in data publication, and are encouraging them to mandate dataset citation in their papers too.

I really do feel like we're gathering momentum on this!

Thursday 2 February 2012

JISC Grant Funding 01/12: Digital Infrastructure Programme

JISC have announced their latest Managing Research Data call. Of particular interest (to me, anyway) is:

Managing Research Data: Innovative Data Publication

Projects to design and implement innovative technical models and organisational partnerships to encourage and enable publication of research data.

Total funding of up to £320,000 for 2-4 projects of between £80,000 and £150,000 per project.
Jun 2012 – Jul 2013.

Closing date is 12:00 noon UK time on 16 March 2012. More details here.

Citing Bytes - Adventures in Data Citation