Citing Bytes - Adventures in Data Citation: April 2012

Tuesday, 3 April 2012

RDMF8 - Feedback from the breakout sessions

Breakout by dspindle, on Flickr

Group 1: What are the main barriers to researcher/publisher collaboration and how might they be transcended?

Who owns the information?

Researchers have a proprietary interest. Journals and institutions also talk about the data being "theirs". Issues of trust.

Need to make clear what value-adds publishers make.

Publishers are making user-driven solutions.

Integrated systems are important

saves duplication of time/effort
Feed through of metadata from data collection systems to publication systems
The DCC has a key role in research support and infrastructure systems, including sharing metadata.

Researcher apathy

Publishers make it easier to put data in repositories
Vicious circle of not getting credit for data means less likely to share.
Lots of collaboration from everyone needed

Group 2: Can peer review of data be performed as scrupulously as peer review of publications? Is there a universal framework that can be applied?

Peer-review of data is not the same as peer-review of publications

Data integrity checks
Scientific review
User review

A taxonomy of data review processes is needed.
Publishers need explicit guidelines on expectations of reviewers regarding data.
Trust in datasets/repositories

Encouraging the wider use of DOIs is essential as it allows researchers to find datasets and evaluate them, starting an evolutionary process for trust.
There are a number of emerging standards for trusted repositories, but they're not commonly known.

Compound or manifest method for publishing the data, article, methods etc.
The role of publishers

varies widely across communities.
Publishers are probably not the best people to archive data.
Learned society publishers have a key role to educate researchers about data.

Institutions

have a key role as part of mobilising the scientific community
The expectations of institutions regarding data have to be spelled out.

Group 3: What future is there for national and institutional data repositories to provide a platform for
data publication?

The future's great!
At the moment, institutional data policies are patchy.

A good incentive for the building of a good institutional repository is it will provide a good record of all institutional research outputs.

Data is a first class scientific output
Institutional repositories should be based on a national template of good practise

Some journals are taking this role at the moment, not sure if someone else should.

Reuse of datasets is a key driver.
Is there mileage in offering a cash prize for best demonstration of best data reuse?

Summary and next steps

Everyone at the meeting was given the job of cascading information about data publication to their colleagues/funders/institution. The DCC promised to engage with funders and others to the extent it can within the UK.

Getting sharing research data right brings in real economic benefits, and that's something we don't have to persuade government about. We need to find out areas to carry out actions where everyone gains. We might find ourselves in the situation where the effort-benefit doesn't apply to the same people, so we need to be prepared.

RDMF8 - Notes from presentations, Fri 30th March.

Lecture Notes by Editor_Tupp, on Flickr

David Tempest (Elsevier) - "Journals and data publishing: enhancing, linking and mining"

David's role is to work on strategy and policy for all the ways people access Elsevier 's data, including open access, mechanisms for access, access for data.
Elsevier 's approach: Interconnections between data and publications are important. Scientists who create data need to have their effort recognised and valued. When journals add value and/or incur significant cost, then their contributions also need to be recognised and valued.
There are many potential new roles - want to embrace an active test and learn approach. Will be sensitive to different practises in different domains. Want to work in collaboration with others. Key is sustainability - want to ensure that information is available for the long term.
Publishing research consortium survey: Data sets/models/algorithms shown as being important yet difficult to access.
Paradox in data availability and access: asking researchers gives positive reasons for sharing data (increased collaboration, reduced duplication of effort, improved efficiencies), but also negative reasons (being scooped, no career rewards for sharing, effort needed to make data suitable for sharing). Embargo periods allow researchers to maximise the use of their data.
Researchers should be certifying data, not the publisher.
Articles on Science Direct can link to many external sources - continuing to work on external and reciprocal linking (at the moment there are 40 different linking agreements). Example: linking with Pangaea.
Article of the future: tabbed so the user can move from one bit of article to another very quickly, incorporating all the different data elements into the text but also in the tabs. Elsevier are rolling it out across all their journals (alongside the traditional view)
Supplementary data: options for expandable boxes containing supplementary information in the text.
Content mining: researchers want to do this so Elsevier are doing a lot to enhance and enable content mining wherever they can. An analogy was shown with physical mining workflows (and some nice pictures too).

Brian McMahon (International Union of Crystallography) - "Research data archiving and publication in a well-defined physical science discipline"

Challenge for publishers engaging with data is the diversity of data.
IUCr unusual amoung international unions in that they publish their own journals. Two journals publish crystal structure reports. These are the most structured and disciplined publications, and had to integrate handling these within more general publishing workflows.
Brian gave a very nice description of a crystallographic (x-ray diffraction) experiment, handily explaining what's involved for all the non-crystallographers in the audience.
Data can mean any or all of: raw measurements from an experiment, processed numerical observations, derived structural information, variable parameters in the experimental set-up or numerical modelling and interpretation, bibliographic and linking information. Make no distinction between data and metadata - metadata are data that are of secondard interest to the current focus of attention.
Crystallographic Information Framework (CIF): human readable and easily machine parseable. Simple tag and value structure. Are xml instances of CIF data. CIF can be used as a vehicle for aticle submission. Within the CIF file is the abstract, other text as a data field. Can reformat the CIF file to produce a more standard paper format.
CIF standard is documented and open.
"Standards are great: everyone should have one!" Important to get started - whatever you can standardise you can leverage.
There is a web service called checkCIF which is the same programs as used to check data on submission of a paper. Authors are encouraged to use this before submission. The author uploads a CIF file, programs generate a report flagging outlying values. If anomaly is detected, then paper will not be passed on through publishing process unless the anomaly is addressed by the author. Reviewer sees outlier flag and response and makes a judgement about it.
Why publish data? Reproducibility, verification, safeguard against error/fraud, expansion of research, example materials for teaching/learning, long-term preservation, systematic collection for comparative studies. Each community has to assess the cost-benefit of each of these reasons for themselves.
IUCr policies: Derived data made freely available. Working on a formal policy for primary data.

Rebecca Lawrence (F1000) - "Data publishing: peer review, shared standards and collaboration"

F1000 core service is post-publication peer-review in biology and medicine. 1500 new evaluations per month, >120k total so far. New: F1000 posters, F1000 Research (F1000R launching later this year)
F100R addressing alternatives to current scholarly publishing approaches: speed (immediate publication), peer review (open, post-publication peer review), dissemination of findings (wide variety of formats, e.g. submitting as a poster), sharing of primary data (sharing, publication and refereeing of datasets). Gold open access, CC-BY
F1000R - post publication formal open refereeing process. Submission to publication lag is days versus months for traditional journals.
Majority of big journals don't see data papers as prior publications.
Key areas requiring stakeholder collaboration for data publication: workflows, cross-linking, data centre accreditation, data peer review.
Datasets: issues of common/mineable formats (DCXL), deposit in relevent subject repositories where possible, otherwise in a stable general data host (Dryad, FigShare, institutional data repository), what counts as an "approved repository", what level of permanency guarantees?
Protocol info: enough for reuse, ultimate aim is computer mineable, MIBBI standards too extreme, F1000R looking at ISA framework with Oxford and Harvard groups.
Authors want it to be quick and simple: minimal effort, maximal reuse of metadata capture, smooth workflow between article, institutional repositories, data centres
Incentive to share data: show view/download statistics (higher than reseeachers think!), impact measures to show value to funders, encourage data citation in main article references (need to agree a standard data citation approach)
Refereeing data: time required to view many data files, how does reviewer know it's ok without repeating experiment/analysing data yourself. Showed ESSD example guidelines, Pensoft, BMC research notes.
Community discussion for peer-review: is the method appropriate? Is there enough information for replication? Appropriate controls? Usual format/structure? Data limitations described? Does data "look" ok?
F1000R sanity check will pick up: format and suitable basic structure, standard basic protocol structure adhered to, data stored in appropriate stable location.
F1000R focus on whether work is scientifically sound, not novelty/interest. Encourage author-referee discussion.
Challenges: referee incentives, author revision incentives, clarity on referee status, knowledge of referee status away from site (CrossMark), mangement of versions (what and how to cite)

Ubiquity press have guidelines for how they choose and suggest data repositories.

Todd Vision (Dryad/National Evolutionary Synthesis Centre) - "Coupling data and manuscript submission: some lessons from Dryad"

Roles for all stakeholders in linking/archiving data/publications
Basic idea of packing information into one thing (the paper) not threatened by enhanced publications, nano publications, data papers.
Researcher - requesting data after publication doesn't work very well.
There is a logical point in time to archive data associated with publications, during the publication process. That's when researchers are motivated to clean up and make data available.
Joint Data Archiving Policy - start of Dryad. Grass-roots effort, rolled out slowly, in the knowledge that there wasn't the infrastructure to handle the long tail data. Response to this policy has been very positive. Embargo on data for a year after publication key to community acceptance.
Dryad requirements (handed down from on high): Less than 15 minutes to complete the deposit through repository interface (once files etc. had been completed). Long term preservation important.
Paper provides large amounts of rich metadata associated with dataset. Orphan data, as long as one has the paper associated with it, can still be valuable. Long-tail data very information rich.
Journals refer authors to Dryad or other suitable repositories.
Curation is the most expensive part of the process. Data DOI (assigned by DRYAD) is put into the article, in whatever version of the article.
Dryad also has authors submitting data outside the integrated systems with specific journals.
Data made available through CC0. About 1/3 of the files get embargoed. Some journals disallow the embargo.
Dryad have handshaking set up with specialised repositories, working with TreeBASE, trying to make progress with Genbank. Will require a lot of community effort on standards.
Adding new journals, ~1/month. Getting closer to financial sustainability all the time.
Legacy data being added. Data being added as a result of it being challenged in the press.
Incentives - in some cases data has a different author list from article author list - providing credit for dataset authors.
Sustainability - deposit cost covered up front. Governed by a membership non-profit organization. Gold Open Access funding model, with different options: Journal subscriptions, pre-purchase of deposits, retrospective purchase of deposits, pay-per-deposit (paid by authors). Deposit fees ~£30/$50!
"Perfect is the enemy of the good" for long tailed data. Repository governance should be community initiative. Lot of room for education about how to prepare data for re-use, how to make data citations actually count. Do we have enough researcher incentives, or are publisher mandates and citation enough?
Limit of 10 GB for dataset. Curation costs for lots of files/complicated metadata drive the costs of deposit.
Reuse of Dryad data: median 12 downloads in a year after deposit. Leader has ~2,000 downloads. Room for improvement in tracking how people use the downloaded data.

Simon Coles (University of Southampton) - "Making the link from laboratory to article"

Talk focussed on the researcher perspective: doing research and writing papers!
We don't think a lot about the researcher's notebooks/how they actually work, of record what they're doing.
Faraday's notebooks are a great example. He recorded over 30,000 experiments and devised a metadata scheme for indexing and tagging experiments.
The notion that we're drowning in a sea of data is true and important to researchers.
Researchers manage and discuss data in relative isolation.
At some level, academics really do want to share, but they want recognition. There's also "how do I get my PhD student to share with me?"
Data puts a large burden on the journals, and it's not clear what the benefits are for the journals.
Example shown of Dial-a-molecule, an EPSRC grand challenge, where information about molecules are provided very efficiently and quickly, all predicated on informatics.
We need to understand all the experiments ever done and the negative results are as important as the positive ones.
Mining data is a big scientific driver.
Chemistry data is: scribblings in a book, the process of mixing stuff, analysis and characterisation of compound using instruments and computers, images, molecules, spectra, all the raw data coming out of instruments. And data ranges from highly structured to difficult to describe.
In chemistry publications, the abstract has complicated and difficult information to catalogue and understand. The experimental section has reams of coded text providing the recipies for what was done. Supplementary information also has pages of text.
There is a problem with information loss, for example, when an author chooses one point from a complete spectra to report on in the text.
With structured data the problem is largely solved. The problem is with unstructured data.
My Lab Notebook provides an on-line research diary to capture what you're doing when you're doing it. This allows a stripped down paper to be written, containing links to the notebook.

Christopher Gutteridge (University of Southampton) - "Publishing open data"

Christopher's remit is to find open data at the University of Southampton and publish it in a joined up way. Or, in other words "Allow the bearer to publish any non-confidential data in our realm without let or hindrance".
His job title is "architect", but he thinks that "gardener" might be more appropriate.
Working on the principle that "we'd be smarter if we knew what we knew".
He started working with buildings on the grounds that they (usually) stay in the same place, and aren't known for suing.
There is a danger with this sort of work, in that the temptation is trong to start by opening up and designing a new system/data model, instead of seeing what's already out there first.
Best practise is to simply list what's available (buildings... people...) and what key is used for them.
He showed an example page about a building, with information about the building, a map of where it is, and a picture of what it looks like, all of which make it a lot easier for visitors and students to find the building. A PhD student put a load of information about the buildings into some code to generate a map of the university buildings. This took a lot of effort to build, but is easy to maintain.
Linked data on the web should have some text with it, explaining what it is, for the use of a random person who has just been dumped on the page courtesy of Google.
If we want to link between research facilities and papers, then each facility needs a unique id. There is value for research data in linking it with the infrastructure that produced it.
Most of the links of value to an organisation are internal.
Homework for the forum attendees: think about the vocabulary we need to specify data management requirements.
Further details at http://blogs.ecs.soton.ac.uk/data/

RDMF8 - Discussions Thurs 29th March

The discussion was lively on the Thursday evening (I think we ran out of steam on the Friday, but it was still an excellent event). Below are the points that were raised:

Journals have a significant role in driving the connections between data and publications. The example given was Nature demanding accession numbers in the 1970s was a key driver for setting up data repositories.
We've only just started with interactive data in papers, and we really do need to think about what readers need and want. Publishers need to become more aware of how researchers work, and get involved further upstream of paper production.
What is the journals' role in the preservation data? Not sure if there is a need for publishers to get into the data repository business. There is a need to move away from supplementary information, and think about how to preserve it. We all have a responsibility to maintain data.
Big question: how do we define a trusted repository? Trusted repositories should be "community endorsed". Publishers are driven by the norms in each scientific community. What are sustainable models for repositories?
An easy way to get more out of supplementary information would be to support it in more and different formats.
What constitutes the version of record for datasets?
The peer-review process is unfunded - how would it change with the integration of data? Nature did a survey where they found that a high percentage of respondents wanted peer-review of data, but didn't want to be the ones to actually do the review.
What role should repositories play in the peer-review of data?
Data papers might help the peer-review process, as it'd break up the procedure of review. For example, in the publication of protocols, the Royal Society of Chemistry checks the data to ensure it is internally consistent, a process separate from peer-review. Could this be part of a new role for technical editors?
There is a CrossRef initiative (CrossMark) in the works which will allow users to see what version a section of a paper is by hovering over it - allowing users to be aware of post publication changes.
The UK Data Archive have a system of high impact and low impact changes for when/if changes in a dataset trigger a new DOI.
Where should data citations be put? In the text? Footnotes? There is concern about things being in the reference list which aren't peer-reviewed, and dual citations. Some publications limit their reference lists.
UKDA are approaching publishers to suggest methods of citations for the social sciences.

Notes from RDMF8: Engaging with the Publishers - Integrating data and publications talk, Thurs 29th March, Southampton

Men with printing press, circa 1930s by Seattle Municipal Archives, on Flickr

Ruth Wilson (Nature Publishing Group) set the scene for us with an excellent key-note talk, which led into some very spirited discussion both after the talk and down the bar before dinner. I scribbled down 3 1/2 pages of notes, so I'm not going to transcribe them all (that would be silly) but instead will aim to get the key points as I understood them. If it's a case of tl;dr, then skip down the end to the talk's conclusions, and you'll get the gist.

NPG's main driving factors for their interest in data publication are: ensuring the transparency of the scientific process, and to speed up the scientific process.
Data neeeds to be: available, findable, interpretable, re-useable and citeable.

The Data Publication Pyramid ( http://www.alliancepermanentaccess.org/wp-content/uploads/downloads/2011/11/ODE-ReportOnIntegrationOfDataAndPublications-1_1.pdf)

Increasing amounts of information are integral to the article (and even more are supplementary). How can we link to data with no serving repository?
Interactive data is becoming important - things like 3 D structure, regraph info, add/remove traces, download data from behind graphs/figures, geospatial data on maps. These are all being pulled together in things like Elsevier's article of the future.
Supplementary data has become "a limitless bag of stuff!", often with the data locked in pdf. Supplementary information is adversely affecting the review process, in that it puts extra pressure on authors, reviewers and readers. There has been a 65% increase in supplementary information between 2008 and 2011. Sometimes it's only tenuously linked to the article, or it can be integral to the article, but put in supplementary information due to journal stringent space restrictions.
Nature Neuroscience will be trialling a new type of paper from April 2012, where the authors will submit one seamless article, putting all of the essential information into it. Editors will then work with the referees and the authors to determine what elements should stay in the paper, and what should be considered supplementary. The plan is that it will make people think what's integral to the paper and ensure all the information submitted is peer-reviewed.
Nature are also investigating an extended on-line version of articles (in html and pdf) where there can be up to 14 extra figures or tables included.
Nature Chemistry was shown as an example: they publish a lot of compounds, where the synthetic procedure for the compounds is in the supplementary information, and gets pulled through to the on-line article in an interactive way.
Linking and integration between journals and data repositories is important. NPG are looking for bidirectional linking between article and data, and are seeking more serious, interactive integration.
NPG has a condition that "authors are required to make materials, data and associated protocols promptly available to others without undue qualifications". It also "strongly recommends" data deposit in subject repositories.
Regarding data publications, the call for data to be a first class scientific object was acknowledged, along with the interest publishers now have in data (as shown by the increasing number of fledgeling data publications)
Data papers were described as being a detailed descriptor of the dataset, with no conclusions, instead focussing on increasing interoperability and reuse. The data should be held in a trusted repository (definition of trusted to be defined!), with linking and integration between the paper and data. Credit would be given through citation for data producers, and would also provide attribution and credit for data managers, who might not qualify for authorship of a traditional paper.

The conclusions:

Linking publications and data strengthens the scientific record and improves transparency
Funders policies are a key driver for integrating data and publications
Journals can and do influence data deposition
Not a situation of one size fits all!
Partnerships are important (institutions, repositories, publishers, researchers, funders), but the roles are not well established, and business models need to be determined.

About the Author

I'm Sarah Callaghan and I am the Research Practice Manager for the University of Oxford.

Previously, I was Editor-in-Chief for Patternsa data science journal from Cell Press.

Before then I worked for the Centre for Environmental Data Analysis as a data scientist and programme manager attempting to make sense of this data citation and publication thing.

Before that I worked for the Radio Communications Research Unit (now the Chilbolton Group at STFC - Rutherford Appleton Laboratory) where I studied radio propagation at frequencies above 10 GHz (and in the process created a number of large datasets).

Needless to say, all opinions are my own, and nothing to do with my employer.

My official biography can be found here.