Tuesday, 3 April 2012

RDMF8 - Notes from presentations, Fri 30th March.

Lecture Notes
Lecture Notes by Editor_Tupp, on Flickr

David Tempest (Elsevier) - "Journals and data publishing: enhancing, linking and mining"

  • David's role is to work on strategy and policy for all the ways people access  Elsevier 's data, including open access, mechanisms for access, access for data.
  • Elsevier 's approach: Interconnections between data and publications are important. Scientists who create data need to have their effort recognised and valued. When journals add value and/or incur significant cost, then their contributions also need to be recognised and valued.
  • There are many potential new roles - want to embrace an active test and learn approach. Will be sensitive to different practises in different domains. Want to work in collaboration with others. Key is sustainability - want to ensure that information is available for the long term.
  • Publishing research consortium survey: Data sets/models/algorithms shown as being important yet difficult to access.
  • Paradox in data availability and access: asking researchers gives positive reasons for sharing data (increased collaboration, reduced duplication of effort, improved efficiencies), but also negative reasons (being scooped, no career rewards for sharing, effort needed to make data suitable for sharing). Embargo periods allow researchers to maximise the use of their data. 
  • Researchers should be certifying data, not the publisher.
  • Articles on Science Direct can link to many external sources - continuing to work on external and reciprocal linking (at the moment there are 40 different linking agreements). Example: linking with Pangaea. 
  • Article of the future: tabbed so the user can move from one bit of article to another very quickly, incorporating all the different data elements into the text but also in the tabs.  Elsevier  are rolling it out across all their journals (alongside the traditional view)
  • Supplementary data: options for expandable boxes containing supplementary information in the text.
  • Content mining: researchers want to do this so  Elsevier are doing a lot to enhance and enable content mining wherever they can. An analogy was shown with physical mining workflows (and some nice pictures too).


Brian McMahon (International Union of Crystallography) - "Research data archiving and publication in a well-defined physical science discipline"

  • Challenge for publishers engaging with data is the diversity of data. 
  • IUCr unusual amoung international unions in that they publish their own journals. Two journals publish crystal structure reports. These are the most structured and disciplined publications, and had to integrate handling these within more general publishing workflows.
  • Brian gave a very nice description of a crystallographic (x-ray diffraction) experiment, handily explaining what's involved for all the non-crystallographers in the audience.
  • Data can mean any or all of: raw measurements from an experiment, processed numerical observations, derived structural information, variable parameters in the experimental set-up or numerical modelling and interpretation, bibliographic and linking information. Make no distinction between data and metadata - metadata are data that are of secondard interest to the current focus of attention.
  • Crystallographic Information Framework (CIF): human readable and easily machine parseable. Simple tag and value structure. Are xml instances of CIF data. CIF can be used as a vehicle for aticle submission. Within the CIF file is the abstract, other text as a data field. Can reformat the CIF file to produce a more standard paper format.
  • CIF standard is documented and open.
  • "Standards are great: everyone should have one!" Important to get started - whatever you can standardise you can leverage. 
  • There is a web service called checkCIF which is the same programs as used to check data on submission of a paper. Authors are encouraged to use this before submission. The author uploads a CIF file, programs generate a report flagging outlying values. If anomaly is detected, then paper will not be passed on through publishing process unless the anomaly is addressed by the author. Reviewer sees outlier flag and response and makes a judgement about it.
  • Why publish data? Reproducibility, verification, safeguard against error/fraud, expansion of research, example materials for teaching/learning, long-term preservation, systematic collection for comparative studies. Each community has to assess the cost-benefit of each of these reasons for themselves.
  • IUCr policies: Derived data made freely available. Working on a formal policy for primary data. 


Rebecca Lawrence (F1000) - "Data publishing: peer review, shared standards and collaboration"

  • F1000  core service is post-publication peer-review in biology and medicine. 1500 new evaluations per month, >120k total so far. New: F1000 posters, F1000 Research (F1000R launching later this year)
  • F100R addressing alternatives to current scholarly publishing approaches: speed (immediate publication), peer review (open, post-publication peer review), dissemination of findings (wide variety of formats, e.g. submitting as a poster), sharing of primary data (sharing, publication and refereeing of datasets). Gold open access, CC-BY
  • F1000R - post publication formal open refereeing process. Submission to publication lag is days versus months for traditional journals.
  • Majority of big journals don't see data papers as prior publications.
  • Key areas requiring stakeholder collaboration for data publication: workflows, cross-linking, data centre accreditation, data peer review.
  • Datasets: issues of common/mineable formats (DCXL), deposit in relevent subject repositories where possible, otherwise in a stable general data host (Dryad, FigShare, institutional data repository), what counts as an "approved repository", what level of permanency guarantees?
  • Protocol info: enough for reuse, ultimate aim is computer mineable, MIBBI standards too extreme, F1000R looking at ISA framework with Oxford and Harvard groups.
  • Authors want it to be quick and simple: minimal effort, maximal reuse of metadata capture, smooth workflow between article, institutional repositories, data centres
  • Incentive to share data: show view/download statistics (higher than  reseeachers think!), impact measures to show value to funders, encourage data citation in main article references (need to agree a standard data citation approach)
  • Refereeing data: time required to view many data files, how does reviewer know it's ok without repeating experiment/analysing data yourself. Showed ESSD example guidelines, Pensoft, BMC research notes.
  • Community discussion for peer-review: is the method appropriate? Is there enough information for replication? Appropriate controls? Usual format/structure? Data limitations described? Does data "look" ok?
  • F1000R sanity check will pick up: format and suitable basic structure, standard basic protocol structure adhered to, data stored in appropriate stable location.
  • F1000R focus on whether work is scientifically sound, not novelty/interest. Encourage author-referee discussion.
  • Challenges: referee incentives, author revision incentives, clarity on referee status, knowledge of referee status away from site (CrossMark), mangement of versions (what and how to cite)

Ubiquity press have guidelines for how they choose and suggest data repositories.

Todd Vision (Dryad/National Evolutionary Synthesis Centre) - "Coupling data and manuscript submission: some lessons from Dryad"

  • Roles for all stakeholders in linking/archiving data/publications
  • Basic idea of packing information into one thing (the paper) not threatened by enhanced publications, nano publications, data papers.
  • Researcher - requesting data after publication doesn't work very well. 
  • There is a logical point in time to archive data associated with publications, during the publication process. That's when researchers are motivated to clean up and make data available. 
  • Joint Data Archiving Policy - start of Dryad. Grass-roots effort, rolled out slowly, in the knowledge that there wasn't the infrastructure to handle the long tail data. Response to this policy has been very positive. Embargo on data for a year after publication key to community acceptance.
  • Dryad requirements (handed down from on high): Less than 15 minutes to complete the deposit through repository interface (once files etc. had been completed). Long term preservation important.
  • Paper provides large amounts of rich metadata associated with dataset. Orphan data, as long as one has the paper associated with it, can still be valuable. Long-tail data very information rich.
  • Journals refer authors to Dryad or other suitable repositories.
  • Curation is the most expensive part of the process. Data DOI (assigned by DRYAD) is put into the article, in whatever version of the article. 
  • Dryad also has authors submitting data outside the integrated systems with specific journals. 
  • Data made available through CC0. About 1/3 of the files get embargoed. Some journals disallow the embargo.
  • Dryad have handshaking set up with specialised repositories, working with TreeBASE, trying to make progress with Genbank. Will require a lot of community effort on standards.
  • Adding new journals, ~1/month. Getting closer to financial sustainability all the time.
  • Legacy data being added. Data being added as a result of it being challenged in the press.
  • Incentives - in some cases data has a different author list from article author list - providing credit for dataset authors.
  • Sustainability - deposit cost covered up front. Governed by a membership non-profit organization. Gold Open Access funding model, with different options: Journal subscriptions, pre-purchase of deposits, retrospective purchase of deposits, pay-per-deposit (paid by authors). Deposit fees ~£30/$50!
  • "Perfect is the enemy of the good" for long tailed data. Repository governance should be community initiative. Lot of room for education about how to prepare data for re-use, how to make data citations actually count. Do we have enough researcher incentives, or are publisher mandates and citation enough?
  • Limit of 10 GB for dataset. Curation costs for lots of files/complicated metadata drive the costs of deposit.
  • Reuse of Dryad data: median 12 downloads in a year after deposit. Leader has ~2,000 downloads. Room for improvement in tracking how people use the downloaded data.

Simon Coles (University of Southampton) - "Making the link from laboratory to article"
  • Talk focussed on the researcher perspective: doing research and writing papers!
  • We don't think a lot about the researcher's notebooks/how they actually work, of record what they're doing.
  • Faraday's notebooks are a great example. He recorded over 30,000 experiments and devised a metadata scheme for indexing and tagging experiments.
  • The notion that we're drowning in a sea of data is true and important to researchers.
  • Researchers manage and discuss data in relative isolation.
  • At some level, academics really do want to share, but they want recognition. There's also "how do I get my PhD student to share with me?"
  • Data puts a large burden on the journals, and it's not clear what the benefits are for the journals.
  • Example shown of Dial-a-molecule, an EPSRC grand challenge, where information about molecules are provided very efficiently and quickly, all predicated on informatics.
  • We need to understand all the experiments ever done and the negative results are as important as the positive ones.
  • Mining data is a big scientific driver.
  • Chemistry data is: scribblings in a book, the process of mixing stuff, analysis and characterisation of compound using instruments and computers, images, molecules, spectra, all the raw data coming out of instruments. And data ranges from highly structured to difficult to describe.
  • In chemistry publications, the abstract has complicated and difficult information to catalogue and understand. The experimental section has reams of coded text providing the recipies for what was done. Supplementary information also has pages of text.
  • There is a problem with information loss, for example, when an author chooses one point from a complete spectra to report on in the text.
  • With structured data the problem is largely solved. The problem is with unstructured data.
  • My Lab Notebook provides an on-line research diary to capture what you're doing when you're doing it. This allows a stripped down paper to be written, containing links to the notebook.
Christopher Gutteridge (University of Southampton) - "Publishing open data"
  • Christopher's remit is to find open data at the University of Southampton and publish it in a joined up way. Or, in other words "Allow the bearer to publish any non-confidential data in our realm without let or hindrance".
  • His job title is "architect", but he thinks that "gardener" might be more appropriate.
  • Working on the principle that "we'd be smarter if we knew what we knew".
  • He started working with buildings on the grounds that they (usually) stay in the same place, and aren't known for suing.
  • There is a danger with this sort of work, in that the temptation is trong to start by opening up and designing a new system/data model, instead of seeing what's already out there first.
  • Best practise is to simply list what's available (buildings... people...) and what key is used for them.
  • He showed an example page about a building, with information about the building, a map of where it is, and a picture of what it looks like, all of which make it a lot easier for visitors and students to find the building. A PhD student put a load of information about the buildings into some code to generate a map of the university buildings. This took a lot of effort to build, but is easy to maintain.
  • Linked data on the web should have some text with it, explaining what it is, for the use of a random person who has just been dumped on the page courtesy of Google.
  • If we want to link between research facilities and papers, then each facility needs a unique id. There is value for research data in linking it with the infrastructure that produced it.
  • Most of the links of value to an organisation are internal.
  • Homework for the forum attendees: think about the vocabulary we need to specify data management requirements.
  • Further details at  http://blogs.ecs.soton.ac.uk/data/ 

No comments:

Post a Comment