Sand castles by experts in Copenhagen
This is the last one of these posts - as it's the end of my notes from the Talinn/Copenhagen trip. Unfortunately it wasn't the last of the meetings I had to go to; the final one was a CODATA working group on data citation report drafting meeting, which doesn't have any presentation notes, but meant I missed the second half of the DataCite meeting.
Anyway, notes from the DataCite Summer Meeting presentations I did get to see are below:
Wed 13th June - Presentation of DataCite and its annual meeting
DataCite - Adam Farquhar
Data gets analysed, synthesised and interpreted to become Information, which when published becomes Knowledge. Published knowledge is accessible - have a mature infrastructure for distributing knowledge. Information can be traceable, but data can get lost.
Finding and citing data shouls be easy, but it isn't.
If we cited data:
* higher visibility
* re-use and verification of data
* improved reputation for data producers
DataCite is addressing these challenges. DataCite founded in 2009. 2005 TIB pioneers DOIs for data. 2010 - pilot with data centres. 2011 >1million DOIs registered. 2012 - things consolidating, improved production infrastructure.
DataCite is a not-for-profit association. IDF (International DOI Foundation) member. DOI registration Agency. Managing agent at TIB. DataCite has member institution (e.g British Library) which have Data clients (e.g. BADC)
Also affiliate members who are organisations interested in this problem, but use other mechanisms for assigning DOIs, or aren't interested in DOIs.
Global association with local representation.
Framework of support:
* 1.3 M DOIs registered
* engagement with EU for Horizon 2020
* joint statement with STM on linking and citing data
* DOI standard approved ISO 26324:2012
* metadata consultation schema.datacite.org
* metadata storage mds.datacite.org
* CNRI hosts DOI resolver
* OAI harvesting oai.datacite.org
* content negotiation data.datacite.org - resolve a DOI to persisten rdf triples. Bringing together DOIs for data and linked data.
* citation formatter (RIS, BibTex, RDF)
* beta stats capability (allocation, resolution)
Trends in member highlights:
* establishing contractual relationships with data centres
* active DOI allocation
* growing communities of practise
* increasing awareness of stakeholder community
* increasing role in national infrastructure
Member highlights (a selection)
* figshare data to have DataCite DOIs
* open source Excek add-in under development
* JISC funded workshop series on data citation (UK)
BGI e.coli example. May - outbreak. 2 June - BGI released the sequence with a DOI.
BGI blog - GigaBlog "notes from an e. coli "tweenome"
14 June 2012 "DIGITAL RESEARCH DATA IN PRACTICE: solutions for improving discovery, access and use"
Keynote: The science of science
Dr Jonathan Grant, President, RAND Europe
Impact - don't mean writing a paper in Nature or getting a paper cited, as this is academic impact. Looking at impact beyond the academic system.People largely funded by the taxpayer - what they're doing to make the world a better place.
RAND Europe: independent, not-for-profit public policy research institute. Part of the global RAND coeporation. Work across the breadth and depth of government. Fundamentally a provider of evidence. Provides evidence for evidence-based policy.
Why evaluate research?
* advocacy - make the case for research funding. Worry that it's sometimes too driven by anecdote.
* accountability - to taxpayers, donors etc. Demonstrate that research provides values for money.
* analysis - what works in research funding. We have very little understanding of this in research. Don't know if it's best to fund research through people or projects, best to have pure research or mixed research and teaching.
* allocation - what to fund (institution, field, people) UK have really concentrated on this in the past few years. 20% of university funding determined by REF. Creating new pressures on universities.
How do you evaluate research?
* No one tool is perfect! Any sensible evaluation will use multiple tools.
* benchmarking - not in the context of league tables, rather as a way of driving learning
* logic models - how to capture non-linear effects of research
* bibliometrics - how do we measure citations on other types of documents, e.g. clinical guidelines, policy documents
* case studies - allow you get to the texture of the research process
* peer review
Group why and how on a matrix.
* know why you are measuring research
* what is the objective of the evaluation
* use a multi-method, multi dimensional approach
* don't rely on one method
* evaluation isn't easy
* no funder has the answer
* need to move from advocacy to accountability
* need "science of science" to understand what works
* practical evidence base for science policy
* need to "walk the talk"
Case Study: Estimating the economic returns from research "Medical research, what's it worth?"
* US "exceptional returns" 2003 - 20 times greater
* Australian (Access Economics) 2003 - return of 500%
To calculate ROI made 4 key estimates:
1. how much was spent
2. how long does it take (time lage between input and output)
3. how much health gain
4. how much spillover (e.g.spin out companies)
Focussed on cardiovascular and mental health research
1975-1992 £2 billion in funding in the UK
Good data for cardiovascular, mental health was the stress test of the models!
1985-2005 net cardiovascular health gains totalled about £53 billion.
QALYs (Quality Adjusted Life Year - an additional year of quality life) Each QALY was worth £25k (number used by NHS to determine cost effectiveness of medication)
2.8M QALYs - net total of £53 billion in health gains. Time lage estimated to be about 17 years. Mean age of paper cited in clinical guidelines is 12.5 years.
4 other studies with other methods come up with same 17 year time lag.
Spillover really hard to estimate!! Did lit review and took best estimate of around 30%. Problems with this, estimate is old, US based, agriculture derived.
Total return of 39%.
US and Australia studies took top down approach to look at overall gains - not linked to interventions. Attributed half gains to R&D. Assumed instantaneous benefits. Didn't net off health service costs. Used high "willingness-to-pay" value for putting a cost on life.
Impact of study:
* used to make case for research
* cited in parliamnetary debates
* used as evidence for prep of spending review
* "genuine attempt to objectively assess the economic returns of research" David Willets
"Mapping the impact: exploring the payback of arthritis research"
Work done for Arthritis Research UK
End of grant reports
* completed at end of grants so many outputs have not happened
* burdensome on reseachers
* relatively unstructured narrative - hard to aggregate and analyse
* should be applicable to all areas of ARC research
* applied to every grant
* minimise burden on researchers and admin
* collect only what can be analysed
Worked with 40 ARC researchers to develop a web-based questionnaire based on yes-no questions. 187 simple concrete questions. Now 220 yes-no questions. 70% fill it in within 30 minutes
Categorising research impacts:
* knowledge production
* research targeting and capacity building
* informing policy
* health and health sector benefit
Use questions to build an impact array - build a database of multiple research grants and impacts. Still in development but will be able to use it to do statistics.
English National Institute of Health Research now using this questionnaire.
Allows assessment of impact occuring in time.
Work in progress - way of disaggreagating impact into a series of data points. MRC does something similar based on this work.
Case study really useful for analysis of research success factors. Develop case studies over a period of 20 years - historical. Bring together a panel of peer-reviewers to rate case studies on different impact factors.
* impact of collaboration - the more teams collaborate, the greater the impact of their research
* tentatively - practise amoung smaller research funders to take what the large funders won't fund but just missed out- associated with lower impact
Researchers are having to write case studies of the impact of their research for REF, which will be peer-reviewed. Would be great to be able to data mine and analyse these case studies! 5,000 ranked impact statements.
Good evaluation is about weaving multiple evidence together in a coherant tapestry.
Session 1: Discovery: It’s all about the metadata? Or is it?
Vishwas Chavan, GBIF
Towards next generation (data inclusive) publishing
Global Biodiversity Information Facility (GBIF) established 2001. Works on principle of free and open access to biodiversity data. Global infrastructure for the sharing of biodiverity data.
327 million data records through GBIF portal. 14k data sets.
Biases in data accessiblity to western world.
Lessons learned from 10 years of GBIF:
* technology and infrastructure is not a barrier
* challenging to work on capacity building, changing policies, social and cultural change
Trends in scientific publishing
* scholarly publishing
* grey literature
* data publishing - small fraction of the community!
Why do we prefer scholarly publishing to data publishing?
* lack of comprehensive data publishing framework that can satify individual ego and institutional pride. All about recognition.
*lack of incentives to invest time, money, energy, in assembling and publishing the data
GBIF data publishing framework task group - recommendations published last year in 2011
* Persistent ids at dataset level, but also data record
* publish data papers
* data citation practises that in someways replicate print citations, but take advantage of digital methods
* data usage index - track impact of publishing data, assess data curation
Persistent identifier for data records, but also to physical specimens, gene sequesnces, taxon names, authorititave taxonomies, scholarly publications, legacy literature, multimedia artwork, people
Started to dead with permanent ids for data sets and records, and linking physical specimens to data records. Natural history museum in London assigning DOIs to their specimen records.
Everyone feels it's too much of a burden to write good metadata. Publish metadata as a scholarly article.
* promote and publicise the existence of data
* provides credit
* describes data in a structured human-readable form
* peer-review of datasets - 1st quarter next year: ~70 data papers in 6 different journals
Chavan and Penev 2011 " Data paper: a mechanism to incentivise publishing in biodiversity science" BMC Bioinformatics
ZooKeys and PhtyoKeys
Possible to have statistical analysis etc in data paper.
Data citation - difficult to identify who collected, created or added value to the data using current citation metrics.
Needin citation: publisher, dataset title or identification, contributer and contributer role, release, updates, volume and how to access it.
Recommended citation practise:
1. citation given by publisher "please cite data as..."
2. query based citations
[citation contains both url and doi - redundant?]
Mechanism to assess impact of data management: data usage index - similar to impact factor
Measure of impact of data publishing being access and used by the scientific community.
Computed on 14 biodiversity data usage indicators Ingwersen and Chavan 2011, BMC Bioinformatics "Indicators for Data Usage Index..."
Want to move towards a unified publishing index - combination of impact factor, citation impact indices, data usage index and other indices. Is this even possible?
Introduction to CODATA task group:
* create awareness of data citation and current practises.
* developing a white paper on current practises, with recommendations for best practise next year.
Andrew Treloar, Director, Technology, Australian National Data Service
Seeking Serendipity: repurposing DataCite metadata to augment ANDS discovery
More researchers using more data more often
Data as a first class object
ANDS transforming datato structured collections.
- trying to complement existing disciplinary resources.
- go to where the users are - data descriptions discoverable through what resources people use to find things (google)
- provide context for discovery (make it easy to find people/project and follow links to organisations, data) assessing value
Resource Data Australia
ANDS discovery service supports serendipity (find things they haven't deliberately searched for) Provide suggested links, start small and add functionality.
1. Internal suggestions (other data collections with matching subjects)
2. draw on DataCite search API (not yet in production) Uses exisiting RDA title as search probe against DataCite metadata. Search in real time, aim for best possible match. Suggested links start with internal records but also include DataCite records. Can go directly to original source
3. still thinking about stage 3 - national library of Australia, NARCIS? Atlas of Living in Austalia
Possible enhancements of DataCite useage:
* tweak search rankings to highlight ANDS data
* See suggested links for the whole set of records instead of individually
* use path followed to get to current page to rerank terms in search
* use RDA subject description
* same of similat spatial/temporal coverage
* links to co-authors' collections
* shared keywords
Issues for the future
* do DataCite care that view resolve record involves a traffic bypass?
* how will this scale?
* How will Ui work? Won't really work with 20/100 external sources
* does the user care? richness vs complexity tradeoffs
Eefke Smit, STM Association
Data and Publications; and how they belong together
Famous Nature paper - DNA structure - 1953: 1 page, 2 authors, no data, 1 figure
2001 Human genome: 62 pages, 49 figures, 27 tables - issue has foldouts - pushing the boundaries of possibility of print
Human genome at 10 - 2010: an iPad edition. More than 1000 genomes drecribed, raw data included. Main part of the paper looks traditional but a lot of extra digital content
Utopia document - interactive pdfs - can click through from figure to underlying data. BioChemical Journal, Portland press.
Elsevier offers gene and protein viewers from within the article. Data can be stored in archives outside Elsevier.
Sheer volume of data is a problem - depositions of datasets in archives continue to grow, surpassing journal articles in biomedical research.
Some publishers struggling with data problem. Journal of Neuroscience no longer accepts supplementary files as the supp material would outgrow the artcile volume. Journal Cell - editors suspect researcher treat supplements as data dumping grounds. In general publishers cannot guarantee proper preservation and future accessibility of suppl. files.
Authors have too few alternative places to put their data except for journals.
PARSE.Insight survey 2009 on where researchers store thier data. Majority is in computer at work. Digital archives get very little. Digital archives are top choice for where scientists would be willing to submit their research data.
Opportunities for Data Exchange (ODE)
Data publication pyramid image.
Data in a publication is very processed, condensed. No need to have the data and publication in one place, if linking is available. Data centres have more expertise in looking after data than publishers.
Likely short term reality: Estimates are that at least 75% of research data is never made openly available. Too many discipline lack a community endorsed data archive. Risk that supplements to articles turn into data sumping places.
Ideally: Data archives get a much bigger share, supplements will shrink (only if data can't be integrated in the article and only relevent extra explainations) more integration of text and data with more seamless links to data sources.
Publishers can help make things better:
* stricter editorial policies about availabilty of data
* recommend trustworthy data repositories
* [lots more I missed]
STM and DataCite joint statement issued today.