Citing Bytes - Adventures in Data Citation: OpenAIREplus workshop notes

The Black Diamond from the water - not a bad little conference venue!

“Linking Open Access publications to data – policy development and implementation"

The next stop in my marathon run of conferences/workshops/meetings was Copenhagen, and the Black Diamond, the home of the Royal Library of Denmark for the OpenAIREplus workshop and the Nordbib conference (more on that in a later post).

This post contains my notes from the presentations given as part of the OpenAIREplus workshop, and boy, they crammed a lot in there! Some really fascinating talks, and my only complaint would be that we didn't have enough time in the breakout sessions to really get into the meat of things. But given that we'd started at 8.30am, I'm not sure where we could have found any more time!

(Insert usual disclaimer about notes here...)

“OpenAIREplus – an overview of activities” – Najla Rettberg

The EU has an open access mandate, SC39, which means that it's madatory for certain EU funded projects to put documentation into the OpenAIRE repository.

OpenAIREplus wants to link publications to datasets and funding info, and extend to projects beyonf FP7.

If a scientist doesn't have access to an institutional repository for data, they can put it in the OpenAIRE orphan repository.

OpenAIREplus has 3 main strands: Technical, Outreach and Services.

Technical group:
* Data being gathered:
* Publication metadata - DRIVER, OpenAIRE
* Project/funding - EU, national metadata
* Licenses
* Datasets - data repositories
* working on producing guidelines for data providers
* have guidelines for repositories already written
* building a cross-discipline infrastructure
* prototypes available at the end of this year.

Outreach:
* Network of EU Open Access Knowledge
* Chamption of Open Access
* Helpdesk - already exists, but will be expanded in OpenAIREplus
* Looking to collaborate with other researchers.
* 4 diffent regions: Portugal (south region), Ghent (west region), Denmark (north region) and Ukraine (east region)
* Results from a data survey done: 16% of respondents have an institutional repository and institutional data policy. 24% have a data policy at funding level.
* Training and interactions with researchers - issues raised about sharing, openness and relevence to scientists.

Services:
* OpenAIRE portal exists
* user can search for publications, get statistics, deposit data and metadata.
* Researchers will be able to make links between data and publications in the future.

OpenAIREplus is on Twitter: @openaire_eu and also on facebook and LinkedIn.

“The research data landscape– an overview” - Oya Rieger, Cornell University

Showed the scholarly communication cycle:

- Analysis/Interpretation - Author presenting - Sharing/networking - publishing/dissemination - archiving/preservation - research data collection -

There are lots of reasons to provide access to research data.

There is an NSF request to document the research process - tracking how data is being creating and managed.

Cornell University's Research Data Management Service Group offers a range of services including metadata, storage, collaboration tools, etc.

Research data includes text, databases, images, code, spreadsheets. Researchers were asked what prevents them from sharing data - most reasons were due to information policy. Researchers were also uncertain whether datasets were up to standard (lack of community standards for data)

The Loon project (http://www1.chapman.edu/~wpiper/) is a good example of extending science to the public - but note that it's on the scientist's personal website! Site contains:
* hundreds of papers in different locations
* data in excel spreadsheets with metadata - used ecological metadata language
* files in proprietary software
* sound files for songs of each bird, video files

It took more than an hour to pull all the presented examples together - there was not obvious ways of lining things together

Issues:

Technical
* scalable, flexible systems
* interoperability
* metadata standards

Socio-cultural
* listen to scientists' concerns/needs
* incentives and rewards
* community based standards
* different access provisions

Information policies:
* IPR, privacy, confidentiality, security, institutional ownership, access limitations, retention and de-accessioning

Organisational infrastructure:
* business and sustainability plans
* governance models
* recognition and engagement of stakeholders
* collaboration strategies
* communication and marketing

Usability:
* data quality standards
* ease of deposit to encourage end users (versus completeness of deposit)
* tools for analytics, mining, integration, visualisation
* digital identity fo persistently locate things
* citation standards
* metrics to track and communicate impact

“Enhanced publications – an introduction” - Arjan Hogenaar, DANS

DANS is promoting trusted digital repositories, specifically the Data Seal of Approval, and sustainable access to data

Enhanced publications are defined as (traditional) publications enhanced with things like datasets, video, audio, images, information on author/organisation, information to clarify context. Not all of these need to be included in an enhanced publication.

There are 2 fundamental ways compose enhanced publications:
1. "machine-based composition", where objects brought together share 1 or more properties. E.g. ARVODI (Netherlands), OpenAIREplus
2. "man-made composition", where it's up to the researcher to add things to the enhanced publication. It's not always clear why a typical object has been related to an enhanced publication

OAI-ORE for enhanced publications
- resource map (in OAI-ORE terms: aggregation)
- aggregation to describe components of an enhanced publication OAI-ORE: aggregated resources)
- aggregated resources may be documents, data, metadata...

Advantages of enhanced publications: backgrounf information is easier to find, therefore conclusions can be verifies and infoamtion presented in context.

Additional advantages of man-made enhanced publications: authors may add comments as to why things are being linked and authors may allow other researchers to add components. This would make an enhanced publication no longer a static document, hence a need for version control.

Demo of the NARCIS portal: Zijdeman, R.L. 2009. Like my father before me: intergenerational occupational status transfer during industrialization (Zeeland, 1811-1915) (2012) (http://www.narcis.nl/vpub/RecordID/escape-demo%3Arem%3A2679/id/1/Language/EN)

Digital Author Identifier (DAI) - ensures no doubt on identity of author. Centralised system in the Netherlands. Now looking to include DAI in ORCID.

http://datasealofapproval.org/ - guidelines for data producer, archive, consumer. Accreditation is done by a process of self-assessment with peer-review by the DSA board. Trust is crucial!

NWO, the Netherlands science funder, has an open access policy, and data produced by NWO grant has to be managed properly.

Other challenges:
* combining components from different sources - issue that sustainable access is not guaranteed from all sources
* copyright issues

“Literature-data integration in the life sciences” –Jo McEntyre, EMBL-EBI

EBI has a number of big data repositories
Primary databases for deposition, and curated databases.

History of working with journals to link data and papers e.e. Nucleic Acids Research, 1988 requiring an accession number for the EMBL Data Library to cite identifier in the article.

Data in thematic databases - all public. Nucelotide data production is outstripping storage capacity!

2 core lit databases, Cite Xplore and UK Pubmed Centraal.

Can easily make links with databases at EBI, and can text mine articles.

UK PMC is a full text database. Releasing web service of this in the next couple of weeks.

UK PMC is supplemented by abstracts in CiteXplore.

UK PMC can link between articles and grant information.

20% articles in UK PMC are open access (~450k articles out of 2.2 million articles). Number of articles submitted increasing year by year.

Making literature-data connections:
* links by the author on submittion as metadata (primary databases)
* by database curators - info and links from literature
* expensive, slow, high quality

Text mining:
* algorithms that use terminologies (can be subject to lag)
* post publication - can find new associations
* variable quality, but high throughput

Text mining in UKPMC looking at semantic types: gene/protein, GO terms, organism, diseas, accession no., chemical.
Thousands of unique terms, articles and annotations.

Area of cross-disciplinary integration!

Case study: phylogenetic tree of life - divergence from common ancestor 3.9 billion years ago. Some elements of DNA are the same across all living organisms.

E. coli meets humans - gene implcated in human colon cancer sequenced and submitted to sequence database. Gene sequence in the database compared with other sequences in the database using tool called BLAST. Link to paper on the role of DNA in gene repair. Gave pointers on where to go next with experiments.

Example text mining of paper abstract, with extra info on citations, related articles and bioentities. Can query how many articles cite a given article.

Half of PubMed cited 0 times.

Algorithm can find similar structures in the database - can find other articles that describe similar structure that are open.
Data driven science - need to learn from physicists to deal with huge author lists.

Hard decisions about value of keeping complete data sets.

Unstructured data - how do we reuse it? 1 in 3 articles now submitted with articles. Mess of formats!

Useability really important! Need to apply solutions in the context of the science that people do.

ukpmc.ac.uk

“Data in the research process - a funder's perspective” – Mark Thorley, National Environment Research Council

Leads RCUK's discussions about Open Access.

Importance of research data. Research funders' view on data.

Wordle developed from NERC science information strategy.

NERC defines environmental data as individual items or records (both digital and analogue) usually obtained by measurement, observation or modelling of the natural world and the impact of humans upon it. This includes data generated through complex systems, such as info retrieval algorithms, data assimilation techniques and application of models.

Different research councils have different definitions, want to standardise!

RCUK values data because it's an integral part of the research record (helps robustness, integrity and transparency of research record). Reuse and repurposing - aka sharing (enabling others to do new things with the data, not just other researchers)

Other benefits of using data to drive innovation and growth.

data.gov.uk - making public data freely and openly available for other people to do stuff with.

What instruments to funders have to achieve their aims?
* Policy - if you tak our money, we expect you to do "stuff" related to data
* funding - deliver "stuff" related to data
* infrastructure - provide and support a data infrastructure

Data policy for RCUK - data generated through RCUK funded research should generally be accessigle for reuse and repurposing - though protections and contraints are in place (consent and confidentiality in medical research, embargoes etc.)

Key message - can be no more constrained than the legislation allows (Freedom of information)

RCUK common principles on Data Policy (rcuk.ac.uk/research/Pages/DataPolicy.aspx)
Another driver - research integrity. Draft RCUK policy on open access - research papers must contain a statement on how underlying data may be accessed.

Codes of practise on research behaviour. Unacceptable conduct includes mismanagement of the underlying data. Data should be preserved and accessible for 10 years, but 20 years or longer for projects of clinical or major social, environmental or policy decisions.

NERC and ESRC have data policy in place for 20 years. EPSRC only released data policy last year. EPSRC puts onus on research institution to take responsibility for long term management of data.

NERC places onus of responsibility on researcher. Tends to become an institutional responsibility.

NERC supports long term management of data and will supply it for free (apart from a few special cases). NERC requires that all environmental data of long-term value generated through NERC funding must be submitted to NERC data centres.

Policy differences:
* discipline differences - e.g. how open data can be (medical, informed consent)
* repsonsibility - individual or institution
* infrastructure - centrally funded provision vs "grant" funded

Funding to support data management activities:
Differentiate between within project (include appropriate resources within grant application) post-project (varies with funder depending on infrastructure)

NERC working on building common web services infrastructure across all 7 NERC data centres. Talking with others on how to extend this to other domains (social science...)

Pointless just holding and managing data without making it available.

Implementing policy - given a big enough carrot you can hit someone with it!

Pointless having a policy if you can't make it work. Looking at ways of monitoring what people do and rewarding/censuring them.

Outline and full data management plans.
Outline at grant submssion

Full data mangement plan
*contract between PI and data centre
* key data management activities - who/what/where/when
* identifies datasets of long term value
* data value checklist (can't manage everything forever)

NERC take in data of long term value. Data value checklist a way of deciding what is of long term value. Still being worked on!

Future:
* data publication - incentivising scientists to submit data
* role of publishers, including commerical publishers
* clarify role of repositories vs data centres
* CODATA "Agenda for data" - try to get big international programmes to operate common principles around data which articulate data's value.

“Research Data Management Policies - the tale of one institute’s journey to ratification” – Mary McDerby, Manchester University

Research data management at Manchester.

Manchester: 5000+ research staff, 20 schools, 4 faculties, 3500+ postgrads, £279.4m external research funding (all figures from 2010)

2009 funded by JISC for project called MaDAM - raised awareness of institutions of the issues of managing research data (funder mandates, wasted resources, reputational damage, publication policies, risk of data loss)

Inconsistent ad hoc data management solutions available within the user community. Multiple copies of data, difficult to track down the right version. Fragemented and decentralised storage. No backup policies. Limited means of disseminating data. No archiving polices to support long term curation.

MaDAM created a simple software solution to manage research data. Only for life sciences.

MiSS - transitions MaDAM (pilot) into a sustainable service. Will cost more in the future to support it as a service.

One of main outputs of MiSS is institutional data policy. Don't want to have your policy contradicting other existing policies!

Academic champion - person to go out and get buy-in for the policy. Policy should be clear and simple and easy for people to understand.

Policy - clear ownership and responsibilities. Manchester has taken shared approach to responsibility for research data, both institution and researcher.

Policy is not fixed, will be reviewed in light of changes to funders' policies.

Multi-partner collaboration - PI's responsibility to state IPR ownership etc in advance.

PI is responsible for preparing data management plan (with university support)

Data must have sufficient metadata to allow other researchers to understand how it was created or acquired, discover and assess it's reuse potential.

Openness and publishing: make data open and available to others, possibly with a limited period of priveleged access.

PI is responsible for compliance with legal and ethical requirements.

University data management policy also gives a gentl reminder of the University's code of good research conduct.

Policy was ratified on the 16th May.

Lessons learned:
* during consultation, there's likely to be a communication path breakdown, so need to have a back-up plan for communication.

Manchester still need to write procedures and guidance to tie in with the policy, which won't be public until September.

Citing Bytes - Adventures in Data Citation

Monday, 25 June 2012

OpenAIREplus workshop notes - 11th June 2012, Copenhagen

No comments:

Post a Comment