Wednesday, 27 June 2012

NordBib conference notes, Copenhagen, June 2012

Inside the lecture theatre at the Black Diamond

The NordBib conference was all about Structural Frameworks for Open Digital Research
- Strategy, Policy & Infrastructure. I kind of fell into attending it by accident, as I was in Copenhagen for the OpenAIREplus workshop before it, and the DataCite meeting after it, so it seemed sensible to go to this one too.

It was an interesting conference, on one hand very high level, EU strategising, while on the other, the audience seemed to mainly consist of librarians and people interested in data without that much by way of actual concrete experience in data management. So I wound up having lots of conversations with lots of people, all interested in finding out what we in the UK and the NERC data centres have been up to.

All the presentations from the conference are available here (which kind of make my notes redundant, but nevermind!)

Monday, June 11th - Prologue: Setting the scene

Video presentation by EU commissioner Neelie Kroes, commissioner for the Digital agenda for Europe

Dr. Carl-Christian Buhr, cabinet member and advisor to Ms. Kroes
"The Digital Agenda for Europe and Horizon 2020"Carl-Christian Buhr

Why is EC interested in open access at all?
 * Policy maker - launches policy debates, proposes legislation, invites member states to take action
 * funding agency - research and innovation, access policies for funding research
 * Infrastructure builder - funds research infrastructures, funds related research, supports networking activities

27 Commissioners
Digital Agenda: Neelie Kroes
R&I, ERA: Máire Geoghagan-Quinn

Horizon 2020
 * for 2014-2020
 * integrated R&D and innovation projects (making things happen)
 * 80bn euro
 * Strong increase for e-Infrastructures
 * Open Access to research results: for all publications madatory, to data where appropriate

Start discussion about openness in data.

Bias towards openness, but need to do things in a participatory manner.

Proposal for Horizon 2020 is now being debated in Parliament and Council.

Staus quo on open access:
 * OA for publications covers 20% FP7 budget. Best effort mandate. Embargo (6/12 months). Costs reimbursable.
 * OA to data: some projects but not systematic.

Horizon 2020 - general principle an open access mandate for publication.

Communication and recommendation on scientific information - recommendation for member state actions. Likely adoption Q3/2012

In Oct will have a better idea of how Horizon 2020 will work.

Important EU strategy documents
 * published Oct 2010
 * published Oct 2011

"the data themselves become the infrastructure"

Need to have a coordinated effort of stakeholders to build and sustain an underlying seamless and trusted infrastructure.

Interoperability is important - don't want to create island solutions that we can't connect.

What next for open science?
 * Policy: who will rule infrastructures? Common approach
 * Funding: infrastructures, tools and services, Big Data
 * community building: e.g. digital humanities, Europeana (connecting Europe Facility - high speed broadband for all in Europe)

Prof. dr. Tony Hey, corporate vice president Microsoft Research Connections 
"Open Access, Open Data and Open Science : Fourth Paradigm of Data-Intensive Scientific Discovery"

Big Data making headlines in the popular press.

Need scientists with new types of skills, including managing, sharing, publishing, querying data. Interest by industry in mining data.

Fourth research paradigm - data-intensive science. 3rd paradigm, computational science. 2nd theoretical science, 1st, experimental science.

Data intensive science - scientists overwhelmed with data sets from many different sources (instruments, simulations, sensor networks...)

eScience is the set of tools and technologies to support data federation and collaboration for analysis and data mining, data visualisation and exploration, for communication and dissemination.

Data may not be co-located. Sharing data essential in many areas of science.

"The Fourth Paradigm" book - available on-line for free. (I downloaded a version for Kindle)

Same techniques using machine learning to target viruses and spam. - sharepoint sire - carbo-climate synthesis. Communal field science. Social problem as much as technical problem - changes how research is done.

Digital watersheds - need to do scientific data mashup as data comes from multiple sources (rain from NOAA, river run off from US Geological Service)

Bill Gates, vision for new era of research reporting - reproducabile research, dynamic documents, interactive data, star-rating...

Can now make a perfect digital copy of a document and use the web to disseminate - changes the publication game.

Subscription costs for journals increasing dramatically out of proportion with consumer prices index.

Inevitability of open access repositories: something every University can do right now - make university staff put peer-reviewed final drafts in institutional repositories.

Publishing model is broken - need new compact. Publishers need to provide services that users want at affordable price.

Webometrics "scholar" ranking - impact of institutional repositories on this metric.

Future of research repositories:
 * library should be guardian of intellectual output of the university
 * will also contain data images and software

COAR - federated repositories that can be searched across.

Berlin declaration 2003 includes data, metadata

NSF data sharing policy 2010 - all future grant proposals require a 2 page data management plan. Investigators are expected to share data.

UK government committed to free and open access to taxpayer funded research. EPSRC putting onus on institution re data management policy

Can remove inefficiencies and increase scientific productivity by making data/infomation open and linked.

2 papers a minute deposited in PubMed - how do we deal with this?

Openess is needed to reduce the time to impact.
1. need to cooperate on standards for data provenance, curation and preservation
2. default expectation of data sharing
3. publication processes and social behaviours more flexible
4. Data recognised as core assets

New tools to visulise big data (TED talk Roy Gould and Curtis Wong)

Layerscape - open community tool for layering data, visualising models - tool for big history. Understand all aspects of history -  zoomable timescale platform. Correlations of historical and climatic/geologic events.

PLANETS - long term preservation and access to Europes cultural and scientific heritage.

The Cloud: many different types: dedicated, private, public. Need to experiment to find out which are best for which users.

Microsoft cloud is purpose built data centres to host containers at large scale. More energy efficient, more cost effective.

EU VENUS-C project - interoperability, non-lock-in. Pilot studies for different applications.

Data interoperability - ways of combining data needs to be as simple as possible. Public discussion in OASIS - adds to html micro-formats that website manager can use to specify what their site is about (Casablanca the movie or the place?)

Want computers to understand/anticipate what you're looking for.

Semantic chemistry add in for Word - understands when you type CH4 you mean methane.

InnerEye: semantic understanfing of medical images - training a computer to identify bits of the body. machin learning as a service - give users the ability to mine their data. Attempt to make machine learning easy for non experts to use.

Dr. Jill Cousins, CEO Europeana
"Europeana : achieving interoperability"

Really need to work to avoid future mistakes - no more re-inventing the wheel.

Cultural and research data should become accessible to researchers.

European frameworks of research and culture should work more closely together.

Europeana brings together musuems, archives etc all across Europe. Bringing things together so user can find it.

23 million objects accesible in repository, 2200 participating institutions. 17 million records coverd by the data exchange agreement. Huge battle to get institutions to release metadata (not content!) in CC0.

Want to do new things with data.

68 requests for APIs were denied because they were semi-commercial and not covered by licensing.

Interoperable systems
- end user has many ways to access research and cultural heritage - private (google) community (wikipedia) public (national)  public (research) public (culture)

From user point of view, where do they go to find things? We should be able to link the public research and public culture fields.

We've invested in aggregation, infrastructure, standards (supply of information)

Time to look at the demand - what the user wants to get out of it.

Try and target different types of users, serviced by a Europeana backend. Users can use API to extract infomation they want. Researcher user might want to pull in information from other sources - needs other sources to be interoperable.

For general public backend feeds into series of things which aggregate, enrich, distribute, market. This value chain will be different depending on the user - so they can get what they want when they want.

Research user will have different set of tools and services. Working on Europeana Research to come up with the basic services for researcher. Need interoperable systems and data.

Cloud could help store information - not creating seperate silos might save money/time.
Interoperable and shared research and cultural heritage infrastructure would lead to better use of scarce resource, better sustainability, better user experience.

Put data out there in a format that links to other data.

We have interoperability in cables and computers, networks, links and documents. We have to work on the the open content licenses and standardised licenses.

Europeana Data Exchange Agreement, published 1 Jan 2012. Baseline for access. allowing metadata to be released under CC0.

Tuesday, June 12th - Segment 1: Infrastructure and research input & output

Prof. dr. Søren Brunak, Denmark's Technical University, Dept. of Systems Biology"Infrastructure for bioinformatics, systems biology and medical informatics in the post-genome era"

Paradigm shift in biology and health from single gene analysis to all genes, to entire genome, to entire population.

Cost per megabase of DNA sequence dropping significantly out of proportion to Moore's law. Rapid drop in 2007 - technology leap with next generation sequencing.

DNA sequence data is doubling every 6-8 months over the last 3 years and looks to continue for this decade. Computer speed, badwidth and storage doubling every 18-24 months.

Sequencing is becoming a disruptive technology. Costs aren't in sequencing, but are in analysis.

ELIXIR: sustainable Euro insfrastructure for bio info, supporting life science research and translation to society, industries, environment, medicine.

ELIXIR - concept is a distributed pan-European infrastructure.

bottom-up vs top-down bioinformatics infrastructure: US caBIG infrastructure terminated because it wasn't successful getting community engagement. 2004-2010, total cost $350 million

Want to focus on how the infrastructure will be adopted, rather than just designing it and mandating it (doesn't work)

ELIXIR costs: hub capital £74m, operating costs 2012-2016: 28.5 MEuro. Costs for nodes will be determined by the node coordinators.

ELIXIR - part of the innovation chain - share data in the pre-cometitive phase. Feeds into academia and also industry.

We are generating a lot of detail on molecular mechanisms and descriptions. Still working with quite crude categories of diseases. Need to expand the phenotypic classification all the way to the level of individuals.

Electronic patient records give the fine details of patients' diseases/conditions. Denmark has an opt-out system. Danish citizens can log into their personal health records the same way as they can do on-line banking.

Can take text and use controlled vocabs in the hospital sector to text mine. Approach is international - same controlled vocabularies used internationally. Meta-analysis across countries. Can spot how diseases correlate - can then go and hunt for the genes that might be in common for the diseases.

Idea of work is to link to personalised medicine.

Can see how the diseases in Denmark correlate with each other.

Should focus on how to link the genes to the diseases. ELIXIR eNewsletter

Large part of ELIXIR is interoperability.

Prof. dr. Martin Mueller, Northwestern University, Chicago IL
"Towards a cultural genome: curation and exploration of large-scale historical text corpora"

Scalable reading: idea of "close reading" - looking very closely at the words. Distance reading is when you just look at the titles of books etc.

"How to talk intelligently about books you haven't read" - book

Scalable reading - integrating things from different perspectives.

People usually talk about the things they find difficult to do - data sharing is one.

Interested in a book of English - very large, growing, public domain corpus of english text - something like a cultural genome - new ways of looking at the data is possible.

Philology and evolutionary biology are related. Data integrated into database can be contextualised.

Corpus of Latin Inscriptions - grew from 1850 into the 1950s. Study of administrative habits of Romans changed by this, as the data was there in a time space grid.

What was more important? Gathering the data or analysis of it?

Over course of 20th century data moved from being looked after by scholars in the department - shift from editing to higher analysis. Researchers lived off the capital of a century of data acquisition.

Digital age - who's going to edit the data for the digital age?

Shift from academic departments to libraries. Who decides when the data is good enough?

Humanities are v. sceptical of digital data - primary data circulating in a digital space.

When are data good enough? Changes with inquiry researcher has.

Text miner can live with a lot of noise in the data. Close reader noise becomes very distracting very soon - "yuck factor" even if reader can make sense of the words.

"How bad is good enough?"

Massive problem of data curation. Only good way to do it is engage the user communities to take charge of their data. Problem of data in the humanities - a lot of problems, not very hard, but just so many of them!

How do you make sure data curation meets standards? How do you review the work of crowd-sourcers?

We need a good system of crowd-sourcing!

Often have to do things to data before you can do things with them. People who want to do things with data have to decide what has to be done to it.
Critical piece of human labour takes a second (correct a letter), but surrounded by other stuff (finding the page, recording the change). Can we automate the other stuff to make it easier?

Temporal distance adds value to each data point in the vanishing past. The further back in time you go, the less there is, the more important what's there.

Segment 2: Organisations and collaboration

Dr. Stefan Winkler-Nees, programme officer Deutsche Forschungsgemeinschaft (DFG)
"Knowledge Exchange : Approaching the challenges of the Digital Agenda"

Knowledge Exchange is a cooperative effort to "make a layer of scholarly and scientific content openly available on the Internet"

Fields of activities:
 * Interoperability of digital repositories
 * Licensing
 * Open Access
 * Research Data
 * Virtual Research Environments

Working group on research data

Data should be securely stored, professionally managed, available for reuse.

Square kilometre array good example of big collaboration for science and with infrastructure.

Knowledge Exchange has commissioned a report on legal status of research data in 4 partner countries.

 * Researchers (incentives and training)
 * technical infrastructure (funding)

Incentives: re-use and recognition, principles of science (reflected in rules and codes of conduct), requirements by funders, journal policies.

Disincentives: risks of publishing data sets (possible abuse, ethical, legal issues), effort involved in sharing data

Shared data is increasing citation rate (Heather Piwowar)

Data skills should be made a core academic competency and should be in the curriculum. Specialist roles for data scientists.

ANU data management manual (Australian National University) available on line.

Ecosystem of infrastructures might describe how the infrastructure landscape will develop.

Interoperability most important issue for realising an ecosystem of repositories.

Don't have a good idea of how much things costs, esp. running costs. 10-15% of total research costs - a guess!

Part of changing culture of science - having a repository signifies good science by the institution.

"A surfboard for riding the wave"

Next steps:
 * standardise data citation and develop metrics
 * more data journals (good idea!)
 * codes of conduct for data sharing
 * find agreements on data management plans
 * data management training courses and curricula for data librarians and data scientists
 * coordinate implementation of infrastructures
 * studies on benefits and costs of re-use, publishing and archiving of datasets

Value of scientific data has to change. Data management is too abstract - a shame.

Information literacy, legal aspects -  need working on these topics.

Highlight enormous potential for science!

Listen to the researchers - they're the users, producers of data.

Dr. Andrew Treloar, director of technology Australian National Data Service
"Collaboration, Competition, and Consumption : the Changing Ecology of Research Data Publishing"

ANDS enables transformation of:
Data that are Managed, disconnected, invisible, single use, to structured collections that are managed, connected, findable, reusable, so that Australian researchers can easily publish, discover, access and use research data. Competitive advantage for the whole country.

Way of thinking about the role of data withing the scholarly communication space.

Scholarly communication isn't a rusted watch, but a jungle (ecology)

Value of an ecological approach - information ecology
Thinking about people, practises, values, technologies, values. Richer way of thinking about scholarly communication space.

Ecology elements:
 * systems that evolve over time
 * environmental factors (constraints, forcing)
 * selection pressures
 * biodiversity
 * species and individuals
 * niches for colonisation/exploitation
 * resources
 * interactions
 * species co-evolution/co-adaptation

Research data ecology elements
 * researchers
 * institutions
 * funders
 * data centres (institutional, disciplinary, national, international)
 * disciplines
 * research facilities
 * libraries
 * publishers

Relationships between these elements. Relatively small number of ways in which elements of an ecology can interact.

 * Predator-prey
 * Competition
 * Parasitism
 * Symbiosis

Co-evolution isn't necessarily good. Systems co-evolve, but you can move from one stable state to a new stable state that isn't necessarily more desireable. E.g print journals to electronic journals. Form and access arrangements have largely not changed. Open access gaining momentum, but the form is changing more slowly.

New niches allow for new possibilities - internet was a new niche for journals. Internet could have offered new possibilities for scholarly communications.

Research data can be new niche for librarians. New roles within institutions. New ways to engage with wider range of clients. New application of exisiting skills.

Selection pressures in research data driving change.
 * Increasing volume, variety, velocity (Gartner, 2001)
 * Increasing importance of data relative to publications
 * mixed messages form journal publishers
 * outcomes currently unclear

Role of publishers: Is relationship between publishers and producers of research symbiotic or parasitic? How will rise of data intensive research change this? More and more instances where data has more value than the publication.

Symbiotic relationships are often better for both parties than either competition of predator-prey.

Conclusions: ecological approach provides a richer way of thinking about scholarly communication than mechanics. Research data is a new niche - undergoing great change! Look for symbiotic relationships. Critivally examine the roles of other players in the ecosystem.

Nardi and O'Day First Monday, 1999

Wednesday, June 13th - Segment 3: Strategy, policy and funding 

Video presentation by EU commissioner Máire Geoghegan-Quinn, commissioner for Research, Innovation & Science

Director Octavio Quintana Trias, European Research Area, European Commission 
"Horizon 2020 and the European Research Area"

 * public policies - make research more efficient, allow access to the outcomes of research and the data
 * Society - better integration of science into society, contribution to knowledge based society
 * Data - key for research validation and innovation process, ethical, commercial, technical concerns.

 * develop and implement open access policy
 * encourage member states to take on board open access policies
 * Discussion with the stakeholders, funders, data centres, researchers, publishers (business model of publishing has to change)

Lead by example: FP7 experimented with Green OA and covered OA publishing costs. Horizon 2020 makes OA the general principle for all projects.

Data - want a pilot in next research program, based on "best efforts". Encourage researchers to submit data to databases.

Encourage national initiatives and OA on publication and data. Define and implement policies for preservation of scientific information. Further develop e-infrastructures and develop synergies between national and European organisations.

European Research Area - improve quality and efficiency of European research by opening research systems. Optimal circulation of scientific knowledge. Member states define and coordinate their policies. Research stakeholder organisations adopt and implement them. EU policy and supporting initiatives.

 * promote OA for efficiency and fairness
 * lead by example in Horizon 2020
 * encourage member states and stakeholders to implement policies

Aim: 2016 - 60% OA publications

Prof. dr. Sverker Holmgren, eScience programme director for NordForsk
"eScience with a Nordic perspective"

Swedish Research Council:
Open access (2010) - researchers getting grants must publish in OA publications or put article in open archive. (Not sure researchers know what this means)
Open data (2012) - data publication plan is required when submitting a grant proposal. DOn't know how this will work yet.
Data services - 2008 SND (hmanities, Social sciences, medicine) Also data centre for climate.

June 1 2012 - joint statement with lots of partners inc (EC and Science Europe) "partners will take timely concrete steps ... optimal circulation and transfer of scientific knowledge" i.e. open access

"Empowering the Global Science Commons" - Nordic eScience Action plan.

NeGI and NeIC characteristics:
 * coordinated effort in e-Science research
 * out-ward looking activities
 * governance inc national funding agencies
 * common pot funding
 * open access to research outputs

Nordic grand challendge research programme on eScience in Climate and environmental research - requirement, open access to publications, data

eNORIA action plan:
 * training researchers in eScicens tools and methods
 * research into  grand challenges
 * data infrastructure

Open Acccess to National Data Repositories
Proposal: Establish Nordic Centre for Data collaboration within the Health Sciences (NCDH)

Challenges: slow progress! Human Genome project in 1996 was open data project.

2012 Nordic council of ministers report on Nordic collaboration on data. Big list of problems!

Academia is conservative - changing community consensus is a slow process. Driving force is publications and references to publications. Needed: a generally accepted reward system for producing data and making data available!

[from discussion after the talk]
Infrastructure for reward system is emerging. Opportunities not fully used today. Research community needs to take responsibility for awarding recongition for producers of data. Funders could push this. Has to be attacked at many different levels.

Workshop : the policy-strategy-funding connection
Suggested themes: Translating reports and recommendations into action. From European programmes to institutional policies. Of metrics and incentives. Bringing about the wanted results. From FP7 to FP8.

Track 1: National and institutional measures

Infrastructure to discover the existence of data, even if it's not open - on a national, international level.

Incentives for researchers?

Institutions need to provide support for data management.

DCC gathering all the information they can about correlation between data in an archive/shared and increased citation to research. Fund more research into this.

Reward the re-use of data - examples?

In UK, research programme only open to people reusing shared data.

Create data policies and make them visible to others - very helpful to the community!

In proposals, capture the intent to create data and use this as a starting point to find out what data exists and where.

Epilogue: a letter to the Commission

Summing up the workshops & madness sessions

Semantics workshop

 * delegates found it hard to talk about solutions and good stories - tendency to talk about problems!
 * tendency to regress to the issues of getting data from researchers in the first place.
 * Data is not like books. Doesn't tell you what it is itself, needs context.

* Naming problem:
 * controlled vocabularies don't exist
 * solution: national data centres and collaboration between them
 * Solution: establish the institutional repository as de facto standard.

* Researchers
* Languages made and evolve in social groups
* tools are not the problem any more
* need more user studies about how people search for data and what they do when they find/don't find it

* Data policies:
 * need to be explicit, understandable, harmonised
 * Legal report by Knowledge Exchange
 * data is fact - facts are not intellectual property
 * major exception - form facts are presented in might be intellectual property. EU database directive

Technical workshop

 * Building on success stories: DataCite, Dryad, Figshare, ANDS
 * what's already possible?
 * what improvements can be made internally, what has to be done across institutions
  * Institutional data systems embedded into the workflows that researchers use
  * collect examples of good practise
  * Need a new actor (trusted third party) who can allow users to work with confidential data
  * discovery in areas with no discipline solution
  * Licensing for data, clearer statements about what can be done with data
  * public should have access to research they funded
  * getting people to describe thier data for other people outside their discipline to find and re-use it
  * Integration of better data management in researchers' workflow, not only metadata but data as such
  * new career path for data scientists/data support professionals

*low hanging fruit:
 * easy access to data and tools in the cloud/easy storage
 * common metadata standards for discovery
 * list of projects/initiatives/standards and keeping this up to date
 * trust-ratings for repositories, range of quality measures
 * single sign in
 * data provenance - track data from source to use
 * incentives for managing and sharing data

EU and global measures:

 * major role - coordinate!
 * include scientists in this discussion, meeting their demands, address disciplinary societies
 * changing academic culture, develop a new way of measuring the research output using quantitative and qualitative indicators. Needs to be developed carefully
 * convince institutions to implement data management into routine research workflows and include an element of competitiveness.
 * follow up the "ecosystem" idea and develop means so support different approaches, allow failure with rules, monitor the ecosystem, innovation drives success
 * support training efforts on project and institutional level
 * Europe has a global role to play in order to attract collaboration
 * EU funds should be used for a dedicated ESFRI program for information infrastructures
 * work on minimising admin barriers (legal, tax regulations, develop and implement licenses for research data etc.)

National and institutional level:

* starting points
 * infrastructure isn't just supercolliders - human infrastructure important
 * open access is fine goal - but start with open discovery

* Actions - national - funders
 * make it mandatory to search for earlier literature and data in proposals
 * capture intent to produce data (cf clinical trials). A register where you pre-register your data set (and you can register interest)? Register of data management/data publication plans
 * monitor compliance; reward good behaviour
 * funding tied to data reuse?
 * ask for evidence of data policy from institutions
 * make infrastructure sharing financially attractive
 * clarity on grant funding for data management activity.

*Actions - institutions
 * rewards again - appraisal and tenure should recognise good data behavious and achievement
 * discover what you have - audit/inventory - even when the results are frightening!
 * then build simple registries for discovery -  needs national, international, domain cooperation
 * support researchers - one stop shop
 * educated researchers in data management

*Actions - all levels
 * need to raise awareness at researcher level and institutional level
  * incentives and disincentives
 * coordinate the demands from research funders - asking for the same things in subtly different ways - causes problems with needing different systems

No comments:

Post a Comment