Citing Bytes - Adventures in Data Citation

Wednesday, 25 January 2023

Looking to the future of research practice

It’s still January, and still close enough to the new year, so it’s almost obligatory to dust off the crystal balls and have a squint into the future, and who am I to break tradition? So I'm going to do some pondering about the future of research practice.

(As for why I'm doing this? I started a new job back in October last year, which is all about research practice, and I've been doing a lot of thinking.)

Fig 1. The many and varied concepts involved in research practice, both now and about 10 years in the future.

Figure 1 is a lot of words and a lot of pictures to illustrate the words and concepts that support and surround research practice. I’m going to address each of those things on a case by case basis.

Firstly, and most importantly, we cannot have research without researchers. It’s the people who make research happen, whether singly or in groups. Modern research is increasingly group-focussed, where researchers come together to achieve large aims and extensive projects, and hence groups can come from almost everywhere, including in-institution, cross-institution, international collaboration, cross- and inter-disciplinary. It’s often said that the real innovative research happens at the boundaries where disciplines meet, and we need inter-disciplinary teams to take advantage of that.

As a former scientific project manager, I can tell you that teams need a lot of support, both from within the team, but also from the institution they are based in. That’s why I’m so pleased to be part of the University of Oxford’s Research Culture programme, which acknowledges that “Beyond resources and facilities, an environment that enables the highest quality research must also be sustained by a positive working culture.”

This researcher support is backed up by training, at all levels and career stages, and the promotion of best practice by all members of the research team, and supporting institution. This is where principles become practice, and where we want to ensure that those who are doing research on a daily basis are enabled to do research effectively and easily, while maintaining exceptional standards of ethics and integrity.

If we want to make people do the right thing, we need to make it easy for them to do the right thing, and support them in doing it.

But how do we define what the right thing is? That is something that a single institution can’t do in isolation, and instead requires international discussion and agreement. There are a great deal of committees and working groups, each addressing key issues in the research environment, and coming up with agreements and concordats for best practice in research. These include standards on ethics and integrity, inclusion and diversity, technological and scientific standards, and promotion (and tenure).

It’s not just the universities and research institutions that have to consider these things – we need to work together with funders and academic publishers to capture the full range of incentives (and disincentives) that researchers face in the course of their work.

Of course, different issues will become more pressing at different times for the researcher. At the beginnings of a project, ensuring you have the right resources (e.g. lab space, access to archives, high performance computing resources etc.) and the right approvals (e.g. for working with humans, or health and safety for chemicals, etc.) is crucial, and researchers often need help and support in getting these pieces in play.

It’s not enough to simply do research and collect results – if we are to progress from this research, we need to be able to communicate what has been learnt to other researchers. The traditional way is though academic publication and conference presentation, but research can have a far wider impact on the world, including via policy and industrial spin-offs. The past trends towards open access and open research have been good ones, and I am very hopeful that such openness will become the norm in the future, as this is good for reproducibility and verifiability of research.

Transparency in research requires more than making the resulting journal articles open access. It also needs all the components of the research to be managed and archived, whether that’s data, software code, protocols, workflows, physical samples, etc. This all requires infrastructure to support it, and that infrastructure has to be maintained and kept for potentially long periods. It is unrealistic to expect a researcher working on a short term contract to be able to manage and archive multiple terrabytes of data – this is where the institution and funders need to step in to provide that long-lived infrastructure. Many institutions already have such things in place, as part of their libraries and archives, but these will need to be extended to look after those assets which are “born digital”.

The amount of data humanity is creating is increasing every day, which provides us with the ability to understand things about ourselves and our world at an unprecedented level of detail. This “data deluge” does have its downside, in that managing it is a challenge, and is only going to get more so. I want to highlight the FAIR principles and the CARE principles for Indigenous Data Governance here (figure 2), as these are excellent ways of thinking about data and other born digital objects.

Fig 2. Be FAIR (Findable, Interoperable, Reusable) and CARE (Collective Benefit, Authority to Control, Responsibility, Ethics).

The increase of data-driven discovery, and the use of new tools such as AI and machine learning have also taken research topics that were considered difficult (how to differentiate between different objects in photos) to something that can be used (and misused) by a wide variety of actors (e.g. governments using face ID to track citizens, or hate groups using algorithms to target specific races or genders). We need to face up to these new technologies and seriously consider the impact they will have, not only from a socio-political standpoint, but also from an environmental one. Big data models use a lot of computing and a lot of energy, with a correspondingly large carbon footprint.

Last, but not least, it is helpful if we can evaluate the outputs of the research, and understand what impact it has made, both within the field of study and externally. This is a historically notoriously difficult thing to determine, and can only be accurately done many, many years after the fact. Instead we turn to proxy measurements and metrics, which are easier to calculate, but don’t actually measure what we want them to. A much-discussed metric is citation counts, working on the principle that the more citations an article has, the more useful it has been to the community. Unfortunately, it’s easy to see that this premise is flawed, simply by looking at the citations that retracted papers get, even after retraction.

Fundamentally, I think the thing that will change the most about research practice in the next few years is going to be the increase in data-driven discovery, and the use of new AI tools and services. These have a huge amount of potential, though I really do think that we need to be aware of the hype, and also of the potential harm (socially and environmentally) these systems can cause.

We also need to remember that without our researchers, research cannot happen. Humans are curious, innovative and creative, and we need to support that in our researchers, and in our general lives. Remembering that people are not just numbers is crucial too – yes, there will always be a push for efficiency and increased speed in doing things, but this should not come at the cost of our humanity, our dignity, or our creativity.

There are many challenges ahead of us, but I do believe that with support and collaboration, we can face them all.

Friday, 20 April 2018

Hippocratic Oath for Research Data

(a not entirely serious output of International Data Week 2016*, see also Should there be an Oath for Scientists and Engineers? and A Hippocratic Oath for life scientists (based on the Modern Hippocratic Oath, written in 1964 by Louis Lasagna, Academic Dean of the School of Medicine at Tufts University)

I swear to fulfill, to the best of my ability and judgment, this covenant:...

I will respect the hard-won research gains of those researchers in whose steps I walk, and gladly share such knowledge as is mine with those who are to follow.

I will make no assertion without evidence.

I will apply, for the benefit of all, all measures which are required to preserve and make usable research data, avoiding those twin traps of data hoarding and unhelpful data description.

I will remember that there is art and craft to data management as well as science, and that humans as well as machines need to be able to interpret and use the data, now and in the future.

I will welcome opportunities to say "I know not," never failing to call in my colleagues when the skills of another are needed to assist in data sharing, dissemination or management.

I will respect the privacy of those who provide personal or sensitive data to me, for their problems are not disclosed to me that the world may know. Most especially must I tread with care in matters of life and death, not only of humanity, but also of the global ecosystem. Above all, I must not play at God (even if I create a data management system or infrastructure that allows me to do so).

I will remember that I do not manage a stream of bytes, but a whole story of data collection, analysis and interpretation. My responsibility includes these related research objects (such as software, workflows, project plans, etc.), if I am to care adequately for the data and the results and conclusions resulting from it.

I will prepare for data management in advance whenever I can, simply because it will make my life, and others’, easier.

I will remember that I remain a member of society, with special obligations to all my fellow human beings, as well as the research record.

If I do not violate this oath, may I enjoy life and art, respected while I live and remembered with affection thereafter. May I always act so as to preserve the finest traditions of my calling and may I long experience the joy of research and help make the world a better place.

_________________________

* yes, it's been a while since I wrote this, or have blogged, for that matter! But I've decided to pick this blog up again and figure that this bit of fluff is a good place to start.

Monday, 22 May 2017

Link roundup - academic publishing edition

"Should we cite preprints?" - Green Tea and Velociraptors

Agrees with my "cite what you use" rule of thumb

"Preprints won’t just publish themselves: Why we need centralized services for preprints" - Collaborative Knowledge Foundation

Neylon C, Pattinson D, Bilder G and Lin J. On the origin of nonequivalent states: How we can talk about preprints [version 1; referees: 1 approved]. F1000Research 2017, 6:608 (doi: 10.12688/f1000research.11408.1)

Really interesting article that proposes a model that distinguishes the characteristics of the object, its “state” (the external, objectively determinable, characteristics), from the subjective “standing” (the position, status, or reputation) granted to it by different communities.

Baldwin, Melinda, "In referees we trust?", Physics Today 70, 2, 44 (2017); doi: http://dx.doi.org/10.1063/PT.3.3463

Fascinating article about the history of academic journal peer review, and the societal pressures that have made peer review the "gold standard" of academic credibility, with some discussion of how it's creaking at the seams.

"Does It Matter Whose Name Appears After the © When Using Creative Commons?" - Todd Carpenter (The Scholarly Kitchen)

"Citation Performance Indicators — A Very Short Introduction" - Phil Davis (The Scholarly Kitchen)

"Satire in Scholarly Publishing" - COPE

A satirical article made it into a serious review article - COPE (Committee on Publication Ethics) give their judgement on the case. TL;DR - always fully read the papers you're citing!

"Journal accepts bogus paper requesting removal from mailing list" - The Guardian

A tale of a predatory open access journal accepting a paper (with lovely diagrams) which just repeated the words: "Get me off your ******* mailing list"

Tuesday, 2 May 2017

RDA Plenary 9, Barcelona, April 2017

Two little aliens stowed away for this trip, and were very pleased that the venue was all space themed.

RDA Plenary 9 was held in Barcelona, in April 2017. I made my usual bunch of scrappy notes, which I've tidied up and added links and commentary (in italics) for those who are interested.

Opening plenary session

Ideas spreadsheet for suggestions on how to coordinate and communicate across RDA groups

has to be an actionable suggestion - no moaning!
closed for suggestions now, but you can see what was proposed

WG RDA/WDS Scholarly Link Exchange
Interesting stuff and presenting things that are approaching maturity and could be useful and usable systems in the future.

All about linking research objects
Scholix information model:

mandatory: for link information package: publication date, link publisher. For source and target object: identifier and object type
other optional metadata includes link provider, relationship type, license URL of link information package (for link information package), title, creator, publication date, publisher (for source/target objects)

DLI service available as a prototype

automatically picks up stuff from DataCite
Scopus using the DLI system to find links to data

information available for preview users
wishlist for Scopus includes: clearer information on where data is stored, ability to retrieve richer metadata...

Scopus planning on doing data citation counts in the future

Scholix plans on collecting every link possible, not just citations
information about datasets in the text of papers, needs to be mined out and extracted - some publishers doing this
community focus groups within the WG - working on documents to answer the main questions "why?" "how?" FAQs - hoping to have them produced in the next 3 months or so
use cases - how data centres can contribute artile links to DataCite = use "relatedIdentifier" property in DataCite metadata schema
Scholix doesn't say whether the dataset or the article is open, or about the licensing of the objects being linked

How to give credit to scientists for their involvement in making data & samples available for sharing
Unfortunately seemed to spend too much time rehashing old data citation, data publication and data metrics arguments.

BRIF - Bioresource Research Impact Factor
Data metrics and reward systems - table 3 in report
Analysis of metadata records in DataCite reveals that not all records are complete.

Consensus and standardisation of metadata needed

Top data creator in DataCite is a mycologist
WG RDA / TD Metadata Standards for attribution of physical and digital collections stewardship already exists. Reasearch Data Provenance IG already exists.
Focussing very much on data publication as a method for giving credit - too much overlap with existing WG/IGs
CoBRA short checklist for citation of bioresources in scientific journal articles
IGSN is now in DataCite metadata schema as relatedIdentifierType

IG RDA/WDS Certification of Digital Repositories
Started with presentations, then we broke out into groups to discuss certain questions and responses in the self certification process. I also got photographed by the official photographer.

Core Trustworthy Data Repository Requirements incude:

explicit mission, licenses, continuity plan, disciplinary and ethical norms, adequate funding (3-5 years) and qualified staff, expert guidance, integrity and authenticity of the data, relevence and understandability, documented processes and procedure, long-term preservation

IG RDA/WDS Publishing Data
A key topic of this session was trying to figure out the next direction the IG should take... unfortunately still to be determined

WG on Data Fitness for use - just starting - see below
OECD-GSF CODATA project: business models for sustainable research data repositories
Niso recommendation on assessment of scholarly research - non traditional metrics
Where to take the IG?

think about where scholarly publishing is going in the future. New publishing models - preprint repositories, open peer-review...

IG Data policy standardisation and implementation
Came from a BoF last plenary, but now an official IG - this meeting primarily about what already exists

UK Concordat on open research data
IG primary objective - define a common framework for research data policy allowing for different requirements, different levels of commitment and acknowledging disciplinary differences
Journal research data policy registry
Complying with funder policy is what researchers give as their motivation to share data, but researchers find it hard to comply with policy
Springer Nature Research Data Policy framework
A Data Citation Roadmap for Publishers
Do studies of quantitative results of the impact of data sharing exist? Citation benefits?

doing studies, but insufficient evidence as yet.

The Open Data Citation Advantage

Suggestion that the Belmont Forum is bringing together people for standardising policies...?

Software Source Code focus group
Good discussion in this BoF - though mainly asking questions rather than providing answers

Statement of the problem clear - need software for scientific reproducibility. But don't have suitable repositories/ontologies for source code.

differences between scientific software and open source? Can we learn from open source developers?
is RDA a suitable venue for this work? Anything else going on in this area?

Mailing lists and bug tracking chains are important sources of information about the code
Software as knowledge, versus software as an instrument in the process

Docker - focussing on re-run-abilty

Archives do throw things out - so saving all the commits might not be possible/practical

Open Source software - don't know when it starts what it will turn into - often safer to archive everything and then throw things out later.

Distinction between code as knowledge and reproducibility
Cost of storage, curation and maintainence of the metadata

Reproducibility IG working on this a bit

Difference between replicability and reproducibility

Docker image not enough for reproducibility - as we need to be able to modify the source code

Don't get the chance to read a scientific article's first five drafts. People don't want to share their first drafts. Might put people off sharing.

first drafts of literature don't usually get shared, until the person writing them becomes famous, in which case people are interested

Rely on top layers overlaying archival? e.g. overlay journal
Work being done on software citation - in/out of scope? Connected to metadata
Notes from the session

WG RDA/WDS Assessment of Data Fitness for Use
New WG - meeting primarily about the criteria that can be used to assess data fitness for (re)use.

Looking at individual data sets
Needs to be efficient, high impact and visibility
Data quality: "degree to which a set of characteristics of data fulfills requirements" (ISO900)

any data are usable as long as they fit the requirements

Criteria 1

inherent properties: objectively verifiable/measurable e.g. validity of used methodologies, completeness of metadata
non-inherent propertise: subjective assessments

Criteria 2: properties directly related to data objects/ data accessibility/ data management processes
FAIR data principles

FAIRness Index - a collection of metrics to assess adherence to the FAIR principles

DANS FAIR badge scheme - going through testing at the moment

reusability as the resultant of the other 3 (F+A+I)/3=R
scores for F,A,I as 1 to 5
publish number of user reviews, archivist assessments, downloads
mapping of reusable criteria to other F/A/I criteria
examples of star values criteria for each F/A/I
Online questionnaire system developed for reviewers of datasets
planning on creating a neutral website to assess datasets FAIRDAT.org (DAT = data assessment tool)

Issues with asssessing multi-file datasets (with files in different formats), quality of metadata (how to evaluate when metadata is insufficient versus rich), how to define use of standard vocabularies

Friday, 26 August 2016

Standing on the Digits of Giants: Research data, preservation and innovation - ALPSP seminar, London, 8 March 2016

Standing on the Digits of Giants: Research data, preservation and innovation

ALPSP seminar, London, 8 March 2016

I was asked to present at an Association of Learned and Professional Society Publishers seminar, back in March this year. You can found my presentation slides here, and the audio of my presentation here.

I've info-dumped my notes on the various talks below, but to sum up, it was a very interesting seminar that seemed to go down well with an audience of primarily publishers, many of whom were getting to grips with this whole data thing for the first time.

William Killbride, Digital Preservation Coalition

* "Access is not an event, it's a process"
* Standing on someone's shoulders is quite precarious! We need a stable and secure platform - but how do we make one?
* Solutions for digital preservation need to be put in place at the beginning of the lifecycle
* Discussions with publishers can get bogged down in Open Access issues
* Small publishers hold the content that's most at risk
* We need action on Open Access! We've talked about it lots already
* International profile is important

Mark Thorley, NERC

* The digital, networked world is a real game changer. Peopel want on-line access now and for free. And anyone can "publish" anything on the web
* Open research is not an admin overhead
* The data revolution is replaying the printing revolution established by Gutenberg's mechanical, moveable type
* ICSU's report "Open Data in a Big Data World"
* Open research costs money - we have to learn to live with that
* Technology is the "easy bit" - people are complicated!

Robert Gurney, University of Reading

* The cloud approach is developing fast in environmental data - visualisation of data (especially large quantities of data) is very important
* Infrastructure as a service provides easy access to resources
* Problems in Big Data - volume, variety, veracity
* The Belmont Forum
* is set up to allow common cross-national calls. Their data policy and principles are published on the web
* is establishing a data and e-Infrastructure coordination office
* creating a common enhanced data plan
* planning scoping workshops and international calls for case studies and to share infrastructure and develop best practice
* NERC are leading the effort on cross-disciplinary training curriculum to expand human capacity. This will involve the UN training agency, and there will be an open call for a training champion
* The Belmont Forum implementation plan is published

Phil Jones, Digital Science

* We are moving from cottage industry to industrial scale science, but funding structures are more set up to support cottage industry science.
* Valen, Blanchat, figshare, 2015 - Survey of data policies for funders across the UK and USA
* Open Academic Tidal Wave is moving from recommendations to enforcement
* Data repositories have different approaches - structured versus unstructured
* Publishers only have a limited window of time to engage with researchers during the research workstream - but new tools are coming out to allow publishers to interactwith researchers across a greater time
* If we want compliance, the simpler we can make the tools to do it, the better

Peter Burnhill, EDINA

* Increasingly more references to the wild web, not just back to other articles
* Scholarly record always has a fuzzy edge
* Libraries no longer have e-collections, only e-connections
* Mostly big publisher content being archived - but we don't know if the small stuff is being archived. Research libraries archiving stuff aren't going for the long tail of stuff published by small publishers
* Reference rot = link rot + content drift
* analysed ~ 1 million URI links - tested if URIs still worked, is there a "memento" of that reference in the "archived web"
* ~75% not archived within 14 days of publication
* Klein 2014, PLOS One - "Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot"
* rotten references mean defective articles!
* author workflow - note taking software, working with Zotero
* Publishers should accept robust links in cited reference, avoid reference rot by triggering archiving of snapshots and inserting Hiberlinks/robust links at the point of ingest into submission system.

Mike Taylor, Elsevier

* Research data metrics - interest has exploded in past few years
* NISO - data metrics recommendations - set up 3 working groups
* "metrics for non-traditional outputs" group
* recommending report dataset download usage by using COUNTER compliant formulations, and that funders support repositories to do this
* Elsevier is adapting its research infrastructure to deal with research data
* much easier to set up new products than adapt existing systems!
* Ambitions for next year:
* most Elsevier journals promoting data publishing with data policies
* submission system to support data citations and data submissions
* communicate what's being done
* Data metrics part of the value loop encouraging researchers to make their data available. (Also including data)
* Metrics based on data citation will be happening in the near future, as soon as the infrastructure is built
* Not just one metric!
* article level metrics
* journal level metrics
* the more metrics, the harder it is to hide things - multiple metrics give multiple points of view

Josh Brown, Orcid

* CRediT schema - update ORCID schema to include other research roles e.g. data etc.
* Contributor type badges
* project-thor.eu
* need PIDs for organisations
* issues with versioning, identifier equivalence, granularity, changes over time, making cultural changes mainstream
* all research activities need to be taken into account
* we can't reward it if we don't recognise it
* we won't recognise it if we can't agree on what it is

Matthew Addis, Arkivium

* direct benefit to researchers in getting involved with digital preservation
* tools and services exist now that allow researchers to get on and do digital preservation
* 44% of links to Astronomy data broken after 10 years
* Researchers only really get judged on how much grant money thay bring in, and how many publications - digital preservation will help with both these
* Lots of tools and models out there, but not particularly helpful for most researchers. Too much choice!
* do the bare minimum to get benefits from digital preservation - parsimonious preservation
* know what you have - understand the formats, catalogue the data
* put it somewhere safe
* link rot - how to address it?
* Droid - file format identification tool, can generate xml/pdf reports. Metadata includes links to PRONOM - technical registry for file formats
* checksums - useful to establish if data has been lost/corrupted. Tools e.g. exactly - creates BagIt manifest of files
* ADMIRe survey at Nottingham
* make lots of copies to keep stuff safe - put them in places like institutional repositories...
* links are important. DOIs are dependent on URLs, which are as brittle as any URLs - lots of links compensate for reference rot

Wendy White, University of Southampton

* PIs as change agents - collaboration with academic leadership to enact changes
* collaboration - e.g. capturing information about equipment and facilities
* Risk of garbage in and pretty visualisations out
* Quick wins - embedding DOIs, CC0 metadata
* Zika initiative - engage with lots of other smaller initiatives as well e.g. greynet.org
* Networks of repositories - institutional repositories working with international and national disciplinary repositories
* Not making enough of theses data - encourage more theses to have data made available
* Library triaged research data services - consultancy, engagement with editors, advice, workshops
* Different training models - pick and mix, intense and seasonal, integrated pathways (what we want!), emergency boost (help panicking people)
* Southampton reviewing curricula - modules on data analysis, ethics and research methods are good areas to discuss data management
* PhD students are great agents for change - passionate advocates
* Embedded librarians inside research teams iutility.ac.uk
* Research data - more than management!
* An archive isn't a thing, it's a strategy

Peter Doorn - DANS

* Lots of different types of data journals and data papers
* Data paper describes the research context of a dataset
* Presentation of a data paper should look attractive - more user-friendly than the view of the dataset in the archive
* Variety of interactive data visualisation - make the data more alive
* publishing data in Mendeley data - Elsevier aren't making it obligatory to publish data in Mendeley Data

Friday, 1 July 2016

COPE Seminar: An Introductions to Publication Ethics, 13th May 2016, Oxford

Old books in my local second hand bookshop

The COPE (Committee on Publication Ethics) Seminar: An Introductions to Publication Ethics, was held on Friday 13th May 2016, in Oxford.

Being fairly new to this being an editor business, and the workshop being so local, I took the opportunity to go, and found it all really useful. Not only from my perspective as someone in charge of a journal, but also from the data management and publication point of view. A lot of the issues raised during the workshop, like attribution, authorship, plagiarism etc. are just as easily applied to datasets as they are to journal articles.

The workshop was a mixture of talks and discussion sessions, where we were given examples of actual cases that COPE had been told about, and we had to discuss and decide what the best course of action was. Then we were told what the response from the COPE members was in those particular cases - reassuringly we were pretty much in agreement in all cases!

Key notes that I jotted down during the day include:

Retractions of papers are growing at a rate faster than publications
An emerging area of concern is the growth of fake peer reviewers
Ethical guidelines for peer reviewers are available on the COPE website, along with other guidelines
Similarly, there are flowcharts on the COPE site to guide you through what to do if you suspect an ethical problem
Report for the Nuffield Council on Bioethics on the culture of scientific research
Academy of Medical Sciences - Reproducibility and reliability of biomedical research
Some authors will put in white quotation marks around text to get around plagiarism detection software

The main take home message for me was that COPE have a lot of resources on their website, all free to use.

Data visualisation and the future of academic publishing, Oxford, 10 June 2016

Astrolabes at the Museum of the History of Science, Oxford

Once again wearing my Editor-in-Chief hat, I was invited to the "Data visualisation and the future of academic publishing" workshop, hosted by University of Oxford and Oxford University Press on Friday 10th June 2016.

It was a pretty standard workshop format - lots of talks, but there were a wide variety of speakers, coming from a wide spread of backgrounds, which really helped make people think about the issues involved in data visualisation. I particularly enjoyed the interactive demonstrations from the speakers from the BBC and the Financial Times - both saying things that seem really obvious in retrospect, but are worth remembering when doing your own data visualisations (like keep it simple, and self contained, and make sure it tells a story).

For those who are interested, I've copied my (slightly edited) notes from the workshop below. Hopefully they'll make sense!

Richard O’Beirne (Digital Strategy Group, Oxford University Press)

What is a figure? A scientific result converted into a collection of pixels
Steep growth in "data visualisation" in Web of Science, PubMed
Data visualisation in Review: Summary, Canada 2012
Infographics tell a story about datasets
Preservation of visualisations is an issue
OUP got funding to identify suitable datasets to create visualisations (using 3rd party tools) and embed them in papers

Mark Hahnel (figshare)

Consistency of how you get to files on the internet is key
Institutional instances of figshare now happening globally e.g. ir.stedwards.edu / stedwards.figshare.com
Making files available in the internet allows the creation of a story
How do you get credit? Citation counts? Not being done yet
Files on the internet -> context -> visualisation
Data FAIRport initiative - to join and support existing communities that try to realise and enable a situation where valuable scientific data is ‘FAIR’ in the sense of being Findable, Accessible, Interoperable and Reusable
Hard to make visualisations scale!
Open data and APIs make it easier to understand the context behind the stories
Whose responsibility is it to look after these data visualisations?
Need to make files human and machine readable - add sufficient metadata!
Making things FAIR just allows people to build on stuff that has gone before - but it's easy to break if people don't share
How to deal with long-tail data? Standardisation...

John Walton (Senior Broadcast Journalist, BBC News)

Example of data visualisation of number of civilians killed by month in Syria
Visualisation has to make things clear - the layer of annotation around a dataset is really important
Most interactive visualisations are bespoke
It's helpful to keep things simple and clear!
Explain the facts behind things with data visualisation, but not just to people who like hard numbers - also include human stories
Lots of BBC web users are on mobile devices - need to take that into account
Big driver for BBC content is sharing on social media - BBC spend time making the content rigourous and collaborating with academia
Jihadism: tracking a month of deadly attacks- during the month there was about 600 deaths and ~700 attacks around the world
Digest the information for your audience
Keep interaction simple - remember different devices are used to access content

Rowan Wilson (Research Technology Specialist, University of Oxford)

Creating cross walks for common types of research data to get it into Blender
People aren't that used to navigating around 3 dimensional data - example imported into Minecraft (as sizeable proportion of the population are comfortable with navigating around that environment)
Issues with confidentiality and data protection, data ownership, copyright and database rights, open licenses are good for data, but should consider waiving hard requirement for attribution, as cumbersome attribution lists will put people off using data
Meshlab - tool to convert scientific data into Blender format

Felix Krawatzek (Department of Politics and International Relations, University of Oxford)

Visualising 150 years of correspondence between the US and Germany
Letters (handwritten/typed) need significant resource and time to process them before they can be used
Software produced to systematically correct OCR mistakes
Visualise the temporal dynamics of the letters
Visualisation of political attitudes
Can correlate geographic data from the corpus with census data
Always questions about availability of time or resources
Crowdsourcing projects that tend to work are those that appeal to people's sense of wonder, or their human interest. Get more richly annotated data if can harness the power of crowds.
Zooniverse created a byline to give the public credit for their work in Zooniverse projects

Andrea Rota (Technical Lead and Data Scientist. Pattrn)

Origin of the platform: the Gaza platform - documenting atrocities of war, humanitarian and environmental crises

"improving the global understanding of human evil"

Not a data analysis tool - for visualisation and exploration
Data in google sheets (no setup needed)
Web-based editor to submit/approve new event data
Information and computational politics - Actor Network Theory - network of human and non-human actors - how to cope with loss
Pattrn platform for sharing of knowledge, data, tools and research, not for profit
Computational agency - what are we trading in exxchange for short term convenience?
"How to protect the future web from its founders' own frailty" Cory Doctorow 2016
Issues with private data backends e.g. dependency on cloud proprietary systems
Computational capacity - where do we run code? Computation is cheap, managing computation isn't easy

Alan Smith (Data Visualisation Editor, Financial Times)

Gave a lovely example of bad chart published in the Times, and how it should have been presented
Visuals need to carry the story
Avoid chart junk!
Good example of taking an academic chart and reformatting them to make the story clearer
Graphics have impact on accompanying copy
Opportunity to "start with the chart"
Self-contained = good for social media sharing
Fewer charts, but better
Content should adapt to different platforms
The Chart Doctor - monthly column in the FT
Visualisation has a grammar and a vocabulary, it needs to be read, like written text

Scott Hale (Data Scientist, Oxford Internet Institute, University of Oxford)

Making existing tools easy to use, online interfaces to move from data file to visualisation
Key: make it easy
Plugin to Gephi to export data as javascript plugin for website
L2project.org - compiles straight to javascript - write code once - attach tables/plot to html element. Interactive environment that can go straight into html page

Alejandra Gonzalez-Beltran (Research Lecturer, Oxford e-Research Centre)

All about Scientific Data journal
Paper on survey about reproducibility - "More than 70% of researchers have tried and failed to reproduce another scientist's experiments, and more than half have failed to reproduce their own experiments."
FAIR principles
isaexplorer to find and filter data descriptor documents

Philippa Matthews (Honorary Research Fellow, Nuffield Department of Medicine)

Work is accessible if you know where to look
Lots of researcher profiles on lots of different places - LinkedIn, ResearchFish, ORCID,...
Times for publication are long
Spotted minor error with data in a supplementary data file - couldn't correct it
Want to be able to share things better - especially entering dialogue with patients and research participants
Want to publish a database of HBV epitopes - publish as a peer-reviewed journal aricle, but journals wary of publishing a live resource

my response to this was to query the underlying assumption that at database needs to be published like a paper - again a casualty of the "papers are the only true academic output" meme.

Public engagement - dynamic and engaging rather than static images e.g. Tropical medicine sketchbook

About the Author

I'm Sarah Callaghan and I am the Research Practice Manager for the University of Oxford.

Previously, I was Editor-in-Chief for Patternsa data science journal from Cell Press.

Before then I worked for the Centre for Environmental Data Analysis as a data scientist and programme manager attempting to make sense of this data citation and publication thing.

Before that I worked for the Radio Communications Research Unit (now the Chilbolton Group at STFC - Rutherford Appleton Laboratory) where I studied radio propagation at frequencies above 10 GHz (and in the process created a number of large datasets).

Needless to say, all opinions are my own, and nothing to do with my employer.

My official biography can be found here.