Citing Bytes - Adventures in Data Citation: 2015

Friday, 2 October 2015

RDA Plenary 6, DataCite and EPIC, and e-Infrastructures - Paris, September 2015

La Tour Eiffel

Last week was the 6th Plenary of the Research Data Alliance, held in Paris, France. It officially started on the Wednesday, but I was there from the Monday to take advantage of the other co-located events.

DataCite and EPIC - Persistent Identifiers: Enabling Services for Data Intensive Research

Monday, September 21, 2015

This workshop consisted of a quick-fire selection of presentations (12 of them!) all in the space of one afternoon, covering such topics as busting DOI myths; persistent identifiers other than DOIs; persistent identifiers for people (including ORCIDs and ISNI - including showing Brian May's ISNI account - linking his research with his music); persistent identifiers for use in climate science,the International GeoSample Number (ISGN) - persistent identifiers for physical samples; the THOR project - all about establishing seamless integration between articles, data, and researchers across the research lifecycle; and Making Data Count - a project to develop data level metrics.

(I also learned that DOIs are also assigned to movies, as part of their supply chain management)

Questions were collected via Google doc during the course of the workshop, and have all since been answered, which is very helpful! I understand that the slides presented at the workshop will also be collected and made available soon.

e-Infrastructures & RDA for data intensive science

Tuesday, 22 September, 2015

This was a day long event featuring several parallel streams. Of course, I went to the stream on Research Data infrastructures for Environmental related Societal Challenges, though I had to miss the afternoon session because of needing to be at the RDA co-chairs meeting (providing an update on my Working Group and also discussing important processes, like, what exactly happens when a Working Group finishes?) Thankfully, all the slides presented in that stream are available on the programme page.

Unsurprisingly, a lot of the presentations at this workshop dealt with the importance of e-infrastructures to address the big changes we'll need to face as a result of things like climate change. There was also talk about the importance of de-fragmenting the infrastructure, across geographical, technological and domain boundaries (RDA being a key part of these efforts).

A common thing in this, and the other RDA meetings, were analogies between data infrastructures and other infrastructures, like for water, or electricity. Users aren't worried about how the water or power gets to them, or the pipes, agreements and standards are generated. They just want to be able to get water when they turn the tap, and electricity when they flick a switch. Another interesting point was that there's a false dichotomy between social and technical solutions, what we really have is a technical solution with a social choice attached to it.

Common themes across the presentations were the sheer complexity of the data we're managing now, whether it's from climate science, oceanography, agriculture, and the needs to standardise, and fill in those gaps in infrastructure that exist now.

RDA 6th Plenary

Wednesday 23 to Fri 25th September, 2015

As ever, the RDA plenaries are a glorious festival of data, with many, many parallel streams, and even more interesting people to talk to! It's impossible to capture the whole event, even with my pages of notes.

If I can pick out a few themes though, these are them:

Data is important to lots of people, and the RDA is a key part of keeping things going in the right direction.
Infrastructures that exist aren't always interoperable - this needs to be changed for the vast quantities of data we'll be getting in the future.
The RDA is all about building bridges, connecting people and creating solutions with people, not for them.
Uncertainty is the enemy of investment – shared information reduces uncertainty

Axelle Lemaire, Minister of State for Digital Technology, French Ministry of Economy, Industry and Digital Technology, said that people say that data are the oil of the 21st century, but this isn't such a good comparison – better to compare it to light – the more light gets diffused, the better it is, and the more the curtains are open the more light gets in. She is launching a public consultation on a digital bill she's preparing and is looking for views from people outside of France - the RDA will distribute the information about this consultation at a later date.

It's interesting now that the RDA has matured to the point that several working groups are either finished, or will be finished by the next plenary (though there is still some uncertainty what "finished" actually means). Given the 18 month lifespan of the working groups - that's enough time to build/develop something, but the actual time to get the community to adopt those outputs will be a lot longer. So there was plenty of discussion about what outputs could/should be, and how the adoption phase could be handled. I suspect that, even with all our discussions, no definite solution was found, so we'll have another phase of seeing what the working groups decide to do over the next few months.

This is of particular relevance to me, as my working group on Bibliometrics for Data is due to finish before the next plenary in March. We had a packed meeting room (standing room only!) which was great, and we achieved my main aim for the session, which was to decide what the final group outputs would be, and how to achieve them. Now we have a plan - hopefully the plan will work for us!

A key part of that plan is collecting information about what metrics data repositories already collect - if you are part of a library/repository, please take a look at this spreadsheet and add things we might have missed!

I went to the following working group and Birds of a Feather meetings:

WG RDA/WDS Publishing Data Workflows

All the collected workflows are now available as reference models
Recommendations for newcomers: follow standards, start small and build components, document roles, workflows and services, trusted repository development

WG RDA/WDS Publishing Data Bibliometrics

We have a plan for our group outputs, which will basically map the landscape for data bibliometrics as it stands - identifying what needs to be done, and the other groups that are addressing aspects of this problem (which is a big one!)

Joint meeting of IG RDA/WDS Publishing Data Cost Recovery for Data Centres

We did a SWOT analysis of different methods for funding repositories

BoF on Data Cultures, Practices, and Ethics

Interesting stuff this, anthropologists and social scientists looking at how we deal with data as humans. Not directly relevant to me, but I think I'll keep half an eye on it purely out of personal interest.

BoF on Earth System Science data management

This meeting was mostly presentations (as detailed in the agenda) but with a bit of time to discuss maybe setting up an interest group, though I don't think anything was formally decided.

WG RDA/WDS Publishing Data Services

There were a few demonstrations made, which showed off how far the group has come in developing a potential new service.
Obviously, when ingesting links from several places, standards and interfaces are needed!

Supporting RDA women networking breakfast

An interesting meeting, despite it being held in a corner of the main marquee, so it was really difficult to get a proper conversation going. RDA is about 1/3 female, which is good, but given that more than 50% of Internet users are female, we need to be careful of the human aspect of our work. It was also very good to see several male RDA members in attendance too - this is not just a woman's issue!

Joint meeting of IG Domain Repositories, IG Libraries for Research Data IG Long tail of research data & IG National Data Services: Building Connections between Libraries, Discipline Repositories and Data Services

Again, the fragmented landscape of repositories came up - we'll need to help people navigate it and find the best places for their data
There was some discussion about commercial data repositories, and the threat they pose to domain/institutional ones. My thoughts (as part of a domain repository) - I'd rather have the data with minimal metadata in a commercial repository than lost on a CD in a drawer somewhere. And the commercial companies are pressure on us to up our game. If we're losing researchers to them because it's easier to put data in the commercial repositories, then we either have to make it easier to put data into ours, or really explain why the pain is worth it!

Joint meeting of IG RDA/WDS Publishing Data, WG RDA/WDS Publishing Data Services, WG Data Description Registry Interoperability (DDRI), WG RDA/WDS Publishing Data Bibliometrics & WG RDA/WDS Publishing Data Workflows

We had a lot of discussion about the structure of the Publishing Data Interest Group, now that most of the Working Groups under its umbrella are coming to an end. Personally, I think there's still a lot that this group can do - we haven't touched on issues like peer review of data for example, plus implementation and adoption of the working group outputs is going to take a while. But having a refresh of the group is probably a good thing too.

So, that was RDA Plenary 6. Next plenary will be held in Tokyo, Japan from the 1st to the 3rd of March 2016. In the meantime, we've got work to be getting on with!

Friday, 31 July 2015

Just because we can measure something...

What are you trying to tell me? - Day 138, Year 2

So, I recently finished a 100 day challenge, where I gave up chocolate, cake, biscuits, sweets, etc., attempted to be more healthy about my eating and exercise as often as I could. This was to see if I could keep off the sugar for 100 days, and also in the hopes that I'd lose some weight.

At the end of my 100 days, I stood on the bathroom scales, and I'd lost a grand total of... wait for it... 0 lb. Bum.

And my brain being what it is, I instantly thought "well, that was a waste of time, wasn't it? Why did I even bother?"

Then my inner physicist kicked in with: "I like not this metric! Bring me another!" (So I found more metrics about how many km I'd run in the hundred days, and how many personal bests had been achieved, and I felt better.)

But that all got me thinking about metrics, and about how easy it is to doom good work, simply because it doesn't meet expectations with regards to one number. Currently, research stands or falls by its citation count - and we're trying to apply this single metric to even more things.

And that got me thinking. What we want to know is: "how useful is our research?" But an awful lot of metrics come at it from another angle: "what can we measure and what does that mean?"

So, citations. We are counting the number of times a paper (which is a proxy for a large amounts of research work) is mentioned in other papers. That is all. We are assuming that those mentions actually mean something (and to be fair, they often do) but what that meaning is, isn't necessarily clear. Is the paper being cited because it's good, or because it's rubbish? Does the citer agree with the paper, or do they refute it? This is the sort of information we don't get when we count how many times a paper has been cited, though there are movements to quantifying a bit better what a citation actually means. See CiTO, the Citation Typing Ontology for example.

Similarly for Twitter, we can count the number of tweets that something gets, but figuring out what that number actually means is the hard part. I've been told that tweets don't correlate with citations, but then that begs the question, is that what we want to use tweet counts for? I'm not sure we do.

We can count citations, tweets, mentions in social media, bookmarks in reference managers, downloads, etc., etc., etc. But are they actually helping us figure out the fundamental question: "how useful is our research?" I don't think they are.

If we take it back to that question, "how useful is my research?" then that makes us rethink things. The question then becomes: "how useful is my research to industry?" or "how useful is my research to my scientific community?, or "to industry?", or "to education?". And once we start asking those questions, we can then think of metrics to answer those questions.

It might be the case that for the research community, citation counts are a good indicator of how useful a piece of research is. It's definitely not going to work like that for education or industry! But if those sectors of society are important consumers of research, then we need to figure out how to quantify that usefulness.

This being just a blog post, I don't have any answers. But maybe, looking at metrics from the point of view of "what we want to measure" rather than simply "what can we measure and what does it mean?" could get us thinking in a different way.

(Now, if you'll excuse me, I have an important meeting with a piece of chocolate!)

Thursday, 30 April 2015

Data, Metadata and Cake

From http://epicgraphic.com/data-cake/

I saw this analogy and thought it was a good one - because of course you need to consume the information before it can become knowledge (and because cake - does anyone need another reason?)

And then, thinking about it a bit more, I developed the analogy further:

If we consider that the raw data, straight out of the instrument/wherever is the raw ingredients, then obviously there's a bit of processing to be done to turn it into something consumable, like this cake.

Sponge cake picture by nettle1234 from http://allrecipes.co.uk/recipe/12122/basic-plain-sponge-cake.aspx

This dataset/cake looks very nice. Someone's obviously taken care with it, it's nice and level and not burned or anything. But it still looks a bit dry, and would definitely need something to go with it, a nice cup of tea, perhaps.

Now, if we consider adding a layer of metadata/icing around the outside of the dataset/cake...

Victoria sponge from https://gollygoshgirl.wordpress.com/2013/06/05/a-little-twist-on-the-classic-victoria-sponge/

Doesn't that look so much more appealing? (Or it does to me anyway - you might be someone who doesn't like chocolate, or strawberries, or cream...but the analogy still works for your preferred cake topping!)

Metadata makes your dataset easier to consume, and makes it more appealing too.

Of course, you get good metadata, that adds to the dataset, makes it look gorgeous and yummy and delicious...

From Sweet Bakes

And then there's the bad metadata, which, er... doesn't.

From Cake Wrecks

And the moral of my analogy? Your dataset might be tasty enough for people to consume without metadata, but adding a bit of metadata can make it even yummier!

(mmmm....cake....)

Thursday, 26 February 2015

Just why is citation important anyway?

The four capital mistakes of open source by opensource.com, on Flickr

I recently had it hammered home to me about just how important citations are in scientific research. This came about as the result of me reviewing a document* .

Me being me, the first thing I did was turn to the back to look at the bibliography**. It was a mess, but I can understand how citation strings get all mucked up. I remember when I was writing my PhD, I had to copy and paste, or even retype, all my citations into the files that were my thesis chapters (files - multiple, because Word couldn't cope with having all the chapters in the one file). Nowadays I have discovered the wonder that is Mendeley, and citations are so much easier to deal with - they even do data citations!

Then I read the document, and one point I said to myself, "Self, this equation looks a bit funny to me. Oh look, here's the citation for the paper it comes from - let's look at the original source to make sure that there's no copying errors in the equation." So verily, I looked up the cited paper, and yay! It was open and accessible. But could I find the quoted equation in the cited paper? Er, no.

There was another moment, where one of my publications was cited as the source for a particular figure. I looked at the figure, and at my name in the caption next to it, and went and checked the cited document. Again, this figure was not contained in the cited publication.

These were the only examples of mis-citation that I caught, but I did find myself scrawling [citation needed] repeatedly in various places throughout the whole work. And every time I did so, my confidence in the research being presented waned a little bit more.

(Unfortunately, it goes without saying that none of the data presented in this work was cited properly either...)

Yes, all researchers stand on the shoulders of giants, and use work that has been published before to support their arguments. But it's important to not rely on unsupported statements of fact being "stuff everyone knows". Yes, the report might be written for a specialist audience who do indeed know all that, and know the citations you'd use to support the statement, but they're not your only audience. And providing citations demonstrates that you've done your due diligence, and can back up your assertions properly.

At the end of the day, when I read a paper or report, I can't check everything that the author(s) have done, so I have to take a certain amount on trust. This trust can be damaged seriously by some silly little things, like too many typos or unreadable graphs (curves all printed in similar shades of grey), and by some serious things, like mis-citations, or no citations at all.

So, citations. They're not just for helping reproducibility, or assigning credit - they also act as a marker that the author(s) knows their background and pays attention to those tricky details that can easily catch you out in science. Honestly - citations are the easy part, but if you don't have the energy to care about them (even though they're annoying) then how can your reader be sure you've applied the same care to the "more important" bits of your research?

____________

* I'm not going to give any names or details about the document, because that's not fair, and not the point of this post.

** Yes, I am a pedant!

My biography

Dr Sarah Callaghan is Research Practice Manager at the University of Oxford, dedicated to supporting researchers through policy and training to ensure that the integrity of research is preserved, and that research excellence is underpinned by the principles of honesty, rigour, impartiality, collegiality, trust, transparency, and accountability

She was formerly Editor-in-Chief for Patterns - a gold open access, multidisciplinary journal of data science launched by Cell Press. She came to Patterns from a twenty-year career in creating, managing and analysing scientific data. Her research started as a combination of radio propagation engineering and meteorological modelling, then moved into data citation and publication, data sharing, visualisation, metadata, and data management for the Centre for Environmental Data Analysis. She was Editor-in-Chief of the Data Science Journal for 4 years, and has over 100 publications. In her first degree she taught a neural network the difference between Bach and Stravinsky. Her personal experience means she understands and sympathises with the frustrations that researchers can have with data!

Her publication list can be found here.

(last updated 9th January 2023)

About the Author

I'm Sarah Callaghan and I am the Research Practice Manager for the University of Oxford.

Previously, I was Editor-in-Chief for Patternsa data science journal from Cell Press.

Before then I worked for the Centre for Environmental Data Analysis as a data scientist and programme manager attempting to make sense of this data citation and publication thing.

Before that I worked for the Radio Communications Research Unit (now the Chilbolton Group at STFC - Rutherford Appleton Laboratory) where I studied radio propagation at frequencies above 10 GHz (and in the process created a number of large datasets).

Needless to say, all opinions are my own, and nothing to do with my employer.

My official biography can be found here.