Friday, 2 May 2014

Who owns the data?


So, who does own a dataset, anyway?

Is it the researcher who sets up the instrument and makes the measurement?
Is it the company that built the instrument?
Is it the organisation that operates the instrument (from whom the researcher has bought instrument time)?
Is it the researcher's institution, who employs the researcher to make measurements?
Is it the institution's data repository, who publish the data, or restrict access to it?
Is it the funder whose grant pays the institution for the researcher to make the measurement?
Is it the government, who provides the funder with the budget to hand out grants?
Is it the tax payer, whose taxes fund the government?*

Like so many things in life, the answers to these questions are "well, it depends..."

Ownership is a social construct. I own a car because I have a document in my filing cabinet giving details of the car make and model, saying that I do. This document is also registered in a national database (the DVLA) saying that the car specified is mine. The car itself sits outside my house, and I have the key, which means I can use it, and other people can't without my express permission. If the car gets stolen, it's uniquely registered, so there's a good chance that (barring an experience with a quick respray and fake plates) it'll still be identifiable as mine.

I also have many books. These are mine, because I bought them. But they're not uniquely identified - most don't even have my name written on them, and I don't have a register of them, not even one independently verified by an external body. If a desperate book thief were to come and nick one of my books, well, I'd be very unlikely to get exactly that same volume back again. Yet I still own them, and feel possessive about them.

[Edited to add: my better half points out that if someone steals a book from me, they take away my ability to read that book. If someone steals a digital object, like a dataset, they're stealing a copy, and unless they destroy the original, then it's still available for use by the original owner.]

And that feeling of possession is key to how people react to data. The person who feels the most strongly about the data is the researcher who created it (part of the IKEA effect, that leads to people valuing things that they assemble, customize or build themselves more highly than premade, finished goods**) But an owner of something can have no feelings for it at all, as witnessed by all those paintings locked in a vault somewhere until their value improves. 

That's why I think ownership is not a helpful thing to think about when it comes to data. Ownership focuses on possession - who has the data now. With it being so easy to make copies of datasets, many people can be "owners" - i.e. have the dataset in their possession. Ownership for data then becomes about who holds the "one, true dataset", and can then assert rights based on this***. 

As for the responsibilities of owners, well, I may be having a failure of imagination here, but I can't really think of any. I am perfectly within my rights to burn my book without asking anyone's permission (though causing a nuisance to to the neighbours with the smoke wouldn't be good). And if someone nicks my car and goes joyriding, I'm not responsible for the damage they do. If I own a dataset, I can delete it, change it, whatever. Other people might want to use it, but tough. I own it. I get to decide what to do with it.

It's better, then, to think more about the other roles involved in data, the roles that have responsibilities as well as rights. Roles like the data creator (the researcher who made the measurement), who is responsible for the contents of the dataset and the supporting information around it, and deserves credit for their work. Roles like data publisher, (the data repository and/or library), who is responsible for releasing the data to defined subsets of the population. Roles like data licenser, the party responsible for determining what other parts of the population are allowed access to the dataset, and under what conditions. Roles like data archiver, who decides whether a dataset should still be kept or should be deleted as it's no longer useful. 

These roles don't have to be carried out by individuals, institutions are capable of doing them as well. For example, the Unseen University could act as the licenser, corporate author and publisher of data that it holds. Corporate authorship is particularly useful for datasets with large numbers of creators, as it enables credit while keeping the number of names in the citation string to meaningful levels (see as an example the list of volunteers for Galaxy Zoo at - note that the url for the list names them all as authors!)

So, when discussing data, especially with the people who have put weeks, months and years of their life into the datasets they've created, it's a good idea to think about more than ownership of the data. Think and talk about those other roles and responsibilities. That way it becomes less about asserting rights and possessiveness, and more about the data itself.

And, in the future, as data becomes more open, and the mechanisms exist for giving the data creators (and their employers, funders and support staff) the credit they deserve, then hopefully the issue of ownership won't be so much of a problem.

* This happens to be my personal opinion. The results of publicly funded research should be made available for the benefit of all. In other words, open, unless there's a damn good reason not to.
** The proper link to the paper publishing this study is , but it's paywalled.
*** I'm sure I'm missing out all sort of technical, legal stuff here...