Friday, 24 January 2014

Cite what you use

Poster for "Screen as Landscape" Exhibition at the Stanley Picker Gallery, Kingston University, December 2011, and "Screen as Landscape", Dan Hays, PhD thesis, 2012, Kingston University. (From

What do you cite, the dataset or the data article? Or should it be both?

There's a lot of confusion about this, mainly stemming from the whole notion that the data article is a direct citation substitute (or proxy) for the dataset it describes (which, to be fair, it can be). Citing both the dataset and the data article gives rise to accusations of "salami-slicing" and double accounting, whereas citing only the dataset could be seen as taking citations away from the article (or vice versa).

The way I see it is that the dataset, and its corresponding data article are two separate, though related, things. It's time for another analogy!

Consider the Fine Arts. If you were wanting to do a PhD in the Fine Arts, you would need to produce a Work of Art (or possibly several, depending on your chosen form of Art) and you would also need to write a thesis about that Work, providing information about how you created the Work, why you did it the way you did, the context and reasoning behind it, and all that sort of important background information.

Now, if I was wanting to write a critique of your Work of Art, I could do so without ever reading your thesis. In that case it'd be entirely appropriate to cite the Work, but I'd have no need to cite the thesis.

If, on the other hand, I was wanting to write an article about the history and practice of a technique you used to create your Work of Art, and I read and used information from your thesis to support my argument, then I'd definitely need to cite your thesis. (I could chose to cite the Work of Art as well, in passing, but might not need to. After all, anyone wanting to find out about the Work can read the thesis I've cited and get to it that way. And I'm not actually discussing the Work itself.)

With me so far?

Ok, so the Work of Art is the dataset, and the thesis is the data article. It starts getting a bit murky in the data world, because often there isn't enough contextualising information in the dataset itself to allow it to be used/critiqued/whatever easily, and that information is captured and published in the data article (which is one of the main reasons for having data articles - to make that sort of important information and metadata available!).

Historically, in many disciplines (in the dark days before data citation), important datasets were cited by proxy - i.e. the authors of the dataset published a paper about it, and then others cited that paper as a stand-in for the dataset. The citation counts for that paper then became the citation counts for the dataset, which had the virtue of being simple enough and a valid work-around to the problem of the lack of a common practice of data citation.

But now we have the situation where a dataset can be cited independently from its data article. And we have the following situations:
  1. Both dataset and article are cited. Data creator is very happy (two citations!). Data publisher is happy (citation!). Data article publisher is happy (citation!). Reader of the citing article may not be happy (potential accusations of double counting of citations and salami-slicing...) Publisher of citing article might not be happy (not enough space in reference lists, potentially two citations that look like they're for the same thing).
  2. Only the dataset is cited.  Data creator is happy (citation!). Data publisher is happy (citation!). Data article publisher is not happy (though might be mollified by the fact that there are links from  the dataset back to the data article). Reader of the citing article may not be happy (may want more info about the dataset that is only provided in the data article). Publisher of citing article is probably not bothered one way or another (depending on journal policies for citing data).
  3. Only the data article is cited. Data creator is happy (citation!). Data publisher is not so happy (but probably resigned, no citation, but link from data article to dataset, so not as bad as old days with no link to the data at all). Data article publisher is happy (citation!). Reader of the citing article may not be happy (may want a direct link to the data). Publisher of citing article is content (situation normal).
It's a balancing act!

Honestly? I do think cultural norms will evolve within the different research domains over time. We should be prepared to give them a gentle nudge if they look like they're going completely haywire, but for the most part I'd say let them grow.

And for me, when asked "But what should I cite?!?", my default answer will be "Cite what you use".

  • If you use a data article to understand and make use of a dataset, cite them both.
  • If you use a dataset, but don't use any of the extra information given in the data article, cite the dataset.
  • If you use a data article, but don't do anything with the dataset, cite the article.

Cite what you use!