Citing Bytes - Adventures in Data Citation: November 2012

Tuesday, 13 November 2012

When science and stories collide

I'm back in the office today after a wonderfully intense couple of days at SpotOn London 2012 - which I'll be blogging more about in another post.

But first, I want to talk about the Story Collider - the fringe event which kicked off the whole conference for me, which was held on the Saturday night in the upstairs room of a pub in Camden (not the usual location for scientific shennanigans, to be fair!)

I'm still not entirely sure how I wound up there, hiding at the back of the room, frantically reading and re-reading my notes. Well, yes, I do know how I wound up there. When the email came around to the registered conference attendees asking for storytellers, I took a look at it and thought "that could be interesting - I wonder if my story is appropriate?" And it went from there. The organsisers liked the sound of my story outline, and that was it. I was on the list to tell my tale.

(I was, of course, blithely ignoring the fact that I was due to vanish into the wilds of West Wales the weekend before the show. Oh, and the fact that my voice was somewhat on the croaky side, and not showing any signs of coming back...)

Anyway, the Story Collider is part stand-up comedy, part confessional, and aims to bring together people to listen to and to tell stories about science in their lives. Its format is simple, a half dozen storytellers, talking for about ten minutes each, standing alone on a stage in front of a microphone.

I think I can safely say it was one of the scariest experiences I've had in a long time. I'm no stranger to the stage, but there's a big difference between presenting research (where you can hide behind powerpoint slides and acronyms), or singing songs (where the words are already written and you know them by heart), to standing in front of strangers, telling them about something that actually, really happened to you, and how it made you feel. (The feelings part was the hardest!)

Be that as it may - I did it. I was shaking like a leaf when I got off that stage, but I did it!

The audience was lovely - only a science crowd would have given me a cheer when I told them how I was finally going to get my dataset published. And I got a lot of laughs, and a lot of really nice comments afterwards too - the ones that stuck in my mind were the ones that said how nice it was to hear a story about the actual trials and tribulations of doing science.

The whole event was recorded, so I'm hoping there'll be podcasts of the show coming out in the not-too-distant future. I'd really like to listen to the other stories that were told that night again, as being second last in the running order meant that I was too distracted by being nervous to give them my full attention!

Many thanks to all the Story Collider organisers for giving me the chance to tell my story, and my fellow story-tellers and the audience for being so supportive, and for laughing and cheering! If you get the chance to go to a Story Collider event, or even talk at one, go for it!

One theme that kept coming back in the discussions at SpotOn London was how much we scientists need to get better at telling stories and talking to people. The Story Collider provides an excellent way of doing just that.

Citing Sensitive Data - workshop report

"Burned" DVD, microwaved to ensure total elimination of private data , by NightRStar

On the 29th October, I went to the British Library for a workshop on the topic of managing and citing sensitive data, one of a series of workshops all about data citation.

I won't go into the detail of what was said during the presentations as all the slides are available on-line here, and there's a good blog post summarising the workshop here.

I will take the opportunity to re-iterate what I said in my previous post about how citation doesn't equal open. Though I will expand on it further and say that there needs to be extremely good reasons for keeping data closed when public money has funded its collection (reasons along the lines of patient confidentiality, saving endangered species, etc, not "but I need extra time to write a paper!")

After all the presentations, we were split up into groups, and made to do some work, it being a workshop and all. First of all, we had to come up with some example scenarios for how to cite data given certain access conditions or embargos, and then we had to swap these with another group and try to solve them. This turned out to be a lot of fun, though I did somehow manage to wind up in the group that was threatening to fire people left, right and centre if they didn't behave!

The Yellow group were looking at access conditions for a study where different participants had given different levels of consent. The solutions they came up with were: 1) have an umbrella DOI for the whole dataset with multiple DOIs for the subsets with different access conditions. 2) Have a hierarchical DOI, or 3) have an umbrella DOI linking to subsets. The trade-off here was clarity versus nuance, and it was generally agreed that communities in different disciplines would have to decide the best approach. We also can't draw an inference on a subset of the data without taking the whole dataset into account.

The Red group were looking at embargoed data. First up was "researchers want to gain more research credit". Suggestions included: early deposit, while the embargo still is in play; access by request during embargo; DOI minted on deposit; open landing page in the repository (so people know the data exists, even if they can't access it yet) with end of embargo date on it; and the metadata should be specified on deposit too.

Next the Red group looked at the situation of longitudinal cohort studies which may change and have multi-layered embargoes. Access to variables could be dependent on layers of the dataset, with access to layers potentially increasing in time. The suggestion was to have multiple DOIs for multiple layers, with links between the landing pages to show how the layers fit together.

The Green group also looked at embargoes - specifically the situation where there was retrospective withdrawal of permission for a dataset and the data was embargoed while an investigation took place. (The assumption was that the DOI had already been minted for the dataset.) Suggested action was: retain the same landing page, but add text to it detailing the embargo and the expected date when the investigations would end (compliant with the institution's investigations policy). A user option to register to get notified when the dataset becomes un-embargoed would be a nice thing to have. When the investigation is complete, update the metadata depending on the results. And, at the beginning of the data collection, make sure that the permissions and data policy are set out clearly!

The Blue group were looking at access criteria, in two cases. Firstly was "White rhino numbers and GPS tracking information". The suggestions were: assigning a DOI to the analysed rather than raw data, and apply access conditions to the raw data so as to verify user credentials. The format of the public dataset could be varied, e.g. releasing it as snapshots instead of time series, or delaying the release of the dataset until the death of the tagged rhinos. Some of the rich descriptive data might also be kept back from the DataCite metadata store in order to protect the subjects.

The second scenario the Blue group looked at was animal experiments - medical testing on guinea pigs with photos and survival times. This one was noted as being difficult - though there was agreement that releasing data should be guided by funders and ethics committees. The metadata should not name individuals, and the possibility of embargoing data, or publishing subsets (without photos?) should be investigated.

In the general discussion afterwards it was (quite rightly!) pointed out that it's ok to cite and make available different levels of data (raw/processed) as raw data might well be completely incomprehensible to non-experts. We also had a lot of discussion about those two favourite topics in data citation - granularity and versioning. Happily enough, they'll be the subject of the next workshop, booked for Mon 3rd Dec.

About the Author

I'm Sarah Callaghan and I am the Research Practice Manager for the University of Oxford.

Previously, I was Editor-in-Chief for Patternsa data science journal from Cell Press.

Before then I worked for the Centre for Environmental Data Analysis as a data scientist and programme manager attempting to make sense of this data citation and publication thing.

Before that I worked for the Radio Communications Research Unit (now the Chilbolton Group at STFC - Rutherford Appleton Laboratory) where I studied radio propagation at frequencies above 10 GHz (and in the process created a number of large datasets).

Needless to say, all opinions are my own, and nothing to do with my employer.

My official biography can be found here.