Scientific Data and Freedom of Information

So last week I was given a bit of a roasting by guest blogger Sarah over a remark I made on Twitter, where I said that I couldn't see why academic data should be covered by the Freedom of Information Act. Unfortunately, Twitter is not a good place to put things in context, so here's a post clarifying my views.

This was in response to a story concerning an ecologist in Northern Ireland, Professor Mike Baillie, who works in dendrochronology, the field of using tree ring data to date things. He's been ordered by the UK Information Commissioner's Office, under the Freedom of Information Act, to hand over 40 years' worth of data to a city banker, Doug Keenan, who likes to rant about climate conspiracies on the internet.

Sarah's dissection of my Twitter comments is, a bit unfair, spot on in terms of the philosophy that data should be open. Where sensible it should be, and my discomfort with this story isn't because I'm against Open Access. I'm all for it, which is why I'm not that thrilled by this story.

Let's step back and look at the problem here, the issue that we're really trying to solve. Essentially, this is about access to research, and that comes in two parts - the publication itself, and the supporting data. Ideally I'd like both to be easily available on the intenet. Why?

In her post, Sarah says:

"...what reason could there be for the public not to have access to publicly-funded academic research? When research is funded from the public coffers, surely it's automatically relevant to public interest?"

I'm actually not persuaded by the "we paid for it so we should have it" argument, because I think it skips over a key issue - we actually haven't paid for it. This is a point which is important, but which I think not enough people grasp.

The public pay for research to be done. They don't pay for peer review, or publication, or data archiving, or indeed any sort of public dissemination of information except where it's explicitly set out in the funded proposal. Peer review and publication are basically privatized, with publishers paying for peer review can cost and recouping the money through subscription fees. Data archiving often just isn't paid for, period. Full-economic costings of research projects are very carefully worked out - if it's not budgeted for in the research proposal, we haven't paid for it.

You might think these costs are trivial, but they're not. Data archiving for a department can be a complex process involving all sorts of agreements, checks, curation, dealing with enquiries and making sure that conditions of use are adhered to, not to mention the problem of storing different formats and making them available. In a typical university department, archiving of the sort that we really want probably represents one full time job. Peer review meanwhile may cost several hundred pounds per paper (see an interesting-if-dated discussion here), so again the costs swiftly mount. Across the science budget, we're talking tens of millions of pounds.

The real argument, for me, is that open data makes life easier for everyone - scientists, researchers, curious members of the public. It means never having to deal with another paywall, and it means being able to test the idea I've just had on some data that I can start downloading in the time it takes me to do a quick Google search. Free access oils the wheels of science.

With all that in mind, let's look at the application of the FOIA. Generally I'm all for it, and I've used it on many occasions to get hold of information about government, but let's compare it to the problem here. We're trying to come up with a system that facilitates the free exchange of research and supporting data, but the FOIA:

  • Can only be used by British citizens.
  • Involves writing a formal request.
  • Involves knowing what to request (no list of available data or links to related useful data).
  • Responses can take up to a month or more.
  • Papers themselves may be subject to copyright.
  • Researchers or institutions are left to respond to requests in an ad hoc manner.
  • No provision made for archiving or searching information.

 

In fact, all the FOIA really achieves is to dump the problem onto individual departments, without actually doing anything to solve it. So while I'm in favour of open access, I really don't think this is the way to go about it.

What we need is a government who understand the issue, not just one small facet of it but the whole network of problems.

What we need, in my opinion, a national strategy for scientific information that creates a proper infrastructure for data and research sharing, pays for proper archiving and peer review, and provides instant access to as wide a group of people as possible.

A good start would be to build a central archive tied to the Research Councils, with the condition that data supporting publications be published built into new research projects and properly funded as part of their full economic costing.

This would immediately provide a one-stop access point for academic data, and make FOIA requests obsolete, while ensuring that academics with little training in information management don't have to spend time or resources dealing with potentially vexatious direct requests.

That, to me, is a far more sensible solution for everybody. If you want open data, do it properly.

__________________

Martin is the editor of layscience.net.

Follow Me!
RSS | Twitter

Trackback URL for this post:

http://layscience.net/trackback/1011
Your rating: None Average: 2.8 (4 votes)
Mike (not verified) on Tue, 04/27/2010 - 00:01

Clearly this is not a black and white issue, and I don't pretend to know much about the issues behind it, not working in a scientific field. Would it be at all feasible to allow journals to put their peer reviewed publications behind a paywall for a certain amount of time to get back their costs, and then after a certain amount of time make them available to the public?

Martin on Tue, 04/27/2010 - 00:12

The problem is that however the journals recoup the costs, the science budget pays them, since academics are probably the biggest source of subscription income. The best way around it would be if top journals adopted the same sort of business model as PLoS, where researchers pay the journal for peer review up front, and papers are then published openly online; but that would rely on getting the big publishers on board, and I'm not sure how likely that is to happen.

__________________

Martin is the editor of layscience.net.

Follow Me!
RSS | Twitter

Grumpy Bob (not verified) on Tue, 04/27/2010 - 05:39

Research Councils now require a data-sharing policy statement as part of a grant proposal.

The bigger issue relating to the particular case that's initiated this blog article is that much of the data pre-date the FOIA, that the individual requesting the data is probably unable to interpret raw data, and in any case would cherry-pick from it.

Can such an FOI request be denied on the grounds that publications are still to be written?

Sean Haffey (not verified) on Tue, 04/27/2010 - 14:06

The effort involved in archiving data is largely one of computing, and solved by current tools. At the simplest, if you put data on the Internet, Google and other tools will automatically build a copy.

Nor is the amount of data an issue. In The Economist in February, one article discussed the astronomical amount of data captured by the Sloan Digital Sky Survey: a staggering 140 terabytes. Except that 140 terabytes is not much more than an entry-level storage system today.

There may be issues around not releasing raw data before the relevant paper is published but not after.

Indeed Tim Berners-Lee is actively encouraging government to make available all its data. See here http://news.bbc.co.uk/1/hi/8470797.stm for example.

Yes, some may cherry-pick data. You don't fix that by hiding the data but by using the data to show what they have done and why it is wrong.

Sarah on Tue, 04/27/2010 - 21:15

Your points are reasonable and they reflect the cultural differences that exist between scientific subjects. In astronomy, grants for large observational projects can cover costs for archiving and curation. The international community has invested substantially in the development of pretty advanced cross-archive data discovery tools (the Virtual Observatory). The new generation of data centres and tools are approaching petabyte storage capability.

Other disciplines have obviously not seen these coordinated efforts. I agree that the Research Councils would do well to lead/drive(/pay for!) data curation efforts. It's well worth the investment.

As this has been a high-profile issue in astronomy for some years now, I guess we're used to having these solutions available and have embraced archives into our way of working. So that's why I was surprised to see such an aversion to data sharing (that's how your tweets sounded, despite what you say here), particularly in a subject as economically and sociologically relevant as climate change.

The same goes for open access to publications. Here in the Netherlands the research council NWO has just made available a 2.5M euro fund for its grantholders, to pay for open access publication costs. As soon as other funding bodies start adopting similar policies, we can start moving systematically to open access publications. In the mean time, there's always Arxiv.

Martin Budden (not verified) on Thu, 04/29/2010 - 20:33

I agree that the "we paid for it so we should have it" is not the necessarily the most persuasive argument for public access to scientific data, but not for the reason you state: "we actually haven't paid for it". You say "The public pay for research to be done. They don't pay for peer review, or publication, or data archiving, or indeed any sort of public dissemination of information except where it's explicitly set out in the funded proposal." This isn't strictly true - peer review and publication are at least partly paid for by the public - I imagine a large proportion of the subscriptions to scientific journals are paid by publicly funded bodies. Peer review and publication may be "basically privatized", but much of their funding still comes from indirectly from our taxes.

On the subject of data archiving your statements are inconsistent. Firstly you say: "if it's not budgeted for in the research proposal, we haven't paid for it." then you say: "in a typical university department, archiving of the sort that we really want probably represents one full time job." Well, that university department and that job is largely funded by the taxpayer.

I agree with your point that data archiving is expensive. But it's an essential part of the activities of a research department. I have worked in the software industry for more than 25 years, and in that industry source code and document management is a similar activity, and I know from experience how costly it is. But it would be much more expensive not to do it. What's more, I don't think it is more much expensive to publicly archive your data that to privately archive that data. There are also opportunities to save money with publicly archived data, since it is possible to share tools and other costs researchers in other institutions. Having said that, moving privately archived data onto a public archive is a costly process.

As you say, the real argument for open data is that science progresses more quickly with open data. Closed data is a form of friction that slows everything down. I agree with your sentiments that the FOIA is not a very good tool to solve the problem. But the FOIA is not a very good tool to solve the problem of closed government data either. In both cases the FOIA is a tool of last resort - the way for government departments and research department to avoid the hassle and costs of complying with the FOIA is to systematically publish their data, so that FOIA requests become unnecesary. (Of course the usual caveats of patient confidentiality licensing issues, data protection etc apply.)

There are some very good examples of open scientific data already. In biology there are, for example, all the genomic databases (eg protein sequence and nucleotide sequence databases). Indeed I believe that sequence submission to GenBank, EMBL or DDBJ is a precondition for publication in many journals. In astronomy we also see lots of publicly available data.

Personally I don't like your idea for a central archive and one stop access point for academic data. I think a distributed model, similar to the approaches taken by biologists, astronomers and the software industry, is a much better approach. The archiving needs of different scientific disciplines differ vastly - a central archive would not meet these differing needs and would be costly. I do agree with the presumed implication that there should be additional funding to support data archiving and the infrastructure required for that archiving.

In computer software we know we have to do documentation, test code and configuration management - it's part of the job. It makes our own lives easier and makes the development process quicker and cheaper, even in the short to medium term. I don't see why this shouldn't also be the case in scientific research - proper organisation and archival of data may be expensive, but it's cheaper than not doing it. Making those archives publicly accessible is not only desirable, in the long term it is also cheaper.

By the way, I have also written to you at layscience at googlemain and have not received a reply - I assume your spam filter has eaten my mail.

Rodney (not verified) on Tue, 05/18/2010 - 15:54

FOIA (and FOISA, which is slightly different) is not really intended for casual data sharing but it is useful.

It's not just for British citizens - anyone in any country can use it.

It requires a formal request (and unless environmental, needs to be a written request, which can include e-mails). But the request can be very general. The University staff are supposed to 'advise and assist' requesters and this ought to be substantial.

This part of the FOI process is really only getting started and it will be interesting to see how it develops.

Rodney

IanH (not verified) on Wed, 05/19/2010 - 12:37

Martin

Either your email spam filters are blocking my emails or you are deliberately ignoring me. I shall stop sending emails about the 1023 teaching materials.


Wikio - Top BlogsCurrent CO2 level in the atmosphere