Open data horror stories: notes for an #ODCamp pitch/session

Post: 20 October 2017

Are there risks attached to open data publication, beyond those that arise from normal publication of information on the web? What’s the risk assessment and governance framework for open data?

Or to put it more bluntly: how badly can open data go wrong? Let’s have a discussion and crowdsource some examples.

Below are a few that are already on my radar.

Open publication of spending data by councils has led to a series of data protection incidents:

In 2014 DVLA’s new online Vehicle Tax Check service had to be amended after activists pointed out that people could use it find out if their neighbours received tax concessions based on mobility-related disability. Not exactly open data, but illustrative; imagine if that had been a bulk release under an open licence.

In May 2016 an NHS trust was fined £185,000 after it mistakenly published data about 6,574 members of staff in a spreadsheet of equality and diversity metrics.

In May 2017 a “routine security review” discovered that a file containing names, email addresses and hashed passwords of registered Data.go.uk users had been publicly accessible online since July 2015.

Earlier this year Land Registry finally admitted that its widely used Price Paid Data, released as open data in 2013, contains non-open address data. This followed an earlier admission that ICO had advised Land Registry in 2012 that house prices were probably personal data. What’s the potential fallout from that for long-term re-users of Price Paid Data?

And a couple of examples from the US:

Most of the above examples relate to inadvertent exposure of personal data, and arguably the same risks arise from access to information releases under FOI/EIR. But can open data publication exacerbate those risks?

Open data is bulk, perpetual, and specifically intended for distribution and re-use. By design there’s no way to practically track down and notify all the recipients of an open dataset. (Is Article 19 of GDPR relevant here?)

Some open data is potentially viral. What are the liability implications of unrecognised releases of not only personal data but data that could undermine environmental protection, breach confidentiality, or assist criminal acts?

If these are real concerns, what’s our best practice for managing and risk assessing open data publication? And what is the responsibility of publishers to re-users of open data, who may not have sufficient information about how the data was produced to judge those risks themselves?