How far can we trust open data?

Post: 29 November 2014

This post is also published on the Royal Statistical Society’s StatsLife site.

How far can we trust open data?

This is a trick question, of course, with no right answer.

When it comes to data, “open” is mainly a licensing approach. Open release can amplify the utility of a dataset, but it tells us nothing about the quality of the data or the processes that went into its production. Those are technical features of the dataset, unaffected by the legal conditions for re-use.

The trustworthiness of open data depends on the particulars of the individual dataset and publisher. Some open data is robust, and some is rubbish. That doesn’t mean there’s anything wrong with open data as a concept. The same broad statement can be made about data that is available only on commercial terms.

But there is a risk attached to open data that does not usually attach to commercial data.

Open datasets are very easy to pick up and use right away, because the licensing is so permissive and because publishers strive to make things as simple as possible for re-users. There is no negotiation between parties and no formal procurement process. That means there is less motivation (and less opportunity) to inquire deeply into the background of the dataset.

After all, open data is free. Why look a gift horse in the mouth?

This is a risk particularly for developers and development-led startups, who may be embracing open data without the discipline of previous experience with commercial licensing. Entrepreneurial developers are often excited by the potential of what they can make with data and less interested in the niceties of what they are allowed to do legally or should do ethically.

It’s tempting to treat the dataset itself as a consumable, so long as it’s properly formatted and the code runs.

Commercial data publishers struggle with the problem of protecting their revenue against “grey use”: when data is supplied for an agreed business purpose but then re-used for wider purposes by their licensees. Open data publishers have less of an incentive to monitor and police how their proprietary data is used. Re-users of open data are usually left to their own devices.

Three of the central issues that open data re-users should watch out for are: data quality, third-party rights, and personal data.

Land Registry’s Price Paid Data

Price Paid Data is a large dataset containing address-level records of residential property transactions in England and Wales from 1995 to the present. The dataset is published by Land Registry under the Open Government Licence and updated monthly.

It’s not my intention to single out this dataset for criticism. Price Paid Data is one of the most useful and economically significant datasets released as open data under the current Government. However due to recent events Price Paid Data provides a useful illustration of some pitfalls for open data re-users.

image

Data Quality

Land Registry is empowered under statute to require the registration of property transactions, to record address details and sale prices, and to make that information available in a public register.

Price Paid Data is the processed output of data collected and submitted through conveyancing and registration systems. Land Registry does not routinely verify individual transactions. However the quality of Price Paid Data is high, particularly considering the size of the dataset and the fact that it is basically crowdsourced. A small percentage of records contain obvious typos and other errors, but no more than one would reasonably expect.

This week Land Registry revealed in a blog post that, following a report from a customer, it had discovered more than 48,000 duplicate entries in the Price Paid data for 2003-05. An investigation traced this problem to an internal error in a process that was changed in early 2005. (A corrections file for the duplicate entries is now available.)

The number of entries affected is small as a percentage of the whole Price Paid dataset but large in absolute terms. It’s difficult to judge without knowing more about the internal error, but usually it should be possible to detect duplicate entries with routine checks on data quality. 

There are a couple of learning points here. Re-users should bear in mind that the standards for open release of historical data may be lower than the standards for commercial productisation of new data. In this case Land Registry had no reason to take any additional steps to quality-assure its older data prior to its release as open data last year.

I am not arguing that Land Registry should have taken such steps. Public sector organisations must assess the general suitability of their data for open release, but I don’t want them to get the idea that they need to invest significant resources. I support the principle of “publish early even if imperfect”.

That does mean though that re-users should get into the habit of performing their own data quality tests on open datasets – perhaps with more attention than they might give to commercial data products.

We don’t know which Land Registry customer reported the above issue. However it’s worth noting that Price Paid Data was licensed to more than 30 commercial subscribers (such as Zoopla) for a long time before it became open data, and the duplicate entries from 2003-05 were found only recently.

That is the silver lining for quality-assurance of open data: open licences maximise re-use, which means more users and re-users, which increases the likelihood that errors will be detected and reported back to the publisher.

Third Party Intellectual Property Rights (IPR)

Recently some people involved with the Open Addresses project have been investigating the working methods behind a number of existing open public datasets that contain UK address data.

This is what Land Registry has to say about addresses in Price Paid Data (from an FOI response to Peter Wells, following up on a FOI response to Stuart Harrison):

If LR can match a customer submitted address to the OS AddressBase data (which includes Royal Mail PAF data) we will use that address. If we cannot match the address with the AddressBase data, we will use the customer submitted address as the basis of a new address record in our database. However that address may be validated against other addresses in the locality, e.g. we may add a locality that matches other localities in the same street to ensure we have a consistent dataset. In such cases, some of the addresses used to validate may have themselves been validated using AddressBase.

and:

Our view is that only a very small percentage of addresses in the Price Paid Data would not have been created or validated using AddressBase.

Anyone who has been following discussions around Ordnance Survey derived data as a barrier to open data release, and the campaign for release of an open national address dataset, will appreciate the significance of the above information.

Basically re-users cannot be certain who owns the address records in Price Paid Data. Land Registry has released the dataset under the Open Government Licence, but OGL exempts “third party rights the Information Provider is not authorised to license”. Ordnance Survey has not authorised open re-use of any content derived from AddressBase. (And more importantly Royal Mail has not authorised open re-use of the PAF data included in AddressBase.)

The Open Addresses team seem to have concluded they cannot use Price Paid Data for their project. That may be a prudent decision. OS and Royal Mail are unlikely to look the other way if Open Addresses launches an address product in competition with theirs that contains their data.

Alternatively, perhaps Open Addresses is being overly cautious. I may be missing something but it looks to me as if the exceptions to copyright in s47 of Copyright, Designs and Patents Act 1988 and to database right in Schedule 1 of the Database Regulations (SI 1997/3032) should cover broad re-use of Price Paid Data, even if AddressBase or PAF have been used to correct the addresses.

If the Open Addresses interpretation is correct, the practical implications for other re-users of Price Paid Data are unclear. It would probably be impolitic for OS or Royal Mail to make trouble over any re-use that does not directly threaten their markets. To do so would only underline their effective monopoly over national address data and the barriers that presents to re-use of many other location datasets. There is also an argument that OS and Royal Mail have acquiesced to open re-use of their derived data in Price Paid Data, by not insisting that Land Registry made their rights clear to re-users. However neither point creates firm legal ground for re-users.

The learning points for re-users are that open licences don’t provide any protection from third-party liability and – more importantly – they don’t create any obligation on open data publishers to make sure re-users are aware of any such potential liability.

Personal Data

Usually I’m keen to emphasise the bright line between open data and personal data. Open datasets are almost invariably non-personal data. It’s important to distinguish between open data and data sharing, a wider subject that often involves personal data and is sometimes controversial. The public reputation of open data can be damaged if it is confused with sharing of personal data.

However publication of open data does sometimes require an understanding of personal data and data protection issues. Open data delivery can become a vector for inadvertent release of personal data. Publishers may not realise that their datasets contain personal data, or that analysis of a public release can expose information about individuals.

An open data licence does not relieve re-users of their obligation to comply with the Data Protection Act, if they discover a dataset contains personal data. Normally this consideration should not arise. But open data based on public registers can present special problems. Public registers are an exception to the normal rules for processing of personal data. However that exception may only extend to publication of the register itself, not to onward re-use of the data.

According to Land Registry:

Price Paid Data is not personal information about individuals but property related information.

In a Privacy Impact Assessment Review conducted prior to the open data release, Land Registry is slightly more cautious:

We also spoke to the Information Commissioner’s Office (‘ICO’) and confirmed the steps and ongoing evaluation we have undertaken over the last twelve months. We confirmed our view that monthly PPI was not personal data but property related information remained unchanged.

and:

Our evaluation confirms our earlier views that PPI is not biographical in nature as the focus of the information is on the property and not on the person who owned or sold the property.

Under DPA, data is likely to be personal if it has “biographical significance” in relation to the individual. It’s generally accepted that residential addresses by themselves, i.e. without the names of the occupants, are not personal data. But isn’t the price paid for a property a significant biographical fact about both the seller and the buyer?

Not everyone realises that “exchanged at” prices become public knowledge. In most US states the situation is the reverse of the UK; land ownership records are widely available but prices paid for residential properties are not generally disclosed.

There is no question that Land Registry is legally entitled to publish this information in the public register. But Land Registry also has an incentive to argue that prices paid are not personal data for purposes of re-use beyond the purposes of the register, because it was supplying Price Paid Data to commercial licensees for years before the dataset was released as open data.

As Land Registry says, there have been no formal legal challenges to re-use of Price Paid Data. My own preference is that prices paid, as well as most other property data, should be treated as non-personal. But I am not entirely convinced by Land Registry’s view.

I am mindful, for example, that Environment Agency treats as personal data the fact of whether an individual address is registered with its Flood Warnings Direct service, and also that DECC redacted address information from its recent release of record-level National Energy Efficiency Data. That seems remarkably inconsistent. Why are those facts about a property more personal than the amount paid to acquire the same property?

In practice data protection is unlikely to be a barrier to re-use of Price Paid Data, because the public is accustomed to the information being available. But re-users should be aware of subjectivities around the definition of personal data, and that whether data is personal can sometimes change depending on the context in which it is processed.

Due diligence and better documentation

There are two main lessons here, one for publishers and one for re-users of open data.

Publishers should make a reasonable effort to think beyond their own use of their datasets and inform re-users about weaknesses in data quality, the provenance of the data, third-party dependencies and other issues – to the extent that they are aware of them.

In my experience open data publishers are generally more forthcoming than commercial data providers about the methods used to produce their datasets. They are less likely to be concerned about confidentiality or that they will lose the sale.

In a recent post Harvey Lewis of Deloitte highlighted Tim Berners-Lee’s five-star scale and the Open Data Institute’s certification scheme as two mechanisms publishers can use to reassure re-users that their open data meets common standards.

But relying on information from the data publisher is not sufficient by itself. Open data licences do not provide re-users with any recourse if that information is inadequate or in error. Re-users also have any obligation to exercise due diligence by making their own investigations, asking the right questions about the sources of data, and where necessary carrying out their own tests to make sure the quality of the data is suitable for their purposes.

Thanks to John Murray, Peter Wells and GeoLytix for Twitter discussions that informed some parts of this post.

Image credit: A photo of a street in Loughborough by Duncharris (GFDL / CC BY-SA 3.0)