Hiding from Google: open data, data discovery and robots.txt

Post: 14 December 2016

Discovery is one the key barriers to re-use of UK open data. I’ve written about this before.

Experienced users build familiarity with sources of data: who publishes what, where to find it, who to ask if they can’t find it.

But there are a lot of datasets and a lot of publishers. For the general public discovering data is often hard. And the corollary problem, establishing definitively that data on a particular subject doesn’t exist or isn’t available (and why not), is even harder.

According to recent research by the Government Digital Service (GDS) users usually begin their search for government data on Google.

Search engine optimisation (SEO) is important if publishers want to encourage re-use of the public data. Usually this doesn’t take much. Government sites rarely have to compete for search rank.

It’s a surprise then to find that some UK government organisations are actually hiding their open data downloads from search engines.

image


Many government sites use the disallow directive in their robots.txt files to prevent Googlebot and other search bots from indexing certain folders and pages. There are often legitimate reasons for this. Search engines like accessible content but there’s usually no need for them to index cached search results, pages with admin logins, etc.

Organisations also often disallow access to folders that have been indexed previously but no longer exist. Met Office’s robots.txt file is a good example:

http://www.metoffice.gov.uk/robots.txt

But some organisations are also using robots.txt to effectively make public data harder to find. Google indexes the page URLs but not the content. If the pages appear in search results all users see is a message:

A description for this result is not available because of this site’s robots.txt

Below are examples.

The Boundary Commission for England

The Boundary Commission’s Data & Resources page provides access to base electoral datasets used in the ongoing 2018 review of constituency boundaries, as well as historical electorate data, spatial boundary data, and information about the electoral register.

http://boundarycommissionforengland.independent.gov.uk/data-and-resources/

All of this is hidden from search engines by the Commission’s robots.txt file, which in fact disallows indexing of everything other than the Boundary Commission’s home page.

http://boundarycommissionforengland.independent.gov.uk/robots.txt

Update 19 December 2016: Boundary Commission’s robots.txt file has now been updated to allow indexing of the site. I understand indexing was previously disallowed in error.

Natural England

Natural England has for many years maintained a highly useful download site for its GIS datasets:

http://www.gis.naturalengland.org.uk/

The site is regularly updated but obfuscated by its robots.txt file. Google knows Natural England has GIS data available but does not, for example, index the names of the datasets listed on the landing page.

http://www.gis.naturalengland.org.uk/robots.txt

UK-AIR (Defra)

Defra’s air pollution unit maintains an extensive public data archive, including both observational data from 1,500 sites around the UK and modelled air quality datasets.

https://uk-air.defra.gov.uk/data/

The landing page for the data archive is indexable by search engines, but some other data pages (including downloads of background mapping for local authorities and the UK Ambient Air Quality Interactive Map) have been disallowed (“to stop bots hammering the databases”).

https://uk-air.defra.gov.uk/robots.txt

VCA’s Car and Van Fuel Data sites

The Vehicle Certification Agency (VCA) maintains two websites that provide data on fuel consumption and emissions for cars and van, by model:

http://carfueldata.dft.gov.uk/

http://vanfueldata.dft.gov.uk/

Both sites provide bulk open data downloads. But Google doesn’t know that because the robots.txt files for both sites disallow indexing:

http://carfueldata.dft.gov.uk/robots.txt

http://vanfueldata.dft.gov.uk/robots.txt

The Charity Commission

The Charity Commission maintains a search facility for registered charities in England and Wales:

http://apps.charitycommission.gov.uk/Showcharity/RegisterOfCharities/registerhomepage.aspx

Indexing by search bots is disallowed:

http://apps.charitycommission.gov.uk/robots.txt

(The bulk data download page is on a different subdomain that has no robots.txt file, so is indexed by Google.)

(Where is GDS in all this?)

GDS provides some advice on use of robots.txt on GOV.UK, which advises organisations to “ask search engines not to index pages on your domain”. Ostensibly this advice only applies to service.gov.uk domains, but there is no advice for other domains.

Even for services only this advice strikes me as pithy. GDS’s ideal of service design tends towards minimalism but even so it’s not reasonable to expect services to place all their explanatory information on the start page. And the GDS design approach sometimes involves turning what should be straightforward information content into a “service”. For example:

DfE’s Compare school and college performance service

The Department for Education (DfE) now presents its detailed statistics on performance of schools and colleges in England on a new service.gov.uk site. The download-data directory:

https://www.compare-school-performance.service.gov.uk/download-data

is disallowed by the robots.txt file:

https://www.compare-school-performance.service.gov.uk/robots.txt

Users are required to navigate a kind of choose-your-own adventure game in order to construct a URL that provides access to bulk downloads.

GDS registers

This final one surprised me a bit more than the others. GDS’s registers programme oversees the design of selected “canonical” open data infrastructure within the public sector. So far the programme has produced three beta registers. Here’s the download page for one of them:

https://country.register.gov.uk/download

And here’s the robots.txt file:

https://country.register.gov.uk/robots.txt

The robots.txt files for the other two beta registers are the same.

In conclusion

Stop using robots.txt to hide useful public data.

Image credit: robots.txt felt robot by Anne Helmond (CC BY-NC-ND 2.0)