Signposting open data: discovery guides and other bright ideas

Post: 2 June 2016

For re-users of open data, discovery of existing datasets and resources is a priority need, second only to the core problem of unlocking public data in the first place. In this post I would like to suggest some strategies that open data publishers can consider to improve signposting and encourage re-use of their datasets.

Data discovery in the broad sense is a well understood concept in the research and commercial sectors, and publishers have available a range of options: data catalogues, hubs and repositories, standards of curation, documentation and metadata, and so on.

However UK open data presents additional challenges for potential re-users. Simply identifying and locating datasets of value is a significant barrier, particularly for the general public and re-users who do not identify as members of the open data community.

image


In many ways this is an awkward and inhospitable moment for discovery of UK open data. The previous Government’s ambition to collate data inventories from across the public sector and produce a coherent “national information infrastructure” has been quietly abandoned. Data.gov.uk remains under-resourced and the quality of its data catalogue is highly variable. Public sector information in general has been culled and atomised to fit the transactional priorities of GDS’s digital transformation of government.

At the same time public authorities are pushing out an increasing number of datasets, driven by unexpired policy objectives, DCLG’s transparency code, the availability of Data.gov.uk’s “harvesting” technology and, in some cases, even a genuine commitment to openness and transparency.

Realistically it’s too early to judge the full economic and social implications of UK open data, given that so many core public data assets remain closed. However with Cabinet Office paying less attention to open data policy some of the early hype-driven momentum has begun to fade. Some public authorities face internal pressure to justify their open data programmes by highlighting examples of re-use and demonstrable benefits.

In that context it is vital that we recognise the difference between lack of demand for open data, and lack of awareness and recognition of its potential. This is why it’s important to make sure significant open data resources are properly signposted.

Although I’m keen to encourage better signposting and discovery of open data across government, I am particularly interested in how Environment Agency approaches these problems. EA is midway through its journey from commercial data provider to open data publisher, and has also been central to the year-long #OpenDefra project to release 8,000 open datasets across the Defra Network.

EA currently has quite a complex ecology of platforms for publishing data, as set out in this recent post by Miles Gabriel. A high proportion of EA datasets are inter-related and of a technical nature, which means that to maximise re-use EA may need to signpost research and sources of underpinning knowledge as well as the data itself.

If EA can develop a successful strategy for promoting the availability of its open data this could serve as a template for other parts of government. Key to this effort is a recognition that we cannot usually expect public authorities to invest heavily in support for re-use of open data. We need to build consensus on the division of responsibilities between the data publisher, open data user groups, sector-specific interests, and data intermediaries.

Discovery guides

I recently wrote an “Open Data Discovery Guide” on the theme of UK Flood Risk and Flood Management, as a talking point for the Environment Agency Data Advisory Group. (It’s a PDF but try to contain your revulsion for a moment and hear me out.)

The discovery guide loosely follows the format of a series of short briefings published by the House of Commons Library, e.g.: Source of statistics: School level data. I like this format because it’s portable and can be distributed and stored locally. This is an approach to information exchange that works well within the culture of large organisations and supply chains, perhaps because the document metaphor creates an illusion of substance that does not translate well into web-native formats.

I think the trick to writing a discovery guide (or a “source guide” or “resource guide” or handbook or whatever), is to make it long enough to cover the key datasets but short enough to be digestible. The discovery guide only needs to engage interest and suggest potential for re-use; it need not be a substitute for documentation or metadata.

A discovery guide could be based around a subject area or theme, as in this example, or alternatively it could be based on the needs of a target audience or a problem to be solved. For example a guide to flood data for local planning authorities would cover different material than a guide to flood data for community action groups.

Although in my example most of the datasets are published by Environment Agency, generally a discovery guide will need to cover data from a range of different publishers if it is to serve the interests of potential re-users. However data publishers can support the development and accuracy of discovery guides by making sure their metadata and dataset documentation is readily available.

Of course if a discovery guide is too narrowly focused, for example on a particular business need or idea, then we get into the traditional territory of a data researcher or consultant and this may reasonably be left as a task for the intermediary market.

Similar ideas in other formats

The Open Data Challenge Series, a partnership between NESTA and the Open Data Institute, produced an excellent series of open data guides based on their challenge themes. These are quite broad but provide a useful template for guides that could be produced either as preparation for, or to consolidate the lessons from, a data dive, hackathon/userthon or similar event.

Natural England publishes a spreadsheet of its datasets, with information on themes, licensing and file formats. This could work for purposes of signposting, particularly if it was expanded with a bit more description of individual datasets.

Tutorials and instructables

One of the interesting developments that followed from EA’s open release of LiDAR data last year has been the spate of tutorials and instructables produced by users as they figure out how to work with the data. For example Brendan Stone’s post on extracting building heights and Kit Wallace’s post on 3D printing.

This is an approach that data publishers could try themselves, as they will already have in-house experience working with the data. Posts that help new users get to grips with the data will make it easier for them to see the potential for re-use.

While few data publishers will have resources to invest in their open data programme to the same extent as Ordnance Survey, some ideas such as the Getting Started with OS Open Data video series may be transferable.

Visualisations and media coverage

In February an ODI-sponsored data visualisation garnered a substantial amount of media coverage for Defra’s National Food Survey statistics. The visualisation was timed to coincide with a new Defra release of related open data, but the statistics themselves had been available (and largely ignored by journalists) for some time. This illustrates the potential of data visualisations to engage not just the media but also public interest.

This type of coverage may raise the profile of an open data programme, but it is debatable whether it can be an effective approach to signposting datasets for further re-use. Mainstream journalists tend to look for a newsworthy angle or hook that may not adequately convey the potential of the dataset, and are also generally reluctant to highlight primary sources.

One approach some public authorities (though perhaps not Environment Agency) could consider is actually releasing data via a newspaper website, either with the cooperation of a data journalist (as DCLG did in 2013 with its green belt dataset) or in the form of an advertising supplement.

Websites and data portals

Environment Agency has been noticeably reluctant to integrate its open data programme into the public-facing areas of its website, even when those areas make direct use of the same data. There is no simple route by which users can navigate from EA’s interactive maps, or water situation reports, or flood warnings, or the flood information service, to the underlying open datasets.

This is such an obvious wasted opportunity that I can only assume some parts of EA remain ambivalent about the benefits of committing to open data.

At the moment signposting of EA’s open data is undermined by the limitations of the Data.gov.uk platform. EA’s metadata and documentation is of a generally high standard but new users are likely to be bewildered by the dataset names and lack of thematic context.

Community wikis

I’ve had several conversations recently with others in the open data community that point toward the need for some kind of collaborative approach to documentation of useful datasets and resources (including but not limited to environmental data).

Data.gov.uk is no longer adequate by itself but remains useful, selectively, as a source of data – as do the numerous more specialised open data portals across government. But we need a means by which we can curate material provided by data publishers and link it to underpinning knowledge, case studies, examples, tutorials, and judgements as to the value and re-use potential of specific datasets.

wiki might be sufficient. My early thoughts are that a project of this nature would need:

 Image credit: Between Marnhull and Fifehead Magdalen: DCC fingerpost (Dorset) by Michael Day (CC BY-NC 2.0)