Written by 510 DATA RESPONSIBILITY EXPERT Kamal Ahmed.
The 510 team works with data from many different sources. We often face challenges of completeness, validity and reliability. Therefore we took a deep dive into best practices for data verification. This blog describes shortly why it is important and how we address it.
One of the weaknesses of “disaster data” is the lack of standardised methodologies and definitions. One such example is the use of the category people “affected” by disaster. Much of the data are retrieved from a variety of public sources: aid agencies, newspapers, insurance reports, etc. Even if the organisation compiling the data uses specific definitions and a standardised methodology, the contributing suppliers of information may not.
Fortunately, due to increased pressures for accountability from various sources, many donor, development agencies and humanitarian relief organisations have started placing priority on data collection and its methodologies. But this has not yet resulted in a recognized, acceptable and effective international system for disaster-data gathering, verification and storage. In this blog we describe our current process for data verification and data storage.
Data verification process
Data verification is an activity related to checking and confirming information. We apply data verification prior to data analyses, data visualisation, model training, model building and model validation. We use the verification process as proposed in the Verification Handbook, 2014:
The verification process focuses on the following 4 aspects:
- Provenance: is the data authentic?
- Source: who published the data, and where is it stored?
- Date: when was the data published?
- Location: from where is the data uploaded in case digital cameras or smart phones are used
While addressing these items we triangulate and challenge the datasets and data collection methods, for example by cross-checking similar or related datasets provided by other organisations shortly after a disaster took place, and then several weeks later. If the dataset indicates values without mentioning how those values were obtained, or what definition was used for a particular parameter, or if unprocessed/raw data is required rather than aggregated data, or if we simply cannot access the data for security reasons, we reach out to the content publisher or researcher for further details. This ensures that we understand the dataset, its limitations, any ambiguity and its relevancy before we use the dataset in any of the subsequent steps.
We also realize that in some cases a thorough data verification is not possible, not practical, or not required. Census data may not be up-to-date or not available, so we may decide to use historic census data instead. Satellite imagery of weather conditions, geographical data and hydrological data are examples of specialized data sets for which we rely on data gathering methodologies used by specialized agencies in the areas of engineering, space and meteorology.
An example: collecting data from satellite imagery
In the Missing Maps Project we work with thousands of volunteers to digitize satellite imagery. To improve the quality and completeness of this effort we agree upon a methodology with all parties involved. The methodology includes tools and training / instructions for data contributors. It involves facilitation by people that know how to properly classify data. A two-step process of data collection by one person and validation by another person, helps us to improve the quality and completeness of the data. We know how accurate the data is (or isn’t) and we therefor know what kind of analysis we can do on that data, and how reliable the outcome of that analysis is.
In July 2014, after extensive research, OCHA launched Humanitarian Data Exchange (HDX). This is a new data sharing platform that encompasses the best standards in data collection, offering access to useful and accurate data. We publish data on HDX in two ways:
- Private datasets: datasets that are input to our models, datasets that are still under review or datasets from which we do not have permission to publish the data, or the licensing is unclear.
- Public datasets: end results of our work, or datasets for which we have permission to publish on HDX by the original data provider.
We always provide the metadata to allow others to verify how the data was generated and to judge if and how they want to use the data. An example dataset can be seen here. A properly formatted dataset includes:
- The dataset is clean, only contains relevant and new data, and is using as much as possible unique identifiers (such as Pcodes) to integrate with other datasets. A description of input data that was used, including links to these sources
- Attribution to the contributors of the dataset
- The location of the country the dataset is about and the time the data was collected or generated
- The license under which the data can be used
- An explanation of caveats in the data, data collection or methodolo
We encourage users dealing with humanitarian datato use the HDX platform for sharing with the wider community.
We would value a qualitative and quantitative review of the HDX platform, to better understand what are limitations, incentives and opportunities for humanitarian organisations to share data there in a standardised way, and how data is being used by both information managers and decision makers.
In an upcoming update to this blog we will explain how we:
- Verify some of the data sources for which the methodology is unknown.
- Find outliers in datasets and check them with other sources