510 TEXT MINING EXPLORATION FOR IMPACT DATA

At 510, we use impact based forecasting (IbF) to predict the impact of impending disasters on vulnerable people living in areas prone to disasters. In order to create a model that can make this prediction, we need detailed information on past disasters as well: when did they take place? Where exactly did they strike? And what was the their impact? This kind of data is often scarce, incomplete or even non-existent in databases. A group of dedicated volunteers developed the ‘510 news scraper’ to bridge this data gap.

WHAT IT IS

The news scraper is a tool that can ‘scrape’ online newspapers and retrieve relevant impact data we can use to improve humanitarian aid. The system is based on keywords provided by the user, such as ‘floods’ or ‘Uganda.’

WHERE WE WORK ON IT

While the news scraper was originally developed for English sources and hazards in Uganda (since May 2018), we recentlyv extended its range to French sources as well, so it could retrieve data on disasters in Mali. Another aim of the recent improvements was to increase the ease of use and flexibility of the system.

HOW IT WORKS

The news scraper consists of a 3-step software, which was written by our volunteers in Python:

  1. It scrapes articles for relevant data.
  2. It determines the relevance of the articles.
  3. It outputs an impact database in a .csv file.

1. SCRAPE ARTICLES

The software retrieves online newspapers and selects those articles that have the provided keyword in the title. It also saves the article and the publication date.

2. TAG RELEVANCE

In a previous version of the software, this still had to be done manually on an interactive interaface. Now, the user can specify keywords to automate (part of) the topicality evaluation.  All relevant articles are automatically saved in a separate text file.

3. GET IMPACT DATA

This step makes use of the Natural Language Processing (NLP) Python package spaCy. The package breaks the text of the article into sentences, tokenizes them to see what ‘role’ they fulfill in the sentence, and finally creates a lexical tree: a tree-model of a sentence structure that includes the interrelatedness of individual words. For example, the words ‘victims’ and ‘3’ are connected to one another in the same tree branch, meaning we know that there were 3 victims. The news scraper applies the package in the following way:

-Defining the article location: Which locations are mentioned in the sentence? Which one is the main location of the disaster?
-Finding numbers in the article: Any numbers mentioned in the article can be useful.
-Assign impact and location: ‘Impact words’ such as ‘killed’, ‘affected’, or ‘victims’ are linked to the numbers using the lexical tree. Then, these impacts are linked to the article locations defined before.

The resulting output is then saved in a separate file, including the number of houses, people, and infrastructure affected. This results in a database with significantly more datapoints than the ‘traditional’ databases used in IbF so far.

DO WE NEED IT?

As mentioned before, the news scraper provides an alternative or complementing source of data when existing databases are not sufficient or don’t exist. This can be helpful in locations where historical data is scarce, but there is still sufficient time to trigger early warning early action. Damage, suffering and the cost of emergency aid are reduced when communities are capable of responding proactively to a disaster.

Yet, there are some shortcomings associated with the scraper as well: as it is still being further developed and refined, the results can still contain inaccuracies, for example due to the double-counting of events, a bias towards larger cities and regions, and the limitations of the French spaCy model. It is also very difficult to validate the model, due to a lack of alternative sources to provide a comparison. In the future, we at 510 plan to address these issues and improve the news scraper even more!

By Bonnie van Vuure, Lone Mokkenstorm, Monica Turner and Wessel de Jong

Comments are closed