WHO WAS INVOLVED
This is a guest post by Yemi Ogoundele , Florian Perdreau, Francisco Simoes and Kotryna Valečkaitė. They recently participated in a two-day Healthcare Hackathon organized by CarePay, The PharmAccess Foundation, 510 Data & Digital team Netherlands Red Cross and CorrelAid Nederland. For their project, they took part in the Mosquitoes challenge created by The Netherlands Red Cross with the task of predicting the prevalence of mosquitoes in the Philippines in order to anticipate Dengue outbreaks.
WHAT IS DENGUE
Dengue is an infection spread via mosquitoes which causes flu-like symptoms that can have lethal complications when severe. It is found in about 100 countries, mainly in tropical and sub-tropical climates. The World Health Organization estimates that about 390 million people are infected every year and that 50% of the world’s population is potentially at risk given an increasing number of cases over the last 5 years (see here). Although there is no specific treatment for Dengue fever, fatalities can be dramatically reduced with prevention and control measures such as monitoring of vectors.
WHY IS THIS WORK NEEDED
Being able to predict Dengue outbreaks in specific regions (for instance, by detecting spikes in the mosquitoes population) could enable NGOs and governments to apply these measures more efficiently and to provide access to medical care at earlier stages of the outbreaks. One way to monitor the mosquito population in a region is by installing so-called ovitraps. These are containers covered with water and covered with a mesh where mosquitoes can lay eggs, which after developing into mosquitoes are trapped inside the ovitrap by the mesh.
WHAT DID THE TEAM DO
We were given the number of mosquitoes captured in said ovitraps on many different schools in the Philippines, with measurements made once a month. We were also provided with climacteric conditions on those days.
Figure 1: Data from ovitraps like these where used to measure mosquito distribution
WHICH DATA WAS USED
In addition to the ovitrap measures, we were given two other time-series datasets. In total we received the following datasets:
- Raw ovitrap measurements: This dataset included
- date of the measurement
- ovitrap ID
- latitude & longitude of the ovitrap
- name of the school where the ovitrap is located
- value measured by the ovitrap
- Ovitrap measurements aggregated by province and by month: This dataset includes
- count of individual ovitrap measurements
- average ovitrap index and its
- associated standard deviation
- relative ovitrap error (standard deviation of ovitrap measurements taken over a specific time window divided by the average ovitrap value for this same time window).
- Weather condition aggregated by province level and by month: This dataset includes
- amount of rainfall,
- land surface temperature at night & day
- soil moisture
- soil temperature
- air temperature
- wind speed.
WHAT WERE THE DATA GAPS
Figure 2: Sample ovitrap time series
After a quick exploratory data analysis, we came to realize that we were dealing with a lot of missing values: most provinces’ time-series had about 60-70% of missing ovitrap measurements (no measurement made on specific dates or over a certain time period). Looking at the provinces’ time-series (see Figure 2 for some examples), we could see that for most of the province OVI measurements were done over a short period of time compared to the total duration of the experiment.
Additionally, disaggregate Ovitrap data showcased spatial inconsistencies. Namely, a part of the measurement locations were outside of research scope (see Figure 3)
Figure 3: Measurement locations
In order to come up with actionable insights at the end of the hackathon and to showcase the predictive power of our data, we reduced the scope of the project, and only focused on Pangasinan, the region that had the most complete time-series.
WHAT WAS THE MODELLING APPROACH
Figure 4: Modelling approach
As a first step, we decided to ignore exogenous predictors (see Figure 4), such as weather or topological data, and to instead focus on forecasting the ovitrap measurements solely based on the previously observed measurements.
Creating a baseline model
We tried a few simple methods to select a benchmark model for more complex models. The task of these models was to accurately predict the future ovitrap measurements given past measurements. Our models were trained on a portion of the ovitrap measurement time series and then were evaluated against their ability to accurately predict future ovitrap measurements (unseen during model training). We assessed our models’ accuracy by comparing their predicted ovitrap measurements to the observed measurements and by computing the root mean square error (RMSE) as metric. This metric expresses the model error in the same unit as the dependent variable (ovitrap measurement). Thus, the lower this metric, the better.
METHOD: Naive average
Figure 5: Naive average
In this simple model, the forecasts of all future values are equal to the average of the training data.
This model proved to be surprisingly effective when applied to our case, with an RMSE of 7.7 on the test set (see Figure 5).
METHOD: SES Simple Exponential Smoothing
Roughly speaking, an exponential smoothing method forecasts using a weighted average whose weights decrease exponentially as the corresponding observation gets older (See reference 3).
There are different exponential smoothing methods, the simplest being SES (Simple Exponential Smoothing).
SES fits the training data using a weighted average, forecasts a single point after that, and uses that same value to forecast all the remaining future times, resulting in a flat line forecast. It performs slightly better than the naive average method in our case with a RMSE of 7.3.
However, the purpose of this model is to detect spikes in mosquitoes in order to implement prevention and control measures. As a consequence, a flat line, although somewhat accurate, is not useful in our case. We therefore used it as a baseline model to be compared with more complex models that might capture variations better.
We also tried other exponential smoothing methods (more notably Holt-Winters’ seasonal methods) which can capture seasonal effects, but all performed worse than the simpler SES.
METHOD: Seasonal Naive
Figure 6: Seasonal naive model
An obvious way to try to incorporate seasonality in a baseline model is by using the seasonal naive method.
In this method, we set each forecast to be equal to the last observed value from the same month of the year. Using this method, we reached a RMSE of 6.0 (Figure 6).
Since it performs better than both the naive average method and the exponential smoothing methods, we keep the seasonal naive method as our baseline model.
As a second step, we picked a model that captured the heavy seasonality of the data. The seasonal ARIMA model (SARIMAX in the statsmodels.tsa module in Python) matched those requirements.
MODEL: Seasonal ARIMA with no exogenous variables
AutoRegressive Integrated Moving Average (ARIMA) models are applied in cases where data shows evidence of non-stationarity and where an initial differencing step can be applied one or more times to eliminate the non-stationarity. A stationary time series is one whose properties do not depend on the time at which the series is observed. In our case, the time series is clearly non-stationary.
As per Wikipedia:
The AR part of ARIMA indicates that the evolving variable of interest is regressed on its own lagged (i.e., prior) values. The MA part indicates that the regression error is actually a linear combination of error terms whose values occurred contemporaneously and at various times in the past. The I (for “integrated”) indicates that the data values have been replaced with the difference between their values and the previous values (and this differencing process may have been performed more than once).
Using Box-Jenkins model identification, we came up with the model coefficients: We fit an ARIMA(11,1,0)(1,0,0)m=12 model, where (11,1,0) are the autoregressive, differencing and moving average terms for the seasonal part of the ARIMA model.
MODEL: Seasonal ARIMA with exogenous variables
Figure 7: Seasonal ARIMA with exogenous variables
Finally, we managed to capture all the effects of exogenous variables on mosquito proliferation by adding exogenous variables into the seasonal ARIMA model. This model builds on top of the Seasonal ARIMA mentioned above, because it uses exogenous variables (soil moisture, humidity, etc.) displayed in the causal diagram above, as predictors for ovitrap measurements.
We grid-searched through all possible subsets of exogenous variables to find which one gave us the greatest accuracy. We ended up adding soil moisture, rainfall and land surface temperature as extra regressors. This enabled us to reach a RMSE of 5.08 (Figure 7).
WHAT IS STILL NEEDED
Our modeling effort was focused on predicting the ovitrap measurements for a cherry-picked region due to the time constraints of the two-day hackathon. Doing so, we were able to showcase the predictive power of the data we were provided with: it is possible to forecast the growth of mosquitoes population in a specific region based on historical data with some acceptable accuracy However, to be useful in the real-world, our model should be generalizable to other regions as well. This would require addressing the issue of missing data.
Moreover, our model only used aggregated data at the province and month level. Thus, we did not exploit the finer granularity of the dataset including raw observations made at the school level on a daily basis. Including these data into our model could be greatly beneficial to capture local characteristics that could be predictors of our dependent variable. For instance, previous research (e.g., Schmidt et al., 2011) suggests that several exogenous factors could contribute to the growth in mosquitoes population, such as local topological elevation (altitude), population density, urbanization, vegetation coverage, etc. But these factors cannot be assumed to be uniformly distributed across a specific region.
Data imputation and enrichment
We developed a feature engineering solution that could make our model generalizable to other regions as well and could solve both missing data imputation and data enrichment issues.
First, we had to map the administration names (used in the aggregated datasets) to actual spatial coordinates (latitude and longitude). To do so, we used an external public dataset providing this mapping (https://simplemaps.com/data/world-cities).
Then, we used a technique called spatial join that allows merging different datasets based on their coordinates. This join can either be exact (exact match of coordinates) or fuzzy (finding pairs with distance shorter than some threshold or minimizing the distance function). For instance, schools’ coordinates will not exactly match those of the administration level they belong to. Spatial join works best for pure spatial data, for instance for merging holistic information (population density, field elevation, etc.) that do not change over time or that can be assumed approximately constant over the considered time period. However, given that our datasets contain both spatial and temporal data (e.g., ovitrap measurements, weather conditions), we had to use a spatio-temporal join where a pair of records from different datasets were matched by minimizing both their spatial and temporal distances given some constraints (e.g., spatial or temporal distances should be less than some thresholds).
Once all of our datasets were merged at the observation level (ovitrap measurements), a similar technique could be used to complement missing information of raw observations with statistics (e.g. average) of other observations falling within the same spatio-temporal region (spatial and temporal distance smaller than some thresholds).
Unfortunately, given the limited time we were given during this hackathon, we did not manage to develop new models based on this new dataset.
WHAT ARE THE CONCLUSIONS
- After comparing several statistical time-series models, we found that the Seasonal ARIMA model including exogenous variables, such as soil temperature, yield the best accuracy (RMSE: 5.08).
- This demonstrates that the ovitrap measurements have some predictive power that could be used in order to forecast peaks in mosquitoes population growth that could be a sign of potential Dengue outbreak.
- This power can be further increased by accounting for additional contextual information (weather data, soil temperature). According to the research literature, it can be expected that adding more exogenous variables (population density, vegetation index, etc.) could yield further improvements.
- We proposed a feature engineering solution that can increase both the amount and the spatio-temporal resolution of the data used by our models. This would allow our model to capture local predictors of mosquitoes population growth and thus increase its prediction accuracy at a more regional level. However, we were not able to evaluate this approach given the time limit of the Hackathon.
Projects and Partners
Read more on the Hackathon here on Carepay’s medium site
CarePay is developing smart mobile health wallets to provide access to healthcare in Kenya and Nigeria. They provide data on medical claims. The aim of this project is to (1) detect patterns of fraud and use those to build predictive models, (2) assess how repeat visits can be used to determine the quality of care, (3) cluster care providers to build a benchmark to compare providers across Kenya.
510 (an initiative of The Netherlands Red Cross) helps local Red Cross organizations improve the speed, quality, and cost-effectiveness of humanitarian aid. They provide data on mosquito abundance and Dengue epidemics. The aim of this project is to predict the diffusion of mosquitoes, which may carry the Dengue virus, from meteorological data in the Philippines. They have also prepared another project on classifying building damage in post-disaster satellite images.
Medical Credit Fund helps private healthcare clinics access affordable financing and support to improve the quality of healthcare they deliver. They provide data on hospital transactions and loan performance history. The aim of this project is to develop a better credit scoring system.
MomCare is a program by PharmAccess Foundation that is aiming to improve the quality of pregnancy care in African countries by evaluating and incentivizing good patient journeys. They provide data from 1000 mothers journeys. The aim of this project is to understand (1) patient behaviour in seeking care, (2) how fatal outcomes are mitigated by the actions of the care provider, (3) the features of reduced mother and child mortality.
SafeCare is part of PharmAccess. They provide time-series data on hospital safety screenings. The aim of this project is largely exploratory; one could for example study factors that determine safety improvements or the risk of losing a high safety rating.