Philippines - Typhoon Haima - Priority Index

Blog post about this prediction can be found here:

The predicted priority index of Typhoon Haima is produced by a machine learning algorithm that was trained on four past typhoons: Haiyan, Melor, Hagupit and Rammasun. It uses base line data for the whole country, combined with impact data of windspeeds and rains, and trained on counts by the Philippine government on people affected and houses damaged.

First run The Priority Index is a 1-5 classification that can be used to identify the worst hit areas: those that need to be visited for further assessments or support first.

Second run The model now predicts two things:

  • a weighted index between partially damaged and completely damaged, where partially damaged is counted as 25% of the completely damaged. This has proven to give he highest accuracy.
  • the precentage of total damage (damaged houses versus all houses)

The absolute number of houses damaged / people affected is insufficiently validated at the moment, and should just be used for further trainng and ground-truthing.

Data sources:

  • Administrative boundaries (P_Codes) - Philippines Government; Published by GADM and UN OCHA (HDX)
  • Census 2015 (population) - Philippine Statistics Authority; received from UN OCHA (HDX)
  • Avg. wind speed (km/h) - University College London
  • Typhoon path - University College London
  • Houses damaged - NDRRMC
  • Rainfall - GPM
  • Poverty - Pantawid pamilyang pilipino program (aggregated)

For the second run of the algorithm we also included:

  • Roof and wall materials
  • New geographical features

The result of different models can be found in the file 'Typhoon Haima - performance of different models - second run.csv' A note on how to interpret this.

  • date running date
  • alg_date same
  • alg_model name of the algorithm used
  • alg_predict_on name of the learning variable
  • alg_use_log i s the learning variable transformed in log
  • code_version version of the code

All the columns with feat_ indicates the importance of that feature, if not present that feature was not used.

  • learn_matrix name of the learning matrix with the 5 typhoons
  • run_name unique run name (pickle files and csv files have this name for this model)
  • typhoon_to_predict name of a new typhoon to predict
  • val_accuracy accuracy based on 10 categories of damage 0% 10% 20% …
  • val_perc_down perc of underpredicted categories
  • val_perc_up perc of overpredicted categories
  • Val_best_score best r2 score

  • Val_stdev_best_score error on best score based on the CV
  • Val_score_test r2 score on the test set (this should be around +- 5% of the previus number to not overfit

  • Val_mean_error_num_houses average error on the number of houses
  • val_median_error_num_houses median
  • val_std_error_num_houses std deviation of the errors (lower is better)

Algorithm developed by the data innovation initiative of the Netherlands Red Cross.

  • This dataset updates: Never


Source Netherlands Red Cross -
Date of Dataset Nov 22, 2016
Expected Update Frequency Never

Data integration: Data that was collected for Haima and for 4 previous Typhoons (Haiyan, Rammasun, Hagupit and Melor), included windspeed data, rainfall data, population, poverty data and number of damaged houses (where the latter is not collected for Haima, but is to be predicted). All data was calculated at the municipality level of the Philippines. This means for example that per municipality the average rainfall over the whole municipality area was calculated. Finally, all information per municipality was joined to each other using the PCODE-system, which assigns a unique identifier to each administrative area in the Philippines.

The prediction model was build using a Machine Learning method, called Random Forest Regressor. As all regression methods it uses historical data and attempts to learn a correlation between input data and their impact on the output.

Random Forest Regressor was selected as it outperformed another, usually also very successful method - Gradient Boosting Regressor. Its power comes from an interesting strategy of building multiple predictors (decision trees) and averaging their outputs. Each tree is built in a slightly different way, using different subset of historical data, and randomly selecting different variables during the process of building the trees. This strategy allows to build a model that can handle well multidimensional data and estimate well importance of each input variable. It is a highly configurable method so several experiments were held to select parameters that produce the best results on training data.

Performance Evaluation of performance let us estimate average prediction error to be close to 1800 and median error close to 1300 (half of predictions should be closer to real values than 1300). Errors were estimated using cross validation: averaging errors made by the model trained on all but one typhoon, which was left out for testing. Procedure was repeated 4 times, each time leaving different one for testing.

Application The model was applied to predict numbers of houses damaged by Haima, the most recent typhoon. Predicted numbers were used to prioritize municipalities in a scale from 1 to 5 (1 of lowest damage, 5 for highest predicted damage). The categories were assigned using following percentiles: 0%, 35%, 65%, 85%, 95%, which should be interpreted as: 35% of all municipalities with lowest damage were assigned priority 1, 5% of all municipalities with highest predicted numbers were given priority 5. This selection of thresholds was rather arbitrary, with a goal in mind to emphasize places with highest damage. Chosen percentiles translate to the following absolute numbers:

Priority 1 - predicted number of houses damaged below 183 Priority 2 - predicted number of houses damaged between [183, 345) Priority 3 - predicted number of houses damaged between [345, 1205) Priority 4 - predicted number of houses damaged between [1205, 2498) Priority 5 - predicted number of houses damaged over 2498

Future steps Prepared model is going to be improved in future by adding more input variables related to geographical information and by extending training data. When more data is available also selection of the algorithm will be reconsidered, if simulation results motivate it.

Caveats / Comments

Accuracy: NDRRMC publishes official data; they can only publish the official data when it is validated by their local counterparts (LGU or regional officials).