ADSA stands for Advanced Data Systems Analysis and is an algorithm that was originally developed by Remi Mollicone, Ing and Giovanni Feverati, PhD.
Here, we demonstrate the effectiveness of the Advanced Data System Analysis in reproducing the countries’ rank in the number of natural disasters based on a set of environmental and socio-economic parameters, also providing the contribution of each parameter to the rank. We show the relative similarity or dissimilarity of countries face to natural disasters, by different methods, as dendrograms and interaction networks. The case study is based on the dataset provided by Université catholique de Louvain (see references).
The societies and the insurance companies experience an impressive growth of the expenses to cover damages of natural disasters. While this is partially a consequence of a more complete information report, still the evolution of costs and risks affects the economies of the countries and the insurance business.
We apply our methods to a dataset containing time series of several variables taken from various public data banks1. With these variables, we can reproduce the number of natural disasters rank of the countries, to a 82 % of precision.
ADSA-Weighting and Ranking Extraction (WRE)
Below, different ranking methods are compared, and evaluated against a benchmark. The benchmark is the actual classification of countries for the number of natural disasters, as obtained from the EMDAT data bank.
- We have ranked the countries with principal component analysis.
- The implemented the Savage score method.
- ADSA topological is our advanced data system method.
Table 1 shows the local weights specific to each item (just the first few countries are shown, but the local weights are available for all the others). Values have been rounded to two decimal digits, to simplify the presentation. Positive values indicate a positive contribution to the ranking, negative ones a diminution. Notice that the absolute value of the entries of each line sums to 1 (except for a rounding effect), namely the weights are relative to each item and tell how the variables distribute their effects to produce the ranking.
The correlation coefficient (Spearman score) of these results with the benchmark has been computed and is given in the following table.
From the table, it appears that two methods offer very high and similar scores with the benchmark. In particular the Savage score rank provides the same score than ADSA. However, there is a huge difference, as the Savage score rank makes the implicit hypothesis of an exponential distribution of the variables, while ADSA does not make any. So, when the variables are approximately exponentially distributed it works, otherwise it may produce large deviations. A second advancement brought by ADSA is in the weight analysis, that provides with the impact of each variable, by computing local and global weights for each variable and item.
The local weights
The ADSA rank has been obtained after evaluation of the overall and specific contribution of each variable to every single item. The local weights, specific to each single item, indicate on the total of 1, how much each variable contributes. A negative value indicates that the variable, for the specific item, tends to diminish the rank, while a positive value indicates that the variable tends to raise it.
It may occur that the weight corresponding to a given item and variable vanishes. This means that the variable doesn’t change the rank of that specific item. The influence of this variable occurs just by an indirect action on all the other items, such that the specific item is not affected. One has to notice that the ranking is a global process, where all the items are interconnected: if a fluctuation in the data let an item’s rank go up of 1, there certainly is another item whose rank goes down of 1. This is one of the reasons that make some of the variables uninfluential, and their weights to vanish for certain items.
The global, and dimensional weights
The global weights apply to the dataset as a whole, and indicate the global contribution of each variable. As before, these values sum to 1, except for a rounding effect. If the variables where perfectly uniform, they would contribute equally, as given in the Equal weighting line.
Table 3 shows the global weights for each variable. Here, the variable that contributes less is hdi, while landsurf has the biggest contribution, on the whole. We can see here that we extracted the weights and did not suppose an equal weight for each variable.
The rank stability
The problem addressed here is about the consequences, for the rank, of errors or fluctuations in the data. The data may evolve with time, following the evolution of the phenomena, or because they were uncertain and are re-evaluated or re-measured from time to time. The effect of small fluctuations on the rank may be very small, in that case the rank is stable, or very big, and the rank is unstable. This depends to a limited extent on the methods used to obtain the rank but depends to a great extent on the dataset itself. The rank for very regular data, like arrays of trees in a human planted forest, usually are very stable against fluctuations, while very complex and irregular data are more likely unstable. If an unstable dataset undergoes some changes, the rank needs to be recalculated. The value produced by the program is a decimal number from zero to one: 1 corresponds to a perfectly stable dataset, while 0 corresponds to an unstable dataset.
The concept of stability is different from the appreciation of the quality of the rank, namely of it accuracy. The predicted rank is of high quality if it is very close to the observed rank. In making predictions, the observed rank is unknown (otherwise, why should one make a prediction, if the result is known?) and the quality cannot be evaluated. On the other hand, in a very unstable dataset the rank quality is probably low (this cannot be known for sure), as small incertitudes in the data may produce big effects on the rank. In most cases of interest, the high complexity of the phenomena makes the stability quite small.
For the present dataset, the stability is 0.67, closer to full stability (1) than full instability (0).
ADSA-Interaction Analysis for Clusters (IAC)
This module creates clusters of items and evaluates the inter-cluster interactions or similarity. The ADSA clustering is very detailed, as it gives accurate description of the similarity (or dissimilarity) between different countries, in regard to natural catastrophes. For example, the cluster formed by Lituania, Slovakia, Hungary, Croatia and Portugal, having linkage value 6, is very compact and uniform: the overall impact of natural disasters, if any, are expected to be similar. The United Arab Emirates and Mauritius join this cluster with linkage value 32, indicating a that this country, albeit comparable with the three other, shows important differences. On the other hand, the other clusters are even more far away, showing more and more relevant differences in the origins and impact of natural disasters. Please notice that we show here just an extract of the full (and big) dendrogram, for reasons of space and readibility.
The first few clusters are indicated in Table 3:
The construction of the clusters is hierarchical, namely it starts with each item as a separate cluster, and then aggregates the clusters at each step until only one cluster is left. A hierarchical tree diagram is called a dendrogram. It shows the linkage points, that quantify the differences between countries, namely the amount of dissimilarity. The clusters are linked at increasing levels of dissimilarity. This example shows the items B and C being combined at the linkage distance of 1, and BC with D at 2. Finally, BCD aggregates with A at 4. Thus, objects are joined together into successively large clusters, from bottom to top, up to form a single cluster.
Interaction distance of clusters
This table shows the distance at which two clusters start to interact. The higher the distance, the farther separated are the clusters. The smaller the value, the higher the interaction.
The figure below shows the clusters and the relative interactions up to 0.15. This scale ranges from 0 to more than 20, so a value of 0.15 means that the clusters connected by a green edge are very close, compared to the rest. The figure being fully scalable, it can be zoomed in and browsed to read every single detail: the clusters contain the country names and the edges the interaction distance. Please see figure 3 for more detail..
This figure shows just the clusters interacting below the cutoff 0.15. The figure being fully scalable, it can be zoomed in and browsed to read every single detail: the clusters contain the country names and the edges the interaction distance.
EM-DAT: The Emergency Events Database – Université catholique de Louvain (UCL) – CRED, D. Guha-Sapir – http://www.emdat.be Brussels, Belgium.