A new approach for processing climate missing databases applied to daily rainfall data in Soummam watershed, Algeria

(1)

A new approach for processing

climate missing databases

applied to daily rainfall data in

Soummam watershed, Algeria

Amir Aieba,b,∗∗, Khodir Madania, Marco Scarpac, Brunella Bonaccorsoc,

Khalef Lefsiha,∗

a_{Laboratoire de Biomath}_{ematiques, Biophysique, Biochimie, et Scientometrie (L3BS), Universite de Bejaia, 06000} Bejaia, Algerie

b_{Department of Computer Science, Faculty of Exact Science, Abderrahmane Mira University, Bejaïa 06000, Algeria} c_{Department of Engineering, University of Messina, Italy}

∗_{Corresponding author.} ∗∗_{Corresponding author.}

E-mail addresses:amir18informatique@gmail.com(A. Aieb),klefsih@yahoo.fr(K. Lefsih).

Abstract

Missing data is a very frequent problem in climatology, it influences on the quality of results that will afford in hydrological studies, as well as water resources management. This paper proposes a new imputation algorithm, based on the optimization of some regression methods, which are hot deck, k-nearest-neighbors imputation, weighted k-nearest-neighbors imputation, multiple imputation, linear regression and simple average method. The choice of these methods was justified by qualitative and quantitative statistical tests analysis. However, the reliability of obtained results depends mainly on percentage of missing data, choice of neighboring stations and data missingness mechanism which should be missing at random. During the study it was found that the most of stations in Soummam watershed don’t have a good correlation because the large loss in rainfall data or the geology of watershed which gives a relationship between station position and rainfall variability. For this case, principal component analysis is applied on a set of stations; it showed a positive impact of altitude, latitude and longitude on correlation index between Revised:

16 December 2018 Accepted: 13 February 2019

Cite as: Amir Aieb, Khodir Madani, Marco Scarpa, Brunella Bonaccorso, Khalef Lefsih. A new approach for processing climate missing databases applied to daily rainfall data in Soummam watershed, Algeria.

Heliyon 5 (2019) e01247.

doi: 10.1016/j.heliyon.2019. e01247

(2)

selected stations. The graphical analysis of the normal law on RMSE values, which were obtained by applying the proposed technique in several random cases of missingness, that are 4%, 8%, 12% and 16% respectively, it conﬁrmed the validity and the performance of this approach.

Keywords: Atmospheric science, Environmental science, Hydrology

1. Introduction

The precipitation is one of variables commonly used to study climate variability, flow estimation and even to understand floods and landslides (Hong et al., 2007; Medlin et al., 2007). These studies require complete and reliable records of rainfall data. For this case, data was also represented by satellite maps, deduced by Algo-rithms to overcome spatial coverage limitations of rainfall and to optimize rainfall measurement (Adler et al., 2000;Pettazzi and Salson, 2012;Vila et al., 2009). The availability of climatic terrestrial networks is one of factors to obtain the best esti-mation of precipitation for remote sensing communities (Sharifi et al., 2016), un-fortunately in practice the obtained databases contain gaps due to systematic errors or other source of malfunctioning such as the absence of the observer, the destruction of gauging devices, the power failure and the elimination of incorrect data (Sharifi et al., 2016).

The breaks in data acquisition of recording systems are more prevalent in Mediter-ranean countries (Gyau-Boakye and Schultz, 1994), it is an interesting topic for me-teorologists, hydrologists and environmental managers toﬁll these gaps (Khosravi

et al., 2015), however the problem is toﬁnd the real data while it has been judged

that it is diﬃcult to ﬁnd a perfect method.

In literature, it canfind two possibilities for filling gaps. Firstly, to exclude matrix rows containing even a single MD (missing data elimination techniques), this tech-nique deletes the cells of matrix that contain MD, it is widely used because of its simplicity; however, it does not give the most efficient utilization of data and it can incur a bias, just only if the values are not missing completely at random. Conse-quently, it can be used only in case where the gaps are very low (Song et al., 2008). Secondly, to estimate gaps of data series by replacing MD with values (imputed missing data techniques) (Li et al., 2007), the choice of thesefilling techniques refers to case where the strategy is the best reliable. There are many methods which can be applied to complete dataset; these procedures are designed to reduce number of gaps by replacing MD with nearest values, this process called completion or substitution. The most significant advantages of these procedures are the conservation of data dimension, therefore, the strength of statistical data processing. In a more or less broad measurement, all procedures of replacement are not important to be used if

(3)

there is not random distribution of missing values. However, the replacement of MD is reliable when correlations between variables are very strong. On the other hand, it was shown that there is no ideal technique of estimation could exist (Presti et al., 2010). The eﬀectiveness of each technique depends on a number of factors, such as the percentage of MD, the mechanism of data loss and characteristics of the var-iable under consideration.

In this regard, there are two types of imputations to be distinguished: simple impu-tation and multiple impuimpu-tation. Simple impuimpu-tation is the one in which the missing value is replaced by the average of all measured observations in the same series, knowing that the amount of MD must be less than 5%; we can also use simple regres-sion methods which starts by studying correlation between observed data of neigh-boring stations and that of reference stations, this technique is applied when data are missing randomly and the amount of the lack must be between 5% and 10% (Johnson, 2003). In this context, there are two types of regression methods, logistic regression that is used to treat discrete variables, and the linear regression applied for treating continuous variables. Multiple imputation is a treatment that uses several similar databases. It will give many replacements for the same unobserved value. Finally, the subsequent inference obtained by combining all imputed values (Sovilj et al., 2016). The most of these techniques suﬀer from problem of obtained results reliability when the data missed randomly, the case that let statistical studies based on the other hand to evaluate models which ignored the existence of MD, for example: Principal Component Analysis (PCA) model building (Folch-Fortuny et al., 2015;Nelson et al., 1996) and Wavelet method which is used to study climate variability on discontinuous temporal series that contains missing observations (Turki et al., 2016).

This paper introduces new approach that uses hybrid methods to solve problem of reliability about obtained results of the ﬁlling, it was applied in missing climate data series. The proposed technique based on optimization of some imputation methods detailed in the content of this article.

2. Materials & methods

Usually, the choice of eachfilling method will be done according to missingness mechanisms (different ways in which data are missing). So to manage matrices char-acterized by incomplete data series properly, thefirst requirement is to identify the mechanism responsible for data losing (Little and Rubin, 1987). There are three different types of processes to be considered (Little and Rubin, 1987), the simplest case is when data loss occurs absolutely at random, which are indicated with the acronym MCAR (missing completely at random). It occurs when the probability

(4)

of missing value depends neither on the variable itself nor of another variable of the database; in mathematical term, it is written:

PðrjXobs; XmisÞ ¼ PðrÞ ð1Þ

where Xobsis observed data, Xmis is MD and (r) is distribution condition of MD.

On the contrary, when the probability that a value is missing depends on the value of other variables, and only on them, the condition is called MAR (missing at random) (Schafer, 1997). That is:

PðrjXobs; XmisÞ ¼ PðrjXobsÞ ð2Þ

Finally, the third and the last case arises when the loss of data does not occur randomly at all (NMAR), in this case, the probability that a value is missing depends on the missing value itself. That is:

PðrjXobs; XmisÞ ¼ PðrjXmisÞ ð3Þ

In general, whether the missingness mechanism is related to study variables or not, it is very signiﬁcant as to determine how it is diﬃcult to handle MD at the same time (Little and Rubin, 1987).

2.1. Methods

There are many imputation methods in the literature based on different ap-proaches, used in specific domains or even for specific datasets transportation

(Tang et al., 2015), meteorology (Junger and De Leon, 2015), and others

(Folguera et al., 2015). Although, the proposed methodology can be considered

when the MAR hypothesis is assumed (Gomez-Carracedo et al., 2014), it should be borne in mind that it can be applied for MCAR case, which are more random than MAR. The first step is to pre-process the time series, it means to identify gaps distribution, then to select a set of neighboring data series (e.g. neighboring weather stations) to those affected by missing values (target station), respecting the temporal behavior of variables under consideration. In climatology, we can use data recoding from weather station in the same watershed as neighboring time series if there is similarity between data using Spearman coefficient and mean absolute error. The following methods used to fill MD in this paper pre-sented in different descriptions.

2.1.1. k-nearest-neighbors imputation (KNNI)

This method is based on the k observed values of the most similar time series, then all values are used into a single estimate using approaches such as the average

(5)

methods or kernel function (Amiri and Jensen, 2016). If we canﬁnd single nearest neighbor data series (k¼ 1), in this case it called hot deck (Batista and Monard, 2003). These methods must be appropriate when data are MAR, meaning that the missing value is correlated with other observed variables.

2.1.1.1. Hot-deck

This method performs the estimation of MD of incomplete records Pxusing values of

similar complete records_Pi, belonging to the same data set when there is a large sim-ilarity between data, that means (k¼ 1) (Rahman et al., 2015). In our case study, the nearest station that has the greatest correlation between its values and that of the reference station which allows us to take the same value of this similar station in the same time.

Px¼ Pi ð4Þ

2.1.1.2. Arithmetic average method

When the number of similar data series is equal two or more, a simple average method can estimate MD. In our case study, when rainfall values of each similar station have diﬀerence less than 10%, comparing to other measurement of record stations, we can apply KNNI shown inEq. (5). But, when this diﬀerence is very large, a normal-ratio method was recommended. We can point out that the selec-tion of neighboring precipitaselec-tion staselec-tions for estimating MD must be based on meteorological judgment (Teegavarapu and Chandramouli, 2005).

Px¼

Pk i¼1

Pi

k ð5Þ

where Pirepresents daily rainfall data of similar stations, k is the number of similar

stations.

2.1.2. Weighted-nearest-neighbors imputation (WKNNI)

It is another estimation method based on the weighting coefficient of similar time series. In climate data series we used Euclidean distance between similarity stations and reference station to calculate this coefficient, in order to obtain the best estima-tion of MD then, thefinal result is determined by a weighted average of all neigh-boring data (Teegavarapu and Chandramouli, 2005; Troyanskaya et al., 2001). It is given by the following equation:

(6)

Px¼ Pk i¼1ðPi*wiÞ Pk i¼1 wi Such as; Wi¼ dK_xi ð6Þ

where Pxis the MD observed in reference station x, k is the number of similar

sta-tions; Pi is imputed data, that means the observed rainfall data on neighboring

station,wiis the weighting coeﬃcient that equaldK_xi ,dxiis the Euclidean distance be-tween location of neighboring station i and the reference station x; and K is referred to as friction distance (Vieux, 2001), which ranges from 1.0 to 6.0.

2.1.3. Multiple imputation (MI)

Multiple Imputation is aﬁlling method that provides valid statistical inferences under MAR condition. This method processes MD sets by using the standard procedures of regression, then to make combination between imputing results from these analyses for obtainingﬁnal result, as shown inEq. (7)(Rubin, 1987;Schafer, 1997;Troyanskaya

et al., 2001). The multiple imputation has been implemented in software such as SAS

(Jerez et al., 2010) and Amelia II package (Honaker et al., 2011). To apply this method, it must follow the following steps: (i) Find k similar databases for each missing value; then the observed values will be used to impute MD. (ii) For each MD (Px),we use data

imputations Piby applying regression methods to obtain k diﬀerent estimate results Ii

(Px,Pi). (iii) Theﬁnal result Pxwill be obtained by combining all imputations results Ii

using average of all the k complete data values, that is:

Px¼ Pk i¼1 IiðPx; PiÞ k ð7Þ

2.1.4. Linear regression (LR)

The linear regression is a very fundamental computational procedure which forms the basis of many elaborate algorithms, like the alternating least squares algorithms (ALS) (Wang et al., 2003), it also assumes that values are missing at random. This method requires two steps, initially to estimate the relationship between predictors and missing values, then to use a trend equation forﬁlling the gaps (Bardossy and

Pegram, 2014), and we can express it byEq. (8):

PxðtÞ ¼ a þ b:PiðtÞ ð8Þ

The value of a and b can be estimated by using respectively Eqs.(9)and(10):

(7)

b¼ Pn i¼1 xy Pn i¼1 xP n i¼1 y n Pn i¼1x 2 Pn i¼1 x xÞ2 n ð10Þ

Where_{y and x are mean values of the data series, respectively, the reference and} similarity stations (Bardossy and Pegram, 2014;Khosravi et al., 2015).

2.1.5. Simple average method

According to (Johnson, 2003), for an amount of MD less than 5%, we can use any ﬁlling method; in our case study, this percentage represent just single missing value in one month. He reports that replacing MD with total average of all observations of the same month gives a good result, only if there is low correlation between variables (Presti et al., 2010), we can write that by:

Px¼

Pn i¼1

Pi

n ð11Þ

Where Pi is the observed rainfall data, n is the number of days for each month

which can be (28, 29, 30 or 31) days.

2.2. Study area

This study focuses on thefilling MD of daily precipitation measurements at Be-jaia airport weather station of Algeria (Fig. 1), over 32 years. This station is located in the Soummam watershed with an area surface of 9200 km2, which is a part of eastern Algeria. It is bounded on the north by the mountains of Djurdjura (Lala Khadija with an altitude equal 2308 m), on the east by the plateau of Setif and Babors mountain chains (with an altitude equal 2004 m), as well as on the West by Bouira. Soummam consists of 10 sub-watersheds, equipped with 34 meteorological stations. This area offers a diverse climate and different morphological zones.

In this paper, the proposedﬁlling methodology was applied to daily precipitation time series from January 1st, 1982 to December 31th, 2014, which were obtained from National Agency of Hydraulic Resources A.N.R.H and National Centers for Environmental Information (NOAA), https://www.ncdc.noaa.gov/cdo-web/. The dataset was aﬀected by 3.47% of MD during the entire time of the study. The amount missing was detailed for each month inFig. 2.

(8)

2.3. Testing of data distribution

As already indicated above the ultimate objective of this paper is to apply diﬀerent imputation methods forﬁlling gaps in rainfall time series, in this case it can only be achieved if the mechanism producing data loss is not NMAR (Presti et al., 2010). To exclude this hypothesis, it is possible to test from empirical knowledge that the rain-fall has a random distribution (MAR). To this aim, it was assumed that the loss of data was due to failure assessment, which is caused in particular by intense rainy Fig. 1.Geographical map of Soummam watershed borders, followed by its position on north of algerian (medallion map).

Fig. 2.Seasonal Percentage of missing data distribution per month over 32 year observation at the Bejaia Airport station and the related Standard entropy. *: The percentage of missing data for each month.

(9)

events. If this hypothesis is correct, the following statements should be veriﬁable: The amount of MD should be aﬀected by obvious seasonal behavior, which usually results in autumn and winter. A positive correlation between the amount of MD and the elevation of any station should exist; this means that rainfall tends to increase as function of altitude.

In this context, it easy to verify the randomness of MD distribution by using stan-dardized entropy, showed inEq. (12)(Shannon, 1948).

H¼

Pk

m¼1½ln pðmÞ*pðmÞ

lnðKÞ ð12Þ

where m represents month, k is the number of months in one year which equal 12 and p(m) is the percentage of MD in each month.

In our case, the standardized entropy coeﬃcient was applied on inter-monthly rain-fall data in all 32 years of study, to verify MD distribution during diﬀerent seasons.

Fig. 2shows that no relationship exists between the amount of MD and the

sea-sonality data distribution, showing that the greatest percentage of MD measured in summer is 10.2% but in winter it is only 7.3 %. On the other hand, the en-tropy value is evidently equal 0.98, it is close to 1, allowing us to reject theﬁrst hypothesis, therefore rejecting NMAR mechanism. To exclude any dependence of MD on rainfall measurements, MAR and MCAR mechanisms should be tested.

An important consideration should be made to choose between these last assump-tions, having to accept the influence of other variables on rainfall measurement, it can be reasonably affirmed that there are some influence also on instrumental failures that caused data losses. Accepting this last statement is equivalent to rejecting the MCAR mechanism. For this purpose (Rubin, 1976;Scheffer, 2002), point out that “MD is very rarely MCAR and they proved that daily rainfall distribution law is not Gaussian but gamma.

2.4. Station similarity

In order to determine the similarities between meteorological stations used to solve problem of MD observed in Bejaia-Airport weather station, the Spearman coeﬃcients and residual average were compared, by coupling the time series values of the target station with the nearby stations values. To determine the sim-ilarity in index values, the whole statistical series was considered by including the “dry” days; characterized by no rainfall. This choice was proposed by respecting the following considerations: MD can also include no rainy days, as it was hy-pothesized that they are missed at random. To use only rainy days could bias

(10)

the similarity index values as it would not allow for consideration of those events where a zero value was registered in the target station, whilst a rainfall event occurred in a neighboring station (Presti et al., 2010).

Data represented relating to set of stations by PCA graph (Fig. 3) for three years that were selected randomly, thisﬁgure gives information about criteria to choose similar climatological stations to that of Bejaia-Airport, in order toﬁll MD observed in this station.

Fig. 3A shows the PCA graph, as a function of two main components F1 and F2,

which are respectively coefficient of correlation (r) and mean absolute error (MAE). They are represented from low to strong, reading respectively; from left to right and from bottom to top. Thisfigure shows three clusters, the choice is mainly based on the degree of correlation, knowing that F1 gives 96% of information rate. The PCA graph (Fig. 3A) shows that Tichy and Taza are the two nearest stations to Bejaia-Airport, representing thefirst cluster, they are very similar to each other ac-cording to the information obtained by Dendrogram (Fig. 4). However, it is prefer-able to choose Tichi owing to the availability of data. It provides a strong correlation index (0.98) (Table 1).

Ziama, Tigzirt and Azeﬀoun stations belong to the same cluster, however Ziama and Tigzirt are more similar than Azeﬀoun (Fig. 4), and the latter also shows a strong correlation (Table 1).

Annaba and Boumerdes have the same degree of similarity, the both of them belong to the second cluster, shown inFig. 3A; on the other hand Dar Sghir is a station of the third cluster which gives a lower similarity compared to all stations (Fig. 4).

Fig. 3. B represents the PCA graph as a function of two principal components (F1

and F2), which are respectively Euclidean distance between similar stations and

Fig. 3.Principal Component Analysis (PCA) score plot show the similarity between nearby stations se-tof Bejaia, airport station.

(11)

Bejaia-Airport station and Altitude. Theﬁrst component gives an information rate equal to 74%, which represents more information, giving that the Euclidean distance for each station is obtained by using latitude and longitude. This graph shows the same clusters that were obtained inFig. 3A.

A relationship between the geographical positions of stations and the similarity of stations can be noticed. Such as, all stations of theﬁrst cluster have small Euclidean distances compared to others stations and altitude measurements between (0.9 and 1.1) which are near to Bejaia-Airport station geographical position that have an alti-tude equals 1m (Table 2). On the other hand, compared to the information obtained

fromFig. 3A, we can deduce that when Euclidean distance decreases, the correlation

index increases, as well as, when the altitude of the station increases or decreases the MAE error also increased.

We can confirm this information after interpreting the results of the second and third clusters. Boumerdes and Annaba are two stations belonging to different watersheds, that have the same Euclidean distance (Fig. 3B), and the correlation also gives the same information (Fig. 3A). However it is preferable to take Annaba station for filling gaps in data, because the MAE value of Boumerdes is larger than that of An-naba, we can also find the same meaning to this information using the altitude Fig. 4.Dendrogram plot show different clusters of nearest stations of Bejaia, airport weather station.

Table 1.Statistical parameters relating to weathers stations selected for ﬁlling missing data of Bejaia-Airport station.

Station Altitude Latitude Longitude Min Max X s r ER* R2

Tichy 1 36.67 5.16 0 33.0 4.37 7.71 0.98 0.93X1 0.97

Ziama 1 36.67 5.48 0 33.2 4.41 7.74 0.96 0.97X2þ0.44 0.94

Azeﬀoun 1 36.79 4.42 0 33.0 4.49 7.74 0.88 0.74X3þ0.75 0.77

Annaba 1 36.83 7.76 0 26.9 4.03 6.87 0.74 0.40X4þ1.78 0.55

*_{: Equation of Regression.}_{r : Correlation; X: Average, s: Standard deviation, R}2

coeﬃcient of determination.

(12)

variable (Fig. 3B), knowing that altitude of Annaba is closer to that of Bejaia-Airport compared to Boumerdes.

With these results we can also conﬁrm the previous hypothesis that rainfall is missing at random; this it means that it has a deterministic role on precipitation se-ries, which explain the relationship between geographic coordinates and correlation index.

Table 2.Daily rainfall datasets obtained in February 2005, used for Example 1. Day Original dataset Dataset with MD* Station similarity

Px1 _Px1 _Pn1 _Pn2 _Pn3 _Pn4 1 12.1 - 11.8 11.1 10.3 3 2 2.1 2.1 2.4 2.8 3.7 5.6 3 17.2 - - 17 17.4 0.5 4 3.1 3.1 3.4 3 3 4.7 5 4.1 - - 4 1 -6 0 0 0 0 0 3 7 0 0 1 0.5 0 1 8 0 0 0 1 3 0 9 0.5 0.5 0.5 0.5 0.4 13.9 10 0 0 0.2 2.8 0 5.9 11 4 - - 0.6 - 7 12 0 0 0.1 0.5 0 0.5 13 0 0 0.3 3.7 1.3 4.2 14 7.2 7.2 7.6 6.8 9.1 1 15 22.3 22.3 20 25.2 20.7 26.9 16 24.1 - - 23.4 - 16 17 24.1 24.1 24.1 23.8 24 13.9 18 19.2 19.2 16.7 19 4 5.5 19 6.2 - - - 5.8 -20 3.1 3.1 2.8 3.7 4 4.2 21 1 1 2.1 0.5 0 3 22 9 9 9.3 3.5 6.5 7.1 23 0.5 0.5 1.3 0.5 0.5 0 24 0 0 0 0.5 2.3 0 25 0 0 0 1 1 2.5 26 1 1 1 1.5 1 3 27 1 1 0.5 0 1 3 28 3.1 - - - - -*_{MD (Missing Data). x}1

: Bejaia airport station. n1: Tichy statio,n2: Ziama station,n3: Azeﬀoun station,n4: Annaba station.

(13)

The dendrogram plot shown inFig. 4gives information regarding the best choice of the nearest station, according to the amount of MD recorded at each station. The graph shows that the most similar stations are respectively (Tichy, Ziama), (Azef-foun, Tigzirt) and (Annaba, Boumerdes). But in data processing, one station for each cluster can be taken, depending on theﬁlling method used, in this case the following can be selected (Tichy, Ziama, Azeﬀoun and Annaba).

2.5. Algorithm description

The various methods used tofill the missing climate data series showed before were summarized in theflowchart of the new algorithm (Fig. 5). The proposed methodol-ogy was applied on spaceetime continuous data (i.e. rainfall time series referred to different monitoring stations belonging to the same network). The data was arranged in matrices with the following structure: each matrix represents one year of each climatological station, which has a row that it uniquely associated to the date of the observation and each column is uniquely associated to month.

Theﬁrst step of this process began with pretreatment of the dataset and tested to ﬁnd similarities between stations by using the geographical coordinates of each station

(Fig. 2) then to calculate the percentage of MD. The chosen method shown in the

second step forfilling MD is determined as a function of similarity indices between stations, proximity of values representing toleration between imputation values of similarity stations, in addition to the availability of neighboring stations. In the third step, the percentage of MD after every step of thefilling must be calculated, in order to check the rate of the gaps remaining after the second step, so that it can befilled either using a simple regression method (LR) or an arithmetic average (SAM). The processing is done iteratively until amount of MD is null.

Fig. 5.Flowchart summarizing theﬁlling daily rain fall dataset. Pretreatment of climate data bases (Step1),ﬁlling missingdata with the appropriate method (Step2), checking the percentage of missing data (Step3).

(14)

2.6. Example

Table 2 shows daily precipitation measurements at Bejaia-airport station during

February 2005. The eﬀectiveness of this new approach is shown by using 25% of MD that is chosen randomly throughout these series. In this table, stations noted Pn1, Pn2, Pn3and Pn4are respectively, Tichy, Ziama, Azeﬀoun and Annaba which are used as neighboring stations to process the MD in this example.Table 1shows statistic parameters of the neighboring stations and large correlation indexes can also be seen, between (0.74 and 0.98), meaning that the similarity between selected sta-tions and the reference station is already existent.

In this example MD are noted by’-’ and ’K’ means the number of neighboring sta-tions;’D’ is the toleration between the rainfall data imputed values of all similar sta-tions in the same time. The calculation steps are the following:

Step1(Find the K nearest stations):

According toTable 1and the result obtained in (Fig. 4), Tichy station can be used to fill gaps by hot deck method, the neighboring stations which are respectively, Ziama, Azeffoun and Annaba have been used for filling data by KNNI, WKNNI, LR and MI methods.

Step2(Filling gaps):

In the 1stday we have Pxis MD values and K is the number of similar stations that

equal 4, but in this case it is better to choose Tichy station (Table 1) forﬁlling gaps Pxby using Hot-deck method:

Px¼ Hot deck ðPn1Þ ¼ 11:8

For the 3rdday: Pxis missing and the number of similar stations K is 3, which are

Ziama, Azeﬀoun and Annaba. So, according toFig. 5andTable 1, we can choose between KNNI or WKNNI methods, this implies calculating the diﬀerence between imputed values (D):

DPn2; Pn3¼ ð17:4 17Þ*100=17:4 ¼ 2:29 < 10% DPn2; Pn4¼ ð17 0:5Þ*100=17 ¼ 97:05 > 10%

According to these results, it is better to use only Ziama (Pn2) and Azeﬀoun (Pn3) stations forﬁlling data by applying KNNI method:

Px¼ KNNI

Pn2; Pn3¼ ð17 þ 17:1Þ=2 ¼ 17:05

For 5thday: in this case, there are just two similar stations (Ziama and Azeﬀoun), so to calculate MD (Px), we need to choose between WKNNI and MI methods.

(15)

DPn3_{; Pn}4 _{¼ ð4 1Þ*100=4 ¼ 75 > 10%}

D Value is more than 10% andTable 1shows that the correlation index of rainfall values between “Ziama and Bejaia-Airport” and also between “Annaba and Be-jaia-Airport” are 96% and 88%, respectively. So it is preferable to apply WKNNI method: Px¼ WKNNI Pn3_{; Pn}4 _¼ _Pn3_*w 3þ Pn4*w4 ðw3þ w4Þ w3¼ 1=ðdn3Þ ¼ 1=ð58:4Þ ¼ 0:017 w4¼ 1=ðdn4Þ ¼ 1=ð122:2Þ ¼ 0:008 px¼ ð4*0:017 þ 1*0:008Þ=ð0:017 þ 0:008Þ ¼ 3:04

For 11thday: we have two similar stations that are Ziama and Annaba. DPn3; Pn4 ¼ ð7 0:6Þ*100=7 ¼ 91 > 10%

One can see inTable 1that correlation index, between Annaba and Bejaia-Airport station is 74%. In this case, according toFig. 5, we canﬁll the datum using MI method:

Px¼ MI

Pn2_{; Pn}4_{¼ AverageðImput1; Imput2Þ}

To use this method, we must calculate the regression equations, shown on (Table 1) that will be applied in each imputation, which are:

Y2¼ 0:97X2þ 0:44

Y4¼ 0:40X4þ 1:78

Imput1¼ 0:97*0:6 þ 0:44 ¼ 1:02 Imput2 ¼ 0:40*7 þ 1:78 ¼ 4:58 Px¼ MIð1:02; 4:58Þ ¼ 2:8

For 16thday: the number of similar stations K is two (Ziama and Annaba), they have correlation index of 96% and 74%, respectively (Table 1). In this caseD equal 31%, so we have to use only WKNNI (Fig. 5).

Px¼ WKNNI

Pn2; Pn4¼ 21:24

For 19th day: the percentage of MD obtained on Bejaia-Airport station series is 7.14%, so it canﬁll these gaps using LR method (Fig. 5):

(16)

Y3¼ 0:74X3þ 0:75

Px¼ LR

Pn3¼ 5:04

For 28thday: the percentage of MD in Bejaia-airport station series is 3.57%, it is less than 5% and it doesn’t have at least one similar station, so in this case we can apply only SAM toﬁll the gaps (Fig. 5):

Px¼ SAMðPxobsÞ ¼ 5:08

2.7. Experimental

In this section, experimental work was done to compare all methods used in the pro-posed technique ofﬁlling missing climate data. The validation criteria are applied on daily rainfall data for three years which were randomly selected (1997, 2003 and 2013). The calculations applied toﬁll lack of data in this part were obtained by the implementation of the proposed algorithm on the Delphi language platform (version XE2).

We have chosen four cases where data was missing, which are respectively (4%, 8%, 12% and16%) for showing the reliability of these methods respecting to the same conditions proposed by (Presti et al., 2010). To validateﬁlling data results, it is necessary to compare the obtained rainfall data with actual mea-surements of the same station using a coeﬃcient of determination R2 showed

in Eq. (13) and an adjusted coeﬃcient of determination R2Adj showed in Eq.

(14). The Error measurements must also be accepted to ensure their reliability, for this case we have used others parameters, such as root mean squared error

RMSE, mean absolute error MAE, which are given by Eqs. (15) and (16),

Knowing that the lower RMSE value gives the best imputation (Chai and

Draxler, 2014). R2_¼ PN i¼1 Xmodel Xobs PN i¼1 Xobs Xobs ð13Þ R2Adj ð1 R2_{ÞðN 1Þ} ðN K’ 1Þ ð14Þ RMSE¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn i¼1ðXobs XmodelÞ 2 n v u u t ð15Þ

(17)

MAE¼1 n

Xn i¼1

jXobs Xmodelj ð16Þ

Where‘n’ is the total number of observations, Xobsare the observed values and X mo-delare the modeled values. N is the number of points in our data sample. K0is the

number of independent repressors, i.e. the number of variables in our model excluding the constant.

3. Results & discussion

In this comparison, different cases of missing values amount were applied to justify the best use of thefilling techniques. The number of similar stations considered in this estimation is three and the nearest station, which are Ziama, Azeffoun, Annaba and Tichy respectively.

Fig. 6show Residual Analysis plots followed by linear regression graphs between

imputation results and real rainfall measurements for each ﬁlling method when data missed 4%, 8%, 12% and 16%. These graphs show that in the case of 4%, all methods are acceptable and this result conﬁrmed the proposition of Presti about the choice of methods (Presti et al., 2010), we can take into consideration the best choice according toTable.3which shows that Hot-deck, KNNI, WKNNI and MI methods perform better than LR and SAM.

When the global rate of MD is 8%, the residual values corresponding to SAM graph show that most of the values are negative,ﬂowing with a high ﬁt of data which is remarked on the regression plot (Fig. 6), this method is not usable in this case of study, as it noted a bad correlation (Table 3). The graph shown inFig. 7proves that Hot deck and KNNI methods perform the best. WKNNI and MI are similar, hav-ing a correlation index that equals 0.82 and 0.84 respectively, however the histogram graph represented byFig. 8shows that RMSE values obtained when using MI are higher compared to WINNI.

For a case where the missing data is 12%, the graph shown inFig. 6pointed out that the results obtained when LR method was used is less reliable in comparison with all previous cases.

Table.3shows that hot deck, KNNNI, WKNNI are the methods that perform the

most having a strong correlation which are more than 95%, on the other hand the ﬁlling by using MI method is still feasible and giving regression index of 81%.

In the last case, the results are given inTable 3. When the percentage of MD was 16%, suggesting that the hot-deck method always provides the best performance, based on both of RMSE and MAR parameters, it can also be found that KNNI and WKNNI provided good values of R2, which are respectively; 0.997 and 0.923.

(18)

However, the MI method shows a very high performance, compared to WKNNI (Table.3), as most of the imputation cases toﬁll the data gap in this study use the data observed at the Annaba station. The latter has a degree of correlation of 74%

(Table.1). The regression equations used under MI that were obtained by the

cor-relation study between the observed data of each station are respectively:

Yb¼ 1:001xZiþ 0:017 ð17Þ

Fig. 6.Residual curves obtained from linear regression graphs (medallion graph) between predicted and experimental values of rainfall data when the amount of missing data equal 4%, 8%, 12% and 16%. (EV) Experimental values, (PV) predicted values, (*): Missing Data.

(19)

Yb¼ 1:006xAzþ 0:113 ð18Þ

Yb¼ 0:621xAnþ 0:813 ð19Þ

Where Ybis daily rainfall data imputation in Bejaia-Airport station, xziis daily

rain-fall data observed in Ziama station, xazis daily rainfall data observed in Azeﬀoun

station, xAn is daily rainfall data observed in Annaba station.

A bad linear relationship and a great deviation of residual values which was observed between experimental and imputing values when LR and SAM was used (showing inFig. 6), demonstrated that it should be rejected.

After testing all cases of MD, thefilling final results explained that the hot-deck method outperforms in all slices filling (Fig. 6). The validation tests relative to KNNI and WKNNI shown inFig. 7andFig. 8, highlight that there is no signif-icant difference in relationship with the respecting amount of missing values. Table 3.Performance indicators for rainfall missing data when amount are equal to 4, 8, 12 and 16%. Determination coefficient (R2), adjusted determination co-efficient (R2adj), root mean square error (RMSE), mean absolute error MAE.

Hot-Deck KNNI WKNNI MI LR SAM

1st_{case: 4 % of MD}_* R2 _0.999 _0.930 _0.853 _0.924 _0.725 _0.562 R2Adj 0.999 0.930 0.852 0.924 0.723 0.56 RMSE 0.009 0.069 0.011 0.113 0.254 0.269 MAE 0.001 0.001 0.007 0.015 0.042 0.024 2ndcase: 8 % of MD* R2 _0.995 _0.918 _0.826 _0.847 _0.712 _0.341 R2 Adj 0.995 0.918 0.826 0.847 0.711 0.338 RMSE 0.013 0.137 0.045 0.241 0.396 0.709 MAE 0.002 0.004 0.018 0.04 0.086 0.039 3rd_{case: 12 % of MD}_* R2 _0.999 _0.994 _0.956 _0.817 _0.679 _0.261 R2Adj 0.999 0.994 0.956 0.816 0.678 0.258 RMSE 0.035 0.094 0.08 0.637 0.959 1.077 MAE 0.003 0.003 0.042 0.136 0.246 0.156 4thcase: 16 % of MD* R2 0.999 0.997 0.983 0.952 0.397 0.265 R2 Adj 0.999 0.997 0.983 0.952 0.394 0.246 RMSE 0.128 0.218 0.548 0.978 3.891 4.327 MAE 0.004 0.002 0.086 0.197 1.029 1.083 *_{MD (Missing Data).}

(20)

InTable 4we can see that in all cases where the distance is varying, the choice of K equal 1 gives the bestfilling results when WKNNI method is used. But when this distance is very large, the similarity indexes of neighboring stations are less than 75%, it can be seen that MI method performs better because the small variation on rainfall intensity can introduce significant changes in the runoff values generated from distributed rain-runoff models (Vieux, 2001). Considering the amounts of Fig. 7.Radar graph of the adjusted determination coefficientR2_Adjobtained from differentmissing rainfall data imputation methods with different amount of missing data at Bejaia airport station.

Fig. 8.Histogram of root mean square errorRMSE obtained from diﬀerent missing rainfall data imputa-tion methods with 4, 8, 12 and 16% of missing data at Bejaia airport staimputa-tion.

(21)

missing values, it is obvious that the results are quite similar to what is obtained for 5% of MD. When this percentage rises, the RMSE raises slightly, showing inFig. 7. All the errors of obtained measurements and the goodﬁt criterion point conclude that hot-deck, KNNI and WKNNI are better methods compared to regression methods such as MI and LR.

Fig. 9represents a simulation of RMSE results obtained over the same period of

study after toﬁll (4%, 8%, 12% and 16%) of MD by the proposed algorithm. In each case, RMSE was calculated 50 times randomly for all climatic seasons, in or-der to study density of error distribution relating to this technical applied on daily rainfall dataset. All graphs show that the distribution of this random variable fol-lows normal law, knowing that in this case the representation of RMSE values were done in class.

Table 4.Weighting coeﬃcients and adjustment quality parameters obtained from WKINNI and MI methods taking into account the neighboring stations and their separating distances (the amount of missing data is equal to 16%).

Stations similarity (Xi) EDðKmÞ Method K R2 R2Adj RMSE MAE

Ziama 41 WKNNI 1 0.9989 0.9989 0.0276 0.0427 Azeﬀoun 58.4 2 0.9986 0.9986 0.0312 0.0441 3 0.9983 0.9983 0.0343 0.0454 4 0.998 0.9980 0.037 0.0465 5 0.9978 0.9977 0.0392 0.0474 6 0.9975 0.9974 0.0409 0.0481 MI - 0.997 0.9969 0.0447 0.0774 Taza 42.5 WKNNI 1 0.8001 0.7955 0.3837 1.1604 Tigzirt 86.4 2 0.7772 0.7721 0.4305 1.2729 3 0.7599 0.7544 0.468 1.3750 4 0.7495 0.7437 0.4915 1.4378 5 0.7438 0.7379 0.5045 1.4784 6 0.7227 0.7163 0.435 1.3303 MI - 0.8117 0.8074 0.3658 0.0862 Annaba 122.2 WKNNI 1 0.6976 0.6906 0.4641 1.1433 Dar sghir 98.5 2 0.6859 0.6787 0.4746 1.1664 3 0.6723 0.6648 0.4858 1.2037 4 0.6576 0.6497 0.4972 1.2387 5 0.6272 0.6184 0.5083 1.2707 6 0.6272 0.6184 0.5189 1.2995 MI - 0.7435 0.7377 0.4139 0.8448

Euclidean distance between the neighboring station and Bejaia Airport station(ED), coefficient using to calculate weighting coefficient (K), determination coefficientðR2_{Þ, adjusted determination} coef-ficientðR2

(22)

When the missing data is 4%, the graph ofFig. 9A shows a symmetric distribu-tion recorded in autumn, spring and summer, which for each case gives an average (U ¼ 5) and median (med ¼ 5), the variability changed from weak to strong, respectively, from the driest to the wettest season. On the other hand, the ﬁlling data in winter graphically shows an asymmetric distribution (right skewed distribution) that gives positive errors justiﬁed by a variability index, which equal 0.038.

When this amount of MD increases to 8%, 12% or 16%, it is found that RMSE distribution is always symmetrical in summer (Fig. 9. B, C, D), only the variability index (Ϭi) increases relatively to the amount of MD; this means that the ﬁlling data in this season gives a large proportion of RMSE which always turns around the average. Fig. 9. B, C, and D show that the ﬁlling data in autumn and winter when the percentage of MD increases is marked by positive distribution, the vari-ability increases as function of this amount. Whereas the density of RMSE re-corded in spring is right skewed distribution only when the lack of data equals 12% and 16%.

After all of these cases, it can be deduced that theﬁlling of data using the proposed algorithm respects the seasonality of the same rainfall series. The density of errors is always positive, which increases relatively with the amount of MD and converges to small values.

Fig. 9.Simulation of RMSE density obtained by generating randomly 50 times of missing data of each season, using 4% (A), 8% (B), 12% (C) and 16% (D) of missing data.

(23)

4. Conclusion

Climate databases suﬀer from the problem of MD, however the best choice of each method is still on studying missingness mechanism and percentage of MD. The main contribution of this article is to propose a new hybrid technique ofﬁlling gaps for estimating missing climate databases more reliable, using climatological network data.

The work shows two main results: (I) explaining the best choice between several of methods, according to the results of comparison using statistical tests on different cases of MD amount; the results shows that the Hot-Deck, KNNI and WKNNI methods give the best filling. It does not depend on the percentage of data. It can be seen that when Tichy, Jijel and Azeffoun stations were chosen forfilling 4%, 8%, 12% and 16% of MD, the RMSEs values are always between 0.001 and 0.548. On the other hand, the choice of Annaba and Dar Sghir stations show that the MI method performs better than WKNNI, which have RMSEs of 0.4139 and 0.4641 respectively; these results depend on a degree of similarity be-tween stations, which is less than 75% (II) It shows the influence of altitude, lati-tude and longilati-tude on the correlation index and the residual average between neighboring stations and reference station. This variation affects the imputation re-sults obtained by MI, LR and SAM. The proposed filling methodology can be also applied to estimate data sets of other case studies when the missing distribu-tion is random.

5. Related work

We want to propose as a future work of this article, the implementation of ourﬁlling technique in form of new software that will serve not only climate data bases but we will try to include other methods to generalize the model ofﬁlling data. The algo-rithm will verify the reliability of the results and even the asymptotic complexity of the algorithm comparing to other software and scripts like (Amelia II script for R) and (MDI toolbox for Matlab) and also SPSS software.

Declarations

Author contribution statement

Amir Aieb: Performed the experiments; Analyzed and interpreted the data; Wrote the paper.

Khodir Mandani: Conceived and designed the experiments; Analyzed and inter-preted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper.

(24)

Marco Scarpa; Brunella Bonaccorso: Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper.

Khalef Lefsih: Conceived and designed the experiments; Analyzed and interpreted the data; Wrote the paper.

Funding statement

This research did not receive any speciﬁc grant from funding agencies in the public, commercial, or not-for-proﬁt sectors.

Competing interest statement

The authors declare no conﬂict of interest.

Additional information

No additional information is available for this paper.

Acknowledgements

We wish to thank the staﬀ of biomathematics, biophysics biochemistry and scien-tometry Laboratory of Algeria (BBBS) for their precious help about climatology and for giving us opportunity to use their subset data. We also thank the staﬀ of Mo-bile and Distributed Systems Laboratory in Italy (MDSLab) for their comments and advices on earlier drafts of the paper.

References

Adler, R.F., Huﬀman, G.J., Bolvin, D.T., Curtis, S., Nelkin, E.J., 2000. Tropical

rainfall distributions determined using TRMM combined with other satellite and

rain gauge information. J. Appl. Meteorol. 39 (12), 2007e2023.

Amiri, M., Jensen, R., 2016. Missing data imputation using fuzzy-rough methods.

Neurocomputing 205, 152e164.

Bardossy, A., Pegram, G., 2014. Inﬁlling missing precipitation recordseA

compar-ison of a new copula-based method with other techniques. J. Hydrol. 519,

1162e1170.

Batista, G.E., Monard, M.C., 2003. An analysis of four missing data treatment

methods for supervised learning. Appl. Artif. Intell. 17 (5-6), 519e533.

Chai, T., Draxler, R.R., 2014. Root mean square error (RMSE) or mean absolute

error (MAE)?eArguments against avoiding RMSE in the literature. Geosci. Model

(25)

Folch-Fortuny, A., Arteaga, F., Ferrer, A., 2015. PCA model building with missing data: new proposals and a comparative study. Chemometr. Intell. Lab. Syst. 146, 77e88.

Folguera, L., Zupan, J., Cicerone, D., Magallanes, J.F., 2015. Self-organizing maps for imputation of missing data in incomplete data matrices. Chemometr. Intell. Lab. Syst. 143, 146e151.

Gomez-Carracedo, M., Andrade, J., Lopez-Mahía, P., Muniategui, S., Prada, D.,

2014. A practical comparison of single and multiple imputation methods to handle complex missing data in air quality datasets. Chemometr. Intell. Lab. Syst. 134, 23e33.

Gyau-Boakye, P., Schultz, G., 1994. Filling gaps in runoﬀ time series in West

Af-rica. Hydrol. Sci. J. 39 (6), 621e636.

Honaker, J., King, G., Blackwell, M., 2011. Amelia II: a program for missing data. J. Stat. Software 45 (7), 1e47.

Hong, Y., Adler, R.F., Negri, A., Huﬀman, G.J., 2007. Flood and landslide

appli-cations of near real-time satellite rainfall products. Nat. Hazards 43 (2), 285e294. Jerez, J.M., et al., 2010. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif. Intell. Med. 50 (2), 105e115.

Johnson, M., 2003. Lose Something? Ways to Find Your Missing Data. Houston Center for Quality of Care and Utilization. Studies Professional Development Se-ries: 17-09. http://www.hsrd.houston.med.va.gov/Documents/MJ%20Missing%

20Data%20PDS%20091703.ppt.

Junger, W., De Leon, A.P., 2015. Imputation of missing data in time series for air

pollutants. Atmos. Environ. 102, 96e104.

Khosravi, G., Nafarzadegan, A.R., Nohegar, A., Fathizadeh, H., Malekian, A.,

2015. A modiﬁed distance-weighted approach for ﬁlling annual precipitation

gaps: application to diﬀerent climates of Iran. Theor. Appl. Climatol. 119 (1-2),

33e42.

Li, J., Ruhe, G., Al-Emran, A., Richter, M.M., 2007. Aﬂexible method for software

eﬀort estimation by analogy. Empir. Softw Eng. 12 (1), 65e106.

Little, R.J., Rubin, D.B., 1987. Statistical Analysis with Missing Data. J. Auﬂ., New York et al.https://www.wiley.com/en-us.

(26)

Medlin, J.M., Kimball, S.K., Blackwell, K.G., 2007. Radar and rain gauge analysis

of the extreme rainfall during Hurricane Danny’s (1997) landfall. Mon. Weather

Rev. 135 (5), 1869e1888.

Nelson, P.R., Taylor, P.A., MacGregor, J.F., 1996. Missing data methods in PCA and PLS: score calculations with incomplete observations. Chemometr. Intell. Lab. Syst. 35 (1), 45e65.

Pettazzi, A., Salson, S., 2012. Combining radar and rain gauges rainfall estimates using conditional merging: a case study. In: The Seventh European Conference on Radar in Meteorology and Hydrology, MeteoGalicia. Galician Weather Service Santiago de Compostela, Spain, p. 5. http://www.meteo.fr/cic/meetings/2012/

ERAD/extended_abs/QPE_302_ext_abs.pdf.

Presti, R.L., Barca, E., Passarella, G., 2010. A methodology for treating missing data applied to daily rainfall data in the Candelaro River Basin (Italy). Environ.

Monit. Assess. 160 (1-4), 1.

Rahman, S.A., Huang, Y., Claassen, J., Heintzman, N., Kleinberg, S., 2015. Combining Fourier and lagged k-nearest neighbor imputation for biomedical time series data. J. Biomed. Inf. 58, 198e207.

Rubin, D., 1987. Multiple Imputations for Non Responses in Surveys, vol. 2.

Wi-ley, New York.

Rubin, D.B., 1976. Inference and missing data. Biometrika 63 (3), 581e592.

Schafer, J.L., 1997. Analysis of Incomplete Multivariate Data. Chapman and Hall/

CRC.

Scheﬀer, J., 2002. Dealing with missing data. Res. Lett. Inf. Math. Sci. 3, 153e160. http://hdl.handle.net/10179/4355.

Shannon, C.E., 1948. A mathematical theory of communication. Bell Sys. Tech. J. 27 (3), 379e423.

Shariﬁ, E., Steinacker, R., Saghaﬁan, B., 2016. Assessment of GPM-IMERG and

other precipitation products against gauge data under diﬀerent topographic and

cli-matic conditions in Iran: preliminary results. Rem. Sens. 8 (2), 135.

Song, Q., Shepperd, M., Chen, X., Liu, J., 2008. Can k-NN imputation improve the performance of C4. 5 with small software project data sets? A comparative

evalu-ation. J. Syst. Software 81 (12), 2361e2370.

Sovilj, D., et al., 2016. Extreme learning machine for missing data using multiple

(27)

Tang, J., Zhang, G., Wang, Y., Wang, H., Liu, F., 2015. A hybrid approach to inte-grate fuzzy C-means based imputation method with genetic algorithm for missing

traﬃc volume data estimation. Transport. Res. C Emerg. Technol. 51, 29e40.

Teegavarapu, R.S., Chandramouli, V., 2005. Improved weighting methods, deter-ministic and stochastic data-driven models for estimation of missing precipitation records. J. Hydrol. 312 (1-4), 191e206.

Troyanskaya, O., et al., 2001. Missing value estimation methods for DNA

micro-arrays. Bioinformatics 17 (6), 520e525.

Turki, I., et al., 2016. Hydrological variability of the Soummam watershed

(North-eastern Algeria) and the possible links to climateﬂuctuations. Arabian J. Geosci. 9

(6), 477.

Vieux, B.E., 2001. Distributed hydrologic modeling using GIS. In: Distributed

Hy-drologic Modeling Using GIS. Springer, pp. 1e17.

Vila, D.A., De Goncalves, L.G.G., Toll, D.L., Rozante, J.R., 2009. Statistical eval-uation of combined daily gauge observations and rainfall satellite estimates over

continental South America. J. Hydrometeorol. 10 (2), 533e543.

Wang, J.-H., Hopke, P.K., Hancewicz, T.M., Zhang, S.L., 2003. Application of

modiﬁed alternating least squares regression to spectroscopic image analysis.