Indices for calibration data selection of the rainfall runoff model

Size: px

Start display at page:

Download "Indices for calibration data selection of the rainfall runoff model"

Gwenda Colleen Martin
5 years ago
Views:

1 Click Here for Full Article Indices for calibration data selection of the rainfall runoff model Jia Liu 1 and Dawei Han 1 WATER RESOURCES RESEARCH, VOL. 46,, doi: /2009wr008668, 2010 Received 20 September 2009; revised 18 November 2009; accepted 1 December 2009; published 27 April [1] The identification of rainfall runoff models requires selection of appropriate data for model calibration. Traditionally, hydrologists use rules of thumb to select a certain period of hydrological data to calibrate the models (i.e., 6 year data). There are no numerical indices to help hydrologists to quantitatively select the calibration data. There are two questions: how long should the calibration data be (e.g., 6 months), and from which period should the data be selected (e.g., which 6 month data should be selected)? In this study, some indices for the selection of calibration data with adequate lengths and appropriate durations are proposed by examining the spectral properties of data sequences before the calibration work. With the validation data determined beforehand, we assume that the more similarity the calibration data set bears to the validation set, the better should the performance of the rainfall runoff model be after calibration. Three approaches are applied to reveal the similarity between the validation and calibration data sets: flow duration curve, Fourier transform, and wavelet analysis. Data sets used for calibration are generated by designing three scenario groups with fixed lengths of 6, 12, and 24 months, respectively, from 8 year continuous observations in the Brue catchment of the United Kingdom. Scenarios in each group have different starting times and thus various durations with specific hydrological characteristics. With a predetermined 18 month validation set and the rainfall runoff model chosen to be the probability distributed model, useful indices are produced for certain scenario groups by all three approaches. The information cost function, an entropy like function based on the decomposition results of the discrete wavelet transform, is found to be the most effective index for the calibration data selection. The study demonstrates that the information content of the calibration data is more important than the data length; thus 6 month data may provide more useful information than longer data series. This is important for hydrological modelers since shorter and more useful data help hydrologists to build models more efficiently and effectively. The idea presented in this paper has also shown potential in enhancing the efficiency of calibration data utilization, especially for data limited catchments. Citation: Liu, J., and D. Han (2010), Indices for calibration data selection of the rainfall runoff model, Water Resour. Res., 46,, doi: /2009wr Introduction [2] Mathematical rainfall runoff models are powerful tools which have been increasingly used in solving practical water resources engineering problems, ranging from online flood forecasting to land use change evaluations and the design of hydraulic structures. The confidence of a rainfallrunoff model depends on the model uncertainty remaining after being calibrated [Yapo et al., 1996]. Besides the automatic optimization related issues which have been focused on by many researchers during the past two decades, an appropriate selection of the calibration data is gaining more and more attention recently in order to get a robust and reliable calibration procedure. In general, modelers tend to use as large a data set as possible to get a representative 1 Water and Environmental Management Research Centre, Department of Civil Engineering, University of Bristol, Bristol, UK. Copyright 2010 by the American Geophysical Union /10/2009WR calibration data set of various phenomena experienced by the watershed. However, it is not the length of data but the quality of the information contained in the data that is of more importance in deciding the calibrated model performance, and the use of additional data beyond a certain amount will only marginally improve the parameter estimates [Sorooshian et al., 1983]. Gupta and Sorooshian [1985a, 1985b] provided a theoretical analysis indicating that data sequences containing greater hydrologic variability are more likely to result in reliable parameter estimates and thus enhance the performance of the calibrated model. [3] Many researchers have focused on searching for the most adequate calibration data length and confirmed that it is not the longer the data is used for calibration, the better is the model performance. Different data lengths were recommended, ranging from 3 months to 10 years, used for calibration regarding different models and optimization methods in their studies [Harlin, 1991; Yapo et al., 1996; Gan and Biftu, 1996, Gan et al., 1997; Anctil et al., 2004; Brath et al., 2004; Butts et al., 2004; Xia et al., 2004; Boughton, 2006; 1of17

2 Perrin et al., 2007]. As early as the beginning of the 1990s, Harlin [1991] developed a process oriented calibration scheme for the automatic calibration of the Hydrologiska Byråns Vattenbalansavdelning model, and a calibration length between 2 and 6 years was found to be sufficient for optimal parameters in the test basins. Later, important contributions were made by Yapo et al. [1996] and Gan et al. [1997], both using the shuffled complex evolution algorithm for the automatic calibration of lumped models operated at daily time scale. Their results suggested a minimal period length of 8 years and 1 year, respectively, of continuous daily data to obtain reliable calibrations that are relatively insensitive to the period selected. Brath et al. [2004] found there was also an optimal length of calibration data for the spatially distributed hydrological models and then showed how reducing the length of the calibration period under the extension of 3 months in the case study would degrade significantly the model performances. [4] However, the conclusions of the appropriate calibration lengths from those researches all depended on the characteristics of the case studies and the types of rainfallrunoff models used. The increasing attention gained by the selection of the most representative calibration data with adequate length has also put a heavy burden on the calibration work, and this problem will worsen as more observed data are collected by modern telemetry systems. There is a lack of a simple but effective approach for the selection of proper data used for calibration. Is it possible that the most appropriate set of calibration data could be decided before the procedure of the calibration work takes place? Besides the model performance, which is unknown until the completion of the whole calibration procedure, are there other criteria indicating the right selection of the calibration data? Assuming that the data used for model validation are ascertained, then the problem could be simplified to finding the criteria evaluating the similarity between the validation and different calibration data sets. We can assume that the similarity is in accordance with the model performance after calibration, which means that the more similar the calibration set is to the validation set, the better should the performance be of the calibrated model using that calibration set. The main purpose of this study is to search for simple indices representing the similarity between the calibration and validation data and then to verify the consistence of the similarity shown by the indices with the model performance after calibration. [5] In this study, the discrete wavelet transform (DWT) is applied, and an entropy like indicator named the identification cost function (ICF) is constructed based on the wavelet analysis to evaluate the spectral characteristics and the similarity between the validation and calibration data sets using the observed flow data. Before the DWT, two basic approaches, the flow duration curve (FDC) and the fast Fourier transform (FFT), are adopted to investigate their possibilities in representing the hydrological and spectral similarity between the validation and calibration data. All the results are verified by the model performances after the calibration of a conceptual rainfall runoff model, the probability distributed model (PDM). In order to eliminate the differences caused by using different automatic calibration methods, three optimization algorithms, particle swarm optimization (PSO), the genetic algorithm (GA), and sequential quadratic programming (SQP), are used to generate average and stable calibration results. An 8 year rainfall runoff calibration data set is split into three scenario groups, with the length of each scenario being 6, 12, and 24 months, respectively. All the analyses performed with FDC, FFT, and DWT are to explore whether the similarities identified by the chosen indices are consistent with the results of the model performances after calibration with different calibration data among the three scenario groups. 2. Methodology 2.1. Flow Duration Curve and Fourier Transform [6] A flow duration curve provides the percentage of time (duration) a flow with a certain time interval is exceeded over a historical period for a particular river basin [Vogel and Fennessey, 1994]. It may be also viewed as the complement of the cumulative distribution function of the considered flows [LeBoutillier and Waylen, 1993]. The empirical FDC can be easily constructed from streamflow observations using standardized nonparametric procedures as described by Vogel and Fennessey [1994]. [7] Fourier transform is a time frequency technique that decomposes a periodic signal into a linear superposition of sinusoids of different frequencies [Newland, 1993]. For discrete data (although the streamflow is a continuous signal, the digital recording device measures the flow at a prefixed time interval, and the resultant data are in discrete format), the DFT is often applied [Robin et al., 1993]. Two indices commonly used to visualize and analyze the results of DFT are the absolute amplitude, A p+1, and the power, P p+1 : A pþ1 ¼ X pþ1 =n ð1þ P pþ1 ¼ X pþ1 2 =n; where X p+1 is the transformed Fourier series with a length of n and index p running from 0 to n 1. Cooley and Tukey [1965] proposed the fast Fourier transform (FFT), which can compute the DFT in a more efficient way with complexity of O(n log n) instead of O(n 2 ). In this paper, the amplitude and total power of the Fourier series after FFT together with the flow duration curves are applied first for the comparison between the validation and calibration data sets Wavelet Analysis and the Information Cost Function (ICF) [8] Wavelet transform is a strong mathematical tool that provides a time frequency representation of an analyzed signal [Daubechies, 1990; Polikar, 1999]. It appears to be a more efficient approach than the Fourier transform in studying nonstationary time series. In recent years, there has been an increasing interest in the use of wavelet analysis in a wide range of fields in water resources and meteorology. Besides its successful application in the characterization and periodic analysis of climatic and hydrological data [Smith et al., 1998; Torrence and Compo, 1998; Park and Mann, 2000; Penalba and Vargas, 2004; Partal and Kahya, 2006], wavelet analysis is also an powerful tool for determining the relationships between different ð2þ 2of17

3 climatic or hydrological elements through analyzing and synthesizing their variable structures in the frequency domain [Nakken, 1999; Drago and Boxall, 2002; Taleb and Druyan, 2003; Kulkarni, 2000; Li et al., 2009]. Results from the recent studies have demonstrated the feasibilities of wavelet analysis in locating the irregularly distributed multiscale features of hydrometeorological data and in quantitatively correlating different observation series through their wavelet based expressions. [9] Basic ideas about the wavelets and how different wavelet transforms (including both continuous wavelet transforms (CWT) and DWT) are performed can be found in the work of Meyer [1993]. An efficient way to implement the DWT was devised by Mallat [1989] as the Mallat decomposition algorithm utilizing a number of successive filtering steps with which the original signal f can be decomposed into a series of approximations and details as follows: S j k ¼ XL 1 n¼0 C j k ¼ XL 1 n¼0 Sn 0 ¼ fn ½ Š; n 2 N ð3þ hn ½ ŠS j 1 nþ2k ; j ¼ 1; 2;...; J ð4þ gn ½ ŠS j 1 nþ2k ; j ¼ 1; 2;...; J; ð5þ j j where S k and C k represent the approximation and detail coefficients, respectively, N is the total number of data points in the signal f, h[n] and g[n] are impulse responses of the low pass filter H and the high pass filter G, respectively, L is the number of the nonzero impulse responses in h[n] and g[n], and J is the maximum possible scale of the Mallat decomposition algorithm, with J [log 2 (N L)] + 1 [Li et al., 1997]. The original signal f is first decomposed into an approximation and an accompanying detail. The approximation coefficients S j k are obtained by convolving the signal with the decomposition low pass filter H, while the detail coefficients C j k are obtained with the high pass filter G. The decomposition process is then iterated, with successive approximations being decomposed in turn so that the original signal is broken down into many lower resolution components. As a result, the approximations are the highscale, low frequency components of the signal, while the details indicate the low scale, high frequency components. j [10] With the wavelet coefficients C k and S j k, the sum E j = k C j2 k (or E j = k S j2 k ) gives the energy of the details (or approximations) of the signal f at level j. If the total energy is denoted as E tot = j E j, the corresponding percentile energy at level j is P j ¼ E j E tot : The level j is associated with a frequency band, DF, obtained in the following way: ð6þ 2 j 1 F s F 2 j F s ; ð7þ where F s is the sampled frequency and j = 1,2,, J. [11] The sequence P j gives the probability distribution of the energy for each level j. This distribution has a Shannon entropy that is defined as the information cost function [Blanco et al., 1998], which essentially measures the order inside the system: ICF ¼ X j P j ln P j ; where the sum is interpreted as zero for any P j = 0. The ICF is an entropy like function that is easy to calculate and gives a good estimate of the degree of disorder of a system [Figliola and Serrano, 1997]. In this study, the ICF and the energy distribution described by the total energy of the wavelet coefficients of both details and approximations are regarded as indicators of similarity in the comparisons between the validation and calibration sets in the frequency domain. 3. Rainfall Runoff Model, Optimization Methods, and Data Description 3.1. Probability Distributed Model [12] The rainfall runoff model used in this study is the PDM developed by Moore [1985]. The PDM model has been widely applied in various catchments in the United Kingdom and could be viewed as a representative of the conceptual saturation excess hydrological models used for the runoff simulation in humid and semihumid regions. It is developed based on the scheme of the Xinanjiang model with a soil moisture storage capacity that varies over the catchment, described by a simple Pareto distribution. There are 13 parameters in the PDM model to be calibrated, including f c (the rainfall factor), c min and c max (the minimum and maximum storage capacity, respectively), b (exponent of the Pareto distribution controlling the spatial variability of the store capacity), b e (exponent in the actual evaporation function), k g (groundwater recharge time constant), b g (exponent of recharge function), S t (soil tension storage capacity), k 1 and k 2 (time constants of a cascade of two linear reservoirs in the surface routing system), k b (base flow time constant of the groundwater routing system), q c (constant flow representing returns or abstractions), and t d (time delay). Moore [2007] gives an extensive description of the parameters and the model structure. The design of the surface and groundwater storage routing models can be found in the work of O Connor [1982] and Dooge [1973] Calibration Methods [13] The process of the rainfall runoff model calibration is normally performed either manually or by using computerbased automatic procedures. An integration of automatic optimization with visually interactive parameter estimation is recommended for the PDM model [Moore, 2007]. Due to the amount of calibration sets in this study and in order to reduce the subjective decisions on the calibration work, the automatic calibrations of all the 13 parameters are chosen here. [14] Recent research into global search methods has led to the use of population evolution based optimization algorithms [Gupta et al., 1998] such as the GA [Wang, ð8þ 3of17

4 1991, 1997], the shuffled complex evolution (SCE) algorithm [Duan et al., 1992, 1994], and simulated annealing (SA) [Sumner et al., 1997], which have proven to be both effective and relatively efficient in dealing with water resources systems [Zakermoshfegh et al., 2008]. Nowadays, a newer evolutionary technique, PSO, has gained much attention and wide applications in different fields [Eberhart and Shi, 2001]. It is a population based stochastic optimization technique developed in 1995, inspired by the simulation of social behavior [Eberhart and Kennedy, 1995]. PSO shares many similarities with the population evolutionbased techniques, especially the genetic algorithms, but it has shown many attractive characteristics, such as simple concept, easy implementation, and quick convergence [Liu et al., 2005]. Recently the PSO method has been successfully used in many parameter optimization cases of the rainfall runoff models [Chau, 2006, 2007; Gill et al., 2006; Goswami and O Connor, 2007; Reddy and Kumar, 2007; Zakermoshfegh et al., 2008]. [15] Dealing with the various calibration data scenarios in this study, although single optimization methods could provide similar calibration results, they are not stable for all the cases, and sometimes the results of certain scenarios can be quite different. In order to exclude the influences of the choice of optimization methods on the calibration results and to emphasis the selection of the calibration data, a more stable combined approach in which the PSO method, together with the genetic algorithm and another nonlinear optimization algorithm, SQP [Biggs, 1975; Han, 1977; Powell, 1978a, 1978b], was chosen to perform the automatic calibration procedure for the PDM model. It will take more computation time than a single optimization approach, but the improved stability helps to derive more reliable results. The objective function is chosen as the Nash Sutcliffe efficiency coefficient (NSE) [Nash and Sutcliffe, 1970] Catchment and Data Description [16] Data used in this study are from the Brue catchment, which is located in Somerset, United Kingdom ( N, 2.58 W), with a drainage area of 135 km 2. It is a predominantly rural catchment of modest relief with spring fed headwaters rising in the Mendip Hill and Salisbury Plain. The rain gauge network consists of 49 Casella 0.2 mm tipping bucket type rain gauges. An automatic weather station and an automatic soil water station are located in the catchment and record the global solar radiation, net radiation, and other weather parameters, such as wind speed, wet and dry bulb temperatures, and atmospheric pressure, in hourly intervals. [17] The Natural Environment Research Council funded the Hydrological Radar Experiment project to run from May 1993 to April 1997 in the Brue catchment (its data collection was extended to 2000). Eight years ( ) of 15 min rainfall runoff data obtained from this project are used in this study. Because there was a gap due to a failure of data collection from July to November 1998 during the project, that gap is regarded as the division for the calibration and validation data sets. Observed data before the gap are used for calibration. Starting at the same time, three sets of calibration data are first made with different lengths of 6, 12, and 24 months. Shifted by a 1 month sliding window, the three sets can then form three scenario groups of calibration sets with respective data lengths of 6, 12, and 24 months. This result in a total of 135 calibration sets, that is, 53 sets in the 6 month scenario group, 47 sets in the 12 month group, and 35 sets in the 24 month group. In the Brue catchment, the wet period normally lasts from November to next April and the dry period from May to October, which divides the year into a pair of two 6 months periods. The period of 12 months can cover a year with four seasons, which is also easy for dealing with the concept of the hydrological year. That is why the period of 6 month and its two integral multiples, 12 and 24 months, are chosen in this study as the lengths of three groups of calibration scenarios. The remaining 18 month observed data after the gap are used for validation. It should be noted that although the selection of validation data is also of great importance to the evaluation of the calibrated model, in order to fully investigate how the starting time and the duration of the calibration data influence the calibration results, the set of validation data is fixed in this study and the validation results are considered as the evaluation criteria for the performances of the calibrated models using different calibration scenarios. Figure 1 shows the hydrographs and the rainfall variations of the validation and calibration data sets in the three scenario groups. Daily potential evaporation data are obtained from the Met Office Surface Exchange Scheme, which are split into 15 min data in accordance with the rainfall runoff data before being processed in the PDM model. 4. Results 4.1. Model Performances of Different Calibration Scenarios [18] Calibration runs are conducted for the 135 calibration scenarios with fixed lengths of 6, 12, and 24 months and various starting times. For each scenario, three optimization runs are performed using the three algorithms of PSO, GA, and SQP. The optimization results of the three algorithms are very similar, and to smooth out the fluctuating outcomes of individual algorithms, the average results are adopted for analysis. All the calibrated models of the 135 scenarios are then validated against the validation data set of 18 month length. Besides the NSE, several other statistics are explored to evaluate the model performance based on both the validation and calibration results, including the root meansquare error, the mean absolute error, the mean bias error, and the correlation coefficient. Due to the consistent results of all the evaluation statistics, the NSE is chosen as the only assessment of the model performance in the following sections of analysis. Figure 2 shows the changes of the model performance due to the variations of the calibration data with different starting times and durations. [19] By comparing the three series of model performances of the 6, 12, and 24 month scenario groups in Figures 2a and 2b, it can be noted that although the validation results are slightly poorer than the calibration results, similar trends exist in the two series. Better calibrated model can produce better validation results, while poor model performance is often caused by the poor calibration scenarios. This tendency is mostly obvious with the 6 month group. The overfitting phenomenon which normally happens with numerical models cannot be found here. This may 4of17

5 Figure 1. Rainfall and runoff of the validation and calibration data sets for the three scenario groups, where the x axis values are the indices and starting times of the calibration sets with the interval representing 1 month. For example, calibration set 25 of the 6 month group can be found in Figure 1b with x values ranging from 25 to 31, and calibration set 25 in the 12 and 24 month group are the sections with x values ranging from 25 to 37 and from 25 to 49, respectively. 5of17

6 Figure 2. Average results of (a) calibration and (b) validation using the three optimization methods. The x axis values are the indices of scenarios in the 6, 12, and 24 month groups, and the y axis values are the model performance of each calibration scenario indicated by the Nash Sutcliffe efficiency coefficient (NSE). be because the largest length of calibration set in this study (24 months) is still appropriate for calibrating the PDM model. [20] The empirical cumulative distribution functions (CDFs) for the NSE statistic representing the model performance are constructed for both the calibration and validation results (Figure 3). The CDF of each scenario group indicates the chance of obtaining a NSE of magnitude less than a specific value if a calibration data set in that group is selected at random. In both Figures 3a and 3b, the CDFs become less steep and wider ranging as we progress from the 24 month group to the 6 month group. Increasing steepness indicates a reducing sensitivity of model performance to selection of scenario group with different data length [Yapo et al., 1996], which means that the 12 and 24 month scenario groups can produce more stable model performances than the 6 month group. By examining the NSE statistics of the three groups in Table 1, similar results can be found. The validation results of the 12 and 24 month groups have relatively higher average NSE values, although the 6 month group can achieve a better model performance after calibration for some data selections. Both the maximum and minimum values of NSE are yielded by the scenarios in the group with the length of 6 months, and as the length increases, the ranges of NSE shrink clearly with a decreasing standard deviation. [21] From another viewpoint, although in general the 6 month group has a less stable result, it does perform better in some cases than the 12 and 24 month groups. If the underlying relationship between the model performance and the starting time and duration of the calibration data can be found, we can pick up the best scenario (if not, the better ones) easily without the trade off between data length and stable model performance. [22] A further insight into the model performances in Figure 2 together with the hydrographs in Figure 1 can help to reveal the reason for the several lowest NSE values in the 6 month group. It can be easily noticed that in those sce- 6of17

7 Figure 3. Empirical cumulative distribution functions (CDFs) of the NSE of (a) calibration and (b) validation results of the 6, 12, and 24 month groups. narios, most of the calibration periods are occupied by the dry months, which normally occur during May October in the study catchment and the duration of which is no more than 6 months. The typical ones are scenarios 8, 20, and 33 in the 6 month group. That is easy to understand and can be avoided when choosing the calibration data by direct experience. In contrast, for the 12 and 24 month groups, it is tricky to identify the relatively poor scenarios before modeling because dry months can take up at least half of the whole period for all the scenarios. At the same time, the good scenarios in all three groups cannot be easily found only by a simple visualization of the hydrographs beforehand. However, from the case of the 6 month group, we can assume that the calibration data sets of the good scenarios may have a higher similarity with the validation data, while those of the poor ones have the least similarity. In the following sections, the exploration of the flow similarity between the calibration and validation data sets is carried out by using the flow duration curve and two spectral analysis tools, the Fourier transform and the wavelet analysis Similarity Identified by the Flow Duration Curve [23] The flow duration curves of the validation and calibration sets of all 135 scenarios are constructed using daily observed flow data. We can assume that when plotted together, the calibration curves of scenarios with good model performances should be close to the validation curve, while scenarios with poor model performances have relatively distant calibration curves. In order to quantify the similarity between the validation and different calibration curves, the Nash Sutcliffe efficiency coefficient is calculated based on the data series of the validation and calibration curves. Figure 4 plots the NSE values versus the corresponding model performances. The fitted regression line in Figure 4a shows a high correlation between the flow duration curve similarity and the model performance in the 6 month group, which means that the more similar the calibration set is to the validation set in the aspect of the flow duration curve, the better is the model performance that it can produce. Unfortunately, the regression lines for the 24 month group (Figure 4b) are nearly flat and the results are not as visible as the 6 month group, which can hardly reflect the same correlation between the curve similarity and the model performance. The results of the 12 month group show an average tendency between the 6 and 24 month groups, which are not displayed here to make the paper more concise Flow Similarity in the Frequency Domain Using Fourier Transform [24] The flow duration curve works quite well for the 6 month group but less effectively for the 12 and 24 month groups. In this section, the Fourier transform is explored to check if a better indicator revealing the relationship between the model performance and the similarity of the calibration and validation data could be found. [25] Fourier transform can help to study the spectral characteristics of a signal in the frequency domain. For feasible comparisons of signals with different data lengths, Table 1. Nash Sutcliffe Efficiency Coefficient Statistics of the Calibration and Validation Results for the Comparison of the Average Model Performances Produced by the 6, 12, and 24 Month Groups Nash Sutcliffe Efficiency Coefficient Statistics Six Month Group Twelve Month Group Twenty Four Month Group Calibration Average value Standard deviation Maximum value (MAX) Upper quartile (P75) Median value (P50) Lower quartile (P25) Minimum value (MIN) Validation Average value Standard deviation Maximum value (MAX) Upper quartile (P75) Median value (P50) Lower quartile (P25) Minimum value (MIN) of17

8 Figure 4. Relationship between the model performance and the similarity of the flow duration curves of the validation and calibration data sets in the 6 and 24 month groups. before being transformed from the time domain to the frequency domain, the 135 calibration sets in the three scenario groups together with the validation set are replicated for different times to generate new validation and calibration data sets of the same length by calculating their least common multiple (this is also for the convenience of the fast Fourier transform computations). Replication will not change the signal amplitude after transforming, and thus the spectral characteristics of the signal would remain the same. The scatterplots showing the total powers of each calibration set in the 6 and 24 month groups against the model performances are displayed in Figure 5. The vertical lines indicate the value of the total power of the validation set on the x axis. The total power can reflect the amount of energy contained in a signal by adding together the powers at each frequency component in the Fourier series. Again, we assume that the closer the value of total power of a calibration set is to that of the validation set, the more similar are the two sets in the frequency domain and thus the better does the calibrated model perform using that calibration data set. A consistent result for the 6 month group can be found in Figure 5a. Although a majority of points gather in the middle with a wide range of model performances, the tendency is quite clear on the left and right ends of the scatter, while for the results of the 24 month groups shown in Figure 5b, the tendencies are not as clear as the 6 month group. The 12 month group gives an average performance between those of the 6 and 24 month groups, the tendency of which is also not obvious to identify the similarity. [26] The amplitude of the Fourier series after transformation can also be compared between the validation and calibration data sets to further explore the spectral similarities and how they relate to the model performances. The comparison results of the amplitude are shown in Figure 6. In the same way as the flow duration curve, the similarity of the amplitude is evaluated by the Nash Sutcliffe efficiency coefficients between the validation and the calibra- Figure 5. Relationship between the model performance and the total power of the calibration data sets in the 6 and 24 month groups after the Fourier transform. 8of17

9 Figure 6. Relationship between the model performance and the similarity of the amplitude of the validation and calibration data sets in the 6 and 24 month groups after the Fourier transform. tion data sets. In order to eliminate the data noise in the high frequency domain, all the transformed Fourier series are processed using the moving average method with a window size of 50. The results are similar to those using the flow duration curve and the total power. The regression lines show a good correlation between the model performance and the amplitude similarity for the 6 month group in Figure 6a but nearly random results for both the 24 month group (Figure 6b) and the 12 month group Flow Similarity Described by Wavelet Analysis and ICF [27] The relationship between the model performance and the spectrum similarity of validation and calibration data sets can be further investigated by means of the DWT on more detailed subdivisions of the frequency domain. The DWT in this paper is carried out by decomposing the validation and calibration sets into six levels of details (d1 d6) and approximations (a1 a6), and a basic Daubechies wavelet of order 10 is chosen for the decomposition. More details about the Daubechies wavelets can be found in the work of Daubechies [1990]. [28] The details containing the high frequency information represent the flavor and nuance of a signal, which are regarded as being of more importance than the approximations and thus are more frequently utilized in wavelet analysis when comparing signals. In contrast, the approximations are the low frequency components, giving the identity of a signal, which means as the wavelet decomposition goes on, the approximation becomes a more and more abstract representation of the original signal. In this study, both the total energy and the energy distribution of details and approximations on different decomposition levels are examined in order to find a better representation of the spectrum similarity between the validation and calibration sets. [29] The total energy of wavelet coefficients on different decomposition levels are the amounts of energy distributed in the respective frequency domains, as described by equation (7). Because the approximation is an abstract of the original signal, the results of the total energy of approximations on all the six levels are almost the same as the results of the FFT. In that case, only the results of the total energy based on details are presented here. The total energy of details on different decomposition levels for the calibration sets in the 6 and 24 month groups are plotted against the corresponding model performance in Figures 7 and 8. The 12 month group shows similar trends as the 24 month group, so only the results of the 24 month group are presented here. The vertical lines indicate the total energy of the validation data set on the six levels. For all three groups with different calibration data lengths, the results show high consistency with the previous assumption made in section 4.3 on Fourier analysis, which is that the closer the total energy of the calibration data set is to that of the validation set, the better is the performance of the calibrated model. The assumption can be particularly verified by detail d5 (Figures 7e and 8e) on the decomposition level 5 and to a lesser extent by details on the other levels. [30] The percentile energy indicating the relative amount of energy distributed on a certain decomposition level can also be considered as a useful indicator in assessing the spectral similarity of the validation and calibration data sets. Figure 9 shows the percentile energies of details on different levels for the calibration sets in the 6 month group. The results of details d3 d6 are in strong agreement with assumption that the greater is the data similarity, the better is the model performance, among which the detail d4 has the most evident results. The results of d1 and d2 are not as good as the others, which may result from the noise in the high frequency domains of the original signals. [31] The results of percentile energy based on details are not ideal for the 12 and 24 month groups, showing no obvious trends on all six levels, and are nearly random series if ranked by the values of the model performance. The reason may lie in the relative low variances of model performances in the two groups compared with the evident differences between the poor and good scenarios in the 6 month group. Details revealing the subtle differences 9of17

10 Figure 7. Total energies of details on different wavelet decomposition levels for the calibration data sets in the 6 month scenario group. in the high frequency domain might not be sensitive to the comparison of similar signals (i.e., the comparison of the calibration sets in the 12 and 24 month groups with the validation set). In that case, the approximations on the six decomposition levels of each scenario in those two groups are explored to calculate the percentile energy on the six levels. Beyond expectation, the results based on approximations are dramatically good for the 12 and 10 of 17

11 Figure 8. Total energies of details on different wavelet decomposition levels for the calibration data sets in the 24 month scenario group. 24 month groups. The 24 month group results are shown in Figure 10. For the results on levels 1 4 (Figures 10a 10d), when the percentile energies calculated on the approximations of the calibration sets are less than that of the validation set (the vertical line), with the decrease of the distance between the validation and calibration values, the values on the y axis indicating the model performance of the scatterplots are on obvious rising trends. On the contrary, when the 11 of 17

12 Figure 9. Percentile energies of details on different wavelet decomposition levels for the calibration data sets in the 6 month scenario group. percentile energies of the calibration sets exceed the values of the validation set (Figure 10f), values of the model performance are decreasing with the increase of the distance between the validation and calibration values. For the 12 month group, it has quite similar results as the 24 month group, hence they are not presented in this paper. [32] An entropy like indicator, the ICF, is found to be a simple but efficient evaluation of the integral energy dis- 12 of 17

13 Figure 10. Percentile energies of approximations on different wavelet decomposition levels for the calibration data sets in the 24 month scenario group. tribution on different wavelet decomposition levels. It is defined as the degree of uniformity of the energy distribution, by comparing which the spectral similarity of the validation and calibration sets can be easily assessed. The results are plotted in Figures 11 and 12, respectively, for the 6 and 24 month scenario groups. (The 12 month group shows nearly the same tendency as the 24 month group, and its results are not presented in this paper.) Figures 11a 13 of 17

14 (because of the scatter in the high model performance area in Figures 11a and 11b), they perform effectively in the identification of the worst scenarios for the 6 month group. As for the 12 and 24 month groups, the scenarios are more sensitive to the calculations based on approximations to show evident trends as the assumption that the more similarity, the better is model performance. As shown in Figure 12, both the ICF and the NSE calculated based on the percentile energy series can be viewed as good indicators for the selection of calibration sets in the 24 month group, and the same in the 12 month group. 5. Discussion [34] Calibration results of the PDM model presented in this paper demonstrate the importance of the selection of calibration data with the most appropriate length and duration. For the three scenario groups containing calibration Figure 11. Relationship between the model performance and the similarity of (a) information cost function (ICF) values and (b) percentile energy series {P j, j =1,2,, 6} between the validation and calibration sets in the 6 month group, the results of which are calculated on the details on different composition levels. and 12a show the ICF values of the calibration data sets versus the respective model performances. The vertical lines indicate the ICF value of the validation set. According to the analysis of the percentile energy, better results are chosen from the calculations based on either details or approximations for the three scenario groups. Figures 11b and 12b present another approach to describe the similarity of the overall energy distribution between the validation and calibration sets. The model performances are plotted versus the Nash Sutcliffe efficiency coefficients calculated between the percentile energy series {P j, j = 1,2,, 6} of the validation set and that of the calibration sets in different scenario groups. [33] For the 6 month group, better results are shown from the calculations on details in Figure 11. Although the indices of ICF and the Nash Sutcliffe efficiency coefficients fail in picking up the best of the calibration sets Figure 12. Relationship between the model performance and the similarity of (a) ICF values and (b) percentile energy series {P j, j = 1,2,, 6} between the validation and calibration sets in the 24 month scenario group, the results of which are calculated on the approximations on different composition levels. 14 of 17

15 data with different starting times and fixed lengths of 6, 12, and 24 months, some scenarios in the 6 month group can even perform better than those in the 12 and 24 month groups, although on average the 12 and 24 month groups provide relatively stable and better model performances with less variance. These results are in line with the statement that it is not the length but the quality of information of the calibration data that is more important in deciding the model performance. From this point of view, we can say that the information contained in the good scenarios of the 6 month group has better quality than that of the scenarios with calibration data length of 12 or 24 months. As the evaluation criteria are chosen to be the validation results of the calibrated models by using an 18 month data set, it can be deduced that the better quality of information in the calibration set means to some extent an underlying similarity to the validation data set. [35] If the modeler can evaluate the similarity or the information quality in all the possible calibration data sets with certain durations, there will be a dramatic reduction in the calibration work for all the possible sets used for calibration in searching for the most appropriate one. However, it is difficult to compare the similarity of the validation and calibration sets visually or to identify the quality of information in the calibration data by direct experience, except for several poorly performed scenarios in the 6 month group which are covered throughout by dry months within a year. That is why the flow duration curve, the Fourier transform, and afterward the wavelet analysis are applied in this study to help assess the similarity between the validation and the calibration sets in the three scenario groups. [36] The similarities described using the flow duration curve and the Fourier transform have yielded similar results: both the flow duration curve and the total power and amplitude of the Fourier series show a good correlation between the model performance and the similarity of the validation and calibration sets for the scenarios in the 6 month group, while for the other two groups with calibration data length of 12 and 24 months, there are no obvious trends. In that case, the wavelet transform, which can reveal more detailed spectral properties of a signal in more specific frequency domains, is applied to search for better indices, the similarities indicated by which could show a more general agreement with the model performance. Comparisons made on both the total and percentile energies between the validation and calibration data sets after the discrete wavelet transform have provided more consistent results for all three scenario groups, showing an improvement in the model performances with the increase of similarities, especially in some particular decomposition levels which represent specific ranges of the frequency domain. On the basis of the wavelet results the entropylike function ICF, which efficiently evaluates the integral energy distribution of a signal on different decomposition levels, was constructed and appeared to be the most suitable index indicating the similarity between the validation and calibration sets. The ICF has provided evident results for all three groups, in high accordance with the previous assumption. It is interesting to note that the ICF performed better with details for the 6 month group, but with approximations for the 12 and 24 month groups. The results of the ICF are confirmed by the comparisons on the percentile energy series, which is another means to describe the similarity of the overall energy distribution between the validation and calibration sets. [37] It should be mentioned here that all three methods are applied to the observed rainfall data as well as the flow observations in this study (the flow duration curve methodology can also be applied to the rainfall data). Except for the results of the 6 month group using the duration curve, which show a similar trend as that using the flow data, poor results are produced by the other two methods. However, this is only the case for the Brue catchment; the rainfall data might still be worth trying for other catchments and other case studies with different designed calibration scenarios. 6. Conclusions [38] Selection of the calibration data is an important task for hydrologists in building hydrological models. Despite the publication of numerous studies on model development, there is a lack of guidance on how to select adequate and appropriate calibration data. The traditional rule of thumb is mainly based on the data length (i.e., 6 year data) and is inadequate for different catchment characteristics. It has been gradually recognized by modelers that it is not the length but the information quality of the data used for calibration that is the most significant factor affecting the performance of the calibrated rainfall runoff model. The selection of calibration data with the most adequate length and an appropriate duration is becoming more and more important, especially as increasingly more observed data with high resolution are collected by modern telemetry systems. This study has provided several practical indices for the calibration data selection of the rainfall runoff models. With the determination of the validation data, it is assumed that the more similarity a calibration data set bears to the validation set, the better performance should the calibrated model have using that calibration data set. Three methods presented in this paper, the flow duration curve, the Fourier transform, and the wavelet analysis, are all found to produce good indices for certain scenario groups in the case study describing the similarities between the validation and calibration data sets, among which the ICF appears to be the most appropriate and efficient one which could be used for calibration data selection. It is interesting to note that some models calibrated using 6 month data in this study had better performance than that using longer data lengths. This has again verified that the information content in the calibration data is more important than the data length. The idea presented in this paper has also shown its potential in enhancing the efficiency of the data utilization, particularly when the modeler is facing the problem of datalimited catchments. [39] Clearly, the outcomes of this paper are to some extent dependent on the characteristics of the case study, for example, the design of the scenario groups, the catchment, and the rainfall runoff model that have been used. More research is needed to explore the applicability of the indices under other catchment conditions and with different choices of the rainfall runoff models, in particular the spatially distributed models which have more complicated input requirements. One limitation of this study seems to be the ascertained validation data which are chosen to be determined beforehand. It is true that validation data are a 15 of 17

16 dominating factor in evaluating the performance of the calibrated model. In practice, the selection of the validation data is associated with the purpose of applying the rainfallrunoff model, such as real time flood forecasting, hydraulic structure designing, and so on, which in normal cases can be decided appropriately before the selection of the calibration data. [40] For completeness, it should be noted that the calibration data sets selected by the suggested indices in the study case are expected to be the relative best ones among the calibration sets in all the scenarios rather than the absolutely best ones. One can notice that there are no large differences between the model performances of the good scenarios in the 6 month group and those in the 12 and 24 month groups, although the average performance is increasing from the 6 to the 24 month group, which can lead to a general conclusion of improved performance with the increase of the calibration data length. Searching for the optimal length of the calibration data remains an unsolved and attractive issue for the future. Therefore, we hope this study will stimulate further researches into the related calibration issues so that some generalizations on the selection of the optimal calibration data and more applicable indices may be found. References Anctil, F., C. Perrin, and V. Andréassian (2004), Impact of the length of observed records on the performance of ANN and of conceptual parsimonious rainfall runoff forecasting models, Environ. Modell. Software, 19(4), , doi: /s (03)00135-x. Biggs, M. C. (1975), Constrained minimization using recursive quadratic programming, in Towards Global Optimization, editedbyl.c.w. Dixon and G. P. Szergö, pp , North Holland, Amsterdam, Netherlands. Blanco,S.,A.Figliola,R.QuianQuiroga,O.A.Rosso,andE.Serrano (1998), Time frequency analysis of electroencephalogram series. III. Wavelet packets and information cost function, Phys.Rev.E, 57(1), Boughton, W. (2006), Calibrations of a daily rainfall runoff model with poor quality data, Environ. Modell. Software, 21(8), , doi: /j.envsoft Brath, A., A. Montanari, and E. Toth (2004), Analysis of the effects of different scenarios of historical data availability on the calibration of a spatially distributed hydrological model, J. Hydrol. Amsterdam, 291, , doi: /j.jhydrol Butts, M. B., J. T. Payne, M. Kristensen, and H. Madsen (2004), An evaluation of the impact of model structure on hydrological modelling uncertainty for streamflow simulation, J. Hydrol. Amsterdam, 298, , doi: /j.jhydrol Chau, K. W. (2006), Particle swarm optimization training algorithm for ANNs in stage prediction of Shing Mun River, J. Hydrol. Amsterdam, 329, , doi: /j.jhydrol Chau, K. W. (2007), A split step particle swarm optimization algorithm in river stage forecasting, J. Hydrol. Amsterdam, 346, , doi: /j.jhydrol Cooley, J. W., and J. W. Tukey (1965), An algorithm for the machine calculation of complex Fourier series, Math. Comput., 19(90), , doi: / Daubechies, I. (1990), The wavelet transform, time frequency localization and signal analysis, IEEE Trans. Inf. Theory, 36(5), , doi: / Dooge, J. C. I. (1973), Linear theory of hydrologic systems, Technical Bulletin 1468, U.S. Dept. of Agric., Washington, D. C. Drago, A. F., and S. R. Boxall (2002), Use of the wavelet transform on hydro meteorological data, Phys. Chem. Earth, 27(32 34), , doi: /s (02) Duan, Q., S. Sorooshian, and V. Gupta (1992), Effective and efficient global optimization for conceptual rainfall runoff models, Water Resour. Res., 28, , doi: /91wr Duan, Q., S. Sorooshian, and V. Gupta (1994), Optimal use of the SCEUA global optimization method for calibrating watershed models, J. Hydrol. Amsterdam, 158, , doi: / (94) Eberhart, R. C., and J. Kennedy (1995), A new optimizer using particle swarm theory, in: Proceedings of the 6th International Symposium Micro Machine and Human Science, pp , IEEE, Piscataway, N. J. Eberhart, R. C., and Y. Shi (2001), Particle swarm optimization: Developments, applications and resources, in Proceeding of the 2001 Congress on Evolutionary Computation, pp , IEEE, Piscataway, N. J. Figliola, A., and E. Serrano (1997), Analysis of physiological time series using wavelet transforms, IEEE Eng. Med. Biol., 16(3), 74 79, doi: / Gan, T. Y., and G. F. Biftu (1996), Automatic calibration of conceptual rainfall runoff models: Optimization algorithms, catchment conditions, and model structure, Water Resour. Res., 32, , doi: / 95WR Gan, T. Y., E. M. Dlamini, and G. F. Biftu (1997), Effects of model complexity and structure, data quality, and objective functions on hydrologic modelling, J. Hydrol. Amsterdam, 192, , doi: /s (96) Gill, M. K., Y. H. Kaheil, A. Khalil, M. McKee, and L. Bastidas (2006), Multiobjective particle swarm optimization for parameter estimation in hydrology, Water Resour. Res., 42, W07417, doi: /2005wr Goswami, M., and K. M. O Connor (2007), Comparative assessment of six automatic optimization techniques for calibration of a conceptual rainfall runoff model, Hydrol. Sci. J., 52(3), , doi: / hysj Gupta, H. V., S. Sorooshian, and P. O. Yapo (1998), Toward improved calibration of hydrological models: Multiple and noncommensurable measures of information, Water Resour. Res., 34, , doi: / 97WR Gupta, V. K., and S. Sorooshian (1985a), The relationship between data and the precision of estimated parameters, J. Hydrol. Amsterdam, 85, 57 77, doi: / (85) Gupta, V. K., and S. Sorooshian (1985b), The automatic calibration of conceptual catchment models using derivative based optimization algorithms, Water Resour. Res., 21, , doi: /wr021i004p Han, S. P. (1977), A globally convergent method for nonlinear programming, J. Optim. Theory Appl., 22, , doi: /bf Harlin, J. (1991), Development of a process oriented calibration scheme for the HBV hydrological model, Nord. Hydrol., 22, Kulkarni, J. R. (2000), Wavelet analysis of the association between the Southern Oscillation and the Indian Summer Monsoon, Int. J. Climatol., 20(1), , doi: /(sici) (200001)20:1<89::aid- JOC458>3.0.CO;2-W. LeBoutillier, D. W., and P. R. Waylen (1993), A stochastic model of flow duration curves, Water Resour. Res., 29, , doi: / 93WR Li, C. H., Z. F. Yang, G. H. Huang, and Y. P. Li (2009), Identification of relationship between sunspots and natural runoff in the Yellow River based on discrete wavelet analysis, Expert Syst. Appl., 36(2), , doi: /j.eswa Li, X. B., H. Q. Li, F. Q. Wang, and J. Ding (1997), A remark on the Mallat pyramidal algorithm of wavelet analysis, Commun. Nonlinear Sci. Numer. Simul., 2(4), , doi: /s (97) Liu, B., L. Wang, Y. H. Jin, F. Tang, and D. X. Huang (2005), Improved particle swarm optimization combined with chaos, Chaos Solitons Fractals, 25(5), , doi: /j.chaos Mallat, S. (1989), A theory for multiresolution signal decomposition: The wavelet representation, IEEE Trans. Pattern Anal. Mach. Intell., 11(7), , doi: / Meyer, Y. (1993), Wavelets: Algorithms and Applications, Society for Industrial and Applied Mathematics, Philadelphia, Pa. Moore, R. J. (1985), The probability distributed principle and runoff production at point and basin scales, Hydrol. Sci. J., 30(2), Moore, R. J. (2007), The PDM rainfall runoff model, Hydrol. Earth Syst. Sci., 11(1), Nakken, M. (1999), Wavelet analysis of rainfall runoff variability isolating climatic from anthropogenic patterns, Environ. Modell. Software, 14(4), , doi: /s (98) Nash, J. E., and J. V. Sutcliffe (1970), River flow forecasting using conceptual models: Part 1. A discussion of principles, J. Hydrol. Amsterdam, 10, Newland, D. E. (1993), An Introduction to Random Vibrations, Spectral and Wavelet Analysis, 477 pp., Addison Wesley Longman, Harlow, Essex, U. K. 16 of 17

Comparison of parameter estimation algorithms in hydrological modelling

Comparison of parameter estimation algorithms in hydrological modelling Calibration and Reliability in Groundwater Modelling: From Uncertainty to Decision Making (Proceedings of ModelCARE 2005, The Hague, The Netherlands, June 2005). IAHS Publ. 304, 2006. 67 Comparison of