Input and Structure Selection for k-nn Approximator

Size: px

Start display at page:

Download "Input and Structure Selection for k-nn Approximator"

Louise Gaines
5 years ago
Views:

1 Input and Structure Selection for k- Approximator Antti Soramaa ima Reyhani and Amaury Lendasse eural etwork Research Centre Helsinki University of Technology P.O. Box spoo Finland {asorama nreyhani Abstract. This paper presents k- as an approximator for time series prediction problems. The main advantage of this approximator is its simplicity. Despite the simplicity k- can be used to perform input selection for nonlinear models and it also provides accurate approximations. Three model structure selection methods are presented: Leave-one-out Bootstrap and Bootstrap 63. We will show that both Bootstraps provide a good estimate of the number of neighbors k where Leave-one-out fails. Results of the methods are presented with the lectric load from Poland data set. Keywords: k- Time Series Prediction Bootstrap Leave-one-out and Model Structure Selection. Introduction In any function approximation system identification classification or prediction task one usually wants to find the best possible model and the best possible parameters to have good performance. Selected model must be generalizing enough still preserving accuracy and reliability without unnecessary complexity which increase computational load and thus calculation time. Optimal parameters must be determined for every model to be able to rank the models according to their performances. Furthermore in order to select the best model or model class one also needs to determine the best input set to use in the determination of the best parameters for each structure. When too many inputs are selected it is possible to get even worse results than with fewer inputs containing more accurate and valid information. Vice versa with only few inputs the accuracy of the model might not be enough and results are poor and unreliable. The problems mentioned above occur simultaneously so it is difficult to find the right combination of correct attributes. It can be a very tedious and time-consuming procedure to go through every possible structure to select the best one. In this paper we focus in finding the optimal structure for k-earest eighbors (k- ) approximator. At the same time we consider the problem of selecting the most necessary and optimal inputs for the k- as well as selecting the structure of the approximator. The k- method is presented in Section ; Section 3 describes the selection of the inputs with an exhaustive search and Section 4 the selection of the J. Cabestany A. Prieto and D.F. Sandoval (ds.): IWA 005 LCS 35 pp Springer-Verlag Berlin Heidelberg 005

2 986 A. Soramaa. Reyhani and A. Lendasse structure using Leave-one-out (LOO) and Bootstraps. In Section 5 we show some experimental results with an electric load time series and then derived conclusions from the results in Section 6. k-earest eighbors k-earest eighbors approximation method is a very simple but powerful method. It has been used in many different applications and particularly in classification tasks []. The key idea behind the k- is that similar input data vectors have similar output values. One has to look for a certain number of nearest neighbors according to uclidean distance [] and their corresponding output values to get the output approximation. We can calculate the estimation of the outputs by using the average of the outputs of the neighbors in the neighborhood. If the pairs (x i y i ) represent the data with x i as an n-dimensional input and y i as a scalar output value k- approximation is k yp( ) yˆ i k where ŷ i represents the output estimation P() is the index number of the th nearest neighbor of the input x i and k is the number of neighbors that are used. We use the same neighborhood size for every data point so we use a global k which must be determined. () 3 Input Selection In order to select the best set of inputs all possible n (n is the maximum number of inputs) input sets are built and evaluated. For each input set the global optimum number of neighbors is determined and the generalization error estimate (defined in Section 4) of the set is calculated as a mean of errors of all data points. In this way it is possible to compare all different input sets and take the best one to be used in the final k- approximation. In this case adding one input doubles the needed calculation time. We have to make a compromise between the maximum input size to use and the calculation time available. This kind of exhaustive search for best inputs is usually not preferred because of the huge computational load. However with k- the computations can be performed in a reasonable time thanks to the simplicity of the k-. 4 Model Structure Selection We consider the problem of determining a model which approximates as accurately as possible an unknown function g(.). This approximation is chosen among a set of several possible models. Models in a set are denoted here by

3 Input and Structure Selection for k- Approximator 987 h q ( x () where q represents the q th model in the set (q) are the parameters of the q th model and x is a n-dimensional input vector. The parameters that define a set of possible models are called hyper-parameters; they are not estimated by the learning algorithm but by some external procedure [4]. In a typical learning procedure the (q) parameters are optimized to minimize the approximation error on the learning set; the structure is determined as the minimization of the generalization error defined as gen M q ( h ( xi ( q)) yi ) i ) lim M M where x i are n-dimensional input vectors to the model and y i the corresponding scalar expected outputs. According to the definition (3) the generalization error is the mean square error of the model computed on an infinite sized test set. Such set is not available in practice so we must approximate the generalization error. The best model structure q is the structure that minimizes the approximation of the generalization error. In our case the model parameters consist in selecting the number of neighbors to use in the k- approximation. We have used two different methods to select the global optimal number of neighbors Leave-one-out (LOO) and Bootstrap 63. Leave-one-out is a common method used in many statistical evaluation purposes and we wanted to show that Bootstrap 63 is better than LOO in this case even [] claims that Bootstrap 63 doesn t work in selecting the number of neighbors. 4. Leave-One-Out Leave-one-out [3] is a special case of k-fold cross-validation resampling method. In k- fold cross-validation the training data is divided into k approximately equal sized sets. Then model is trained by using all but one set and the leftover set is used in validation. The generalization error estimation of k-fold cross-validation is a mean of all k different validation results. LOO procedure is the same as k-fold cross-validation with k equal to the size of the training set. So for each different neighborhood size LOO procedure is used to calculate its generalization error estimate by removing each neighbor at a time from the training set building a model with the rest of the training data and calculating the validation error with the one taken out. This procedure is done for every data point in the training set and the estimate of the generalization error is calculated as a mean of all k or validation errors (4). ˆ gen ( q) i q * ( h ( x y ) i where x i is the i th input vector from the training set y i is the corresponding output and ι *(q) includes the model parameters without using (x i y i ) in the training. i i (3) (4)

4 988 A. Soramaa. Reyhani and A. Lendasse Because we want to use the global optimum size of the neighborhood we have to calculate the LOO error for each data point and each size of the neighborhood. After that we can take the mean over all data points to find out which is globally the optimum number of neighbors. We select the number of neighbors that gives us the smallest generalization error. 4. Bootstrap and Bootstrap 63 Bootstrap [4] is a resampling technique developed to estimate some statistical parameters (like the mean of a population its variance etc). In the case of a model structure selection the parameter to be estimated is the generalization error. When using bootstrap the generalization error is not estimated directly. Rather the bootstrap estimates the difference between the generalization error and the training error or apparent error according to fron []. This difference is called the optimism. The estimation of the generalization error will be the sum of the training error and the estimated optimism. The training error is computed using all available data on the training set. I I * q I * I ( h ( xi yi ) i where h q is the q th model that is used I denotes the training set *(q) includes the model parameters after learning x i I is the i th input vector from the training set y i is the corresponding output and is the number of elements in the training set. The optimism is estimated using a resampling technique based on drawing with replacement within the training set. This bootstrap set is as large as the training set with its participants drawn randomly from the training set. ach model is trained using the bootstrap set and optimism is calculated as the difference between the learning error (6) and the validation error (7). Learning error is calculated in the bootstrap set with model trained in the same bootstrap set. A A * i q A * A ( h ( xi yi ) where A is the th bootstrap set x i A is the i th input vector from the bootstrap set and y i A is the corresponding output. Validation error is calculated in the initial training set with model trained on the same bootstrap set than the learning error. A I * q I * I ( h ( xi yi ) i Above described optimism calculation procedure is repeated as many times or rounds as possible considering linearly increasing computation time. The optimism. (5) (6) (7)

5 Input and Structure Selection for k- Approximator 989 of a model is then calculated as a mean of the difference of the two error functions described above J A I A A * * opti mism ˆ ( q) J where J is the number of bootstrap rounds done. The final generalization error estimate is the sum of the training error and the optimism ˆ * gen I I ( q) opti mism ˆ ( q) +. (9) (8) Bootstrap 63 [5] is a modified version of the original Bootstrap. Where the original Bootstrap gives biased estimation of the generalization error (3) Bootstrap 63 is not biased [4] and thus is more comparable with other methods estimating the generalization error. Bootstrap 63 converges towards the correct generalization error in a reasonable amount of calculation time. The main difference between standard Bootstrap and Bootstrap 63 is the estimation of optimism. In original Bootstrap the optimism is calculated as difference of the errors in two sets in the initial training data set and in the randomly drawn bootstrap set (8). Bootstrap 63 estimates optimism using the data points not drawn into the bootstrap set (0). The model is trained using this set of unselected data points and its error is evaluated on the bootstrap set. opti mism ˆ 63 ( q) J A A J * where the error function is the same as the learning error (6) except the model is trained using Ā the complement of the bootstrap set A. The estimation of the generalization error of the Bootstrap 63 is calculated as weighted sum of the training error and the new optimism (0). ˆ * gen (0) 63 I I ( q).368 opti mism ˆ ( q) () From equation () it gets quite clear to see that the name of the Bootstrap 63 comes from the weighting coefficient of the training error term. The value 0.63 is the probability of one sample to be drawn to the bootstrap set from the training set [ 5]. 5 xperimental Results 5. Time Series Prediction Time series forecasting [6] is a challenge in many fields. In finance one forecasts stock exchange courses or stock market indices; data processing specialists forecast the flow of information on their networks; producers of electricity forecast the load of

6 990 A. Soramaa. Reyhani and A. Lendasse the following day. The common point to their problems is the following: how can one analyse and use the past to predict the future? ext section will demonstrate how k- can be used to predict future values of a time series. 5. Results of Data Set : lectric Load The dataset used in experiments is a benchmark in the field of time series prediction: The Poland lectricity Dataset. It represents the electric load of Poland during 500 days in the 90 s. In our experiments we used the first half of the data set as a training set and the other half as a test set in order to evaluate the selected model s performance between different methods. lectric load Time Fig.. lectric load time series from Poland Fig. and 3 show the generalization error estimates of all different methods according to the number of neighbors when using the best selected input set. Generalization error 3 x umber of eighbors Fig.. The generalization error estimate using Leave-one-out according to the number of neighbors The Table shows the results of the experiments with three methods in selecting the inputs and number of neighbors. We have used n 8 as a maximum number of inputs and J 00 bootstrap rounds in calculations in both Bootstrap and Bootstrap 63.

7 Input and Structure Selection for k- Approximator 99 Generalization error.5 x umber of eighbors Fig. 3. The generalization error estimate using Bootstrap (solid line) and Bootstrap 63 (dashed line) according to the number of neighbors Table. The results using lectric load data set and methods described in this paper Selected Inputs k Êgen Test error LOO t - { 5 7 8} Bootstrap t - { 5 7 8} Bootstrap 63 t - { 5 7 8} All three methods select the same inputs but different number of neighbors. According to the test error both Bootstrap and Bootstrap 63 select the best k. On the other hand Bootstrap and Bootstrap 63 use J times more time than LOO. In Fig. 4 we have used the model selected by bootstrap to predict 300 first test set values. lectric load Time Fig. 4. Test set from lectric load data. 00 first values predicted and plotted. Solid line represents the real values and dashed represents the prediction 6 Conclusion We have shown that all methods Leave-one-out and Bootstraps select the same inputs. But number of neighbors is selected more efficiently by Bootstraps according to the test error; even some studies have proven otherwise [].

8 99 A. Soramaa. Reyhani and A. Lendasse It has also been shown that k- is a good approximator for time series. We have also tested k- with a couple of other time series and acquired the same results. As a conclusion we suggest Leave-one-out to be used in input selection and Bootstrap or Bootstrap 63 in selection of k. Acknowledgements Part the work of A. Soramaa. Reyhani and A. Lendasse is supported by the proect of ew Information Processing Principles of the Academy of Finland. References. Bishop C.M.: eural etworks for Pattern Recognition. Oxford University Press (995).. fron B. Tibshirani R. J.: An introduction to the bootstrap. Chapman & Hall (993). 3. Kohavi R.: A study of Cross-Validation and Bootstrap for Accuracy stimation and Model Selection. In: Proc. of the 4th Int. Joint Conf. on A.I. Montréal (995) : Lendasse A. Wertz V. Verleysen M.: Model selection with cross-validations and bootstraps Application to time series prediction with RBF models. In: Artificial eural etworks and eural Information Processing ICA/ICOIP (003) Kaynak O. Alpaydin. Oa. Xu L. (eds): Springer-Verlag Lecture otes in Computer Science 74 Berlin (003) fron B. Tibshirani R. J.: Improvements on cross-validation: The.63+ bootstrap method. J. Amer. Statist. Assoc. (997) 9: Weigend A.S. Gershenfeld.A.: Times Series Prediction: Forecasting the future and Understanding the Past. Addison-Wesley Reading MA (994).

LS-SVM Functional Network for Time Series Prediction

LS-SVM Functional Network for Time Series Prediction Tuomas Kärnä 1, Fabrice Rossi 2 and Amaury Lendasse 1 Helsinki University of Technology - Neural Networks Research Center P.O. Box 5400, FI-02015 -