FLEXIBLE AND OPTIMAL M5 MODEL TREES WITH APPLICATIONS TO FLOW PREDICTIONS

Size: px

Start display at page:

Download "FLEXIBLE AND OPTIMAL M5 MODEL TREES WITH APPLICATIONS TO FLOW PREDICTIONS"

Margaret Hutchinson
5 years ago
Views:

1 6 th International Conference on Hydroinformatics - Liong, Phoon & Babovic (eds) 2004 World Scientific Publishing Company, ISBN FLEXIBLE AND OPTIMAL M5 MODEL TREES WITH APPLICATIONS TO FLOW PREDICTIONS DIMITRI P. SOLOMATINE UNESCO-IHE Institute for Water Education P.O. Box 3015 Delft The Netherlands MICHAEL BASKARA L. A. SIEK UNESCO-IHE Institute for Water Education P.O. Box 3015 Delft The Netherlands M5 is a method developed by Quinlan [10] for inducing trees of linear regression models (model trees). The paper addresses the flexibility and optimality in M5 model tree by proposing two new algorithms, namely M5flex and M5opt. M5flex algorithm brings in domain knowledge by enabling the user to choose split attributes and split values for important nodes in a model tree so that the resulting model would be more accurate, reliable and appropriate for practical applications. M5opt is a semi-non-greedy algorithm with a number of improvements if compared with M5. For experiments six hydrological data sets and five benchmark data sets were used. For comparison, M5 and ANN algorithms were employed as well. Overall, M5flex was the most accurate, followed by M5opt, M5 and ANN. INTRODUCTION Data-driven modelling (Solomatine [14]) based on the advances of machine learning and computational intelligence, proved to be a powerful approach to a number of problems in hydroinformatics context. One of the most frequently and successfully used techniques in this respect is an artificial neural network (ANN). It has been demonstrated, however, that there is a whole set of other methods that can at least as accurate, and have additional advantages (Solomatine & Dulal [15]). One of such numerical prediction (regression) methods, that we found to be practically unknown to practitioners is so-called M5 model tree of Quinlan [10]. It is based on ideas of a popular classification method, a decision tree that follows the principle of recursive partitioning of input space using entropy-based measures, and finally assigning class labels to resulting subsets. In regression context, if a leave is associated with an average output value of the instances sorted down to it (zero-order model), then the overall approach is called a regression tree introduced by Breiman et al. [4]. If the tree has in its leaves more complex regression functions of the input variables, then the overall approach is called a model tree. The two notable approaches are: M5 model trees (Quinlan [11]; Wang & Witten [18]). multiple adaptive regression splines (MARS) by Friedman [8]. 1

2 The advantages of M5 model trees (Solomatine & Dulal [15]; Solomatine [16]) are that they are more accurate than regression trees, more understandable than, for example, ANNs, easy to use and to train, robust when dealing with missing data, can handle large number of attributes and high dimensions. The paper describes new implementations of the M5 model tree method, namely, M5flex and M5opt algorithms, together with their applications. M5 MODEL TREES M5 model trees splits the input progressively. The set T of examples is either associated with a leaf, or some test is chosen that splits T into subsets corresponding to the test outcomes and the same process is applied recursively to the subsets. Splits are based on minimizing the intra-subset variation in the output values down each branch. In each node, the standard deviation of the output values for the examples reaching a node is taken as a measure of the error of this node and calculating the expected reduction in error as a result of testing each attribute and all possible split values. The attribute that maximizes the expected error reduction is chosen. The standard deviation reduction (SDR) is calculated by SDR = sd(t) sd(t ) T / T (1) i i i where T is the set of examples that reach the node and T 1, T 2, are the sets that result from splitting the node according to the chosen attribute (in case of multiple split). The splitting process will terminate if the output values of all the instances that reach the node vary only slightly or only a few instances remain. Figure 1 presents an example. Tree-like regression models are built following the assumption that the functional dependency varies across domain so should be approximated by a number of local model, in case of M5 trees a linear one this makes M5 model tree a piece-wise linear function. After the initial tree has been grown, there are several steps that have to be taken, such as: calculation of error estimates, generation of linear models, simplification of linear models, pruning and smoothing. M5flex MODEL TREE ALGORITHM: INCLUSION OF DOMAIN EXPERT Some approaches give the user opportunity to choose the split attribute and value for each node. Ankerst et al. [1] introduced a visual approach to decision tree construction (based on CART, C4, CLOUDS, SPRINT algorithms) by visualizing multi-dimensional data with a class label such that their degree of impurity with respect to class membership can be easily perceived by the user. Ware et al. [19] introduced visual decision tree (C4.5) construction using 2D polygons. Techniques for building interactively the model trees seem to be missing. 2

3 The challenge is to integrate the background knowledge into a machine learning algorithm by allowing the user to determine some important structural properties of the model based on the physical insight, and leaving more tedious tasks to machine learning. The proposed M5flex method enables the user to determine split attributes and values in some parts of important (top-most) nodes, and then the M5 machine learning algorithm takes care of the remainder of the model tree building. Typically the domain expert would define the split parameters for the nodes of two levels at the top of the tree. The splitting processes in these nodes are important since they affect the splitting below these nodes and influence the performance of the whole model. User-defined split in the subsequent levels can be done as well, however, our experience shows that it becomes more complex for the user is often less accurate than automatic splits by M5. In the context of flood prediction, for example, the expert user can instruct the M5flex to separate the low flow and high flow conditions to be modelled separately. Hence, the M5flex model trees can be more suitable for hydrological applications and operating strategies, than ANNs or standard M5 model trees. M5opt MODEL TREE ALGORITHM: OPTIMIZATION A number of researchers aimed at improving the predictive accuracy of tree-based model, however they dealt mostly with decision trees; these are Utgoff et al. [17], Caruana & Freitag [5], Freund & Mason [7], Pfahringer et al. [9], Frank & Witten [6] (introduced multi-way splits), Sikonja & Kononenko [13] (pruning regression trees using minimum description length principle), Quinlan [11] (combination of regression and model trees and ANNs with instance-based learning). A notable non-greedy approach using iterative linear programming for constructing globally optimal decision trees was proposed by Bennett [2]. However, we have not found publications on optimal model trees for regression. Standard M5 adopts a greedy algorithm which constructs a model tree with a nonfixed structure by using a certain stopping criterion. M5 minimizes error at each interior node, one node at a time. This process is started at the root and is repeated recursively until all or almost all of the instances are correctly classified. In constructing this initial tree M5 is greedy, and this can be improved. In principle, it is possible to build a fully non-greedy algorithm, however computational cost of such approach would be too high. In M5opt, a compromise of combining greedy and non-greedy approaches was adopted. M5opt enables the user to define level of tree until which non-greedy algorithm is applied starting from the root. If full exhaustive search is employed at this stage, all tree structures and all possible attributes and split values are tried; an alternative is to employ randomized search, for example a genetic algorithm. The levels below are constructed by the greedy M5 algorithm. This principle still complies with the way that the terms of linear models at the leaves of model tree are obtained from split attributes of the interior nodes below these leaves before pruning process. M5opt has a number of other attractive 3

4 additional features: initial approximation (M5 builds the initial model tree in a way similar to regression trees where the split is performed based on the averaged output values of the instances that reach a node; M5opt algorithm builds linear models directly in the initial model tree); compacting the tree (improvement to the pruning method of M5). EXPERIMENTS Three hydrological data sets of Sieve catchment (Italy) with the hourly rainfall and runoff (Solomatine and Dulal [15]), three hydrological data sets of Bagmati catchment (Nepal) with the daily rainfall and runoffs, and five widely used benchmark data sets (Autompg, Bodyfat, CPU, Friedman and Housing, Blake & Mertz [3]) were employed. Five methods were used: M5, M5flex, M5opt and ANN (MLP). M5flex which was used applied only for 6 hydrological data sets. The problem associated with hydrological data sets is to predict runoff (Q t+i ) several hours ahead with respect to previous runoff (Q t- ) and effective rainfall (RE t- ). Before building a prediction model, it was necessary to analyze the physical characteristics of the catchment and then to select the input and output variables by analyzing the interdependencies between variables and the lags using correlation analysis. Finally the following 3 models were built: Q t+1 = f (RE t, RE t-1, RE t-2, RE t-3, RE t-4, RE t-5, Q t, Q t-1, Q t-2 ) Q t+3 = f (RE t, RE t-1, RE t-2, RE t-3, Q t, Q t-1 ) Q t+6 = f (RE t, Q t, ) The model for Bagmati case was set to be: Q t+1 = f (RE t, RE t-1, RE t-2, Q t, Q t-1 ). In Bagmati case study the data set was also separated into high flows and low flows with 300 m 3 /s as division point, and two additional models were built. M5 model trees were built based on default parameter values: pruning factor 2.0 and smoothing option; same settings were also used in M5flex and M5opt experiments. M5flex model trees. The user could to modify the split attributes and values in each node only in the first and the second level of model tree; this limitation was proposed simply to reduce the complexity that the domain expert would face. The split values that were used in experiments were points around extreme values (minimum and maximum), mean and some trials were needed to find the best model tree. M5opt model trees. There is a large number of parameters combinations that can be set in M5opt; we used twelve of these (Siek [12]). RESULTS AND DISCUSSION The overall experimental results can be summarized in Table 1 and Table 2. M5opt model trees were most accurate on eleven data sets, but for Bagmati-High, Bagmati-Low, Autompg, and Friedman data sets the best accuracy was given by ANN models. 4

5 The experiments dealing with all algorithms for Sieve and Bagmati (six) data sets (Table 2) showed that M5flex model trees were the best accuracy on most of these data sets, except Sieve Q t+6 data set where M5opt model was better. To compare algorithms' performance, a scoring matrix proposed by D.L. Shrestha was used; it is a square matrix whose diagonal elements are zero and other elements are the averages of relative performance of one algorithm compared to anotherwith respect to all data sets used. The element of scoring matrix SM i,j should be read as the average performance of ith algorithm over jth algorithm and is calculated as follows: N 1 RMSEk, j RMSEk, i =, i j SM i, j N max( RMSE RMSE k= k j, k i ) 1,, (2) 0, i = j where N is the number of data sets. By summing up all element values column-wise one can determine the overall score of each algorithm. Table 1. The best performance of M5, M5opt and ANN for each data set (RMSE) Data Sets ANN M5' M5opt Train Verif Train Verif Train Verif Q t Q t Q t All High Low Autompg Bodyfat CPU Friedman Housing Sieve Bag Others Table 2. The best performance of M5, M5flex, M5opt and ANN for hydrological data sets (RMSE) Data ANN M5' M5flex M5opt Sets Train Verif Train Verif Train Verif Train Verif Q t Q t Q t All High Low Sieve Bag 5

6 For all (eleven) data sets and comparing three algorithms (M5, M5opt and ANN), M5opt has the highest scoring factor of 27.8 (Table 3). The best models for the eight data sets of the eleven data sets used were obtained using exhaustive search. Table 3. Scoring matrix for all 11 verification data sets (in %) ANN M5' M5opt ANN M5' M5opt Total Table 4. Scoring matrix for the 6 verification hydrological data sets (in %) ANN M5' M5flex M5opt ANN M5' M5flex M5opt Total The comparison among four algorithms (M5, M5flex, M5opt and ANN) on hydrological problems (Sieve and Bagmati) can be seen in Table 4. The accuracy of M5opt (16.0) is lower than the accuracy of M5flex (37.4). The reason for high performance of M5flex is that it uses additional domain knowledge for determining the best split attributes and values. Also, M5flex and M5opt could predict the peak values where the others algorithms could not. The use of compacting in M5opt allows for making the resulting model tree simpler (as simple as the user wants) and more balanced this is desirable for the practical applications. To see the effect of optimization and compacting, compare model trees built for one of the case studies (Sieve Q t+6 ): M5 tree (Figure 1) has 7 rules with RMSE , but M5opt (Figure 2) has only 2 rules with RMSE Qt <= 37 : REt <= : LM1 (879/3.51%) REt > : LM2 (221/41.3%) Qt > 37 : REt <= : Qt <= 70.2 : LM3 (356/24.4%) Qt > 70.2 : LM4 (225/33.7%) REt > : REt <= 2.04 : LM5 (135/160%) REt > 2.04 : Qt <= 342 : LM6 (30/392%) Qt > 342 : LM7 (8/144%)

7 7 Models at the leaves: LM1: Qt+6 = REt Qt LM2: Qt+6 = REt Qt LM3: Qt+6 = REt Qt LM4: Qt+6 = REt Qt LM5: Qt+6 = REt Qt LM6: Qt+6 = REt Qt LM7: Qt+6 = REt + 0.4Qt Number of Rules : 7 Root mean squared error Figure 1. Model tree (M5 ), Q t+6 data set Qt <= 37 : LM1 (1100/19.3%) Qt > 37 : LM2 (754/116%) Models at the leaves: LM1: Qt+6 = REt Qt LM2: Qt+6 = REt Qt Number of Rules : 2 Root mean squared error Figure 2. Model tree (M5opt) with compacting the tree until level 1, Q t+6 data set CONCLUSION The following can be concluded. M5 model trees family is an accurate data-driven modelling approach leading to transparent models that can be easily understood by the decision makers. The approach taken in M5opt algorithm makes it possible to construct models, more accurate than standard M5 (or M5) ones. Additional computational costs can be high but they can be user-controlled by selecting the appropriate tree level until which the exhaustive search is executed. M5flex algorithm allows for bringing the domain knowledge into the process of data-driven modelling and in a number of case if outperforms the M5opt and ANN. It requires however the involvement of the domain expert but we see this more as its strength. The following research will be oriented towards reduction of computational time of M5opt and the procedure of building the M5 trees. The immediate plan of the authors is to improve the M5opt algorithm by introducing a choice of optimization approaches, and to try to combine M5opt with M5flex. REFERENCES [1] Ankerst, M., Elsen, C., Ester, M. and Kriegel, H., Visual classification: an interactive approach to decision tree construction, In Proceedings of the 5 th International Conference on Knowledge Discovery and Data Mining, ACM Press, (1999), pp

8 [2] Bennett, K.P., Global tree optimization: a non-greedy decision tree algorithm, Journal of Computing Science and Statistics, Vol. 26, (1994), pp [3] Blake, C.L. & Mertz, C.J., UCI Repository of machine learning databases. [ Univ. of California, (1998). [4] Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J., Classification and regression trees, Wadsworth, Belmont, CA, (1984). [5] Caruana, R. and Freitag, D., Greedy attribute selection, International Conference on Machine Learning, (1994), pp [6] Frank, E. and Witten, I.H., Selecting multiway splits in decision trees, Working paper 96/31, Dept. of Computer Science, University of Waikato, (1996). [7] Freund, Y. & Mason, L., The alternating decision tree learning algorithm, Proc. 16 th International Conf. on Machine Learning, Morgan Kaufmann, San Francisco, (1999), pp [8] Friedman, J.H., Multivariate adaptive regression splines. Annals of Statistics, Vol. 19, (1991), pp [9] Pfahringer, B., Geoffrey, H. and Kirkby, R., Optimizing the induction of alternating decision trees, Proceedings of the Fifth Pasific-Asia Conf. on Advances in Knowledge Discovery and Data Mining, (2001). [10] Quinlan, J.R., Learning with continuous classes, Proc. AI 92, 5th Australian Joint Conference on Artificial Intelligence, Adams & Sterling (eds.), World Scientific, Singapore, (1992), pp [11] Quinlan, J.R., Combining instance-based and model-based learning, In proceedings ML 93 (Utgoff, Ed), Morgan Kaufmann, (1993). [12] Siek, M.B.L.A., Flexibility and optimality in model tree learning with application to water-related problems, MSc Thesis Report, IHE Delft, (2003). [13] Sikonja, M.R. and Kononenko, I., Pruning regression trees with MDL, 13 th European Conference on Artificial Intelligence (ECAI 98), (1998). [14] Solomatine, D.P., Data-driven modelling: paradigm, methods, experiences, Proc. 5 th International Conference on Hydroinformatics, Cardiff, UK, (2002). [15] Solomatine, D.P. and Dulal, K. N., Model tree as an alternative to neural network in rainfall-runoff modeling, Hydrological Sc. J., Vol.48(3), (2003), pp [16] Solomatine, D.P., Mixtures of simple models vs ANNs in hydrological modelling, Proc. Int. Conference in Hybrid Intelligence System (HIS'03), Melbourne, (2003). [17] Utgoff, P.E., Berkman, N.C. and Clouse, J.A., Decision tree induction based on efficient tree restructuring, J. of Machine Learning, Vol. 29 (1), (1997), pp [18] Wang, Y. and Witten, I.H., Induction of model trees for predicting continuous classes, Proc. European Conf. on Machine Learning, Prague, (1997), pp [19] Ware, M., Frank, E., Holmes, G., Hall, M. and Witten, I.H., Interactive machine learning - letting users build classifiers, Int. J. on Human-Computer Studies, (2000). 8

OPTIMIZING MIXTURES OF LOCAL EXPERTS IN TREE-LIKE REGRESSION MODELS

Proc. IASTED Conference on Artificial Intelligence and Apllications, M.H. Hamza, (ed), Innsbruck, Austria, February 2005, 497-502 OPTIMIZING MIXTURES OF LOCAL EXPERTS IN TREE-LIKE REGRESSION MODELS Michael