FLEXIBLE AND OPTIMAL M5 MODEL TREES WITH APPLICATIONS TO FLOW PREDICTIONS
|
|
- Margaret Hutchinson
- 5 years ago
- Views:
Transcription
1 6 th International Conference on Hydroinformatics - Liong, Phoon & Babovic (eds) 2004 World Scientific Publishing Company, ISBN FLEXIBLE AND OPTIMAL M5 MODEL TREES WITH APPLICATIONS TO FLOW PREDICTIONS DIMITRI P. SOLOMATINE UNESCO-IHE Institute for Water Education P.O. Box 3015 Delft The Netherlands MICHAEL BASKARA L. A. SIEK UNESCO-IHE Institute for Water Education P.O. Box 3015 Delft The Netherlands M5 is a method developed by Quinlan [10] for inducing trees of linear regression models (model trees). The paper addresses the flexibility and optimality in M5 model tree by proposing two new algorithms, namely M5flex and M5opt. M5flex algorithm brings in domain knowledge by enabling the user to choose split attributes and split values for important nodes in a model tree so that the resulting model would be more accurate, reliable and appropriate for practical applications. M5opt is a semi-non-greedy algorithm with a number of improvements if compared with M5. For experiments six hydrological data sets and five benchmark data sets were used. For comparison, M5 and ANN algorithms were employed as well. Overall, M5flex was the most accurate, followed by M5opt, M5 and ANN. INTRODUCTION Data-driven modelling (Solomatine [14]) based on the advances of machine learning and computational intelligence, proved to be a powerful approach to a number of problems in hydroinformatics context. One of the most frequently and successfully used techniques in this respect is an artificial neural network (ANN). It has been demonstrated, however, that there is a whole set of other methods that can at least as accurate, and have additional advantages (Solomatine & Dulal [15]). One of such numerical prediction (regression) methods, that we found to be practically unknown to practitioners is so-called M5 model tree of Quinlan [10]. It is based on ideas of a popular classification method, a decision tree that follows the principle of recursive partitioning of input space using entropy-based measures, and finally assigning class labels to resulting subsets. In regression context, if a leave is associated with an average output value of the instances sorted down to it (zero-order model), then the overall approach is called a regression tree introduced by Breiman et al. [4]. If the tree has in its leaves more complex regression functions of the input variables, then the overall approach is called a model tree. The two notable approaches are: M5 model trees (Quinlan [11]; Wang & Witten [18]). multiple adaptive regression splines (MARS) by Friedman [8]. 1
2 The advantages of M5 model trees (Solomatine & Dulal [15]; Solomatine [16]) are that they are more accurate than regression trees, more understandable than, for example, ANNs, easy to use and to train, robust when dealing with missing data, can handle large number of attributes and high dimensions. The paper describes new implementations of the M5 model tree method, namely, M5flex and M5opt algorithms, together with their applications. M5 MODEL TREES M5 model trees splits the input progressively. The set T of examples is either associated with a leaf, or some test is chosen that splits T into subsets corresponding to the test outcomes and the same process is applied recursively to the subsets. Splits are based on minimizing the intra-subset variation in the output values down each branch. In each node, the standard deviation of the output values for the examples reaching a node is taken as a measure of the error of this node and calculating the expected reduction in error as a result of testing each attribute and all possible split values. The attribute that maximizes the expected error reduction is chosen. The standard deviation reduction (SDR) is calculated by SDR = sd(t) sd(t ) T / T (1) i i i where T is the set of examples that reach the node and T 1, T 2, are the sets that result from splitting the node according to the chosen attribute (in case of multiple split). The splitting process will terminate if the output values of all the instances that reach the node vary only slightly or only a few instances remain. Figure 1 presents an example. Tree-like regression models are built following the assumption that the functional dependency varies across domain so should be approximated by a number of local model, in case of M5 trees a linear one this makes M5 model tree a piece-wise linear function. After the initial tree has been grown, there are several steps that have to be taken, such as: calculation of error estimates, generation of linear models, simplification of linear models, pruning and smoothing. M5flex MODEL TREE ALGORITHM: INCLUSION OF DOMAIN EXPERT Some approaches give the user opportunity to choose the split attribute and value for each node. Ankerst et al. [1] introduced a visual approach to decision tree construction (based on CART, C4, CLOUDS, SPRINT algorithms) by visualizing multi-dimensional data with a class label such that their degree of impurity with respect to class membership can be easily perceived by the user. Ware et al. [19] introduced visual decision tree (C4.5) construction using 2D polygons. Techniques for building interactively the model trees seem to be missing. 2
3 The challenge is to integrate the background knowledge into a machine learning algorithm by allowing the user to determine some important structural properties of the model based on the physical insight, and leaving more tedious tasks to machine learning. The proposed M5flex method enables the user to determine split attributes and values in some parts of important (top-most) nodes, and then the M5 machine learning algorithm takes care of the remainder of the model tree building. Typically the domain expert would define the split parameters for the nodes of two levels at the top of the tree. The splitting processes in these nodes are important since they affect the splitting below these nodes and influence the performance of the whole model. User-defined split in the subsequent levels can be done as well, however, our experience shows that it becomes more complex for the user is often less accurate than automatic splits by M5. In the context of flood prediction, for example, the expert user can instruct the M5flex to separate the low flow and high flow conditions to be modelled separately. Hence, the M5flex model trees can be more suitable for hydrological applications and operating strategies, than ANNs or standard M5 model trees. M5opt MODEL TREE ALGORITHM: OPTIMIZATION A number of researchers aimed at improving the predictive accuracy of tree-based model, however they dealt mostly with decision trees; these are Utgoff et al. [17], Caruana & Freitag [5], Freund & Mason [7], Pfahringer et al. [9], Frank & Witten [6] (introduced multi-way splits), Sikonja & Kononenko [13] (pruning regression trees using minimum description length principle), Quinlan [11] (combination of regression and model trees and ANNs with instance-based learning). A notable non-greedy approach using iterative linear programming for constructing globally optimal decision trees was proposed by Bennett [2]. However, we have not found publications on optimal model trees for regression. Standard M5 adopts a greedy algorithm which constructs a model tree with a nonfixed structure by using a certain stopping criterion. M5 minimizes error at each interior node, one node at a time. This process is started at the root and is repeated recursively until all or almost all of the instances are correctly classified. In constructing this initial tree M5 is greedy, and this can be improved. In principle, it is possible to build a fully non-greedy algorithm, however computational cost of such approach would be too high. In M5opt, a compromise of combining greedy and non-greedy approaches was adopted. M5opt enables the user to define level of tree until which non-greedy algorithm is applied starting from the root. If full exhaustive search is employed at this stage, all tree structures and all possible attributes and split values are tried; an alternative is to employ randomized search, for example a genetic algorithm. The levels below are constructed by the greedy M5 algorithm. This principle still complies with the way that the terms of linear models at the leaves of model tree are obtained from split attributes of the interior nodes below these leaves before pruning process. M5opt has a number of other attractive 3
4 additional features: initial approximation (M5 builds the initial model tree in a way similar to regression trees where the split is performed based on the averaged output values of the instances that reach a node; M5opt algorithm builds linear models directly in the initial model tree); compacting the tree (improvement to the pruning method of M5). EXPERIMENTS Three hydrological data sets of Sieve catchment (Italy) with the hourly rainfall and runoff (Solomatine and Dulal [15]), three hydrological data sets of Bagmati catchment (Nepal) with the daily rainfall and runoffs, and five widely used benchmark data sets (Autompg, Bodyfat, CPU, Friedman and Housing, Blake & Mertz [3]) were employed. Five methods were used: M5, M5flex, M5opt and ANN (MLP). M5flex which was used applied only for 6 hydrological data sets. The problem associated with hydrological data sets is to predict runoff (Q t+i ) several hours ahead with respect to previous runoff (Q t- ) and effective rainfall (RE t- ). Before building a prediction model, it was necessary to analyze the physical characteristics of the catchment and then to select the input and output variables by analyzing the interdependencies between variables and the lags using correlation analysis. Finally the following 3 models were built: Q t+1 = f (RE t, RE t-1, RE t-2, RE t-3, RE t-4, RE t-5, Q t, Q t-1, Q t-2 ) Q t+3 = f (RE t, RE t-1, RE t-2, RE t-3, Q t, Q t-1 ) Q t+6 = f (RE t, Q t, ) The model for Bagmati case was set to be: Q t+1 = f (RE t, RE t-1, RE t-2, Q t, Q t-1 ). In Bagmati case study the data set was also separated into high flows and low flows with 300 m 3 /s as division point, and two additional models were built. M5 model trees were built based on default parameter values: pruning factor 2.0 and smoothing option; same settings were also used in M5flex and M5opt experiments. M5flex model trees. The user could to modify the split attributes and values in each node only in the first and the second level of model tree; this limitation was proposed simply to reduce the complexity that the domain expert would face. The split values that were used in experiments were points around extreme values (minimum and maximum), mean and some trials were needed to find the best model tree. M5opt model trees. There is a large number of parameters combinations that can be set in M5opt; we used twelve of these (Siek [12]). RESULTS AND DISCUSSION The overall experimental results can be summarized in Table 1 and Table 2. M5opt model trees were most accurate on eleven data sets, but for Bagmati-High, Bagmati-Low, Autompg, and Friedman data sets the best accuracy was given by ANN models. 4
5 The experiments dealing with all algorithms for Sieve and Bagmati (six) data sets (Table 2) showed that M5flex model trees were the best accuracy on most of these data sets, except Sieve Q t+6 data set where M5opt model was better. To compare algorithms' performance, a scoring matrix proposed by D.L. Shrestha was used; it is a square matrix whose diagonal elements are zero and other elements are the averages of relative performance of one algorithm compared to anotherwith respect to all data sets used. The element of scoring matrix SM i,j should be read as the average performance of ith algorithm over jth algorithm and is calculated as follows: N 1 RMSEk, j RMSEk, i =, i j SM i, j N max( RMSE RMSE k= k j, k i ) 1,, (2) 0, i = j where N is the number of data sets. By summing up all element values column-wise one can determine the overall score of each algorithm. Table 1. The best performance of M5, M5opt and ANN for each data set (RMSE) Data Sets ANN M5' M5opt Train Verif Train Verif Train Verif Q t Q t Q t All High Low Autompg Bodyfat CPU Friedman Housing Sieve Bag Others Table 2. The best performance of M5, M5flex, M5opt and ANN for hydrological data sets (RMSE) Data ANN M5' M5flex M5opt Sets Train Verif Train Verif Train Verif Train Verif Q t Q t Q t All High Low Sieve Bag 5
6 For all (eleven) data sets and comparing three algorithms (M5, M5opt and ANN), M5opt has the highest scoring factor of 27.8 (Table 3). The best models for the eight data sets of the eleven data sets used were obtained using exhaustive search. Table 3. Scoring matrix for all 11 verification data sets (in %) ANN M5' M5opt ANN M5' M5opt Total Table 4. Scoring matrix for the 6 verification hydrological data sets (in %) ANN M5' M5flex M5opt ANN M5' M5flex M5opt Total The comparison among four algorithms (M5, M5flex, M5opt and ANN) on hydrological problems (Sieve and Bagmati) can be seen in Table 4. The accuracy of M5opt (16.0) is lower than the accuracy of M5flex (37.4). The reason for high performance of M5flex is that it uses additional domain knowledge for determining the best split attributes and values. Also, M5flex and M5opt could predict the peak values where the others algorithms could not. The use of compacting in M5opt allows for making the resulting model tree simpler (as simple as the user wants) and more balanced this is desirable for the practical applications. To see the effect of optimization and compacting, compare model trees built for one of the case studies (Sieve Q t+6 ): M5 tree (Figure 1) has 7 rules with RMSE , but M5opt (Figure 2) has only 2 rules with RMSE Qt <= 37 : REt <= : LM1 (879/3.51%) REt > : LM2 (221/41.3%) Qt > 37 : REt <= : Qt <= 70.2 : LM3 (356/24.4%) Qt > 70.2 : LM4 (225/33.7%) REt > : REt <= 2.04 : LM5 (135/160%) REt > 2.04 : Qt <= 342 : LM6 (30/392%) Qt > 342 : LM7 (8/144%)
7 7 Models at the leaves: LM1: Qt+6 = REt Qt LM2: Qt+6 = REt Qt LM3: Qt+6 = REt Qt LM4: Qt+6 = REt Qt LM5: Qt+6 = REt Qt LM6: Qt+6 = REt Qt LM7: Qt+6 = REt + 0.4Qt Number of Rules : 7 Root mean squared error Figure 1. Model tree (M5 ), Q t+6 data set Qt <= 37 : LM1 (1100/19.3%) Qt > 37 : LM2 (754/116%) Models at the leaves: LM1: Qt+6 = REt Qt LM2: Qt+6 = REt Qt Number of Rules : 2 Root mean squared error Figure 2. Model tree (M5opt) with compacting the tree until level 1, Q t+6 data set CONCLUSION The following can be concluded. M5 model trees family is an accurate data-driven modelling approach leading to transparent models that can be easily understood by the decision makers. The approach taken in M5opt algorithm makes it possible to construct models, more accurate than standard M5 (or M5) ones. Additional computational costs can be high but they can be user-controlled by selecting the appropriate tree level until which the exhaustive search is executed. M5flex algorithm allows for bringing the domain knowledge into the process of data-driven modelling and in a number of case if outperforms the M5opt and ANN. It requires however the involvement of the domain expert but we see this more as its strength. The following research will be oriented towards reduction of computational time of M5opt and the procedure of building the M5 trees. The immediate plan of the authors is to improve the M5opt algorithm by introducing a choice of optimization approaches, and to try to combine M5opt with M5flex. REFERENCES [1] Ankerst, M., Elsen, C., Ester, M. and Kriegel, H., Visual classification: an interactive approach to decision tree construction, In Proceedings of the 5 th International Conference on Knowledge Discovery and Data Mining, ACM Press, (1999), pp
8 [2] Bennett, K.P., Global tree optimization: a non-greedy decision tree algorithm, Journal of Computing Science and Statistics, Vol. 26, (1994), pp [3] Blake, C.L. & Mertz, C.J., UCI Repository of machine learning databases. [ Univ. of California, (1998). [4] Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J., Classification and regression trees, Wadsworth, Belmont, CA, (1984). [5] Caruana, R. and Freitag, D., Greedy attribute selection, International Conference on Machine Learning, (1994), pp [6] Frank, E. and Witten, I.H., Selecting multiway splits in decision trees, Working paper 96/31, Dept. of Computer Science, University of Waikato, (1996). [7] Freund, Y. & Mason, L., The alternating decision tree learning algorithm, Proc. 16 th International Conf. on Machine Learning, Morgan Kaufmann, San Francisco, (1999), pp [8] Friedman, J.H., Multivariate adaptive regression splines. Annals of Statistics, Vol. 19, (1991), pp [9] Pfahringer, B., Geoffrey, H. and Kirkby, R., Optimizing the induction of alternating decision trees, Proceedings of the Fifth Pasific-Asia Conf. on Advances in Knowledge Discovery and Data Mining, (2001). [10] Quinlan, J.R., Learning with continuous classes, Proc. AI 92, 5th Australian Joint Conference on Artificial Intelligence, Adams & Sterling (eds.), World Scientific, Singapore, (1992), pp [11] Quinlan, J.R., Combining instance-based and model-based learning, In proceedings ML 93 (Utgoff, Ed), Morgan Kaufmann, (1993). [12] Siek, M.B.L.A., Flexibility and optimality in model tree learning with application to water-related problems, MSc Thesis Report, IHE Delft, (2003). [13] Sikonja, M.R. and Kononenko, I., Pruning regression trees with MDL, 13 th European Conference on Artificial Intelligence (ECAI 98), (1998). [14] Solomatine, D.P., Data-driven modelling: paradigm, methods, experiences, Proc. 5 th International Conference on Hydroinformatics, Cardiff, UK, (2002). [15] Solomatine, D.P. and Dulal, K. N., Model tree as an alternative to neural network in rainfall-runoff modeling, Hydrological Sc. J., Vol.48(3), (2003), pp [16] Solomatine, D.P., Mixtures of simple models vs ANNs in hydrological modelling, Proc. Int. Conference in Hybrid Intelligence System (HIS'03), Melbourne, (2003). [17] Utgoff, P.E., Berkman, N.C. and Clouse, J.A., Decision tree induction based on efficient tree restructuring, J. of Machine Learning, Vol. 29 (1), (1997), pp [18] Wang, Y. and Witten, I.H., Induction of model trees for predicting continuous classes, Proc. European Conf. on Machine Learning, Prague, (1997), pp [19] Ware, M., Frank, E., Holmes, G., Hall, M. and Witten, I.H., Interactive machine learning - letting users build classifiers, Int. J. on Human-Computer Studies, (2000). 8
OPTIMIZING MIXTURES OF LOCAL EXPERTS IN TREE-LIKE REGRESSION MODELS
Proc. IASTED Conference on Artificial Intelligence and Apllications, M.H. Hamza, (ed), Innsbruck, Austria, February 2005, 497-502 OPTIMIZING MIXTURES OF LOCAL EXPERTS IN TREE-LIKE REGRESSION MODELS Michael
More informationImproving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique
www.ijcsi.org 29 Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn
More informationFuzzy Partitioning with FID3.1
Fuzzy Partitioning with FID3.1 Cezary Z. Janikow Dept. of Mathematics and Computer Science University of Missouri St. Louis St. Louis, Missouri 63121 janikow@umsl.edu Maciej Fajfer Institute of Computing
More informationImproving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique
Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn University,
More informationImproving Tree-Based Classification Rules Using a Particle Swarm Optimization
Improving Tree-Based Classification Rules Using a Particle Swarm Optimization Chi-Hyuck Jun *, Yun-Ju Cho, and Hyeseon Lee Department of Industrial and Management Engineering Pohang University of Science
More informationA Two Stage Zone Regression Method for Global Characterization of a Project Database
A Two Stage Zone Regression Method for Global Characterization 1 Chapter I A Two Stage Zone Regression Method for Global Characterization of a Project Database J. J. Dolado, University of the Basque Country,
More informationUnivariate and Multivariate Decision Trees
Univariate and Multivariate Decision Trees Olcay Taner Yıldız and Ethem Alpaydın Department of Computer Engineering Boğaziçi University İstanbul 80815 Turkey Abstract. Univariate decision trees at each
More informationSandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing
Generalized Additive Model and Applications in Direct Marketing Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Abstract Logistic regression 1 has been widely used in direct marketing applications
More informationUsing Decision Boundary to Analyze Classifiers
Using Decision Boundary to Analyze Classifiers Zhiyong Yan Congfu Xu College of Computer Science, Zhejiang University, Hangzhou, China yanzhiyong@zju.edu.cn Abstract In this paper we propose to use decision
More informationComparing Univariate and Multivariate Decision Trees *
Comparing Univariate and Multivariate Decision Trees * Olcay Taner Yıldız, Ethem Alpaydın Department of Computer Engineering Boğaziçi University, 80815 İstanbul Turkey yildizol@cmpe.boun.edu.tr, alpaydin@boun.edu.tr
More informationEstimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees
Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,
More informationREGRESSION BY SELECTING APPROPRIATE FEATURE(S)
REGRESSION BY SELECTING APPROPRIATE FEATURE(S) 7ROJD$\GÕQDQG+$OWD\*üvenir Department of Computer Engineering Bilkent University Ankara, 06533, TURKEY Abstract. This paper describes two machine learning
More informationRandom Forest A. Fornaser
Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University
More informationLogistic Model Tree With Modified AIC
Logistic Model Tree With Modified AIC Mitesh J. Thakkar Neha J. Thakkar Dr. J.S.Shah Student of M.E.I.T. Asst.Prof.Computer Dept. Prof.&Head Computer Dept. S.S.Engineering College, Indus Engineering College
More informationData Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier
Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio
More informationCyber attack detection using decision tree approach
Cyber attack detection using decision tree approach Amit Shinde Department of Industrial Engineering, Arizona State University,Tempe, AZ, USA {amit.shinde@asu.edu} In this information age, information
More informationSpeeding up Logistic Model Tree Induction
Speeding up Logistic Model Tree Induction Marc Sumner 1,2,EibeFrank 2,andMarkHall 2 Institute for Computer Science University of Freiburg Freiburg, Germany sumner@informatik.uni-freiburg.de Department
More informationDynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers
Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers A. Srivastava E. Han V. Kumar V. Singh Information Technology Lab Dept. of Computer Science Information Technology Lab Hitachi
More informationData Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396
Data Mining Decision Tree Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 24 Table of contents 1 Introduction 2 Decision tree
More informationConstraint Based Induction of Multi-Objective Regression Trees
Constraint Based Induction of Multi-Objective Regression Trees Jan Struyf 1 and Sašo Džeroski 2 1 Katholieke Universiteit Leuven, Dept. of Computer Science Celestijnenlaan 200A, B-3001 Leuven, Belgium
More informationInduction of Multivariate Decision Trees by Using Dipolar Criteria
Induction of Multivariate Decision Trees by Using Dipolar Criteria Leon Bobrowski 1,2 and Marek Krȩtowski 1 1 Institute of Computer Science, Technical University of Bia lystok, Poland 2 Institute of Biocybernetics
More informationData Mining Practical Machine Learning Tools and Techniques
Decision trees Extending previous approach: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank to permit numeric s: straightforward
More informationInternational Journal of Software and Web Sciences (IJSWS)
International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International
More informationInteractive Machine Learning Letting Users Build Classifiers
Interactive Machine Learning Letting Users Build Classifiers Malcolm Ware Eibe Frank Geoffrey Holmes Mark Hall Ian H. Witten Department of Computer Science, University of Waikato, Hamilton, New Zealand
More informationHybrid Approach for Classification using Support Vector Machine and Decision Tree
Hybrid Approach for Classification using Support Vector Machine and Decision Tree Anshu Bharadwaj Indian Agricultural Statistics research Institute New Delhi, India anshu@iasri.res.in Sonajharia Minz Jawaharlal
More informationAdaptive Metric Nearest Neighbor Classification
Adaptive Metric Nearest Neighbor Classification Carlotta Domeniconi Jing Peng Dimitrios Gunopulos Computer Science Department Computer Science Department Computer Science Department University of California
More informationSpeeding Up Logistic Model Tree Induction
Speeding Up Logistic Model Tree Induction Marc Sumner 1,2,EibeFrank 2,andMarkHall 2 1 Institute for Computer Science, University of Freiburg, Freiburg, Germany sumner@informatik.uni-freiburg.de 2 Department
More informationICA as a preprocessing technique for classification
ICA as a preprocessing technique for classification V.Sanchez-Poblador 1, E. Monte-Moreno 1, J. Solé-Casals 2 1 TALP Research Center Universitat Politècnica de Catalunya (Catalonia, Spain) enric@gps.tsc.upc.es
More informationThe digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).
http://waikato.researchgateway.ac.nz/ Research Commons at the University of Waikato Copyright Statement: The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand). The thesis
More informationA Systematic Overview of Data Mining Algorithms
A Systematic Overview of Data Mining Algorithms 1 Data Mining Algorithm A well-defined procedure that takes data as input and produces output as models or patterns well-defined: precisely encoded as a
More information7. Decision or classification trees
7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,
More informationSupervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...
Supervised Learning Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression... Supervised Learning y=f(x): true function (usually not known) D: training
More informationPerformance Analysis of Data Mining Classification Techniques
Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal
More informationUnsupervised Discretization using Tree-based Density Estimation
Unsupervised Discretization using Tree-based Density Estimation Gabi Schmidberger and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {gabi, eibe}@cs.waikato.ac.nz
More informationTechnical Note Using Model Trees for Classification
c Machine Learning,, 1 14 () Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. Technical Note Using Model Trees for Classification EIBE FRANK eibe@cs.waikato.ac.nz YONG WANG yongwang@cs.waikato.ac.nz
More informationEnhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques
24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE
More informationStudy on Classifiers using Genetic Algorithm and Class based Rules Generation
2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules
More informationInducing Cost-Sensitive Trees via Instance Weighting
Inducing Cost-Sensitive Trees via Instance Weighting Kai Ming Ting School of Computing and Mathematics, Deakin University, Vic 3168, Australia. Abstract. We introduce an instance-weighting method to induce
More informationA Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York
A Systematic Overview of Data Mining Algorithms Sargur Srihari University at Buffalo The State University of New York 1 Topics Data Mining Algorithm Definition Example of CART Classification Iris, Wine
More informationUsing Text Learning to help Web browsing
Using Text Learning to help Web browsing Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia Carnegie Mellon University, Pittsburgh, PA, USA Dunja.Mladenic@{ijs.si, cs.cmu.edu} Abstract Web browsing
More informationSYMBOLIC FEATURES IN NEURAL NETWORKS
SYMBOLIC FEATURES IN NEURAL NETWORKS Włodzisław Duch, Karol Grudziński and Grzegorz Stawski 1 Department of Computer Methods, Nicolaus Copernicus University ul. Grudziadzka 5, 87-100 Toruń, Poland Abstract:
More informationLook-Ahead Based Fuzzy Decision Tree Induction
IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 9, NO. 3, JUNE 2001 461 Look-Ahead Based Fuzzy Decision Tree Induction Ming Dong, Student Member, IEEE, and Ravi Kothari, Senior Member, IEEE Abstract Decision
More informationThesis Overview KNOWLEDGE EXTRACTION IN LARGE DATABASES USING ADAPTIVE STRATEGIES
Thesis Overview KNOWLEDGE EXTRACTION IN LARGE DATABASES USING ADAPTIVE STRATEGIES Waldo Hasperué Advisor: Laura Lanzarini Facultad de Informática, Universidad Nacional de La Plata PhD Thesis, March 2012
More informationIntrusion detection in computer networks through a hybrid approach of data mining and decision trees
WALIA journal 30(S1): 233237, 2014 Available online at www.waliaj.com ISSN 10263861 2014 WALIA Intrusion detection in computer networks through a hybrid approach of data mining and decision trees Tayebeh
More informationA Two-level Learning Method for Generalized Multi-instance Problems
A wo-level Learning Method for Generalized Multi-instance Problems Nils Weidmann 1,2, Eibe Frank 2, and Bernhard Pfahringer 2 1 Department of Computer Science University of Freiburg Freiburg, Germany weidmann@informatik.uni-freiburg.de
More informationPUBLIC: A Decision Tree Classifier that Integrates Building and Pruning
Data Mining and Knowledge Discovery, 4, 315 344, 2000 c 2000 Kluwer Academic Publishers. Manufactured in The Netherlands. PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RAJEEV
More informationLecture 5: Decision Trees (Part II)
Lecture 5: Decision Trees (Part II) Dealing with noise in the data Overfitting Pruning Dealing with missing attribute values Dealing with attributes with multiple values Integrating costs into node choice
More informationGenerating Rule Sets from Model Trees
Generating Rule Sets from Model Trees Geoffrey Holmes, Mark Hall and Eibe Frank Department of Computer Science University of Waikato, New Zealand {geoff,mhall,eibe}@cs.waikato.ac.nz Ph. +64 7 838-4405
More informationHybrid Feature Selection for Modeling Intrusion Detection Systems
Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,
More informationDEVELOPMENT OF NEURAL NETWORK TRAINING METHODOLOGY FOR MODELING NONLINEAR SYSTEMS WITH APPLICATION TO THE PREDICTION OF THE REFRACTIVE INDEX
DEVELOPMENT OF NEURAL NETWORK TRAINING METHODOLOGY FOR MODELING NONLINEAR SYSTEMS WITH APPLICATION TO THE PREDICTION OF THE REFRACTIVE INDEX THESIS CHONDRODIMA EVANGELIA Supervisor: Dr. Alex Alexandridis,
More informationLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Syllabus Fri. 27.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 3.11. (2) A.1 Linear Regression Fri. 10.11. (3) A.2 Linear Classification Fri. 17.11. (4) A.3 Regularization
More informationISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationClassification and Regression Trees
Classification and Regression Trees Matthew S. Shotwell, Ph.D. Department of Biostatistics Vanderbilt University School of Medicine Nashville, TN, USA March 16, 2018 Introduction trees partition feature
More informationGenerating Optimized Decision Tree Based on Discrete Wavelet Transform Kiran Kumar Reddi* 1 Ali Mirza Mahmood 2 K.
Generating Optimized Decision Tree Based on Discrete Wavelet Transform Kiran Kumar Reddi* 1 Ali Mirza Mahmood 2 K.Mrithyumjaya Rao 3 1. Assistant Professor, Department of Computer Science, Krishna University,
More informationLecture 7: Decision Trees
Lecture 7: Decision Trees Instructor: Outline 1 Geometric Perspective of Classification 2 Decision Trees Geometric Perspective of Classification Perspective of Classification Algorithmic Geometric Probabilistic...
More informationBiology Project 1
Biology 6317 Project 1 Data and illustrations courtesy of Professor Tony Frankino, Department of Biology/Biochemistry 1. Background The data set www.math.uh.edu/~charles/wing_xy.dat has measurements related
More informationImproving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets
Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Md Nasim Adnan and Md Zahidul Islam Centre for Research in Complex Systems (CRiCS)
More informationAn Empirical Study on feature selection for Data Classification
An Empirical Study on feature selection for Data Classification S.Rajarajeswari 1, K.Somasundaram 2 Department of Computer Science, M.S.Ramaiah Institute of Technology, Bangalore, India 1 Department of
More informationAlgorithms: Decision Trees
Algorithms: Decision Trees A small dataset: Miles Per Gallon Suppose we want to predict MPG From the UCI repository A Decision Stump Recursion Step Records in which cylinders = 4 Records in which cylinders
More informationDECISION TREE INDUCTION USING ROUGH SET THEORY COMPARATIVE STUDY
DECISION TREE INDUCTION USING ROUGH SET THEORY COMPARATIVE STUDY Ramadevi Yellasiri, C.R.Rao 2,Vivekchan Reddy Dept. of CSE, Chaitanya Bharathi Institute of Technology, Hyderabad, INDIA. 2 DCIS, School
More informationGenetic Programming for Data Classification: Partitioning the Search Space
Genetic Programming for Data Classification: Partitioning the Search Space Jeroen Eggermont jeggermo@liacs.nl Joost N. Kok joost@liacs.nl Walter A. Kosters kosters@liacs.nl ABSTRACT When Genetic Programming
More informationSoftening Splits in Decision Trees Using Simulated Annealing
Softening Splits in Decision Trees Using Simulated Annealing Jakub Dvořák and Petr Savický Institute of Computer Science, Academy of Sciences of the Czech Republic {dvorak,savicky}@cs.cas.cz Abstract.
More informationBusiness Club. Decision Trees
Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building
More informationCloNI: clustering of JN -interval discretization
CloNI: clustering of JN -interval discretization C. Ratanamahatana Department of Computer Science, University of California, Riverside, USA Abstract It is known that the naive Bayesian classifier typically
More informationImplementation of Classification Rules using Oracle PL/SQL
1 Implementation of Classification Rules using Oracle PL/SQL David Taniar 1 Gillian D cruz 1 J. Wenny Rahayu 2 1 School of Business Systems, Monash University, Australia Email: David.Taniar@infotech.monash.edu.au
More informationRank Measures for Ordering
Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many
More informationDecision Tree CE-717 : Machine Learning Sharif University of Technology
Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adapted from: Prof. Tom Mitchell Decision tree Approximating functions of usually discrete
More informationNoise-based Feature Perturbation as a Selection Method for Microarray Data
Noise-based Feature Perturbation as a Selection Method for Microarray Data Li Chen 1, Dmitry B. Goldgof 1, Lawrence O. Hall 1, and Steven A. Eschrich 2 1 Department of Computer Science and Engineering
More informationarxiv: v1 [stat.ml] 25 Jan 2018
arxiv:1801.08310v1 [stat.ml] 25 Jan 2018 Information gain ratio correction: Improving prediction with more balanced decision tree splits Antonin Leroux 1, Matthieu Boussard 1, and Remi Dès 1 1 craft ai
More informationCOMP 465: Data Mining Classification Basics
Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised
More informationLecture 2 :: Decision Trees Learning
Lecture 2 :: Decision Trees Learning 1 / 62 Designing a learning system What to learn? Learning setting. Learning mechanism. Evaluation. 2 / 62 Prediction task Figure 1: Prediction task :: Supervised learning
More informationDESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES
EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset
More informationNetwork. Department of Statistics. University of California, Berkeley. January, Abstract
Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,
More informationFuzzy-Rough Feature Significance for Fuzzy Decision Trees
Fuzzy-Rough Feature Significance for Fuzzy Decision Trees Richard Jensen and Qiang Shen Department of Computer Science, The University of Wales, Aberystwyth {rkj,qqs}@aber.ac.uk Abstract Crisp decision
More informationUSING REGRESSION TREES IN PREDICTIVE MODELLING
Production Systems and Information Engineering Volume 4 (2006), pp. 115-124 115 USING REGRESSION TREES IN PREDICTIVE MODELLING TAMÁS FEHÉR University of Miskolc, Hungary Department of Information Engineering
More informationUsing Turning Point Detection to Obtain Better Regression Trees
Using Turning Point Detection to Obtain Better Regression Trees Paul K. Amalaman, Christoph F. Eick and Nouhad Rizk pkamalam@uh.edu, ceick@uh.edu, nrizk@uh.edu Department of Computer Science, University
More informationA Cloud Framework for Big Data Analytics Workflows on Azure
A Cloud Framework for Big Data Analytics Workflows on Azure Fabrizio MAROZZO a, Domenico TALIA a,b and Paolo TRUNFIO a a DIMES, University of Calabria, Rende (CS), Italy b ICAR-CNR, Rende (CS), Italy Abstract.
More informationInternational Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X
Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,
More informationAdaptive Building of Decision Trees by Reinforcement Learning
Proceedings of the 7th WSEAS International Conference on Applied Informatics and Communications, Athens, Greece, August 24-26, 2007 34 Adaptive Building of Decision Trees by Reinforcement Learning MIRCEA
More informationC-NBC: Neighborhood-Based Clustering with Constraints
C-NBC: Neighborhood-Based Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is
More informationPattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition
Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant
More informationA Comparative Study of Reliable Error Estimators for Pruning Regression Trees
A Comparative Study of Reliable Error Estimators for Pruning Regression Trees Luís Torgo LIACC/FEP University of Porto R. Campo Alegre, 823, 2º - 4150 PORTO - PORTUGAL Phone : (+351) 2 607 8830 Fax : (+351)
More informationDecision Tree Induction from Distributed Heterogeneous Autonomous Data Sources
Decision Tree Induction from Distributed Heterogeneous Autonomous Data Sources Doina Caragea, Adrian Silvescu, and Vasant Honavar Artificial Intelligence Research Laboratory, Computer Science Department,
More informationPerformance analysis of a MLP weight initialization algorithm
Performance analysis of a MLP weight initialization algorithm Mohamed Karouia (1,2), Régis Lengellé (1) and Thierry Denœux (1) (1) Université de Compiègne U.R.A. CNRS 817 Heudiasyc BP 49 - F-2 Compiègne
More informationA Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)
A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995) Department of Information, Operations and Management Sciences Stern School of Business, NYU padamopo@stern.nyu.edu
More informationMachine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme
Machine Learning A. Supervised Learning A.7. Decision Trees Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany 1 /
More informationGraph Propositionalization for Random Forests
Graph Propositionalization for Random Forests Thashmee Karunaratne Dept. of Computer and Systems Sciences, Stockholm University Forum 100, SE-164 40 Kista, Sweden si-thk@dsv.su.se Henrik Boström Dept.
More informationTowards an Effective Cooperation of the User and the Computer for Classification
ACM SIGKDD Int. Conf. on Knowledge Discovery & Data Mining (KDD- 2000), Boston, MA. Towards an Effective Cooperation of the User and the Computer for Classification Mihael Ankerst, Martin Ester, Hans-Peter
More informationCS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008
CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof. Ruiz Problem
More informationBOAI: Fast Alternating Decision Tree Induction based on Bottom-up Evaluation
: Fast Alternating Decision Tree Induction based on Bottom-up Evaluation Bishan Yang, Tengjiao Wang, Dongqing Yang, and Lei Chang Key Laboratory of High Confidence Software Technologies (Peking University),
More informationDecision Trees Dr. G. Bharadwaja Kumar VIT Chennai
Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target
More informationMySQL Data Mining: Extending MySQL to support data mining primitives (demo)
MySQL Data Mining: Extending MySQL to support data mining primitives (demo) Alfredo Ferro, Rosalba Giugno, Piera Laura Puglisi, and Alfredo Pulvirenti Dept. of Mathematics and Computer Sciences, University
More informationLOAD BALANCING IN MOBILE INTELLIGENT AGENTS FRAMEWORK USING DATA MINING CLASSIFICATION TECHNIQUES
8 th International Conference on DEVELOPMENT AND APPLICATION SYSTEMS S u c e a v a, R o m a n i a, M a y 25 27, 2 0 0 6 LOAD BALANCING IN MOBILE INTELLIGENT AGENTS FRAMEWORK USING DATA MINING CLASSIFICATION
More informationCustomer Clustering using RFM analysis
Customer Clustering using RFM analysis VASILIS AGGELIS WINBANK PIRAEUS BANK Athens GREECE AggelisV@winbank.gr DIMITRIS CHRISTODOULAKIS Computer Engineering and Informatics Department University of Patras
More informationChapter 12 Feature Selection
Chapter 12 Feature Selection Xiaogang Su Department of Statistics University of Central Florida - 1 - Outline Why Feature Selection? Categorization of Feature Selection Methods Filter Methods Wrapper Methods
More informationApplication of Multivariate Adaptive Regression Splines to Evaporation Losses in Reservoirs
Open access e-journal Earth Science India, eissn: 0974 8350 Vol. 4(I), January, 20, pp.5-20 http://www.earthscienceindia.info/ Application of Multivariate Adaptive Regression Splines to Evaporation Losses
More informationNotes based on: Data Mining for Business Intelligence
Chapter 9 Classification and Regression Trees Roger Bohn April 2017 Notes based on: Data Mining for Business Intelligence 1 Shmueli, Patel & Bruce 2 3 II. Results and Interpretation There are 1183 auction
More informationA Comparative Study of Selected Classification Algorithms of Data Mining
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.220
More informationEvaluating the Replicability of Significance Tests for Comparing Learning Algorithms
Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms Remco R. Bouckaert 1,2 and Eibe Frank 2 1 Xtal Mountain Information Technology 215 Three Oaks Drive, Dairy Flat, Auckland,
More informationCLASSIFICATION FOR SCALING METHODS IN DATA MINING
CLASSIFICATION FOR SCALING METHODS IN DATA MINING Eric Kyper, College of Business Administration, University of Rhode Island, Kingston, RI 02881 (401) 874-7563, ekyper@mail.uri.edu Lutz Hamel, Department
More information