Logistic Model Tree With Modified AIC

Similar documents
Speeding up Logistic Model Tree Induction

Speeding Up Logistic Model Tree Induction

Random Forest A. Fornaser

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Induction of Multivariate Decision Trees by Using Dipolar Criteria

7. Decision or classification trees

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Univariate and Multivariate Decision Trees

Fuzzy Partitioning with FID3.1

Comparing Univariate and Multivariate Decision Trees *

Lecture 13: Model selection and regularization

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

International Journal of Software and Web Sciences (IJSWS)

Lazy Decision Trees Ronny Kohavi

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

What is machine learning?

Lecture 5: Decision Trees (Part II)

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Visualizing class probability estimators

Feature-weighted k-nearest Neighbor Classifier

Decision Tree CE-717 : Machine Learning Sharif University of Technology

ECLT 5810 Evaluation of Classification Quality

Unsupervised Discretization using Tree-based Density Estimation

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

10601 Machine Learning. Model and feature selection

Noise-based Feature Perturbation as a Selection Method for Microarray Data

S2 Text. Instructions to replicate classification results.

Classification/Regression Trees and Random Forests

Weka ( )

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Boosting Simple Model Selection Cross Validation Regularization

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

RESAMPLING METHODS. Chapter 05

Classification and Regression Trees

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Data Mining Practical Machine Learning Tools and Techniques

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

arxiv: v1 [stat.ml] 25 Jan 2018

Tree-based methods for classification and regression

Learning and Evaluating Classifiers under Sample Selection Bias

CS Machine Learning

Data Mining Practical Machine Learning Tools and Techniques

Expectation Maximization (EM) and Gaussian Mixture Models

Cover Page. The handle holds various files of this Leiden University dissertation.

Calibrating Random Forests

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms

Lecture 2 :: Decision Trees Learning

Simple Model Selection Cross Validation Regularization Neural Networks

Business Club. Decision Trees

CS 229 Midterm Review

Machine Learning Techniques for Data Mining

Implementierungstechniken für Hauptspeicherdatenbanksysteme Classification: Decision Trees

University of California, Berkeley

CS229 Lecture notes. Raphael John Lamarre Townshend

Constraint Based Induction of Multi-Objective Regression Trees

COMP 465: Data Mining Classification Basics

Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers

Ensemble Methods, Decision Trees

Chapter 8 The C 4.5*stat algorithm

CS249: ADVANCED DATA MINING

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Association Rule Mining and Clustering

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

Forward Feature Selection Using Residual Mutual Information

Penalizied Logistic Regression for Classification

Globally Induced Forest: A Prepruning Compression Scheme

FLEXIBLE AND OPTIMAL M5 MODEL TREES WITH APPLICATIONS TO FLOW PREDICTIONS

Classification: Decision Trees

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

Machine Learning Techniques for Data Mining

Network Traffic Measurements and Analysis

A Course in Machine Learning

A Two-level Learning Method for Generalized Multi-instance Problems

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Slides for Data Mining by I. H. Witten and E. Frank

Optimal Extension of Error Correcting Output Codes

Online Feature Selection using Grafting

A Fast Decision Tree Learning Algorithm

Topics in Machine Learning

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Application of Principal Components Analysis and Gaussian Mixture Models to Printer Identification

Learning Directed Probabilistic Logical Models using Ordering-search

Supervised Learning for Image Segmentation

Unsupervised Learning

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology

Multi-relational Decision Tree Induction

Extra readings beyond the lecture slides are important:

Logistic Model Trees with AUC Split Criterion for the KDD Cup 2009 Small Challenge

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Artificial Intelligence. Programming Styles

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Decision Tree Learning

Didacticiel - Études de cas. Comparison of the implementation of the CART algorithm under Tanagra and R (rpart package).

Transcription:

Logistic Model Tree With Modified AIC Mitesh J. Thakkar Neha J. Thakkar Dr. J.S.Shah Student of M.E.I.T. Asst.Prof.Computer Dept. Prof.&Head Computer Dept. S.S.Engineering College, Indus Engineering College L.D.College of Engineering BHAVNAGAR AHMEDABAD AHMEDABAD mitesh.thakkar11@gmail.com neha28842003@gmail.com jssld@yahoo.com Abstract Logistic Model Trees have been shown to be very accurate and compact classifiers. Their greatest disadvantage is the computational complexity of inducing the logistic regression models in the tree. This issue is addressed by using the modified AIC criterion instead of crossvalidation to prevent overfitting these models. In addition, to fill the missing values, mean and mode are used class wise for numeric and nominal attributes respectively. The comparison of training time and accuracy of the new induction process with the original one on various datasets and show that the training time often decreases and the classification accuracy also increase slightly. Keywords: Model tree, logistic regression. 1. Introduction Two popular methods for classification are linear logistic regression and tree induction, which have somewhat complementary advantages and disadvantages. The former fits a simple (linear) model to the data, and the process of model fitting is quite stable, resulting in low variance but potentially high bias. The latter, on the other hand, exhibits low bias but often high variance: it searches a less restricted space of models, allowing it to capture nonlinear patterns in the data, but making it less stable and prone to overfitting. So it is not surprising that neither of the two methods is superior in general [7]. However, the main drawback of LMTs is the time needed to build them. This is due mostly to the cost of building the logistic regression models at the nodes. The LogitBoost algorithm is repeatedly called for a fixed number of iterations, determined by a five fold cross validation. Then AIC criterion is used to overcome this problem. In this paper we investigate whether we can use the modified AIC criterion without loss of accuracy. In addition to fill missing values LMT used simple global imputation method; instead we find means/modes of respective attributes class wise. The rest of this paper is organized as follows. In Section 2 we give a brief overview of the original LMT induction algorithm. Section 3 describes the modifications made to various parts of the algorithm. In Section 4, we evaluate the modified algorithm and discuss the results, and in Section 5 we draw some conclusions. 2. Logistic Model Tree Induction The original LMT induction algorithm can be found in [1]. We give a brief overview of the process here, focusing on the aspects where we have made improvements. We begin this section with an overview of the underlying foundations and conclude it with a synopsis of the original LMT induction algorithm. 2.1 Logistic Regression Fitting a logistic regression model means estimating the parameter vectors βj. The standard procedure in statistics is to look for the maximum likelihood estimate: choose the parameters that maximize the probability of the observed data points [5]. For the logistic regression model, there are no closed-form solutions for these estimates. Instead, we have to use numeric optimization algorithms that approach the maximum likelihood solution iteratively and reach it in the limit. These models are a generalization of the (linear) logistic regression models. 1088

1. Start with weights W ij = 1/n, i=1,,n, j=1, J, F j (x)=0 and P j (x)=1/j j 2. Repeat for m=1,,m: (a) Repeat for j=1,.,j (b) Set (c) i. Compute working responses and weights in the j th class ii. Fit the function f mj (x) by a weighted squaresregression of z ij to x i with weights w ij Update 3. Output the classifier argmax Figure 1.1 LogitBoost algorithm (J classes) Generally, they have the form Where and the f mj are (not necessarily linear) functions of the input variables. Figure 1 gives the pseudocode for the algorithm. The variables y * ij encode the observed class membership probabilities for instance x i, i.e....(1) y i is the class label of example x i ). The p j (x) are the estimates of the class probabilities for an instance x given by the model fit so far. LogitBoost performs forward stagewise fitting: in every iteration, it computes response variables z ij that encode the error of the currently fit model on the training examples (in terms of probability estimates), and then tries to improve the model by adding a function f mj to the committee F j, fit to the response by least-squared error [6]. As shown in (Friedman et al., 2000), this amounts to performing a quasi-newton step in every iteration 2.2 Tree Induction The goal of supervised learning is to find a subdivision of the instance space into regions labeled with one of the target classes. Top-down tree induction finds this subdivision by recursively splitting the instance space, stopping when the regions of the subdivision are reasonably pure in the sense that they contain examples with mostly identical class labels. The regions are labeled with the majority class of the examples in that region. Important advantages of tree models (with axis-parallel splits) are that they can be constructed efficiently and are easy to interpret. A path in a decision tree basically corresponds to a conjunction of Boolean expression of the form attribute = value (for nominal attributes) or attribute value (for numeric attributes), so a tree can be seen as a set of rules that say how to classify instances. The goal of tree induction is to find a subdivision that is fine enough to capture the structure in the underlying domain but does not fit random patterns in the training data. The usual approach to the problem of finding the best number of splits is to first perform many splits (build a large tree) and afterwards use a pruning scheme to undo some of these splits. Different pruning schemes have been proposed. For example, C4.5 uses a statistically motivated estimate for the true error given the error on the training data, while the CART (Breiman et al., 1984) method cross-validates a costcomplexity parameter that assigns a penalty to large trees. 2.3 Logistic Model Tree For the details of the LMT induction algorithm, the reader should consult [1]. Here is a brief summary of the algorithm: - First, LogitBoost is run on all the data to build a logistic regression model the root node. The number of iterations to use is determined by a five fold crossvalidation. In each fold, LogitBoost is run on the training set up to a maximum number of iterations (200). The number of iterations that produces the lowest sum of errors on the test set over all five folds is used in LogitBoost on all the data to produce the model for the root node and is also used to build logistic regression models at all nodes in the tree. - The data is split using the C4.5 splitting criterion [9]. Logistic regression models are then built at the child nodes on the corresponding subsets of the data using LogitBoost. However, the algorithm starts with the committee Fj (x), weights wij and probability estimates pij inherited from the parent. - As long as at least 15 instances are present at a node and a useful split is found (as defined in the C4.5 1089

splitting scheme), then splitting and model building is continued in the same fashion. - The CART cross-validation-based pruning algorithm is applied to the tree [8]. 2.4 Issues in Existing Algorithm There are several issues that provide directions for future work. Probably the most important drawback of logistic model tree induction is the high computational complexity compared to simple tree induction. Although the asymptotic complexity of LMT is acceptable compared to other methods, the algorithm is quite slow in practice. Most of the time is spent fitting the logistic regression models at the nodes with the LogitBoost algorithm. It would be worthwhile looking for a faster way of fitting the logistic models that achieves the same performance. A further issue is the handling of missing values. At present, LMT uses a simple global imputation scheme for filling in missing values. a more sophisticated scheme for handling them might improve accuracy for domains where missing values occur frequently. Whatever issues are there in Logistic Model Tree (LMT) [1] are addressed by AIC criteria [2] instead of crossvalidation to prevent overfitting these models. In addition, a weight trimming heuristic is used which produces a significant speedup. If the AIC is not minimize up to 50 iteration then they have take a minimum iteration 50 as a default. So that number of iteration may not work for the large and high dimensional dataset and may be accuracy is decrease. 3. Modification to The Existing Algorithm We have done two modifications to the existing algorithm as explain below: 3.1 Automatic Iteration Termination: A common alternative to cross-validation for model selection is the use of an in-sample estimate of the generalization error, such as Akaike s Information Criterion (AIC) [3] as explain in [2]. They investigated its usefulness in selecting the optimum number of LogitBoost iterations and found it to be a viable alternative to cross validation in terms of classification accuracy and far superior in training time. We have used modified AIC [4] instead of AIC. In existing algorithm if AIC is calculated up to 50 iteration. If it is not minimize up to 50 iteration than default it is 50. Instead there may be a situation, AIC is minimize before 50 iteration. That is implemented in our algorithm. 3.2 Find Mean and Mode The existing algorithm used global imputation method to find mean and mode. Instead we find mean of the each attribute class wise. So, each attributes having means values for each class. Same method is used to find modes value for nominal attributes. When we find any missing values in any instance, we check the class of that attributes, then mean value of that class for that particular attributes is used to fill that missing value. Same method will be used for nominal values. So accuracy can be increased. 3.3 Steps of Algorithm 1. Fill the missing values by finding Mean and Mode by class wise. 2. Find the root node using C4.5 split criteria.[9] 3. Num_iteration=AIC_Iteration(Example,Initiallinearmo del) 4. Initialize the probability/weights for the LogitBoost algorithm 5. For i = 1 to Num_iteration Update probability and weights using LogitBoost algorithm and add regression function to linear model. 6. Find split using C4.5 splitting criteria. 7. For each split repeat step 4 and 5 until stopping criteria met. 8. Return the logistic regression function. 4. Experiments This section evaluates the modified LMT induction algorithm. For the evaluation we measure training times and classification accuracy. All experiments are ten runs of a tenfold stratified cross-validation. To implement the proposed algorithm JAVA is used as front end with the Netbeans editor. We are fetching the data from the file, which is in the ARFF (Attribute Relation File Format) format. In order to study the performance, the algorithm has been tested on an Intel based machine with a 2.20 GHz processor and 3 GB of main memory. We have used mostly two dataset to fill the missing values. 4.1 Scale up Performance Table 4.1 shows the scale up experiment performance which has been carried out on echocardiogram dataset of different size. The total number of records affects the performance of the algorithm. The parameter values used to generate the dataset are as follows. The total numbers of attributes are 13. There are basically two class values for the class variable. It contains Numeric data only. 1090

records TIME TIME (sec) TIME (sec) (sec) 132 0.59 0.27 0.36 264 0.64 0.59 0.61 1000 8.14 2.43 2.03 3000 11.19 12.36 8.33 5000 69.71 50.13 44.12 [Table 4.1 Scale up Performance] Figure 4.2 Accuracy performance (numeric attributes contains missing values) 4.3 Performance Study on Classifier Accuracy with missing Values in Nominal Attributes Figure 4.1 Scales up Performance 4.2 Performance Study on Classifier Accuracy with missing Values in Numeric Attributes Figure 7.2 shows the performance of accuracy study which has been done our classification model. The model is derived on echocardiogram [10] dataset of different size. The accuracy of the model has been tested for three different methods. In this dataset all attributes are numerical. The experiment shows that the accuracy is maintained even after the model is updated. records Accuracy Accuracy Accuracy (%) (%) (%) 132 93.12 86.29 94.65 264 95.41 97.30 95.80 1000 99 98.42 98.48 3000 99 99.02 99.26 5000 99 98 98.50 Figure 4.3 shows the performance of accuracy study which has been done our classification model. The model is derived on mushroom [10] dataset of different size. The accuracy of the model has been tested for three different methods. In this dataset all attributes are nominal. The experiment shows that the accuracy is maintained even after the model is updated. records Accuracy Accuracy Accuracy (%) (%) (%) 1000 99.8 99.9 99.9 3862 100 100 100 10000 99 99 99 20000 98 98 98.9 [Table 4.3 Accuracy performance (nominal attributes contains missing values)] [Table 4.2 Accuracy performance (numeric attributes contains missing values)] Figure 4.3 Accuracy performance (nominal attributes contains missing values) 1091

5. Conclusion Proposed algorithm Logistic Model Tree used modified AIC (LMT_MAIC) to minimize the number of LogitBoost iteration, so is more speeding up the algorithm. It also handle missing value by using the attribute mean for all samples belonging to the same class as the given tuple, which is more accurate than basic LMT algorithm so accuracy is increased where missing values are occur frequently in numeric attributes. And when there is missing values are there in nominal attributes there is no change in the accuracy but it takes less time to build model. 6. Future Extension There are several issues that provide directions for future work. Proposed algorithm used modified AIC to minimize the number of LogitBoost iteration, and fill missing values class wise. But future research might yield a principled way instead of AIC which increase the accuracy also. The algorithm can also be implemented in incremental manner in future. 7. References [1] Niels Landwehr, Mark Hall, and Eibe Frank. Logistic model trees. Machine Learning, 59(1/2):161 205, 2005. [2] Marc Sumner, Eibe Frank and Mark Hall Speeding up Logistic Model Tree Induction. [3] H. Akaike. Information theory and an extension of the maximum likelihood principle. In Second Int Symposium on Information Theory, pages 267 281, 1973. [4] Fjo De Riddef', Rik Pintelon, Johan Schoukens and David Paul Gillikinb Modified AIC and MDL Model Selection Criteria for Short Data Records (IEEE) [5] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. [6] Marcel Detting and Peter Buhlmann Boosting for tumor classification with gene expression data. September 5, 2002. [7] Claudia Perlich, Foster Provost, Jeffrey S. Simonoff Tree Induction vs. Logistic Regression: A Learning-Curve Analysis. [8] Breiman, L., H. Friedman, J. A. Olshen, and C. J. Stone: Classification and Regression Trees. Wadsworth.1984 [9] R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [10] Blake, C. and C. Merz: 1998, UCI Repository of machine learning databases. [www.ics.uci.edu/mlearn/mlrepository.html]. 1092