Logistic Model Tree With Modified AIC Mitesh J. Thakkar Neha J. Thakkar Dr. J.S.Shah Student of M.E.I.T. Asst.Prof.Computer Dept. Prof.&Head Computer Dept. S.S.Engineering College, Indus Engineering College L.D.College of Engineering BHAVNAGAR AHMEDABAD AHMEDABAD mitesh.thakkar11@gmail.com neha28842003@gmail.com jssld@yahoo.com Abstract Logistic Model Trees have been shown to be very accurate and compact classifiers. Their greatest disadvantage is the computational complexity of inducing the logistic regression models in the tree. This issue is addressed by using the modified AIC criterion instead of crossvalidation to prevent overfitting these models. In addition, to fill the missing values, mean and mode are used class wise for numeric and nominal attributes respectively. The comparison of training time and accuracy of the new induction process with the original one on various datasets and show that the training time often decreases and the classification accuracy also increase slightly. Keywords: Model tree, logistic regression. 1. Introduction Two popular methods for classification are linear logistic regression and tree induction, which have somewhat complementary advantages and disadvantages. The former fits a simple (linear) model to the data, and the process of model fitting is quite stable, resulting in low variance but potentially high bias. The latter, on the other hand, exhibits low bias but often high variance: it searches a less restricted space of models, allowing it to capture nonlinear patterns in the data, but making it less stable and prone to overfitting. So it is not surprising that neither of the two methods is superior in general [7]. However, the main drawback of LMTs is the time needed to build them. This is due mostly to the cost of building the logistic regression models at the nodes. The LogitBoost algorithm is repeatedly called for a fixed number of iterations, determined by a five fold cross validation. Then AIC criterion is used to overcome this problem. In this paper we investigate whether we can use the modified AIC criterion without loss of accuracy. In addition to fill missing values LMT used simple global imputation method; instead we find means/modes of respective attributes class wise. The rest of this paper is organized as follows. In Section 2 we give a brief overview of the original LMT induction algorithm. Section 3 describes the modifications made to various parts of the algorithm. In Section 4, we evaluate the modified algorithm and discuss the results, and in Section 5 we draw some conclusions. 2. Logistic Model Tree Induction The original LMT induction algorithm can be found in [1]. We give a brief overview of the process here, focusing on the aspects where we have made improvements. We begin this section with an overview of the underlying foundations and conclude it with a synopsis of the original LMT induction algorithm. 2.1 Logistic Regression Fitting a logistic regression model means estimating the parameter vectors βj. The standard procedure in statistics is to look for the maximum likelihood estimate: choose the parameters that maximize the probability of the observed data points [5]. For the logistic regression model, there are no closed-form solutions for these estimates. Instead, we have to use numeric optimization algorithms that approach the maximum likelihood solution iteratively and reach it in the limit. These models are a generalization of the (linear) logistic regression models. 1088
1. Start with weights W ij = 1/n, i=1,,n, j=1, J, F j (x)=0 and P j (x)=1/j j 2. Repeat for m=1,,m: (a) Repeat for j=1,.,j (b) Set (c) i. Compute working responses and weights in the j th class ii. Fit the function f mj (x) by a weighted squaresregression of z ij to x i with weights w ij Update 3. Output the classifier argmax Figure 1.1 LogitBoost algorithm (J classes) Generally, they have the form Where and the f mj are (not necessarily linear) functions of the input variables. Figure 1 gives the pseudocode for the algorithm. The variables y * ij encode the observed class membership probabilities for instance x i, i.e....(1) y i is the class label of example x i ). The p j (x) are the estimates of the class probabilities for an instance x given by the model fit so far. LogitBoost performs forward stagewise fitting: in every iteration, it computes response variables z ij that encode the error of the currently fit model on the training examples (in terms of probability estimates), and then tries to improve the model by adding a function f mj to the committee F j, fit to the response by least-squared error [6]. As shown in (Friedman et al., 2000), this amounts to performing a quasi-newton step in every iteration 2.2 Tree Induction The goal of supervised learning is to find a subdivision of the instance space into regions labeled with one of the target classes. Top-down tree induction finds this subdivision by recursively splitting the instance space, stopping when the regions of the subdivision are reasonably pure in the sense that they contain examples with mostly identical class labels. The regions are labeled with the majority class of the examples in that region. Important advantages of tree models (with axis-parallel splits) are that they can be constructed efficiently and are easy to interpret. A path in a decision tree basically corresponds to a conjunction of Boolean expression of the form attribute = value (for nominal attributes) or attribute value (for numeric attributes), so a tree can be seen as a set of rules that say how to classify instances. The goal of tree induction is to find a subdivision that is fine enough to capture the structure in the underlying domain but does not fit random patterns in the training data. The usual approach to the problem of finding the best number of splits is to first perform many splits (build a large tree) and afterwards use a pruning scheme to undo some of these splits. Different pruning schemes have been proposed. For example, C4.5 uses a statistically motivated estimate for the true error given the error on the training data, while the CART (Breiman et al., 1984) method cross-validates a costcomplexity parameter that assigns a penalty to large trees. 2.3 Logistic Model Tree For the details of the LMT induction algorithm, the reader should consult [1]. Here is a brief summary of the algorithm: - First, LogitBoost is run on all the data to build a logistic regression model the root node. The number of iterations to use is determined by a five fold crossvalidation. In each fold, LogitBoost is run on the training set up to a maximum number of iterations (200). The number of iterations that produces the lowest sum of errors on the test set over all five folds is used in LogitBoost on all the data to produce the model for the root node and is also used to build logistic regression models at all nodes in the tree. - The data is split using the C4.5 splitting criterion [9]. Logistic regression models are then built at the child nodes on the corresponding subsets of the data using LogitBoost. However, the algorithm starts with the committee Fj (x), weights wij and probability estimates pij inherited from the parent. - As long as at least 15 instances are present at a node and a useful split is found (as defined in the C4.5 1089
splitting scheme), then splitting and model building is continued in the same fashion. - The CART cross-validation-based pruning algorithm is applied to the tree [8]. 2.4 Issues in Existing Algorithm There are several issues that provide directions for future work. Probably the most important drawback of logistic model tree induction is the high computational complexity compared to simple tree induction. Although the asymptotic complexity of LMT is acceptable compared to other methods, the algorithm is quite slow in practice. Most of the time is spent fitting the logistic regression models at the nodes with the LogitBoost algorithm. It would be worthwhile looking for a faster way of fitting the logistic models that achieves the same performance. A further issue is the handling of missing values. At present, LMT uses a simple global imputation scheme for filling in missing values. a more sophisticated scheme for handling them might improve accuracy for domains where missing values occur frequently. Whatever issues are there in Logistic Model Tree (LMT) [1] are addressed by AIC criteria [2] instead of crossvalidation to prevent overfitting these models. In addition, a weight trimming heuristic is used which produces a significant speedup. If the AIC is not minimize up to 50 iteration then they have take a minimum iteration 50 as a default. So that number of iteration may not work for the large and high dimensional dataset and may be accuracy is decrease. 3. Modification to The Existing Algorithm We have done two modifications to the existing algorithm as explain below: 3.1 Automatic Iteration Termination: A common alternative to cross-validation for model selection is the use of an in-sample estimate of the generalization error, such as Akaike s Information Criterion (AIC) [3] as explain in [2]. They investigated its usefulness in selecting the optimum number of LogitBoost iterations and found it to be a viable alternative to cross validation in terms of classification accuracy and far superior in training time. We have used modified AIC [4] instead of AIC. In existing algorithm if AIC is calculated up to 50 iteration. If it is not minimize up to 50 iteration than default it is 50. Instead there may be a situation, AIC is minimize before 50 iteration. That is implemented in our algorithm. 3.2 Find Mean and Mode The existing algorithm used global imputation method to find mean and mode. Instead we find mean of the each attribute class wise. So, each attributes having means values for each class. Same method is used to find modes value for nominal attributes. When we find any missing values in any instance, we check the class of that attributes, then mean value of that class for that particular attributes is used to fill that missing value. Same method will be used for nominal values. So accuracy can be increased. 3.3 Steps of Algorithm 1. Fill the missing values by finding Mean and Mode by class wise. 2. Find the root node using C4.5 split criteria.[9] 3. Num_iteration=AIC_Iteration(Example,Initiallinearmo del) 4. Initialize the probability/weights for the LogitBoost algorithm 5. For i = 1 to Num_iteration Update probability and weights using LogitBoost algorithm and add regression function to linear model. 6. Find split using C4.5 splitting criteria. 7. For each split repeat step 4 and 5 until stopping criteria met. 8. Return the logistic regression function. 4. Experiments This section evaluates the modified LMT induction algorithm. For the evaluation we measure training times and classification accuracy. All experiments are ten runs of a tenfold stratified cross-validation. To implement the proposed algorithm JAVA is used as front end with the Netbeans editor. We are fetching the data from the file, which is in the ARFF (Attribute Relation File Format) format. In order to study the performance, the algorithm has been tested on an Intel based machine with a 2.20 GHz processor and 3 GB of main memory. We have used mostly two dataset to fill the missing values. 4.1 Scale up Performance Table 4.1 shows the scale up experiment performance which has been carried out on echocardiogram dataset of different size. The total number of records affects the performance of the algorithm. The parameter values used to generate the dataset are as follows. The total numbers of attributes are 13. There are basically two class values for the class variable. It contains Numeric data only. 1090
records TIME TIME (sec) TIME (sec) (sec) 132 0.59 0.27 0.36 264 0.64 0.59 0.61 1000 8.14 2.43 2.03 3000 11.19 12.36 8.33 5000 69.71 50.13 44.12 [Table 4.1 Scale up Performance] Figure 4.2 Accuracy performance (numeric attributes contains missing values) 4.3 Performance Study on Classifier Accuracy with missing Values in Nominal Attributes Figure 4.1 Scales up Performance 4.2 Performance Study on Classifier Accuracy with missing Values in Numeric Attributes Figure 7.2 shows the performance of accuracy study which has been done our classification model. The model is derived on echocardiogram [10] dataset of different size. The accuracy of the model has been tested for three different methods. In this dataset all attributes are numerical. The experiment shows that the accuracy is maintained even after the model is updated. records Accuracy Accuracy Accuracy (%) (%) (%) 132 93.12 86.29 94.65 264 95.41 97.30 95.80 1000 99 98.42 98.48 3000 99 99.02 99.26 5000 99 98 98.50 Figure 4.3 shows the performance of accuracy study which has been done our classification model. The model is derived on mushroom [10] dataset of different size. The accuracy of the model has been tested for three different methods. In this dataset all attributes are nominal. The experiment shows that the accuracy is maintained even after the model is updated. records Accuracy Accuracy Accuracy (%) (%) (%) 1000 99.8 99.9 99.9 3862 100 100 100 10000 99 99 99 20000 98 98 98.9 [Table 4.3 Accuracy performance (nominal attributes contains missing values)] [Table 4.2 Accuracy performance (numeric attributes contains missing values)] Figure 4.3 Accuracy performance (nominal attributes contains missing values) 1091
5. Conclusion Proposed algorithm Logistic Model Tree used modified AIC (LMT_MAIC) to minimize the number of LogitBoost iteration, so is more speeding up the algorithm. It also handle missing value by using the attribute mean for all samples belonging to the same class as the given tuple, which is more accurate than basic LMT algorithm so accuracy is increased where missing values are occur frequently in numeric attributes. And when there is missing values are there in nominal attributes there is no change in the accuracy but it takes less time to build model. 6. Future Extension There are several issues that provide directions for future work. Proposed algorithm used modified AIC to minimize the number of LogitBoost iteration, and fill missing values class wise. But future research might yield a principled way instead of AIC which increase the accuracy also. The algorithm can also be implemented in incremental manner in future. 7. References [1] Niels Landwehr, Mark Hall, and Eibe Frank. Logistic model trees. Machine Learning, 59(1/2):161 205, 2005. [2] Marc Sumner, Eibe Frank and Mark Hall Speeding up Logistic Model Tree Induction. [3] H. Akaike. Information theory and an extension of the maximum likelihood principle. In Second Int Symposium on Information Theory, pages 267 281, 1973. [4] Fjo De Riddef', Rik Pintelon, Johan Schoukens and David Paul Gillikinb Modified AIC and MDL Model Selection Criteria for Short Data Records (IEEE) [5] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. [6] Marcel Detting and Peter Buhlmann Boosting for tumor classification with gene expression data. September 5, 2002. [7] Claudia Perlich, Foster Provost, Jeffrey S. Simonoff Tree Induction vs. Logistic Regression: A Learning-Curve Analysis. [8] Breiman, L., H. Friedman, J. A. Olshen, and C. J. Stone: Classification and Regression Trees. Wadsworth.1984 [9] R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [10] Blake, C. and C. Merz: 1998, UCI Repository of machine learning databases. [www.ics.uci.edu/mlearn/mlrepository.html]. 1092