SSV Criterion Based Discretization for Naive Bayes Classifiers

Similar documents
Feature Selection with Decision Tree Criterion

CloNI: clustering of JN -interval discretization

Unsupervised Discretization using Tree-based Density Estimation

FEATURE SELECTION BASED ON INFORMATION THEORY, CONSISTENCY AND SEPARABILITY INDICES.

Discretizing Continuous Attributes Using Information Theory

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms

Non-Disjoint Discretization for Naive-Bayes Classifiers

Weighting and selection of features.

Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology

Feature Selection for Supervised Classification: A Kolmogorov- Smirnov Class Correlation-Based Filter

Proportional k-interval Discretization for Naive-Bayes Classifiers

Targeting Business Users with Decision Table Classifiers

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Locally Weighted Naive Bayes

Feature Selection Algorithm with Discretization and PSO Search Methods for Continuous Attributes

Efficient SQL-Querying Method for Data Mining in Large Data Bases

An ICA-Based Multivariate Discretization Algorithm

Chapter 8 The C 4.5*stat algorithm

An Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods

Research Article A New Robust Classifier on Noise Domains: Bagging of Credal C4.5 Trees

SYMBOLIC FEATURES IN NEURAL NETWORKS

Forward Feature Selection Using Residual Mutual Information

BRACE: A Paradigm For the Discretization of Continuously Valued Data

Weighted Proportional k-interval Discretization for Naive-Bayes Classifiers

The Role of Biomedical Dataset in Classification

Genetic Programming for Data Classification: Partitioning the Search Space

Comparison of Various Feature Selection Methods in Application to Prototype Best Rules

Support Vector Machines for visualization and dimensionality reduction

Machine Learning Techniques for Data Mining

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

Slides for Data Mining by I. H. Witten and E. Frank

Combining Cross-Validation and Confidence to Measure Fitness

Technical Report. No: BU-CE A Discretization Method based on Maximizing the Area Under ROC Curve. Murat Kurtcephe, H. Altay Güvenir January 2010

Fuzzy Partitioning with FID3.1

Supervised and Unsupervised Discretization of Continuous Features. James Dougherty Ron Kohavi Mehran Sahami. Stanford University

Increasing efficiency of data mining systems by machine unification and double machine cache

Comparing Univariate and Multivariate Decision Trees *

Technical Note Using Model Trees for Classification

A Clustering based Discretization for Supervised Learning

Network Traffic Measurements and Analysis

Introduction to Artificial Intelligence

CSE4334/5334 DATA MINING

Abstract. 1 Introduction. Ian H. Witten Department of Computer Science University of Waikato Hamilton, New Zealand

TUBE: Command Line Program Calls

Using Decision Trees and Soft Labeling to Filter Mislabeled Data. Abstract

A Method for Handling Numerical Attributes in GA-based Inductive Concept Learners

Feature Subset Selection Problem using Wrapper Approach in Supervised Learning

Learning highly non-separable Boolean functions using Constructive Feedforward Neural Network

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Global Metric Learning by Gradient Descent

A Modified Chi2 Algorithm for Discretization

A Fast Decision Tree Learning Algorithm

An Approach for Discretization and Feature Selection Of Continuous-Valued Attributes in Medical Images for Classification Learning.

Feature Subset Selection as Search with Probabilistic Estimates Ron Kohavi

Data Preparation. Nikola Milikić. Jelena Jovanović

Universal Learning Machines

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Building Classifiers using Bayesian Networks

A DISCRETIZATION METHOD BASED ON MAXIMIZING THE AREA UNDER RECEIVER OPERATING CHARACTERISTIC CURVE

Discretization and Grouping: Preprocessing Steps for Data Mining

Information theory methods for feature selection

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

RIONA: A New Classification System Combining Rule Induction and Instance-Based Learning

Feature-weighted k-nearest Neighbor Classifier

Experimental Evaluation of Discretization Schemes for Rule Induction

Univariate and Multivariate Decision Trees

Medical datamining with a new algorithm for Feature Selection and Naïve Bayesian classifier

Features: representation, normalization, selection. Chapter e-9

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

10601 Machine Learning. Model and feature selection

IMPLEMENTATION OF ANT COLONY ALGORITHMS IN MATLAB R. Seidlová, J. Poživil

Classification: Basic Concepts, Decision Trees, and Model Evaluation

WRAPPER feature selection method with SIPINA and R (RWeka package). Comparison with a FILTER approach implemented into TANAGRA.

Discretizing Continuous Attributes While Learning Bayesian Networks

Comparing Case-Based Bayesian Network and Recursive Bayesian Multi-net Classifiers

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

Robust PDF Table Locator

Univariate Margin Tree

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Data Preparation. UROŠ KRČADINAC URL:

Data Preprocessing. Data Preprocessing

Supervised classification exercice

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1

Automatic Domain Partitioning for Multi-Domain Learning

MODL: a Bayes Optimal Discretization Method for Continuous Attributes

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

3 Virtual attribute subsetting

Regularization and model selection

Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique

Decision Tree Grafting From the All-Tests-But-One Partition

Discretization from Data Streams: Applications to Histograms and Data Mining

Chapter 3: Supervised Learning

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Integrated Classifier Hyperplane Placement and Feature Selection

A Survey of Distance Metrics for Nominal Attributes

Performance Analysis of Data Mining Classification Techniques

Transcription:

SSV Criterion Based Discretization for Naive Bayes Classifiers Krzysztof Grąbczewski kgrabcze@phys.uni.torun.pl Department of Informatics, Nicolaus Copernicus University, ul. Grudziądzka 5, 87-100 Toruń, Poland. http://www.phys.uni.torun.pl/kmk Abstract. Decision tree algorithms deal with continuous variables by finding split points which provide best separation of objects belonging to different classes. Such criteria can also be used to augment methods which require or prefer symbolic data. A tool for continuous data discretization based on the SSV criterion (designed for decision trees) has been constructed. It significantly improves the performance of Naive Bayes Classifier. The combination of the two methods has been tested on 15 datasets from UCI repository and compared with similar approaches. The comparison confirms the robustness of the system. 1 Introduction There are many different kinds of algorithms used in computational intelligence for data classification purposes. Different methods have different capabilities and are successful in different applications. For some data sets, really accurate models can be obtained only with a combination of different kinds of methodologies. Decision tree algorithms have successfully used different criteria to separate different classes of objects. In the case of continuous variables in the feature space they find some cut points which determine data subsets for further steps of the hierarchical analysis. The split points can also be used for discretization of the features, so that other methods become applicable to the domain or improve their performance. The aim of the work described here was to examine the advantages of using the Separability of Split Value (SSV) criterion for discretization tasks. The new method has been analyzed in the context of the Naive Bayes Classifier results. 2 Discretization Many different approaches to discretization can be found in the literature. There are both supervised and unsupervised methods. Some of them are very simple, others quite complex. The simplest techniques (and probably the most commonly used) divide the scope of feature values observed in the training data into several intervals of equal width or equal data frequency. Both methods are unsupervised they operate with no respect to the labels assigned to data vectors.

The supervised methods are more likely to give high accuracies, because the analysis of data labels may lead to much better separation of the classes. However, such methods should be treated as parts of more complex classification algorithms, not as means of data preprocessing. When a cross-validation test is performed and a supervised discretization is done in advance for the whole dataset, then the CV results may get strongly affected. Only performing the discretization of the trainig parts of data for each fold of the CV will guarantee reliable results. One of the basic supervised methods is discretization by histogram analysis. The histograms for all the classes displayed as a single plot provide some visual tool for manual cut points selection (but obviously this technique can also be automated). Some of the more advanced methods are based on statistical tests: ChiMerge [6], Chi2 [7]. There are also supervised discretization schemes based on the information theory [3]. Decision trees which are capable of dealing with continuous data can be easily used to recursively partition the scope of a feature into as many intervals as needed, but it is not simple to specify the needs. The same problems with adjusting parameters concern all the discretization methods. Some methods use statistical tests to decide when to stop further splitting, some others rely on the Minimum Description Length Principle. The way features should be discretized can depend on the classifier which is to deal with the converted data. Therefore it is most advisable to use wrapper approaches to determine the optimal parameters values for particular subsystems. Obviously, wrappers suffer from their complexity, but in the case of fast discretization methods and fast final classifiers, they can be efficiently applied to medium-sized datasets (in particular to the UCI repository datasets [8]). There are already some comparisons of different discretization methods available in the literature. Two broad studies of the algorithms can be found in [1] and [9]. They are good reference points for the results presented here. 3 SSV criterion The SSV criterion is one of the most efficient among criteria used for decision tree construction [5]. Its basic advantage is that it can be applied to both continuous and discrete features. The split value (or cut-off point) is defined differently for continuous and symbolic features. For continuous features it is a real number and for symbolic ones it is a subset of the set of alternative values of the feature. The left side (LS) and right side (RS) of a split value s of feature f for a given dataset D is defined as: { {x D : f (x) < s} if f is continuous LS(s, f,d) = {x D : f (x) s} otherwise (1) RS(s, f,d) =D LS(s, f,d) where f (x) is the f s feature value for the data vector x. The definition of the separability of a split value s is: SSV(s) =2 c C LS(s, f,d c RS(s, f,d D c ) c C min( LS(s, f,d c, RS(s, f,d c ) (2)

where C is the set of classes and D c is the set of data vectors from D assigned to class c C. Decision trees are constructed recursively by searching for best splits (with the largest SSV value) among all the splits for all the features. At each stage when the best split is found and the subsets of data resulting from the split are not completely pure (i.e. contain data belonging to more than one class) each of the subsets is analyzed in the same way as the whole data. The decision tree built this way gives maximal possible accuracy (100% if there are no contradictory examples in the data), which usually means that the created model overfits the data. To remedy this a cross validation training is performed to find the optimal parameters for pruning the tree. The optimal pruning produces a tree capable of good generalization of the patterns used in the tree construction process. 3.1 SSV based discretization The SSV criterion has proved to be very efficient not only as a decision tree construction tool, but also in other tasks like feature selection [2] or discrete to continuous data conversion [4]. The discretization algorithm based on the SSV criterion operates for each of the continuous features separately. For given training dataset D, selected continuous feature f and the target number n of splits, it acts according to the following: 1. Build SSV decision tree for the dataset obtained by projection of D to one-dimentional space consisting of the feature f. 2. Prune the tree to have no more than n splits. This is achieved by pruning the outermost splits (i.e. the splits with two leaves), one by one, in the order of increasing value of the reduction of the training data classification error obtained with the node. 3. Discretize the feature using all the split points occurring in the pruned tree. Obviously, the discretization with n split points leads to n + 1 intervals, so the sensible values of the parameter are integers greater than zero. The selection of the n parameter is not straightforward. It depends on the data being examined and the method which is to be applied to the discretized data. It can be selected arbitrarily as it is usually done in the case of equal width or equal frequency intervals methods. Because the SSV discretization is supervised, the number of intervals will not need to be as large as in the case of the unsupervised methods. If we want to look for the optimal value of n, we can use a wrapper approach which, of course, requires more computations than a single transformation, but it can better fit n to the needs of the particular symbolic data classifier. 4 Results To test the advantages of the presented discretization algorithm, it has been applied to 15 datasets available in the UCI repository [8]. Some information about the data is presented in table 1. These are the datasets defined in spaces with continuous features,

which facilitates testing discretization methods. The same sets were used in the comparison presented in [1]. Although Dougherty et al. present many details about their way of testing, there is still some uncertainty, which suggests some caution in comparisons: for example the way of treating unknown values is not reported. The results presented here ignore the missing values when estimating probabilities i.e. for given interval of given features only vectors with known values are given consideration to. When classifying a vector with missing values, the corresponding features are excluded from calculations. For each of the datasets a number of tests has been performed and the results placed in table 2. All the results are obtained with the Naive Bayes Classifier run on differently prepared data. The classifier function is argmax c C P(c)P(x i c), where x i is the value of i th feature of the vector x being classified. P(c) and P(x i c) are estimated on the basis of the training data set (no Laplace correction or m-estimate were applied). For continuous features P(x i c) is replaced by the value of Gaussian density function with mean and variance estimated from the training samples. In table 2, the Raw column reflects the results obtained with no data preprocessing, 10EW means that 10 equal width intervals discretization was used, SSV 4 reflects the SSV criterion based discretization with 4 split points per feature and SSV O corresponds to the SSV criterion based discretization with a search for optimal number of split points. The best value of the parameter was searched in the range from 1 to 25, in accordance with the results of 5-fold cross-validation test conducted within the training data. The table presents classification accuracies placed above their standard deviations calculated for 30 repetitions 1 of 5-fold cross-validation (the left one concerns the 30 means and the right one is the average internal deviation of the 5-fold CVs). The -P suffix means that the discretization was done at the data preprocessing level, and the -I suffix, that the transformations were done internally in each of the CV folds. Using supervised methods for data preprocessing is not reasonable it offends the principle of testing on unseen data. Especially when dealing with a small number of vectors and a large number of features, there is a danger of overfitting and attractively looking results may turn out to be misleading and completely useless. A confirmation of these remarks can be seen in the lower part of table 3. Statistical significance of the differences was examined with the paired t test. 5 Conclusions The discretization method presented here significantly improves the Naive Bayes Classifier s results. It brings a statistically significant improvement not only to the basic NBC formulation, but also in relation to the 10 equal-width bins technique and other methods examined in [1]. The results presented in table 2 also facilitate a comparison of the proposed algorithm with the methods pursued in [9]. Calculating average accuracies over 11 datasets (common to both studies) for the methods tested by Yang et al. and for SSV based algorithm shows that SSV offers lower classification error rate than all the others but one the Lazy Discretization (LD), which is extremely computationally 1 The exception is the last column which because of higher time demands shows the results of 10 repetitions of the CV test. One should also notice that the results collected in [1] seem to be single CV scores.

Dataset Vectors Classes Features Missing values continuous discrete number percent Annealing 898 5 6 12 5907 36.54 Australian credit approval 690 2 6 8 0 0 Wisconsin breast cancer 699 2 9 0 16 0.25 Cleveland heart disease 303 2 5 8 7 0.18 Japanese credit screening (crx) 690 2 6 9 67 0.65 Pima Indian diabetes 768 2 8 0 0 0 German credit data 1000 2 24 0 0 0 Glass identification 214 6 9 0 0 0 Statlog heart 270 2 13 0 0 0 Hepatitis 155 2 6 13 167 5.67 Horse colic 368 2 7 15 1927 23.80 Hypothyroid 3163 2 7 18 5329 6.74 Iris 150 3 4 0 0 0 Sick-euthyroid 3163 2 7 18 5329 6.74 Vehicle silhouettes 846 4 18 0 0 0 Table 1. Datasets used for the comparisons. Dataset Raw 10EW-P 10EW-I SSV 4 -P SSV 4 -I SSV O -P SSV O -I 90,99 Annealing 0,83/2,45 77,16 Australian credit 0,33/3,34 95,97 Wisconsin breast cancer 0,11/1,45 83,62 Cleveland heart disease 0,57/4,78 77,39 Japanese credit (crx) 0,33/2,97 75,56 Pima Indian diabetes 0,47/3,32 72,41 German credit data 0,61/3,12 46,02 Glass identification 2,14/6,88 84,22 Statlog heart 0,85/4,21 85,05 Hepatitis 0,85/5,49 79,35 Horse colic 0,73/4,48 97,74 Hypothyroid 0,08/0,43 95,44 Iris 0,47/3,52 85,52 Sick-euthyroid 0,36/2,04 45,65 Vehicle silhouettes 0,85/3,27 96,06 0,29/1,53 84,43 0,48/2,80 96,62 0,14/1,41 81,28 0,96/4,50 84,18 0,48/2,94 75,70 0,74/3,49 75,08 0,51/2,78 57,84 1,44/6,22 80,94 1,02/4,89 86,52 1,02/5,71 79,44 0,74/4,83 96,49 0,15/0,51 92,76 1,05/4,56 91,53 0,19/1,05 61,99 1,00/2,88 95,97 0,39/1,39 84,71 0,62/2,74 96,54 0,15/1,58 81,79 0,68/4,39 83,92 0,55/3,16 75,85 0,74/3,17 74,94 0,58/2,75 60,73 2,45/7,17 81,60 0,95/4,41 86,75 1,85/5,11 79,23 0,93/4,95 96,55 0,20/0,70 92,36 1,36/4,77 90,62 0,69/2,05 61,71 1,05/3,26 97,58 0,22/1,06 85,86 0,40/2,63 96,54 0,21/1,37 83,77 0,57/4,65 85,90 0,50/2,55 75,34 0,41/3,45 76,20 0,48/2,77 69,59 1,45/6,75 83,28 0,69/4,68 86,65 1,00/5,11 78,82 0,69/4,58 98,25 0,08/0,38 93,44 0,84/4,00 95,10 0,17/0,68 59,00 0,62/3,54 97,64 0,30/1,13 85,86 0,40/2,51 96,54 0,24/1,35 83,26 0,68/4,64 85,64 0,47/2,68 75,07 0,78/3,09 76,04 0,44/2,84 66,09 2,30/7,40 82,70 0,84/4,84 85,01 1,12/5,80 78,39 0,94/4,46 98,17 0,13/0,37 93,13 0,93/4,40 94,86 0,21/0,67 57,61 0,71/3,96 97,69 0,28/1,10 85,99 0,48/2,85 97,41 0,05/1,31 83,92 0,86/4,33 85,90 0,50/2,55 76,35 0,36/2,90 76,20 0,48/2,77 68,97 1,84/6,42 84,06 0,57/5,47 86,83 0,87/5,51 79,00 0,66/5,11 98,49 0,09/0,48 93,78 1,16/3,85 95,65 0,12/0,69 62,49 1,04/3,44 97,54 0,33/0,92 85,68 0,47/3,12 97,30 0,11/1,32 83,30 1,24/3,67 85,67 0,66/3,07 74,91 0,75/3,60 75,53 0,58/2,88 64,40 1,76/6,63 83,07 0,61/5,59 85,10 0,88/5,62 78,86 0,50/5,43 98,26 0,10/0,45 92,93 1,00/3,98 95,51 0,24/0,72 61,64 0,97/3,75 Average 79,47 82,72 82,88 84,35 83,73 84,85 83,98 Table 2. Comparison of the classification results.

Methods Nominally Significantly better same worse better same worse 10EW-I vs Raw 10 0 5 8 3 4 SSV 4 -I vs Raw 9 0 6 9 5 1 SSV O -I vs Raw 12 0 3 9 6 0 SSV 4 -I vs 10EW-I 10 1 4 4 10 1 SSV O -I vs 10EW-I 11 0 4 5 10 0 10EW-P vs 10EW-I 8 0 7 0 15 0 SSV 4 -P vs SSV 4 -I 12 2 1 0 15 0 SSV O -P vs SSV O -I 15 0 0 3 12 0 Table 3. Statistical significance of the results differences. expensive. Moreover, the high result of the LD methods is a consequence of 13% lower error for the glass data, and it seems that the two approaches used different data (the reported numbers of classes are different). The 13% difference is also a bit suspicious, since it concerns all the other methods too and is the only one of that magnitude. There is still some area for future work: the most interesting result would be a criterion able to determine the optimal number of split points without an application of a wrapper technique. Moreover, to simplify the search a single n determines the number of intervals for each of the continuous features. It could be more accurate to treat the features independently and, if necessary, use different numbers of splits for different features. References 1. J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous features. In Proceedings of the ICML, pages 194 202, 1995. 2. W. Duch, T. Winiarski, J. Biesiada, and A. Kachel. Feature ranking, selection and discretization. In Proceedings of the ICANN 2003, 2003. 3. U. M. Fayyad and K. B. Irani. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artifficial Intelligence, pages 1022 1027. Morgan Kaufmann Publishers, 1993. 4. K. Grąbczewski and N. Jankowski. Transformations of symbolic data for continuous data oriented models. In Proceedings of the ICANN 2003. Springer, 2003. 5. K. Grąbczewski and W. Duch. The Separability of Split Value criterion. In Proceedings of the 5th Conference on Neural Networks and Their Applications, pages 201 208, Zakopane, Poland, June 2000. 6. R. Kerber. Chimerge: Discretization for numeric attributes. In National Conference on Artificial Intelligence, pages 123 128. AAAI Press, 1992. 7. H. Liu and R. Setiono. Chi2: Feature selection and discretization of numeric attributes. In Proceedings of 7th IEEE Int l Conference on Tools with Artificial Intelligence, 1995. 8. C. J. Merz and P. M. Murphy. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/ mlearn/mlrepository.html. 9. Y. Yang and G. I. Webb. A comparative study of discretization methods for naive-bayes classifiers. In Proceedings of PKAW 2002: The 2002 Pacific Rim Knowledge Acquisition Workshop, pages 159 173, 2002.