Some questions of consensus building using co-association VITALIY TAYANOV Polish-Japanese High School of Computer Technics Aleja Legionow, 4190, Bytom POLAND vtayanov@yahoo.com Abstract: In this paper the co-association matrix has been applied to divide all set of objects into some functional groups. The consistence of each group depends on how difficult is to classify a certain object. This has been done to solve two main problems in pattern recognition and machine learning: to reduce the recognition error and the overtraining value. Key Words: Consensus, Co-association matrix, Hamming distance, Dissimilarity 1 Introduction In general case classifier building as well as all recognition algorithm aims to achieve their insensitivity to the irregularity of the set or sample. If one uses learning for building such a kind of algorithms, the irregularity of the sample leads to the error while testing even if there was not an error during the learning period. As an example Support Vector Machines (SVM) could be mentioned. According to this algorithm we can construct a linear hyperplane in some feature space that gives us the possibility to separate classes in this space with error that equals to zero. But this is only for the learning set and for other set obtained from the same source but under a bit other conditions the algorithm will be characterized by some value of error. In this case the hyperplane is determined constantly and can not take into account the probabilistic character of the new objects that belong to some class set. So the value of overtraining shows us the quality of the learning process. That is why it is so important to have a good estimate of this value. Vapnik-Chervonenkis (VC) theory tells that to achieve a small enough value of difference between the probability of errors while learning and testing there should be tens or hundreds of thousands of objects what is often impossible to have. Such a difference is actually called overtraining. These estimates are very overrated and are built for the worst cases in classification problem that are almost unlikely. That is why during last ten years this theory was developed in the domain of determination of factors causing overrating of these estimates []. Thanks to these research it was improved in a lot of times the classical VC estimates. On the other hand it is very interesting to make a research on how to build the algorithms where influence of the sample irregularity is minimal. This could be done by devision of the general set onto functional groups depending on the data complexity and the results of classifier work. The mathematical mechanism to realize this division is based on the co-association matrices and belongs to consensus approach for the classification and clustering. Co-association matrices with respect to the classification algorithms The idea of proposed approach consists in grouping (combining) results of the classification that are identical for the group (ensemble) of classifiers or decision algorithms. Proposed approach concerns the hierarchical classifiers or clustering algorithms construction. Here one considers the classification on N classes. Let I is the number of objects in the set and P is the number of classification results. Every classification p (p = 1,..., P ) associates every object k of the sample with one and only one class. Elementary co-association matrix A k contains the information according to which algorithms u and v has a consensus with respect to some class for object k: { A k 1 for u v u,v = 0 otherwise, (1) where denotes consensus between algorithms ISBN: 978-1-61804-068-8 61
u and v. Because u v is the same that v u, A k u,v is symmetric binary matrix. If n is the number of algorithms which are used for consensus building then the size of the matrix A is n n. The number of possible different composition of algorithms that could be created is equal P = n(n 1). Let there exists the limitation in the algorithm space by finite limited number of composition P that could be made from these algorithms (p = 1,..., P ). From this set of consensus algorithms one needs to take two of them that should be maximally dissimilar. Formally dissimilarity of a pair of algorithms could be defined as Hamming distance between results of classification for each object k that are presented in form of binary sequences with zeros and ones. The number of zeros and ones in such a sequence is equal to the number of objects I. This means that if algorithm votes that object belongs to the class c (c = 1,..., C) then one puts 1 in the sequence on the position that corresponds to object k, otherwise one puts 0. So the only task left to do is to find the pair of algorithms that have the maximal Hamming distance. Determination of appropriate indices could be done by the following way { } i, j = arg min u,v A k u,v () After determination of this pair of algorithms on the basis of some learning set one estimates the frequency of belonging of some object to the group of objects that have no consensus. Because the pair of the most dissimilar algorithms is estimated on the basis of some learning set the solution will be approximate. In general for the task of classification onto arbitrary number of classes these consensus algorithms divide our set onto three functional sets i.e.: a group of objects on which consensus of two algorithms is reached and it is correct, a group of objects on which consensus is not reached and the group of objects on which consensus is reached and it is incorrect. The amount of objects in the third group can not be reduced at all. So the value of minimal probability of classification error in this case is conditioned by the third group of objects and can not be less than probability that object from the set I belongs to this group. The most interesting from the research point of view is the second group of objects on which there is no consensus of the most dissimilar algorithms. Reclassification of objects from this set gives the possibility to obtain some amount of objects that belongs to the first and the third group. The more objects will drop into the first group the better is the specialized algorithm that makes the reclassification. If one denotes the probability that object belongs to one of k three groups as P 1, P and P 3 then the probability of reclassification in this case by fifty-fifty principle gives the general error of classification equal P e = P 3 + 0.5P. (3) Given probability has the sense of the upper bound for the classification error, so the probability of correct classification will be in the interval [P 3 ; P 3 + 0.5P ]. This could be explained by the fact that the probability of error more than 0.5 is not considered because the worst acceptable algorithm of classification by fifty-fifty principle has the error probability that approaches to 0.5. If two algorithms have been characterised by the approximately equal probability P 3 then the better algorithm will be the one that has lower value of probability P under the constraint P > P 3. This is determined by the risks of correct classification obtaining while reclassifying the objects from the second class. Thus one could have the fast approximate estimation of the reliability of algorithms. 3 Some properties of obtained groups It is also important to carry out the research of peculiarities of test objects. The objects of the first and the third groups are not so interesting as objects from the second one. For objects from the first and the third groups as well as for the second one it could be made a research in the direction of definition of structure of these groups. This could be done using Gauss Mixture Models (GMM), where the number of mixtures is some characteristics of the data complexity in this or that group of objects. But these two groups could not be reclassified. For the research purposes it is interesting to consider the second set of objects. It is interesting first of all by the fact that some objects are separated from others and are in separate group and reclassification of these objects gives the possibility to reduce the general classification error. The principle task of the research of peculiarities of objects from the second group is object analysis from this group to build specialised classifiers that could correctly classify as much as possible of objects from this set. First of all let us make a research of the symmetry on objects of the second group. Symmetry test is done relatively to the algorithm that is exactly in the middle between two the most dissimilar algorithms. This means that the Hamming distance from this algorithm to two others is the same. If we have three algorithms ISBN: 978-1-61804-068-8 6
X, Y and Z then d h (X, Y ) = d h (Y, Z); d h (X, Z) = D h (Y, Z) = D h (X, Y ). All of the estimates of Hamming distance have been done on the basis of learning set. The third algorithm t on the basis of matrix A k u,v could be found by the following way. Let us introduce the notation of the Hamming distance matrix as D H u,v = d h (u, v). Using the matrix A k u,v and making the normalization of the Hamming distance one obtains: k d h (u, v) = 1 Ak u,v, k = 1,..., I. (4) I First we have to find the element in the Hamming distance matrix Du,v H that has a value close as much as ). Then we have to find some ( D H u,v possible to max algorithm t having a property that d h (u, t) = d h (t, v) = max ( D H u,v ( Du,v H ). (5) Then one should search for the minimal element in the column j (see eq. 1) of the new distance matrix Du,v H max. The number of row ) where this element is determines the index of the searched algorithm t. 4 Experimental results Figures 1 6 show graphic dependencies of consensus results for problems taken from UCI repository. This repository was created at the University of California. The data structure of the test tasks from this repository is as follows. Each task is written as a text file where columns are attributes of the object and rows consist of a number of different attributes for every object. Thus the number of rows corresponds to the number of objects and the number of columns correspond to the number of attributes for each object. A separate column consists of labels of classes, which mark each of the objects. A lot of data within this repository has been related to biology and medicine. In table 1 one gives the probabilities of errors obtained on the test data for different classifiers or classifier compositions (committees of algorithms). All these algorithms were verified on two tasks that are difficult enough from the classification point of view. For the proposed algorithm it has been given the minimal and maximal errors that can be obtained on given tested data. Figure 1: Task pima from UCI repository: nonparametrical function of the correct consensus that consists of two algorithms Table 1: Error of classification for different algorithms. method /task bupa pima Monotone (SVM) 0.313 0.36 Monotone (Parzen) 0.37 0.30 AdaBoost (SVM) 0.307 0.7 AdaBoost (Parzen) 0.33 0.90 SVM 0.4 0.30 Parzen 0.338 0.307 RVM 0.333 - Proposed algorithm (min/max, Q = 00) 0.040/0.1 0.041/0.03 In table 1 the value of minimal error is equal to consensus error for the proposed algorithm. The value of maximal error has been calculated as sum of minimal error and the half of the related amount of objects, on which there is no consensus (fifty-fifty principle). As seen from the table the value of maximal error is much less than the least value of error of all given algorithms for two tasks from UCI repository. In comparison with some algorithms given in the table the value of minimal error is approximately 10 times less for the proposed algorithm then the error of some other algorithms from the table. The proposed algorithms are characterized by much more stability of the classification error in comparison with other algorithms. It can be seen from corresponding error comparison for two tasks from the UCI repository. In tables -3 the estimates of probability of belonging of every object from the task of repository UCI to every of three functional groups of objects have been given. In this case the objects, on which ISBN: 978-1-61804-068-8 63
Figure : Task pima from UCI repository: nonparametrical function of the incorrect consensus that consists of two algorithms Figure 4: Task bupa from UCI repository: nonparametrical function of the correct consensus that consists of two algorithms Table : Task pima from UCI repository. Q=00 Q=30 µ σ µ σ P c 0.635 0.04 0.611 0.064 P e 0.041 0.006 0.046 0.013 P c 0.34 0.019 0.344 0.05 Figure 3: Task pima from UCI repository: nonparametrical function of no consensus between two algorithms consensus of the most dissimilar algorithms exists (P c ), belong to the class of so called easy objects. Then objects, on which both of algorithms that are in consensus make errors (P e ), belong to the class of objects that cause uncorrected error and this error can not be reduced at all. The last class of objects consists of objects, on which there is no consensus of the most dissimilar algorithms (P c ). This group of objects also belongs to the class of border objects. In the tables one gives variances of corresponding probabilities too. Minimal size of the blocks, on which one builds estimates using algorithms of cross-validation changes from 30 to 00. Also every distribution of objects in all groups could be described as mixture of Gaussians i.e GMM. For this it is useful to use Expectation-Maximization (EM) algorithm for determination of moments of Gaussians. This means that object have no homogeneous structure but form a compact structures i.e clusters that could be overlapped. Objects from the second group that are the most interesting for us also form clusters. Such kind of a structure of objects from the second group of objects allows to build algorithms that could classify the part of objects of clusters that forms the mixture of Gaussians. For this it is necessary that algorithm could sort out objects in some boundary that is around of some averaged object that is formed by objects mostly from one class. So one should develop some specialized algorithm that could pick out objects from every of compact clusters and then reclassify them correctly for the majority of objects for every cluster. Then the total error of classification will be approach to the minimal while improving the algorithms. If compact clusters (every of mixtures) consist of objects of different classes then it is impossible to use static models. In this case one should use dynamic models (e.g. some Graphical models such as Bayesian networks, Markov Models (MM) or Hidden Markov Models (HMM)) and detecting will be made on the basis of object behavior or regularity of object behavior in some state space. ISBN: 978-1-61804-068-8 64
Figure 5: Task bupa from UCI repository: nonparametrical function of the incorrect consensus that consists of two algorithms Figure 6: Task bupa from UCI repository: nonparametrical function of no consensus between two algorithms Table 3: Task bupa from UCI repository. Q=00 Q=30 µ σ µ σ P c 0.616 0.008 0.599 0.030 P e 0.040 0.00 0.048 0.016 P c 0.344 0.008 0.353 0.017 Figure 7: Task pima from UCI repository: nonparametrical function of consensus symmetry problem So the research of object peculiarities from the second group is interesting because it gives the possibility to understand the rules according to which the specialized algorithms should be built. These algorithms are developed only for reclassifying of objects from the second group. As we can see from Fig. 7 and 8 there is no symmetry in algorithms that is was verified on two tasks. Both mean and variance are different. This means that Hamming distance in general case is not linear that is conditioned by the data. Such a nonlinearity makes the task and algorithms very data-dependent. On the other hand it is difficult to predict the results of classification. All this approve the usefulness of consensus approach that uses the most dissimilar algorithms satisfying only statistical changes of the classification results. Using such an approach makes the selection of algorithms easy non-ambiguity and non-empiric task. 5 Conclusion In this paper the probability of belonging of every object to each of three groups of objects: a group of easy objects, on which it is reached the correct consensus of two algorithms, a group of objects, on which two the most dissimilar algorithms have an incorrect consensus and a group of objects, on which one does not achieve consensus. The analysis shows that there are probability distributions of data that can be presented as a multicomponent models including GMM. All this makes it possible to analyze the proposed algorithms by means of mathematical statistics and probability theory. From the figures and tables one can see that the probability estimations using methods of cross-validation with averaged blocks of 30 and ISBN: 978-1-61804-068-8 65
Figure 8: Task bupa from UCI repository: nonparametrical function of consensus symmetry problem On the other hand proposed algorithm gives the possibility to evaluate and analyse other algorithms. For example it could be applied for analysis of the SVM and RVM (Relevant Vector Machines). For example, for SVM if we consider it as a symmetric problem in consensus then the third algorithm t will correspond to a separation hyperplane and other two hyperplanes that represent the support vectors correspond to the initial dissimilar algorithms. Because we use training in SVM and RVM the position of a hyperplane is only approximate. Then changing the direction of a hyperplane (due to overtraining) will lead to results that are very dependent on the direction changed (that is caused by the other learning set) because of the symmetry problem. This makes the SVM unstable to the learning set that is approved in a lot of research. 00 elements minimum [3] differ a little among themselves, which makes it possible to conclude that this method of consensus building where consensus consists in the most dissimilar algorithms are quite regular and does not have such sensitivity to the samples as other algorithms that use training. As seen from the corresponding tables, the minimal classification error is almost less by order of magnitude than error for the best existing algorithms. The maximal error is less from 1.5 to times in comparison with other algorithms. Also, the corresponding errors are much more stable both relatively to the task, on which one tests the algorithm and the series of given algorithms where the error value has significantly large variance. Moreover, since the minimal value of error is quite small and stable, it guaranties the stability of receipt of correct classification results on objects, on which consensus is reached by the most dissimilar algorithms. Relatively to other algorithms such a confidence can not be achieved. Indeed, the error value at 30 40% (as compared to 4%) gives no confidence in the results of classification. Estimates of probabilities on the basis of average values and the corresponding maximum probability distributions (for a maximum likelihood estimation (MLE)) are not much different, which gives an additional guarantee for the corresponding probability estimates. Significance of obtained consensus estimates of probabilities of correct consensus, incorrect consensus and probability that consensus will not be achieved provides a classification complexity estimate. Problems and algorithms for the complexity estimation of classification task is discussed in [4]. Mathematical analysis of committees of algorithms building has been considered in details in [5]. References: [1] V. Vapnik, The nature of statistical learning theory, nd ed. Springer Verlag, New York 000. [] K. Vorontsov, On the influence of similarity of classifiers on the probability of overfitting, in Proc. of the Ninth International Conference on Pattern Recognition and Image Analysis: New Information Technologies (PRIA-9), Nizhni Novgorod, Russian Federation, vol., 008, pp. 303 306. [3] S. Gurov, The reliability estimation of classification algorithm, Publishing department of the Computational mathematics and cybernetics faculty of Moscow State University, Moscow 003 (in Russian). [4] M. Basu, T. Ho, Data complexity in pattern recognition, Springer, London 006. [5] J. Zhuravlev, About the algebraic approach to recognition or classification tasks solution, Problems of cybernetics, vol. 33, 1978, pp. 5 68, (in Russian). ISBN: 978-1-61804-068-8 66