Some questions of consensus building using co-association

Similar documents
Clustering Lecture 5: Mixture Model

K-Means and Gaussian Mixture Models

Naïve Bayes for text classification

Machine Learning. Unsupervised Learning. Manfred Huber

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Sketchable Histograms of Oriented Gradients for Object Detection

Support Vector Machines

Content-based image and video analysis. Machine learning

Optimization Methods for Machine Learning (OMML)

Note Set 4: Finite Mixture Models and the EM Algorithm

Introduction to Machine Learning CMU-10701

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

CS 229 Midterm Review

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

Leave-One-Out Support Vector Machines

Support Vector Machines

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p.

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Support Vector Machines.

Application of Principal Components Analysis and Gaussian Mixture Models to Printer Identification

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Kernel Combination Versus Classifier Combination

Random projection for non-gaussian mixture models

Enhancing K-means Clustering Algorithm with Improved Initial Center

J. Weston, A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, C. Watkins. Technical Report. February 5, 1998

Clustering and The Expectation-Maximization Algorithm

CSEP 573: Artificial Intelligence

An Introduction to Pattern Recognition

CS 340 Lec. 4: K-Nearest Neighbors

Introduction to Mobile Robotics

NICOLAS BOURBAKI ELEMENTS OF MATHEMATICS. General Topology. Chapters 1-4. Springer-Verlag Berlin Heidelberg New York London Paris Tokyo

THE COMPUTER MODELLING OF GLUING FLAT IMAGES ALGORITHMS. Alekseí Yu. Chekunov. 1. Introduction

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

PV211: Introduction to Information Retrieval

Unsupervised Learning: Clustering

THE COMPUTER MODELLING OF GLUING FLAT IMAGES ALGORITHMS. Alekseí Yu. Chekunov. 1. Introduction

Introduction to Support Vector Machines

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

HMM-Based Handwritten Amharic Word Recognition with Feature Concatenation

COMS 4771 Clustering. Nakul Verma

Local Linear Approximation for Kernel Methods: The Railway Kernel

Cluster quality assessment by the modified Renyi-ClipX algorithm

Generating the Reduced Set by Systematic Sampling

Face Hallucination Based on Eigentransformation Learning

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

Modeling of ambient O 3 : a comparative study

Link Prediction for Social Network

Machine Learning Lecture 3

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Clustering: Classic Methods and Modern Views

A noninformative Bayesian approach to small area estimation

Machine Learning Lecture 3

Stable and Multiscale Topological Signatures

Unsupervised Learning : Clustering

10-701/15-781, Fall 2006, Final

Temporal Pooling Method for Rapid HTM Learning Applied to Geometric Object Recognition

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Bioinformatics - Lecture 07

Lecture 2 September 3

VC 17/18 TP14 Pattern Recognition

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Unsupervised: no target value to predict

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

The alpha-procedure - a nonparametric invariant method for automatic classification of multi-dimensional objects

Expectation Maximization (EM) and Gaussian Mixture Models

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Using Decision Boundary to Analyze Classifiers

ORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"

Adaptive Learning of an Accurate Skin-Color Model

Support vector machines

Constrained Clustering with Interactive Similarity Learning

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

A Graph Theoretic Approach to Image Database Retrieval

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Semi-supervised learning and active learning

Supervised vs unsupervised clustering

NIST. Support Vector Machines. Applied to Face Recognition U56 QC 100 NO A OS S. P. Jonathon Phillips. Gaithersburg, MD 20899

Practice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Rank Measures for Ordering

Calculation of extended gcd by normalization

CLASSIFICATION is one of the most important applications of neural systems. Approximation

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Contents. Preface to the Second Edition

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Texture Classification by Combining Local Binary Pattern Features and a Self-Organizing Map

Multivariate Data Analysis and Machine Learning in High Energy Physics (V)

Inference and Representation

I How does the formulation (5) serve the purpose of the composite parameterization

STA 4273H: Statistical Machine Learning

Integration of Public Information at the Regional Level Challenges and Opportunities *

Support Vector Machines.

Transcription:

Some questions of consensus building using co-association VITALIY TAYANOV Polish-Japanese High School of Computer Technics Aleja Legionow, 4190, Bytom POLAND vtayanov@yahoo.com Abstract: In this paper the co-association matrix has been applied to divide all set of objects into some functional groups. The consistence of each group depends on how difficult is to classify a certain object. This has been done to solve two main problems in pattern recognition and machine learning: to reduce the recognition error and the overtraining value. Key Words: Consensus, Co-association matrix, Hamming distance, Dissimilarity 1 Introduction In general case classifier building as well as all recognition algorithm aims to achieve their insensitivity to the irregularity of the set or sample. If one uses learning for building such a kind of algorithms, the irregularity of the sample leads to the error while testing even if there was not an error during the learning period. As an example Support Vector Machines (SVM) could be mentioned. According to this algorithm we can construct a linear hyperplane in some feature space that gives us the possibility to separate classes in this space with error that equals to zero. But this is only for the learning set and for other set obtained from the same source but under a bit other conditions the algorithm will be characterized by some value of error. In this case the hyperplane is determined constantly and can not take into account the probabilistic character of the new objects that belong to some class set. So the value of overtraining shows us the quality of the learning process. That is why it is so important to have a good estimate of this value. Vapnik-Chervonenkis (VC) theory tells that to achieve a small enough value of difference between the probability of errors while learning and testing there should be tens or hundreds of thousands of objects what is often impossible to have. Such a difference is actually called overtraining. These estimates are very overrated and are built for the worst cases in classification problem that are almost unlikely. That is why during last ten years this theory was developed in the domain of determination of factors causing overrating of these estimates []. Thanks to these research it was improved in a lot of times the classical VC estimates. On the other hand it is very interesting to make a research on how to build the algorithms where influence of the sample irregularity is minimal. This could be done by devision of the general set onto functional groups depending on the data complexity and the results of classifier work. The mathematical mechanism to realize this division is based on the co-association matrices and belongs to consensus approach for the classification and clustering. Co-association matrices with respect to the classification algorithms The idea of proposed approach consists in grouping (combining) results of the classification that are identical for the group (ensemble) of classifiers or decision algorithms. Proposed approach concerns the hierarchical classifiers or clustering algorithms construction. Here one considers the classification on N classes. Let I is the number of objects in the set and P is the number of classification results. Every classification p (p = 1,..., P ) associates every object k of the sample with one and only one class. Elementary co-association matrix A k contains the information according to which algorithms u and v has a consensus with respect to some class for object k: { A k 1 for u v u,v = 0 otherwise, (1) where denotes consensus between algorithms ISBN: 978-1-61804-068-8 61

u and v. Because u v is the same that v u, A k u,v is symmetric binary matrix. If n is the number of algorithms which are used for consensus building then the size of the matrix A is n n. The number of possible different composition of algorithms that could be created is equal P = n(n 1). Let there exists the limitation in the algorithm space by finite limited number of composition P that could be made from these algorithms (p = 1,..., P ). From this set of consensus algorithms one needs to take two of them that should be maximally dissimilar. Formally dissimilarity of a pair of algorithms could be defined as Hamming distance between results of classification for each object k that are presented in form of binary sequences with zeros and ones. The number of zeros and ones in such a sequence is equal to the number of objects I. This means that if algorithm votes that object belongs to the class c (c = 1,..., C) then one puts 1 in the sequence on the position that corresponds to object k, otherwise one puts 0. So the only task left to do is to find the pair of algorithms that have the maximal Hamming distance. Determination of appropriate indices could be done by the following way { } i, j = arg min u,v A k u,v () After determination of this pair of algorithms on the basis of some learning set one estimates the frequency of belonging of some object to the group of objects that have no consensus. Because the pair of the most dissimilar algorithms is estimated on the basis of some learning set the solution will be approximate. In general for the task of classification onto arbitrary number of classes these consensus algorithms divide our set onto three functional sets i.e.: a group of objects on which consensus of two algorithms is reached and it is correct, a group of objects on which consensus is not reached and the group of objects on which consensus is reached and it is incorrect. The amount of objects in the third group can not be reduced at all. So the value of minimal probability of classification error in this case is conditioned by the third group of objects and can not be less than probability that object from the set I belongs to this group. The most interesting from the research point of view is the second group of objects on which there is no consensus of the most dissimilar algorithms. Reclassification of objects from this set gives the possibility to obtain some amount of objects that belongs to the first and the third group. The more objects will drop into the first group the better is the specialized algorithm that makes the reclassification. If one denotes the probability that object belongs to one of k three groups as P 1, P and P 3 then the probability of reclassification in this case by fifty-fifty principle gives the general error of classification equal P e = P 3 + 0.5P. (3) Given probability has the sense of the upper bound for the classification error, so the probability of correct classification will be in the interval [P 3 ; P 3 + 0.5P ]. This could be explained by the fact that the probability of error more than 0.5 is not considered because the worst acceptable algorithm of classification by fifty-fifty principle has the error probability that approaches to 0.5. If two algorithms have been characterised by the approximately equal probability P 3 then the better algorithm will be the one that has lower value of probability P under the constraint P > P 3. This is determined by the risks of correct classification obtaining while reclassifying the objects from the second class. Thus one could have the fast approximate estimation of the reliability of algorithms. 3 Some properties of obtained groups It is also important to carry out the research of peculiarities of test objects. The objects of the first and the third groups are not so interesting as objects from the second one. For objects from the first and the third groups as well as for the second one it could be made a research in the direction of definition of structure of these groups. This could be done using Gauss Mixture Models (GMM), where the number of mixtures is some characteristics of the data complexity in this or that group of objects. But these two groups could not be reclassified. For the research purposes it is interesting to consider the second set of objects. It is interesting first of all by the fact that some objects are separated from others and are in separate group and reclassification of these objects gives the possibility to reduce the general classification error. The principle task of the research of peculiarities of objects from the second group is object analysis from this group to build specialised classifiers that could correctly classify as much as possible of objects from this set. First of all let us make a research of the symmetry on objects of the second group. Symmetry test is done relatively to the algorithm that is exactly in the middle between two the most dissimilar algorithms. This means that the Hamming distance from this algorithm to two others is the same. If we have three algorithms ISBN: 978-1-61804-068-8 6

X, Y and Z then d h (X, Y ) = d h (Y, Z); d h (X, Z) = D h (Y, Z) = D h (X, Y ). All of the estimates of Hamming distance have been done on the basis of learning set. The third algorithm t on the basis of matrix A k u,v could be found by the following way. Let us introduce the notation of the Hamming distance matrix as D H u,v = d h (u, v). Using the matrix A k u,v and making the normalization of the Hamming distance one obtains: k d h (u, v) = 1 Ak u,v, k = 1,..., I. (4) I First we have to find the element in the Hamming distance matrix Du,v H that has a value close as much as ). Then we have to find some ( D H u,v possible to max algorithm t having a property that d h (u, t) = d h (t, v) = max ( D H u,v ( Du,v H ). (5) Then one should search for the minimal element in the column j (see eq. 1) of the new distance matrix Du,v H max. The number of row ) where this element is determines the index of the searched algorithm t. 4 Experimental results Figures 1 6 show graphic dependencies of consensus results for problems taken from UCI repository. This repository was created at the University of California. The data structure of the test tasks from this repository is as follows. Each task is written as a text file where columns are attributes of the object and rows consist of a number of different attributes for every object. Thus the number of rows corresponds to the number of objects and the number of columns correspond to the number of attributes for each object. A separate column consists of labels of classes, which mark each of the objects. A lot of data within this repository has been related to biology and medicine. In table 1 one gives the probabilities of errors obtained on the test data for different classifiers or classifier compositions (committees of algorithms). All these algorithms were verified on two tasks that are difficult enough from the classification point of view. For the proposed algorithm it has been given the minimal and maximal errors that can be obtained on given tested data. Figure 1: Task pima from UCI repository: nonparametrical function of the correct consensus that consists of two algorithms Table 1: Error of classification for different algorithms. method /task bupa pima Monotone (SVM) 0.313 0.36 Monotone (Parzen) 0.37 0.30 AdaBoost (SVM) 0.307 0.7 AdaBoost (Parzen) 0.33 0.90 SVM 0.4 0.30 Parzen 0.338 0.307 RVM 0.333 - Proposed algorithm (min/max, Q = 00) 0.040/0.1 0.041/0.03 In table 1 the value of minimal error is equal to consensus error for the proposed algorithm. The value of maximal error has been calculated as sum of minimal error and the half of the related amount of objects, on which there is no consensus (fifty-fifty principle). As seen from the table the value of maximal error is much less than the least value of error of all given algorithms for two tasks from UCI repository. In comparison with some algorithms given in the table the value of minimal error is approximately 10 times less for the proposed algorithm then the error of some other algorithms from the table. The proposed algorithms are characterized by much more stability of the classification error in comparison with other algorithms. It can be seen from corresponding error comparison for two tasks from the UCI repository. In tables -3 the estimates of probability of belonging of every object from the task of repository UCI to every of three functional groups of objects have been given. In this case the objects, on which ISBN: 978-1-61804-068-8 63

Figure : Task pima from UCI repository: nonparametrical function of the incorrect consensus that consists of two algorithms Figure 4: Task bupa from UCI repository: nonparametrical function of the correct consensus that consists of two algorithms Table : Task pima from UCI repository. Q=00 Q=30 µ σ µ σ P c 0.635 0.04 0.611 0.064 P e 0.041 0.006 0.046 0.013 P c 0.34 0.019 0.344 0.05 Figure 3: Task pima from UCI repository: nonparametrical function of no consensus between two algorithms consensus of the most dissimilar algorithms exists (P c ), belong to the class of so called easy objects. Then objects, on which both of algorithms that are in consensus make errors (P e ), belong to the class of objects that cause uncorrected error and this error can not be reduced at all. The last class of objects consists of objects, on which there is no consensus of the most dissimilar algorithms (P c ). This group of objects also belongs to the class of border objects. In the tables one gives variances of corresponding probabilities too. Minimal size of the blocks, on which one builds estimates using algorithms of cross-validation changes from 30 to 00. Also every distribution of objects in all groups could be described as mixture of Gaussians i.e GMM. For this it is useful to use Expectation-Maximization (EM) algorithm for determination of moments of Gaussians. This means that object have no homogeneous structure but form a compact structures i.e clusters that could be overlapped. Objects from the second group that are the most interesting for us also form clusters. Such kind of a structure of objects from the second group of objects allows to build algorithms that could classify the part of objects of clusters that forms the mixture of Gaussians. For this it is necessary that algorithm could sort out objects in some boundary that is around of some averaged object that is formed by objects mostly from one class. So one should develop some specialized algorithm that could pick out objects from every of compact clusters and then reclassify them correctly for the majority of objects for every cluster. Then the total error of classification will be approach to the minimal while improving the algorithms. If compact clusters (every of mixtures) consist of objects of different classes then it is impossible to use static models. In this case one should use dynamic models (e.g. some Graphical models such as Bayesian networks, Markov Models (MM) or Hidden Markov Models (HMM)) and detecting will be made on the basis of object behavior or regularity of object behavior in some state space. ISBN: 978-1-61804-068-8 64

Figure 5: Task bupa from UCI repository: nonparametrical function of the incorrect consensus that consists of two algorithms Figure 6: Task bupa from UCI repository: nonparametrical function of no consensus between two algorithms Table 3: Task bupa from UCI repository. Q=00 Q=30 µ σ µ σ P c 0.616 0.008 0.599 0.030 P e 0.040 0.00 0.048 0.016 P c 0.344 0.008 0.353 0.017 Figure 7: Task pima from UCI repository: nonparametrical function of consensus symmetry problem So the research of object peculiarities from the second group is interesting because it gives the possibility to understand the rules according to which the specialized algorithms should be built. These algorithms are developed only for reclassifying of objects from the second group. As we can see from Fig. 7 and 8 there is no symmetry in algorithms that is was verified on two tasks. Both mean and variance are different. This means that Hamming distance in general case is not linear that is conditioned by the data. Such a nonlinearity makes the task and algorithms very data-dependent. On the other hand it is difficult to predict the results of classification. All this approve the usefulness of consensus approach that uses the most dissimilar algorithms satisfying only statistical changes of the classification results. Using such an approach makes the selection of algorithms easy non-ambiguity and non-empiric task. 5 Conclusion In this paper the probability of belonging of every object to each of three groups of objects: a group of easy objects, on which it is reached the correct consensus of two algorithms, a group of objects, on which two the most dissimilar algorithms have an incorrect consensus and a group of objects, on which one does not achieve consensus. The analysis shows that there are probability distributions of data that can be presented as a multicomponent models including GMM. All this makes it possible to analyze the proposed algorithms by means of mathematical statistics and probability theory. From the figures and tables one can see that the probability estimations using methods of cross-validation with averaged blocks of 30 and ISBN: 978-1-61804-068-8 65

Figure 8: Task bupa from UCI repository: nonparametrical function of consensus symmetry problem On the other hand proposed algorithm gives the possibility to evaluate and analyse other algorithms. For example it could be applied for analysis of the SVM and RVM (Relevant Vector Machines). For example, for SVM if we consider it as a symmetric problem in consensus then the third algorithm t will correspond to a separation hyperplane and other two hyperplanes that represent the support vectors correspond to the initial dissimilar algorithms. Because we use training in SVM and RVM the position of a hyperplane is only approximate. Then changing the direction of a hyperplane (due to overtraining) will lead to results that are very dependent on the direction changed (that is caused by the other learning set) because of the symmetry problem. This makes the SVM unstable to the learning set that is approved in a lot of research. 00 elements minimum [3] differ a little among themselves, which makes it possible to conclude that this method of consensus building where consensus consists in the most dissimilar algorithms are quite regular and does not have such sensitivity to the samples as other algorithms that use training. As seen from the corresponding tables, the minimal classification error is almost less by order of magnitude than error for the best existing algorithms. The maximal error is less from 1.5 to times in comparison with other algorithms. Also, the corresponding errors are much more stable both relatively to the task, on which one tests the algorithm and the series of given algorithms where the error value has significantly large variance. Moreover, since the minimal value of error is quite small and stable, it guaranties the stability of receipt of correct classification results on objects, on which consensus is reached by the most dissimilar algorithms. Relatively to other algorithms such a confidence can not be achieved. Indeed, the error value at 30 40% (as compared to 4%) gives no confidence in the results of classification. Estimates of probabilities on the basis of average values and the corresponding maximum probability distributions (for a maximum likelihood estimation (MLE)) are not much different, which gives an additional guarantee for the corresponding probability estimates. Significance of obtained consensus estimates of probabilities of correct consensus, incorrect consensus and probability that consensus will not be achieved provides a classification complexity estimate. Problems and algorithms for the complexity estimation of classification task is discussed in [4]. Mathematical analysis of committees of algorithms building has been considered in details in [5]. References: [1] V. Vapnik, The nature of statistical learning theory, nd ed. Springer Verlag, New York 000. [] K. Vorontsov, On the influence of similarity of classifiers on the probability of overfitting, in Proc. of the Ninth International Conference on Pattern Recognition and Image Analysis: New Information Technologies (PRIA-9), Nizhni Novgorod, Russian Federation, vol., 008, pp. 303 306. [3] S. Gurov, The reliability estimation of classification algorithm, Publishing department of the Computational mathematics and cybernetics faculty of Moscow State University, Moscow 003 (in Russian). [4] M. Basu, T. Ho, Data complexity in pattern recognition, Springer, London 006. [5] J. Zhuravlev, About the algebraic approach to recognition or classification tasks solution, Problems of cybernetics, vol. 33, 1978, pp. 5 68, (in Russian). ISBN: 978-1-61804-068-8 66