Combined Weak Classifiers

Similar documents
Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

HALF&HALF BAGGING AND HARD BOUNDARY POINTS. Leo Breiman Statistics Department University of California Berkeley, CA

Machine Learning Classifiers and Boosting

For Monday. Read chapter 18, sections Homework:

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Applying Supervised Learning

Naïve Bayes for text classification

Neural Network Weight Selection Using Genetic Algorithms

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Nearest Neighbors in Random Subspaces

CS229 Final Project: Predicting Expected Response Times

CS6716 Pattern Recognition

5 Learning hypothesis classes (16 points)

Machine Learning in Biology

Classification: Linear Discriminant Functions

Network Traffic Measurements and Analysis

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Human Performance on the USPS Database

Comparison of various classification models for making financial decisions

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

6.034 Quiz 2, Spring 2005

Classification Algorithms for Determining Handwritten Digit

Artificial Intelligence Introduction Handwriting Recognition Kadir Eren Unal ( ), Jakob Heyder ( )

Nearest Neighbor Classification

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

k-nearest Neighbors + Model Selection

Data Mining. Neural Networks

Contents. Preface to the Second Edition

Why MultiLayer Perceptron/Neural Network? Objective: Attributes:

Neural Network Neurons

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Machine Learning. Chao Lan

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification

Artificial Neural Networks (Feedforward Nets)

Deep Learning. Architecture Design for. Sargur N. Srihari

10-701/15-781, Fall 2006, Final

Model combination. Resampling techniques p.1/34

ECG782: Multidimensional Digital Signal Processing

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Notes on Multilayer, Feedforward Neural Networks

Simplifying OCR Neural Networks with Oracle Learning

Multi-label classification using rule-based classifier systems

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Large Margin Classification Using the Perceptron Algorithm

Artificial neural networks are the paradigm of connectionist systems (connectionism vs. symbolism)

In this assignment, we investigated the use of neural networks for supervised classification

Assignment # 5. Farrukh Jabeen Due Date: November 2, Neural Networks: Backpropation

Facial Expression Classification with Random Filters Feature Extraction

Bagging for One-Class Learning

CMPT 882 Week 3 Summary

Ensembles. An ensemble is a set of classifiers whose combined results give the final decision. test feature vector

Supervised vs unsupervised clustering

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

Textural Features for Image Database Retrieval

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Intro to Artificial Intelligence

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Face detection and recognition. Many slides adapted from K. Grauman and D. Lowe

Static Gesture Recognition with Restricted Boltzmann Machines

Hand Written Digit Recognition Using Tensorflow and Python

Statistical Methods in AI

Machine Learning with MATLAB --classification

Learning to Detect Faces. A Large-Scale Application of Machine Learning

Random projection for non-gaussian mixture models

IMPROVEMENTS TO THE BACKPROPAGATION ALGORITHM

Perceptron-Based Oblique Tree (P-BOT)

Perceptron: This is convolution!

CS6220: DATA MINING TECHNIQUES

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

All lecture slides will be available at CSC2515_Winter15.html

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Adaptive Metric Nearest Neighbor Classification

Week 3: Perceptron and Multi-layer Perceptron

Machine Learning Lecture 3

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Classification of Hand-Written Numeric Digits

Comparative analysis of classifier algorithm in data mining Aikjot Kaur Narula#, Dr.Raman Maini*

Introduction to Machine Learning

CGBoost: Conjugate Gradient in Function Space

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Face Detection using Hierarchical SVM

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

Performance analysis of a MLP weight initialization algorithm

Conflict Graphs for Parallel Stochastic Gradient Descent

Combine the PA Algorithm with a Proximal Classifier

Neural Network Approach for Automatic Landuse Classification of Satellite Images: One-Against-Rest and Multi-Class Classifiers

Content-based image and video analysis. Machine learning

Support Vector Machines

CS4495/6495 Introduction to Computer Vision. 8C-L1 Classification: Discriminative models

Cluster based boosting for high dimensional data

Visual object classification by sparse convolutional neural networks

A study of classification algorithms using Rapidminer

Transcription:

Combined Weak Classifiers Chuanyi Ji and Sheng Ma Department of Electrical, Computer and System Engineering Rensselaer Polytechnic Institute, Troy, NY 12180 chuanyi@ecse.rpi.edu, shengm@ecse.rpi.edu Abstract To obtain classification systems with both good generalization performance and efficiency in space and time, we propose a learning method based on combinations of weak classifiers, where weak classifiers are linear classifiers (perceptrons) which can do a little better than making random guesses. A randomized algorithm is proposed to find the weak classifiers. They are then combined through a majority vote. As demonstrated through systematic experiments, the method developed is able to obtain combinations of weak classifiers with good generalization performance and a fast training time on a variety of test problems and real applications. 1 Introduction The problem we will investigate in this work is how to develop a classifier with both good generalization performance and efficiency in space and time in a supervised learning environment. The generalization performance is measured by the probability of classification error of a classifier. A classifier is said to be efficient if its size and the (average) time needed to develop such a classifier scale nicely (polynomiauy) with the dimension of the feature vectors, and other parameters in the training algorithm. The method we propose to tackle this problem is based on combinations of weak classifiers [8][6], where the weak classifiers are the classifiers which can do a little better than random guessing. It has been shown by Schapire and Freund [8][4] that the computational power of weak classifiers is equivalent to that of a welltrained classifier, and an algorithm has been given to boost the performance of weak classifiers. What has not been investigated is the type of weak classifiers that can be used and how to find them. In practice, the ideas have been applied with success in hand-written character recognition to boost the performance of an already well-trained classifier. But the original idea on combining a large number of weak classifiers has not been used in solving real problems. An independent work

Combinations of Weak Classifiers 495 by Kleinberg[6] suggests that in addition to a good generalization performance, combinations of weak classifiers also provide advantages in computation time, since weak classifiers are computationally easier to obtain than well-trained classifiers. However, since the proposed method is based on an assumption which is difficult to realize, discrepancies have been found between the theory and the experimental results[7]. The recent work by Breiman[1][2] also suggests that combinations of classifiers can be computationally efficient, especially when used to learn large data sets. The focus of this work is to investigate the following problems: (1) how to find weak classifiers, (2) what are the performance and efficiency of combinations of weak classifiers, and (3) what are the advantages of using combined weak classifiers compared with other pattern classification methods? We will develop a randomized algorithm to obtain weak classifiers. We will then provide simulation results on both synthetic real problems to show capabilities and efficiency of combined weak classifiers. The extended version of this work with some of the theoretical analysis can be found in [5]. 2 Weak Classifiers In the present work, we choose linear classifiers (perceptrons) as weak classifiers. Let t - ~ be the required generalization error of a classifier, where v 2: 2, is called the weakness factor which is used to characterize the strength of a classifier. The larger the v, the weaker the weak classifier. A set of weak classifiers are combined through a simple majority vote. 3 Algorithm Our algorithm for combinations of weak classifiers consists of two steps: (1) generating individual weak classifiers through a simple randomized algorithm; and (2) combining a collection of weak classifiers through a simple majority vote. Three parameters need to be chosen a priori for the algorithm: a weakness factor v, a number (J (~ ~ (J < 1) which will be used as a threshold to partition the training set, and the number of weak classifiers 2 + 1 to be generated, where is a positive integer. 3.1 Partitioning the Training Set The method we use to partition a training set is motivated by what given in [4]. Suppose a combined classifier consists of K (K ~ 1) weak classifiers already. In order to generate a (new) weak classifier, the entire training set of N training samples is partitioned into two subsets: a set of Ml samples which contain all the misclassified samples and a small fraction of samples correctly-classified by the existing combined classifier; and the remaining N - Ml training samples. The set of Ml samples are called "cares", since they will be used to select a new weak classifier, while the rest of the samples are the "don't-cares". The threshold (J is used to determine which samples should be assigned as cares. For instance, for the n-th training sample (1 ~ n ~ N), the performance index a( n) is recorded, where a( n) is the fraction of the weak classifiers in the existing combined classifier which classify the n-th sample correctly. If a(n) < (J, this sample is assigned to the cares. Otherwise, it is a don't-care. This is done for all N samples.

496 C. Ji and S. Ma Through partitioning a training set in this way, a newly-generated weak classifier is forced to learn the samples which have not been learned by the existing weak classifiers. In the meantime, a properly-chosen () can ensure that enough samples are used to obtain each weak classifier. 3.2 Random Sampling To achieve a fast training time, we obtain a weak classifier by randomly sampling the classifier-space of all possible linear classifiers. Assume that a feature vector x E Rd is distributed over a compact region D. The direction of a hyperplane characterized by a linear classifier with a weight vector, is first generated by randomly selecting the elements of the weight vector based on a uniform distribution over (-1, l)d. Then the threshold of the hyperplane is determined by randomly picking an xed, and letting the hyperplane pass through x. This will generate random hyperplanes which pass through the region D, and whose directions are randomly distributed in all directions. Such a randomly selected classifier will then be tested on all the cares. If it misclassifies a fraction of cares no more than k - ~ - ( > 0 and small), the classifier is kept and will be used in the combination. Otherwise, it is discarded. This process is repeated until a weak classifier is finally obtained. A newly-generated weak classifier is then combined with the existing ones through a simple majority vote. The entire training set will then be tested on the combined classifier to result in a new set of cares, and don't-cares. The whole process will be repeated until the total number 2L + 1 of weak classifiers are generated. The algorithm can be easily extended to multiple classes. Details can be found in [5]. 4 Experimental Results Extensive simulations have been carried out on both synthetic and real problems using our algorithm. One synthetic problem is chosen to test the efficiency of our method. Real applications from standard data bases are selected to compare the generalization performance of combinations of weak classifiers (CW) with that of other methods such as K-Nearest-Neighbor classifiers (K-NN)l, artificial neural networks (ANN), combinations of neural networks (CNN), and stochastic discriminations (SD). 4.1 A Synthetic Problem: Two Overlapping Gaussians To test the scaling properties of combinations of weak classifiers, a non-linearly separable problem is chosen from a standard database called ELENA 2. The problem is a two-class classification problem, where the distributions of samples in both classes are multi-variate Gaussians with the same mean but different variances for each independent variable. There is a considerable amount of overlap between the samples in two classes, and the problem is non-linearly separable. The average generalization error and the standard deviations are given in Figure 1 for our algorithm based on 20 runs, and for other classifiers. The Bayes error is also given to show the theoretical limit. The results show that the performance of knn degrades very quickly. The performance of ANN is better than that of knn but still deviates more and more from the Bayes error as d gets large. The combination of weak classifiers 1 The best result of different k is reported. 2 I pu b I neural-nets I ELEN AI databases IBenchmarks. ps. Z on ft p.dice. ucl. ac. be

Combinations of Weak Classifiers 497 - "- k-nn -_ ANN 10-0- cw -... IV_ " I., Figure 1: Performance versus the dimension of the feature vectors Algorithms Card1 Diabetes1 Gene1 (%) Error /er (%) Error / er (%1 Error/ er Combined Weak Classifiers 11.3/ 0.85 22.70/ 0.70 11.80/0.52 k Nearest Neighbor 15.67 25.8 22.87 Neural Networks 13.64/ 0.85 23.52/ 0.72 13.47/0.44 Combined Neural Networks 13.02/0.33 22.79/0.57 12.08/0.23 Table 1: Performance on Card1, Diabetes1 and Gene1. er: standard deviation continues to follow the trend of the Bayes error. 4.2 Proben1 Data Sets Three data sets, Card1, Diabetes1 and Gene1 were selected to test our algorithm from Proben1 databases which contain data sets from real applications 3. Card1 data set is for a problem on determining whether a credit-card application from a customer can be approved based on information given in 51-dimensional feature vectors. 345 out of 690 examples are used for training and the rest for testing. Diabetes1 data set is for determining whether diabetes is present based on 8-dimensional input patterns. 384 examples are used for training and the same number of samples for testing. Gene1 data set is for deciding whether a DNA sequence is from a donor, an acceptor or neither from 120 dimensional binary feature vectors. 1588 samples out of total of 3175 were used for training, and the rest for testing. The average generalization error as well as the standard deviations are reported in Table 1. The results from combinations of weak classifiers are based on 25 runs. The results of neural networks and combinations of well-trained neural networks are from the database. As demonstrated by the results, combinations of weak classifiers have been able to achieve the generalization performance comparable to or better than that of combinations of well-trained neural networks. 4.3 Hand-written Digit Recognition Hand-written digit recognition is chosen to test our algorithm, since one of the previously developed method on combinations of weak classifiers (stochastic discrimination[6]) was applied to this problem. For the purpose of comparison, the 3 Available by anonymous ftp from ftp.ira.uka.de, as / pub/papers/techreports/1994/1994-21. ps.z.

498 C. Ji and S. Ma Algorithms (%) Error/O' Combined Weak Classifiers 4.23/ 0.1 k Nearest Neighbor 4.84 Neural Networks 5.33 Stochastic Discriminations 3.92 Table 2: Performance on handwritten digit recognition. Parameters Gaussians Card1 Diabetes1 Gene1 Digits 1/2 + l/v 0.51 0.51 0.51 0.55 0.54 e 0.51 0.51 0.54 0.54 0.53 2L+1 2000 1000 1000 4000 20000 A verage Tries 2 3 7 4 2 Table 3: Parameters used in our experiments. same set of data as used in [6](from the NIST data base) is utilized to train and to test our algorithm. The data set contains 10000 digits written by different people. Each digit is represented by 16 by 16 black and white pixels. The first 4997 digits are used to form a training set, and the rest are for testing. Performance of our algorithm, k-nn, neural networks, and stochastic discriminations are given in Table 2. The results for our methods are based on 5 runs, while the results for the other methods are from [6]. The results show that the performance of our algorithm is slightly worse (by 0.3%) than that of stochastic discriminations, which uses a different method for multi-class classification [6]. 4.4 Effects of The Weakness Factor Experiments are done to test the effects of v on the problem of two 8-dimensional overlapping Gaussians. The performance and the average training time (CPUtime on Sun Spac-10) of combined weak classifiers based on 10 runs are given for different v's in Figures 2 and 3, respectively. The results indicate as v increases an individual weak classifier is obtained more quickly, but more weak classifiers are needed to achieve good performance. When a proper v is chosen, a nice scaling property can be observed in training time. A record of the parameters used in all the experiments on real applications are provided in Table 3. The average tries, which are the average number of times needed to sample the classifier space to obtain an acceptable weak classifier, are also given in the table to characterize the training time for these problems. 4.5 Training Time To compare learning time with off-line BackPropagation (BP), feedforward two layer neural network with 10 sigmoidal hidden units are trained by gradient-descent to learn the problem on the two 8-dimensional overlapping Gaussians. 2500 training samples are used. The performance versus CPU time 4 are plotted for both our algorithm and BP in Figure 4. For our algorithm, 2000 weak classifiers are combined. For BP, 1000 epoches are used. The figure shows that our algorithm is much faster than the BP algorithm. Moreover, when several well-trained neural networks are combined to achieve a better performance, the cost on training time will be 4Both algorithms are run on a Sun Sparc-lO sun workstation

Combinations of Weak Classifiers 499 ' 0 200 400 too 100 1000 1200 1400,'00 1100 2000 NuntMI'oII... ~ Figure 2: Performance versus the number of weak classifiers for different 1/. nu: 1/. I' -0-1 AwaO 005 -x- ljnu-o.ol0 ~ - - 1~O'5, ~ u -+-: la'1u-002o - : lhn-0025, _ m "::;~~=1;:~-~ _ : ~~ _ ~ ~ -~~:: :: : :::::..' ~oiiw.-a...r... Figure 3: Training time versus the number of weak classifiers for different 1/. even higher. Therefore, compared to combinations of well-trained neural networks, combining weak classifiers is computationally much cheaper. 5 Discussions From the experimental results, we observe that the performance of the combined weak classifiers is comparable or even better than combinations of well-trained classifiers, and out-performs individual neural network classifiers and k-nearest Neighbor classifiers. In the meantime whereas the k-nearest neighbor classifiers suffer from the curse of dimensionality, a nice scaling property in terms of the dimension of feature vectors has been observed for combined weak classifiers. Another..,.., 35 ; IS l --_ TranngcurteollBP T_CV't'eollBP - T,...cuwolCW - r_cwy.otcw I v_ -1- - - - - _ I- - t 10 -----.. I... --t-, Figure 4: Performance versus CPU time

500 C. Ii and S. Ma important observation obtained from the experiments is that the weakness factor directly impacts the size of a combined classifier and the training time. Therefore, the choice of the weakness factor is important to obtain efficient combined weak classifiers. It has been shown in our theoretical analysis on learning an underlying perceptron [5] that v should be at least large as O( dlnd) to obtain a polynomial training time, and the price paid to accomplish this is a space-complexity which is polynomial in d as well. This cost can be observed from our experimental results for the need of a large number of weak classifiers. Acknowledgement Specials thanks are due to Tin Kan Ho for providing NIST data, related references and helpful discussions. Support from the National Science Foundation (ECS- 9312594 and (CAREER) IRI-9502518) is gratefully acknowledged. References [1] L. Breiman, "Bias, Variance and Arcing Classifiers," Technical Report, TR- 460, Department of Statistics, University of California, Berkeley, April, 1996. [2] 1. Breiman, "Pasting, Bites Together for Prediction in Large Data sets and On-Line," ftp.stat.berkeley.edu/users/breiman, 1996. [3] H. Drucker, R. Schapire and P. Simard, "Improving Performance in Neural Networks Using a Boosting Algorithm," Neural Information Processing Symposium, 42-49, 1993. [4] Y. Freund and R. Schapire, "A Decision-Theoretic Generalization of On-Line Learning and An Application to Boosting," http://www.research.att.com/orgs/ssr/people/yoav or schapire. [5] C. Ji and S. Ma, "Combinations of Weak Classifiers," IEEE Trans. Neural Networks, Special Issue on Neural Networks and Pattern Recognition, vol. 8, 32-42, Jan., 1997. [6] E.M. Kleinberg, "Stochastic Discrimination," Annals of Mathematics and Artificial Intelligence, vou, 207-239, 1990. [7] E.M. Kleinberg and T. Ho, "Pattern Recognition by Stochastic Modeling," Proceedings of the Third International Workshop on Frontiers in Handwriting Recognition, 175-183, Buffalo, May 1993. [8] R.E. Schapire, "The Strength of Weak Learnability," Machine Learning, vol. 5, 197-227, 1990.