Large Margin Classification Using the Perceptron Algorithm

Size: px

Start display at page:

Download "Large Margin Classification Using the Perceptron Algorithm"

Pamela Robinson
5 years ago
Views:

1 Large Margin Classification Using the Perceptron Algorithm Yoav Freund Robert E. Schapire Presented by Amit Bose March 23, 2006

2 Goals of the Paper Enhance Rosenblatt s Perceptron algorithm so that it can make use of large margin Analyze the error bounds of such an algorithm Verify hypotheses using experimental results

3 But Why do It? Achieving a large margin is desirable Non-linear mapping to higherdimensional spaces is possible Improves separability Improves chances of widening the margin Kernel functions allow computational tractability despite high dimensionality Good old Perceptron doesn t care about margins It is just happy to reduce the training error as much as it can

4 haven haven t you heard of SVMs? Yeah, sure. But we love Perceptrons! SVM Upsides: linear, maximal margin, kernel compatible Downside: optimization involves solving a large quadratic program Enter the voted-perceptron Perceptron based linear classifier Retains simplicity Takes advantage of any margin that can be achieved

5 Perceptron Revisited Given: A sequence of m training samples (x i, y i ); each x is a n-dimensional real vector and each y i is either +1 or -1 Maintain a prediction vector w initialized to 0 For each sample x presented, predict y (pred) = sign(w x) If y (pred) matches y, do nothing; else update w : w = w + yx Modification to get voted-perceptron: Don t discard intermediate prediction vectors Instead maintain weights on the prediction vectors themselves, so that different predictions can be combined Weight = Number of samples that predictor can survive without making a mistake

6 From Online to Batch Perceptron is naturally online algorithm Several ways to convert an online algorithm to batch Cycle through the data for pre-defined number of epochs or till algorithm converges Pocket algorithm track which intermediate vector has the longest run of correct predictions Test all generated prediction vectors against validation set and pick the best Leave-one-out Voted-perceptron uses the deterministic version of leave-one-out

7 Train only once on subset of training samples and make prediction on a test instance Two ways of choosing subsets: Randomized pick subset size r randomly and train on first r samples Deterministic for all possible subset sizes r, train separately on first r to get a set of classifiers; prediction is made by majority wins rule Perceptron with modifications suggested earlier, when run for exactly one epoch, is actually deterministic leave-one-out Each presentation of sample in perceptron is like training the perceptron to a different subset-size Maintaining weights on the prediction vectors is like aggregating the votes of a predictor in leaveone-out Leave-One One-Out Out Conversion

8 General Comments Algorithm is exceedingly simple original algorithm remains untouched except for saving intermediate predictors and combining these (compare this to quadratic programming) Modus operandi is similar in many respects to that of boosting Doesn t make an explicit attempt to maximize any margin, yet like boosting, enjoys benefits of a large margin Enhancements of Perceptron that do so exist, notably The AdaTron Converges asymptotically to the correct solution Rate of convergence follows an exponential law in the number of iterations

9 Analysis: Leave-one one-out out Suppose the online algorithm is expected to make P mistakes when given m+1 samples drawn randomly from an i.i.d. Now convert the online algorithm to batch using leave-one-out and provide it m random training samples Expected probability of making an error on a randomly selected test sample is upper bounded by: P/(m+1) 2P/(m+1) for the randomized version for the deterministic version This is indeed a generalization error bound, but is slightly different from bounds we have seen called instantaneous error

10 Analysis: Perceptron Need to find P for the online Perceptron In the separable case, number of errors at any stage (R/γ) 2 For inseparable, define slack variables Add dimensions so as to reduce the problem to separable case in higher dimensions Upper bound on number of errors ((R+D)/γ) 2, where D is the square root of the squared sum of the slack variables R 2γ

11 Analysis: Putting it Together Given m samples, probability of instantaneous error of voted-perceptron is at most 2 m R+ Du, γ [ inf 1 1; 0 E u = γ γ + > Notice the inverse proportionality to margin γ Here expectation E[ ] is taken over all possible m+1 samples that can be drawn from the underlying distribution A stronger statement on the bound follows where R and D are determined for only those samples that are misclassified 2 ]

12 Using Kernels Kernels are useful for calculating inner products in high-dimensional mapped spaces The Perceptron algorithm can formulated so that computations involve inner products between samples Training and prediction involve inner products between samples and prediction vectors Because of the update rule, prediction vector itself is calculated as a sum of (misclassified) samples Used inner products can be expanded to a sum of inner products between samples kernel compatible Calculation of final prediction vector can also be done in linear time

13 Experimental Results NIST OCR database Classification of hand-written digits Multi-class problem reduce to several one-vs-rest 2-class problems; go with the class that has highest prediction score Score calculated in 4 different ways: Using final prediction vector (normalized and unnormalized) Using voted prediction vector Using averaged prediction vector (normalized and unnormalized) Using randomized prediction vector (normalized and unnormalized) Polynomial kernels of upto 6 degrees Observations Moving to higher dimensions causes large drop in error Voting and averaging prediction vectors better than traditional perceptron These are significantly better for less epochs Random vectors performed worst in all cases Algorithm ran slowly for low degree polynomial kernels Accuracy inferior, but comparable, to that of SVM

14 Summary Paper described simple extension to Perceptron Extension allows Perceptron to utilize any margin that may be achievable Kernel formulation can reduce computation The Perceptron is now ready for classification in non-linearly mapped higher dimensions Experiments verify this proposition

15 The (Kernel) AdaTron Algorithm

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

CSE 417T: Introduction to Machine Learning Lecture 22: The Kernel Trick Henry Chai 11/15/18 Linearly Inseparable Data What can we do if the data is not linearly separable? Accept some non-zero in-sample