Abstract Ths paper ponts out an mportant source of necency n Smola and Scholkopf's Sequental Mnmal Optmzaton (SMO) algorthm for SVM regresson that s c

Similar documents
Support Vector Machines

Support Vector Machines

Smoothing Spline ANOVA for variable screening

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

A Binarization Algorithm specialized on Document Images and Photos

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Improvements to the SMO Algorithm for SVM Regression

Classification / Regression Support Vector Machines

5 The Primal-Dual Method

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Support Vector Machines. CS534 - Machine Learning

Parallel matrix-vector multiplication

Programming in Fortran 90 : 2017/2018

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

GSLM Operations Research II Fall 13/14

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Solving two-person zero-sum game by Matlab

Hermite Splines in Lie Groups as Products of Geodesics

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

SENSITIVITY ANALYSIS IN LINEAR PROGRAMMING USING A CALCULATOR

Analysis of Continuous Beams in General

Classifier Selection Based on Data Complexity Measures *

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements


Lecture 5: Multilayer Perceptrons

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Taxonomy of Large Margin Principle Algorithms for Ordinal Regression Problems

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Mathematics 256 a course in differential equations for engineering students

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Parallel Sequential Minimal Optimization for the Training. of Support Vector Machines

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

Array transposition in CUDA shared memory

LECTURE NOTES Duality Theory, Sensitivity Analysis, and Parametric Programming

An Entropy-Based Approach to Integrated Information Needs Assessment

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

y and the total sum of

Unsupervised Learning

The Research of Support Vector Machine in Agricultural Data Classification

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

An Optimal Algorithm for Prufer Codes *

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

SVM-based Learning for Multiple Model Estimation

Lecture 4: Principal components

A One-Sided Jacobi Algorithm for the Symmetric Eigenvalue Problem

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Channel 0. Channel 1 Channel 2. Channel 3 Channel 4. Channel 5 Channel 6 Channel 7

Computer models of motion: Iterative calculations

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Feature Reduction and Selection

CMPS 10 Introduction to Computer Science Lecture Notes

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Using Neural Networks and Support Vector Machines in Data Mining

Column-Generation Boosting Methods for Mixture of Kernels

Meta-heuristics for Multidimensional Knapsack Problems

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

On Some Entertaining Applications of the Concept of Set in Computer Science Course

Related-Mode Attacks on CTR Encryption Mode

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

1 Introducton Gven a graph G = (V; E), a non-negatve cost on each edge n E, and a set of vertces Z V, the mnmum Stener problem s to nd a mnmum cost su

Polyhedral Compilation Foundations

Cluster Analysis of Electrical Behavior

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

Alternating Direction Method of Multipliers Implementation Using Apache Spark

Complex System Reliability Evaluation using Support Vector Machine for Incomplete Data-set

Edge Detection in Noisy Images Using the Support Vector Machines

Wishing you all a Total Quality New Year!

Wavefront Reconstructor

CLASSIFICATION OF ULTRASONIC SIGNALS

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

S1 Note. Basis functions.

and NSF Engineering Research Center Abstract Generalized speedup is dened as parallel speed over sequential speed. In this paper

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Variant Multi Objective Tsp Model

Design and Analysis of Algorithms

Data Mining For Multi-Criteria Energy Predictions

Adaptive Virtual Support Vector Machine for the Reliability Analysis of High-Dimensional Problems

Learning to Project in Multi-Objective Binary Linear Programming

A Robust LS-SVM Regression

an assocated logc allows the proof of safety and lveness propertes. The Unty model nvolves on the one hand a programmng language and, on the other han

Classifying Acoustic Transient Signals Using Artificial Intelligence

Solving the SVM Problem. Christopher Sentelle, Ph.D. Candidate L-3 CyTerra Corporation

Non-Split Restrained Dominating Set of an Interval Graph Using an Algorithm

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

Transcription:

Improvements to SMO Algorthm for SVM Regresson 1 S.K. Shevade S.S. Keerth C. Bhattacharyya & K.R.K. Murthy shrsh@csa.sc.ernet.n mpessk@guppy.mpe.nus.edu.sg cbchru@csa.sc.ernet.n murthy@csa.sc.ernet.n 1 Author for Correspondence: Prof. S.S. Keerth, Dept of Mechancal and Producton Engneerng, Natonal Unversty of Sngapore, Sngapore-119 62

Abstract Ths paper ponts out an mportant source of necency n Smola and Scholkopf's Sequental Mnmal Optmzaton (SMO) algorthm for SVM regresson that s caused by the use of a sngle threshold value. Usng clues from the KKT condtons for the dual problem, two threshold parameters are employed to derve modcatons of SMO for regresson. These moded algorthms perform sgncantly faster than the orgnal SMO on the datasets tred. 1 Introducton Support Vector Machne (SVM) s an elegant tool for solvng pattern-recognton and regresson problems. Over the past few years, t has attracted a lot of researchers from the neural network and mathematcal programmng communty; the man reason for ths beng ther ablty to provde excellent generalzaton performance. SVMs have also been demonstrated to be valuable for several real-world applcatons. In ths paper, we address the SVM regresson problem. Recently, Smola and Scholkopf[7, 8] proposed an teratve algorthm called Sequental Mnmal Optmzaton (SMO), for solvng the regresson problem usng SVM. Ths algorthm s an extenson of the SMO algorthm proposed by Platt[5] for SVM classer desgn. The remarkable feature of the SMO algorthms s that they are fast as well as very easy to mplement. In a recent paper[4] we suggested some mprovements to Platt's SMO algorthm for SVM classer desgn. In ths paper, we extend those deas to Smola and Scholkopf's SMO algorthm for regresson. The mprovements that we suggest n ths paper enhance the value of SMO for regresson even further. In partcular, we pont out an mportant source of necency caused by the way SMO mantans and updates a sngle threshold value. Gettng clues from optmalty crtera assocated wth the Karush-Kuhn-Tucker(KKT) condtons for the dual problem, we suggest the use of two threshold parameters and devse two moded versons of SMO for regresson that are ecent than the orgnal SMO. Computatonal comparson on datasets show that the modcatons perform sgncantly better than the orgnal SMO. The paper s organzed as follows. In secton 2 we brey dscuss the SVM regresson problem formulaton, the dual problem and the assocated KKT optmalty condtons. We also pont out how these condtons lead to proper crtera for termnatng algorthms for desgnng SVM for regresson. Secton 3 gves a bref overvew of Smola's SMO algorthm for regresson. In secton 1

4 we pont out the necency assocated wth the way SMO uses a sngle threshold value and descrbe the moded algorthm n secton 5. Computatonal comparson s done n secton 6. 2 The SVM Regresson Problem and Optmalty Condtons The basc problem addressed n ths paper s the regresson problem. The tutoral by Smola and Scholkopf[7] gves a good overvew of the soluton of ths problem usng SVMs. Throughout the paper we wll use x to denote the nput vector of the support vector machne and z to denote the feature space vector whch s related to x by a transformaton, z = (x). Let the tranng set, fx ; d g, consst of m data ponts, where x s the -th nput pattern and d s the correspondng target value, d 2 R. The goal of SVM regresson s to estmate a functon f(x) that s as \close" as possble to the target values d for every x and at the same tme, s as \at" as possble for good generalzaton. The functon f s represented usng a lnear functon n the feature space: f(x) = w (x) + b where b denotes the bas. As n all SVM desgns, we dene the kernel functon k(x; ^x) = (x)(^x), where \" denotes nner product n the z space. Thus, all computatons wll be done usng only the kernel functon. Ths nner-product kernel helps n takng the dot product of two vectors n the feature space wthout havng to construct the feature space explctly. Mercer's theorem[2] tells the condtons under whch ths kernel operator s useful for SVM desgns. For SVM regresson purposes, Vapnk[9] suggested the use of -nsenstve loss functon where the error s not penalzed as long as t s less than. It s assumed here that s known a pror. Usng ths error functon together wth a regularzng term, and lettng z = (x ), the optmzaton problem solved by the support vector machne can be P formulated as: mn 1 2 kwk2 + C ( + ) s:t: d? w z? b + w z + b? d + ; (P) The above problem s referred to as the prmal problem. The constant C > determnes the trade-o between the smoothness of f and the amount up to whch devatons larger than are tolerated. 2

Let us dene w(; ) = P (? )z : We wll refer to the () 's as Lagrange multplers. Usng Wolfe dualty theory, t can be shown that the () 's are obtaned by solvng the followng Dual problem: max s:t: X X d (? )? X 1 ( + )? 2 w(; ) w(; ) (? ) = ; 2 [; C] 8 Once the 's and 's are obtaned, the prmal varables, w; b; and can be easly determned by usng the KKT condtons mentoned earler. The feature space (and hence w) can be nnte dmensonal. Ths makes t computatonally dcult to solve the prmal problem (P). The numercal approach n SVM desgn s to solve the P dual problem snce t s a nte-dmensonal optmzaton problem. (Note that w(; ) w(; ) = Pj(? )( j? j )k(x ; x j ).) To derve proper stoppng condtons for algorthms whch solve the dual, t s mportant to wrte down the optmalty condtons for the dual. The Lagrangan for the dual s: Let X L D = 1 2 w(; ) w(; )? d (? ) + X ( + ) X + (? )? X X X X?? (C? )? The KKT condtons for the dual problem are: F = d? w(; ) z @L D @ =?F + +? + = @L D @ = F +?? + = = ; ; = ; ; (C? ) = ; ; C (C? ) = ; ; C (C? ) (D) 3

These condtons 2 can be smpled by consderng the followng ve cases. It s easy to check that at optmalty, for every, and cannot be non-zero at the same tme. Hence cases correspondng to 6= have been left out. (It s worth notng here that n the SMO regresson algorthm and ts modcatons dscussed n ths paper, the condton, = 8 s mantaned throughout.) Case 1: = =? (F? ) (1a) Case 2: = C Case 3: = C Case 4: 2 (; C) F? F?? F? = (1b) (1c) (1d) Case 5: 2 (; C) F? =? (1e) Dene the followng ndex sets at a gven : I a = f : < < Cg; I b = f : < < Cg; I 1 = f : = ; = g; I2 = f : = ; = Cg; I3 = f : = C; = g. Also, let I = I a [ I b. Let us also dene F ~ and F as ~F = F + f 2 I b [ I 2 ; = F? f 2 I a [ I 1 : and F = F? f 2 I a [ I 3 ; = F + f 2 I b [ I 1 : Usng these dentons we can rewrte the necessary condtons mentoned n (1a)-(1e) as F ~ 8 2 I [ I 1 [ I 2 ; F 8 2 I [ I 1 [ I 3 : (2) Let us dene b up = mnf F : 2 I [ I 1 [ I 3 g and b low = maxf F ~ : 2 I [ I 1 [ I 2 g (3) 2 The KKT condtons are both necessary and sucent for optmalty. Hereafter we wll smply refer to them as optmalty condtons. 4

Then the optmalty condtons wll hold at some b low b up (4) It s easy to see the close relatonshp between the threshold parameter b n the prmal problem and the multpler,. In partcular, at optmalty, and b are dentcal. Therefore, n the rest of the paper, and b wll denote one and the same quantty. We wll say that an ndex par (; j) denes a volaton at (; ) f one of the followng two sets of condtons holds: 2 I [ I 1 [ I 2 ; j 2 I [ I 1 [ I 3 and F ~ > Fj (5a) 2 I [ I 1 [ I 3 ; j 2 I [ I 1 [ I 2 and F < Fj ~ (5b) Note that optmalty condton wll hold at there does not exst any ndex par (; j) that denes a volaton. Snce, n numercal soluton, t s usually not possble to acheve optmalty exactly, there s a need to dene approxmate optmalty condtons. The condton (4) can be replaced by b low b up + 2 (6) where s a postve tolerance parameter. (In the pseudo-codes gven n the appendx of ths paper, ths parameter s referred to as tol). Correspondngly, the denton of volaton can be altered by replacng (5a) and (5b) respectvely by: 2 I [ I 1 [ I 2 ; j 2 I [ I 1 [ I 3 and F ~ > Fj + 2 (7a) 2 I [ I 1 [ I 3 ; j 2 I [ I 1 [ I 2 and F < Fj ~? 2 (7b) Hereafter n the paper, when optmalty s mentoned t wll mean approxmate optmalty. Let E = F? b. Usng (1) t s easy to check that optmalty holds there exsts a b such that the followng hold for every : > ) E? (8a) < C ) E + (8b) > ) E? + (8c) < C ) E?? (8d) 5

These condtons are used n[7,8] together wth a specal choce of b to check f an example volates the KKT condtons. However, unless the choce of b turns out to be rght, usng the above condtons for checkng optmalty can be ncorrect. We wll say more about ths n secton 4 after a bref dscusson of Smola and Scholkopf's SMO algorthm n the next secton. 3 Smola and Scholkopf's SMO Algorthm for Regresson A number of algorthms have been suggested for solvng the dual problem. Smola and Scholkopf[7, 8] gve a detaled vew of these algorthms and ther mplementatons. Tradtonal quadratc programmng algorthms such as nteror pont algorthms are not sutable for large sze problems because of the followng reasons. Frst, they requre that the kernel matrx k(x ; x j ) be computed and stored n memory. Ths requres extremely large memory. Second, these methods nvolve expensve matrx operatons such as Cholesky decomposton of a large sub-matrx of the kernel matrx. Thrd, codng of these algorthms s dcult. Attempts have been made to develop methods that overcome some or all of these problems. One such method s chunkng. The dea here s to operate on a xed sze subset of the tranng set at a tme. Ths subset s called the workng set and the optmzaton subproblem s solved wth respect to the varables correspondng to the examples n the workng set and a set of support vectors for the current workng set s found. These current support vectors are then used to determne the new workng set, the data the current estmator would make error on. The new optmzaton subproblem s solved and ths process s repeated untl the KKT condtons are satsed for all the examples. Platt[5] proposed an algorthm, called Sequental Mnmal Optmzaton (SMO) for the SVM classer desgn. Ths algorthm puts chunkng to the extreme by teratvely selectng workng sets of sze two and optmzng the target functon wth respect to them. One advantage of usng workng sets of sze 2 s that the optmzaton subproblem can be solved analytcally. The chunkng process s repeated tll all the tranng examples satsfy KKT condtons. Smola and Scholkopf[7,8] extended these deas for solvng the regresson problem usng SVMs. We descrbe ths algorthm very brey below. The detals, together wth a pseudo-code can be found n [7,8]. We assume that the reader s famlar wth them. To convey our deas compactly we employ the notatons used n 6

[7,8]. The basc step n SMO algorthm conssts of choosng a par of ndces, ( 1 ; 2 ) and optmzng the dual objectve functon n (D) by varyng the Lagrange multplers correspondng to 1 and 2 only. We make one mportant comment here on the role of the threshold parameter,. As n[7,8] dene the output error on -th pattern as E = F? Let us call the ndces of the two multplers chosen for jont optmzaton n one step as 2 and 1. To take a step by varyng the Lagrange multplers of examples 1 and 2, we only need to know E 1? E 2 = F 1? F 2. Therefore a knowledge of the value of s not needed to take a step. The method followed to choose 1 and 2 at each step s crucal for ndng the soluton of the problem ecently. The SMO algorthm employs a two loop approach: the outer loop chooses 2 ; and, for a chosen 2 the nner loop chooses 1. The outer loop terates over all patterns volatng the optmalty condtons, rst only over those wth Lagrange multplers nether on the upper nor on the lower boundary(n Smola and Scholkopf's pseudo-code ths loopng s ndcated by examneall = ), and once all of them are satsed, over all patterns volatng the optmalty condtons(examneall = 1) to ensure that the problem has ndeed been solved. For ecent mplementaton a cache for E s mantaned and updated for those ndces correspondng to non-boundary Lagrange multplers. The remanng E are computed as and when needed. Let us now see how the SMO algorthm chooses 1. The am s to make a large ncrease n the objectve functon. Snce t s expensve to try out all possble choces of 1 and choose the one that gves the best ncrease n the objectve functon, the ndex 1 s chosen to maxmze je 2? E 1 j or je 2? E 1 2j dependng on the multplers of 1 and 2. Snce E s avalable n cache for non-boundary multpler ndces, only such ndces are ntally used n the above choce of 1. If such a choce of 1 does not yeld sucent progress, then the followng steps are taken. Startng from a randomly chosen ndex, all ndces correspondng to non-bound multplers are tred as a choce for 1, one by one. If stll sucent progress s not possble, all ndces are tred as choces for 1, one by one, agan startng from a randomly chosen ndex. Thus the choce of random seed aects the runnng tme of SMO. Although a value of s not needed to take a step, t s needed f (8a)-(8d) are employed for checkng optmalty. In the SMO algorthm s updated after each step. A value of s chosen so 7

as to satsfy (1) for 2 f 1 ; 2 g. If, after a step nvolvng ( 1 ; 2 ), one of the Lagrange multplers (or both) takes a non-boundary value then (1d) or (1e) s exploted to update the value of. In the rare case that ths does not happen, there exsts a whole nterval, say [ low ; up ], of admssble thresholds. In ths stuaton SMO smply chooses to be the md-pont of ths nterval. 4 Inecency of the SMO algorthm SMO algorthm for regresson, dscussed above, s very smple and easy to mplement. However t can become necent, typcally near a soluton pont, because of ts way of computng and mantanng a sngle threshold value. At any nstant, the SMO algorthm xes b based on the current two ndces used for jont optmzaton. However, whle checkng whether the remanng examples volate optmalty or not, t s qute possble that a derent, shfted choce of b may do a better job. So, n the SMO algorthm t s qute possble that, even though (; ) has reached a value where optmalty s satsed (.e., (6)), but SMO has not detected ths because t has not dented the correct choce of b. It s also qute possble that, a partcular ndex may appear to volate the optmalty condtons because (8) s employed usng an \ncorrect" value of b although ths ndex may not be able to par wth another to make progress n the objectve functon. In such a stuaton the SMO algorthm does an expensve and wasteful search lookng for a second ndex so as to take a step. We beleve that ths s a major source of necency n the SMO algorthm. There s one smple alternate way of choosng b that nvolves all ndces. By dualty theory, the objectve functon value n (P) of a prmal feasble soluton s greater than or equal to the objectve functon value n (D) of a dual feasble soluton. The derence between these two values s referred to as the dualty gap. The dualty gap s zero only at optmalty. Suppose (; ) s gven and w = w(; ). The term can be chosen optmally (as a functon of ). The result s that the dualty gap s expressed as a functon of only. One possble way of mprovng the SMO algorthm s to always choose so as to mnmze the dualty gap. Ths corresponds to the subproblem, mn X max(; F?? ;?F +? ) Let m denote the number of examples. In an ncreasng order arrangement of ff? g and ff + g let f m and f m+1 be the m-th and (m + 1)-th values. Then any n the nterval, [f m ; f m+1 ] s a mnmzer. The determnaton of f m and f m+1 can be done ecently usng a \medan-ndng" 8

technque. Snce all F are not typcally avalable at a gven stage of the algorthm, t s approprate to apply the above dea to that subset of ndces for whch F are avalable. Ths set s nothng but I. We mplemented ths dea and tested t on some benchmark problems. But t dd not fare well. See secton 6 for performance on an algorthm. 5 Modcatons of the SMO Algorthm In ths secton, we suggest two moded versons of the SMO algorthm for regresson, each of whch overcomes the problems mentoned n the last secton. As we wll see n the computatonal evaluaton of secton 6, these modcatons are always better than the orgnal SMO algorthm for regresson and n most stuatons, they also gve qute a remarkable mprovement n ecency. In short, The modcatons avod the use of a sngle threshold value b and the use of (8) for checkng optmalty. Instead, two threshold parameters, b up and b low are mantaned and (6) (or (7)) s employed for checkng optmalty. Assumng that the reader s famlar wth[7] and the pseudo-codes for SMO gven there, we only gve a set of ponters that descrbe the changes that are made to Smola and Scholkopf's SMO algorthm for regresson. Pseudo-codes that fully descrbe these can be found n[6]. 1. Suppose, at any nstant, F s avalable for all. Let low and up be ndces such that ~F low = b low = maxf F ~ : 2 I [ I 1 [ I 2 g (9a) and F up = b up = mnf F : 2 I [ I 1 [ I 3 g (9b) Then checkng a partcular for optmalty s easy. For example, suppose 2 I 3. We only have to check f F < b low? 2. If ths condton holds, then there s a volaton and n that case SMO's takestep procedure can be appled to the ndex par (; low ). Smlar steps can be gven for ndces n other sets. Thus, n our approach, the checkng of optmalty of the rst ndex, 2 and the choce of second ndex, 1, go hand n hand, unlke the orgnal SMO algorthm. As we wll see below, we compute and use ( low ; b low ) and ( up ; b up ) va an ecent updatng process. 2. To be ecent, we would, lke n the SMO algorthm, spend much of the eort alterng ; 2 I ; cache for F, 2 I are mantaned and updated to do ths ecently. And, when optmalty holds for all 2 I, only then all ndces are examned for optmalty. 9

3. The procedure takestep s moded. After a successful step usng a par of ndces, ( 2 ; 1 ), let ^I = I [ f 1 ; 2 g. We compute, partally, ( low ; b low ) and ( up ; b up ) usng ^I only (.e., use only 2 ^I n (9)). Note that these extra steps are nexpensve because cache for ff ; 2 I g s avalable and updates of F 1 ; F 2 are easly done. A careful look shows that, snce 2 and 1 have been just nvolved n a successful step, each of the two sets, ^I \(I [I 1 [I 2 ) and ^I \(I [I 1 [I 3 ), s non-empty; hence the partally computed ( low ; b low ) and ( up ; b up ) wll not be null elements. Snce l ow and u p could take values from f2; 1 g and they are used as choces for 1 n the subsequent step (see tem 1 above), we keep the values of F 1 and F 2 also n cache. 4. When workng wth only ; ; 2 I,.e., a loop wth examneall =, one should note that, f (6) holds at some pont then t mples that optmalty holds as far as I s concerned. (Ths s because, as mentoned n tem 3 above, the choce of b low and b up are nuenced by all ndces n I.) Ths gves an easy way of extng ths loop. 5. There are two ways of mplementng the loop nvolvng ndces n I only (examneall = ). Method 1. Ths s smlar to what s done n SMO. Loop through all 2 2 I. For each 2, check optmalty and f volated, choose 1 approprately. For example, f F2 < b low? 2 then there s a volaton and n that case choose 1 = low. Method 2. Always work wth the worst volatng par,.e., choose 2 = low and 1 = up. Dependng on whch one of these methods s used, we call the resultng overall modcaton of SMO as SMO-Modcaton 1 and SMO-Modcaton 2. SMO and SMO-Modcaton 1 are dentcal except n the way the bas s mantaned and optmalty s tested. On the other hand, SMO-Modcaton 2 can be thought of as a further mprovement of SMO-Modcaton 1 where the cache s eectvely used to choose the volatng par when examneall =. 6. When optmalty on I holds, as already sad we come back to check optmalty on all ndces (examneall = 1). Here we loop through all ndces, one by one. Snce (b low ; low ) and (b up ; up ) have been partally computed usng I only, we update these quanttes as each s examned. For a gven, F s computed rst and optmalty s checked usng the current (b low ; low ) and b up ; up ); f there s no volaton, F are used to update these quanttes. For example, f 2 I 3 and F < b low? 2, then there s a volaton, n whch case we take a step usng (; low ). On the other hand, f there s no volaton, then ( up ; b up ) s moded usng F,.e., f F < b up then we do: up := and b up := F. 1

7. Suppose we do as descrbed above. What happens f there s no volaton for any n a loop havng examneall = 1? Can we conclude that optmalty holds for all? The answer s: YES. Ths s easy to see from the followng argument. Suppose, by contradcton, there does exst one (; j) par such that they dene a volaton,.e., they satsfy (7). Let us say, < j. Then j would not have satsed the optmalty check n the above descrbed mplementaton because ether F or ~F would have, earler than j s seen, aected ether the calculaton of b up and/or b low settngs. In other words, even f s mstakenly taken as havng satsed optmalty earler n the loop, j wll be detected as volatng optmalty when t s analysed. Only when (6) holds t s possble for all ndces to satsfy the optmalty checks. Furthermore, when (6) holds and the loop over all ndces has been completed, the true values of b up and b low, as dened n (3) would have been computed snce all ndces have been encountered. As a nal choce of b (for later use n dong nference) t s approprate to set: b = :5(b up + b low ). 6 Computatonal Comparson In ths secton we compare the performance of our modcatons aganst Smola and Scholkopf's SMO algorthm for regresson on three datasets. We mplemented all these methods n C and ran them usng gcc on a P3 45 MHz Lnux machne. The value, = :1 was used for all experments. The rst dataset s a toy dataset where the functon to be approxmated s a cubc polynomal, :2x 3 + :5x 2? x. The doman of ths functon was xed to [?1; 1]. A Gaussan nose of mean zero and varance 1 was added to the tranng set output. A hundred tranng samples were chosen randomly. The performance of the four algorthms for the polynomal kernel k(x ; x j ) = (1 + x x j ) p where p was chosen to be 3, s shown n Fg. 1. The second dataset s the Boston housng dataset whch s a standard benchmark for testng regresson algorthms. Ths dataset s avalable at UCI Repostory [1]. The dmenson of the nput s 13. We used the tranng set of sze 46. A Gaussan nose of mean zero and standard devaton 6 was added to the tranng data. = :56 was used n ths case. Fg. 2 shows the performance of the four algorthms on ths dataset. For ths as well as the thrd dataset the Gaussan kernel k(x ; x j ) = exp(?kx? x j k 2 =) 11

25 2 "smo" "smo_dualty_gap" "smo_mod_1" "smo_mod_2" CPU Tme(s) 15 1 5 5 1 15 2 25 3 35 4 45 5 C Fgure 1: Toy data: CPU Tme (n seconds) shown as a functon of C. 6 5 4 "smo" "smo_dualty_gap" "smo_mod_1" "smo_mod_2" CPU Tme(s) 3 2 1 5 1 15 2 25 3 35 4 45 5 C Fgure 2: Boston Housng data: CPU Tme (n seconds) shown as a functon of C. was used and the value employed was 5:. The thrd dataset, Comp-Actv, s avalable at the Delve webste[3]. Ths dataset contans 8192 data ponts of whch we used 5764. We mplemented the \cpusmall" prototask, whch nvolves usng 12 attrbutes to predct the fracton of tme (n percentage) the CPU runs n user mode. Gaussan nose of mean zero and standard devaton 1 was added to ths tranng set. We used = :48 for ths dataset. The performance of the four algorthms on ths dataset s shown n Fg. 3. It s very clear that both modcatons outperform the orgnal SMO algorthm. In many stuatons the mprovement n ecency s remarkable. In partcular, at large values of C the mprovement s by an order of magntude. Between the two modcatons, t s dcult to say 12

whch one s better. We have not reported a comparson of the generalzaton abltes of the three methods snce all three methods apply to the same problem formulaton, are termnated at the same tranng set accuracy, and hence gve very close generalzaton performance. 7 Concluson In ths paper we have ponted out an mportant source of necency n Smola and Scholkopf's SMO algorthm that s caused by the operaton wth a sngle threshold value. We have suggested two modcatons of the SMO algorthm that overcome the problem by ecently mantanng and updatng two threshold parameters. Our computatonal experments show that these modcatons speed up the SMO algorthm sgncantly n most stuatons. References [1] C.L. Blake and C.J. Merz, UCI repostory of machne learnng databases, Unversty of Calforna, Department of Informaton and Computer Scence, Irvne, CA, USA, 1998. See: http://www.cs.uc.edu/~mlearn/mlrepostory.html. [2] C.J.C. Burges, A tutoral on support vector machnes for pattern recognton, Data Mnng and Knowledge Dscovery, 3(2), 1998. [3] Delve: Data for evaluatng learnng n vald experments. See: http://www.cs.utoronto.ca/~delve [4] S.S. Keerth, S.K, Shevade, C. Bhattacharyya and K.R.K. Murthy, Improvements to Platt's smo algorthm for svm classer desgn, Techncal Report CD-99-14, Control Dvson, Dept. of Mechancal and Producton Engneerng, Natonal Unversty of Sngapore, Sngapore, August 1999. See: http://guppy.mpe.nus.edu.sg/~mpessk [5] J.C. Platt, Fast tranng of support vector machnes usng sequental mnmal optmzaton, n B. Scholkopf, C. Burges, A. Smola. Advances n Kernel Methods: Support vector Machnes, MIT Press, Cambrdge, MA, December 1998. 13

[6] S.K. Shevade, S.S. Keerth, C. Bhattacharyya and K.R.K. Murthy, Improvement to smo algorthm for regresson, Techncal Report CD-99-16, Control Dvson, Dept. of Mechancal and Producton Engneerng, Natonal Unversty of Sngapore, Sngapore, August 1999. See: http://guppy.mpe.nus.edu.sg/~mpessk [7] A.J. Smola, Learnng wth kernels, PhD Thess, GMD, Brlnghoven, Germany, 1998 [8] A.J. Smola and B. Scholkopf, A tutoral on support vector regresson, NeuroCOLT Techncal Report TR 1998-3, Royal Holloway College, London, UK, 1998. [9] V. Vapnk, The Nature of Statstcal Learnng Theory, Sprnger, NY, USA, 1995. 14