Combine the PA Algorithm with a Proximal Classifier

Combine the Passive and Aggressive Algorithm with a Proximal Classifier Yuh-Jye Lee Joint work with Y.-C. Tseng Dept. of Computer Science & Information Engineering TaiwanTech. Dept. of Statistics@NCKU Sept. 29, 2011

Outline 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

Supervised Learning Problems Assumption: training instances are drawn from an unknown but fixed probability distribution P(x, y) independently. Our learning task: Given a training set T = {(x 1, y 1 ), (x 2, y 2 ),..., (x m, y m )} We would like to construct a rule, f (x) that can correctly predict the label y given unseen x If f (x) y then we get some loss or penalty For example: l(f (x), y) = 1 2 f (x) y Learning examples: classification, regression and sequence labeling If y is drawn from a finite set it will be a classification problem. The simplest case: y { 1, +1} called binary classification problem If y is a real number it becomes a regression problem More general case, y can be a vector and each element is drawn from a finite set. This is the sequence labeling problem

Binary Classification Problem Given a training set T = {(x j, y j ) x j R d, y j { 1, 1}, j = 1,..., m} Main Goal: x j P y j = 1 & x j N y j = 1 Predict the unseen class label for new data Find a function f : R d R by learning from data f (x) 0 x P and f (x) < 0 x N The simplest function is linear: f (x) = w, x + b

Expected Risk vs. Empirical Risk Assumption: training instances are drawn from an unknown but fixed probability distribution P(x, y) independently. Ideally, we would like to have the optimal rule f that minimizes the Expected Risk: E(f ) = l(f (x), y)dp(x, y) among all functions Unfortunately, we can not do it. P(x, y) is unknown and we have to restrict ourselves in a certain hypothesis space, F How about compute fm F that minimizes the Empirical Risk: E m (f ) = 1 m l(f (x j ), y j ) j Only minimizing the empirical risk will be in danger of overfitting

Approximation Optimization Approach Most of learning algorithms can be formulated as an optimization problem The objective function consists of two parts: E m (f )+ controls on VC-error bound Controlling the VC-error bound will avoid the overfitting risk It can be achieved via adding the regularization term into the objective function Note that: We have made lots of approximations when formulate a learning task as an optimization problem Why bother to find the optimal solution for the problem? One could stop the optimization iteration before its convergence

Gradient Descent: Batch Learning For an optimization problem min f (w) = min r(w) + 1 m m l(w; (x j, y j )) GD tries to find a direction and the learning rate decreasing the objective function value. j=1 w t+1 = w t η f (w t ) where η is the learning rate, f (w t ) is the steepest direction f (w t ) = r(w t ) + 1 m m l(w t ; (x i, y i )) When m is large, computing m l(w t ; (x j, y j )) may cost much time. j=1 i=1

Stochastic Gradient Descent: Online Learning In GD, we compute the gradient using the entire training set. In stochastic gradient descent(sgd), we use l(w t ; (x t, y t )) So the gradient of f (w t ) instead of 1 m m l(w t ; (x j, y j )) j=1 f (w t ) = r(w t ) + l(w t ; (x t, y t )) SGD computes the gradient using only one instance. In experiment, SGD is significantly faster than GD when m is large.

Large-Scale Problems Collecting and distributing large-scale datasets are no longer expensive tasks. Two definitions of large-scale problems, It consists of problems where the main computational constraint is the amount of time available, rather than the number of instances [Bottou, 2008]. Training set may not be stored in modern computer s memory [Langford, 2008]. We are in a need of learning algorithms that scale linearly with the size of datasets The performance of the algorithms should be better than processing a random subset of the data via conventional learning algorithms

Outline Perceptron Algorithm Passive and Aggressive (PA) Algorithm 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

Online Learning Perceptron Algorithm Passive and Aggressive (PA) Algorithm Definition of online learning Given a set of new training data, Online learner can update its model without reading old data while improving its performance. In contrast, off-line learner must combine old and new data and start the learning all over again, otherwise the performance will suffer. Online is considered as a solution of large learning tasks Usually require several passes (or epochs) through the training instances Need to keep all instances unless we only run the algorithm one single pass

Outline Perceptron Algorithm Passive and Aggressive (PA) Algorithm 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

Perceptron Algorithm Passive and Aggressive (PA) Algorithm Perceptron Algorithm [Rosenblatt, 1956] An online learning algorithm and a mistake-driven procedure The current classifier is updated whenever the new arriving instance is misclassified Initiation: k = 0, R = max 1 j m xj 2 repeat for t = 1 : m if y t( w k, x t + b k ) 0 w k+1 = w k + ηy tx t b k+1 = b k + ηy tr 2 k = k + 1 end end until no mistake made within the for-loop k is number of mistakes. η > 0 is the learning rate.

Perceptron Algorithm Passive and Aggressive (PA) Algorithm Perceptron Algorithm [Rosenblatt, 1956] The Perceptron is considered as a SGD method. The underlying optimization problem of the algorithm min (w,b) R d+1 m ( y j ( w, x j + b)) + j=1 In the linearly separable case, the Perceptron alg. will be terminated in finite steps no matter what learning rate is chosen In the nonseparable case, how to decide the appropriate learning rate that will make the least mistake is very difficult Learning rate can be a nonnegative number. More general case, it can be a positive definite matrix

Outline Perceptron Algorithm Passive and Aggressive (PA) Algorithm 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

Perceptron Algorithm Passive and Aggressive (PA) Algorithm Key Idea of PA Algorithm [Crammer, 2006] An online learning algorithm and a mistake-driven procedure The PA algorithm suggests that the new classifier should not only classify the new arriving data correctly but also as close to the current classifier as possible It can be formulated the problem as follows: w t+1 1 arg min w R d 2 w wt 2 2 (1) s.t. l(w; (x t, y t )) = 0 where l(w; (x t, y t )) is a hinge loss function

Simplify the PA Algorithm Perceptron Algorithm Passive and Aggressive (PA) Algorithm We can simplify Eq.(1) as follows: w t+1 1 arg min w R d 2 w 2 2 w, wt (2) s.t. l(w; (x t, y t )) = 0 In the Eq.(2), the PA algorithm implicitly minimizes the regularization term, 1 2 w 2 2 Minimizing w, w t is also expressed that the new updated classifier must be similar to the current classifier

Notes on PA Algorithm Perceptron Algorithm Passive and Aggressive (PA) Algorithm If (x t, y t ) is correctly classified then w t is the solution of the optimization problem It is called the Passive phase If (x t, y t ) is misclassified then w t should be updated to correctly classify (x t, y t ) It is called the Aggressive phase We may treat the classifier learned by the PA algorithm as a hard-margin classifier However, if an instance comes with noise, the current classifier may be led to the wrong direction seriously

PA-1 and PA-2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm Based on the concept of the soft-margin classifier, the non-negative slack variable ξ was introduced into the PA algorithm in two different ways Linear scale with ξ (called PA-1) w t+1 1 arg min w R d 2 w wt 2 2 + Cξ s.t. l(w; (x t, y t )) ξ and ξ 0 Square scale with ξ (called PA-2) w t+1 1 arg min w R d 2 w wt 2 2 + Cξ 2 s.t. l(w; (x t, y t )) ξ

Perceptron Algorithm Passive and Aggressive (PA) Algorithm PA Algorithm Closed Form Updating It seems that we have to solve an optimization problem for each instance. Fortunately, PA, PA-1 and PA-2 come with the closed form of updating schemes They share the same closed form w t+1 = w t + τ t y t x t where τ t > 0 is defined as l(w t ; (x t,y t )) (PA) x t 2 2 τ t = min{c, l(wt ; (x t,y t )) } (PA-1) x t 2 2 (PA-2) l(w t ; (x t,y t )) x t 2 2 + 1 2C It replaced the fixed learning rate such as Perceptron algorithm with the dynamic learning rate depending on the current instance

Outline 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

Motivation for a Proximal Classifier In online learning framework, we do not keep the previous instances in the memory A lack of memory for previous instances might hurt the learning efficiency The previous instances which have been classified correctly may be misclassified again We keep some simple statistical information (mean and variance) to summarize the previous instances Have to take the cost into account, fixed and small size memory and linear CPU time

Illustrations of Simple Proximal Models If the dataset is easy to separate, the means difference suggests a good proximal classifier (left figure) For the more complicated dataset, the performance of the LDA direction is better (right figure) The proximal classifier would provide a good suggestion for our classifier updating

Proximal Classifier: Quasi-LDA However, the cost of computing the LDA is expensive when the input space is in the high dimensional space We proposed the quasi-lda direction as our proximal classifier (w t p) i = (mt + ) i (m t ) i (s t + ) i +(s t ) i, i = 1, 2,..., d where w t p m t +/ s t +/ : the proximal classifier on round t : mean vector of Pos./Neg. class on round t : variance vector of Pos./Neg. class on round t It is very cheap both in CPU time and in memory usage

The Details of Proximal Classifier Small Trick: Var(X) = E((X E(X)) 2 ) = E(X 2 ) (E(X)) 2 The details about the statistical information we maintained P t = {j y j positive, j = 1, 2,..., t} N t = {j y j negative, j = 1, 2,..., t} (m t 1 +) i = (x j ) P t i i = 1, 2,..., d j P t (m t 1 ) i = (x j ) N t i i = 1, 2,..., d j N t (s t 1 +) i = (x j ) 2 P t 1 i P t j P t P t 1 (mt +) 2 i i = 1, 2,..., d (s t 1 ) i = (x j ) 2 N t 1 i N t j N t N t 1 (mt ) 2 i i = 1, 2,..., d All we need are (m t +) i, (m t ) i, (x j ) 2 i and (x j ) 2 i for each attribute in j P t j N t online fashion

Key Idea of Our Proposed Method How to combine the PA algorithm with the proximal classifier? Remember, in the PA algorithm, w, w t is expressed as the similarity between w and w t Therefore, we can formulate our idea by adding γ w, w t p into the Eq.(2), called PAm: w t+1 1 arg min w R d 2 w 2 2 w, w t γ w, w t p s.t. l(w; (x t, y t )) = 0 Minimizing w, w t γ w, w t p is the way to keep the new classifier w to be close to w t and w t p γ w, w t p also can be introduced into the PA-1 and PA-2, called PAm-1 and PAm-2, respectively

Closed Form Updating Rule of PAm Algorithm We derive the closed form updating rules for PAm, PAm-1 and PAm-2 as follows, they also shared the same closed form where α t > 0 is defined as α t = w t+1 = w t + γw t p + α t y t x t l(w t ;(x t,y t )) γy t w t p,xt x t 2 2 (PAm) min{c, l(wt ;(x t,y t )) γy t w t p,xt } (PAm-1) x t 2 2 l(w t ;(x t,y t )) γy t w t p,xt x t 2 2 + 1 2C (PAm-2)

PAm Algorithm (Single Pass) Given a training set T = {(x t, y t ) x t R d, y t { 1, 1}, t = 1, 2,..., m} for t = 1,2,...,m Update the statistical information if 1 y t w t, x t > 0 Obtain the proximal classifier w t p Compute the learning rate α t w t+1 = w t + γw t p + α t y t x t end end

PAm Algorithm After a single pass, we will not update the statistical information. Updating scheme is very similar to PA algorithm The input data order is less affected on the final classifier after single pass. In our experiments, we show that classifiers between two consecutive passes are very similar when PAm algorithm is applied

Extension of PAm to Sliding Window of Previous Classifiers In each updating of the PA algorithm, it only considers one previous classifier. We suggest that taking more than one previous classifiers into account will be more robust. We maintain a queue which saves previous k classifiers and k is small, we can formulate as follow, w t+1 1 arg min w R d 2 w 2 2 w, w t γ w, w t p + Cξ s.t. l(w; (x t, y t )) ξ and ξ 0 where w t is the average of k previous classifiers on round t We call it as PAmQ-1.

Global and Local Information We can also derive the formulations and closed form updating rules for PAmQ and PAmQ-2 All we need to do is replacing w t with w t PAm algorithm takes global and local information into account in online learning framework The statistical information we kept can be treated as the global information The previous classifier w t or w t can be treated as the local information

Outline 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

Numerical Experiments Setting In the first part of our experiments, we would like to investigate the effect of the input order of instances We run a single epoch for PA algorithm and PAm algorithm 10 times with different input order at each time The second part of our experiments, we would like to observe the convergence behavior of our proposed methods How good it could be if we only run a single pass of the PAm family algorithms?

Benchmark Datasets Used in Numerical Experiments The sources of benchmark datasets 3 medium-size datasets from LIBSVM library site 3 large-scale datasets from Pascal Large Scale Learning Challenge in 2008 Dataset Training Testing Features Svmguide1 3089 4000 4 Ijcnn 49990 91701 22 Cod-rna 59535 271617 8 Alpha 250000 250000 500 Delta 250000 250000 500 Gamma 250000 250000 500

Comparison in the Training (Single Pass) Training Error Rate and Standard Deviation (%) Dataset PA PAm PA-1 PAm-1 PA-2 PAm-2 PAmQ-1 LDA Svmguide1 7.5(3.2) 5.0(1.0) 6.9(3.0) 4.9(0.3) 6.4(1.1) 4.9(0.3) 5.0(0.4) 12.2 Ijcnn 10.1(2.2) 9.5(1.0) 8.9(0.4) 8.5(0.1) 9.0(0.3) 9.0(0.1) 8.7(0.2) 13.9 Cod-rna 10.5(10.5) 7.7(1.7) 6.8(0.5) 6.3(0.1) 7.0(0.5) 6.9(0.6) 6.2(0.1) 6.4 Alpha 31.7(3.9) 26.9(1.7) 24.3(1.4) 26.3(0.7) 24.2(1.8) 26.1(0.6) 26.4(0.3) 21.7 Delta 35.0(0.7) 25.2(0.3) 30.7(0.2) 21.5(0.01) 33.5(0.5) 24.4(0.2) 21.5(0.02) 21.5 Gamma 32.2(1.0) 23.8(0.4) 25.8(0.7) 19.9(0.01) 30.0(1.0) 22.3(0.2) 19.9(0.01) 19.9 Min/Max Training Error Rate (%) Dataset PA PAm PA-1 PAm-1 PA-2 PAm-2 PAmQ-1 LDA Svmguide1 5.0/15.7 4.2/7.7 4.9/14.2 4.5/5.2 5.4/8.8 4.6/5.4 4.6/6.1 12.2 Ijcnn 8.8/15.7 8.7/12.0 8.1/9.4 8.4/8.7 8.6/9.4 8.8/9.1 8.3/8.9 13.9 Cod-rna 6.5/25.9 6.4/12.2 6.4/7.8 6.2/6.6 6.5/7.9 6.3/8.2 6.1/6.4 6.4 Alpha 27.6/39.7 24.2/29.5 22.5/26.3 25.6/27.3 22.5/28.0 25.5/27.6 25.9/27.1 21.7 Delta 33.7/36.0 24.8/25.6 30.3/31.1 21.5/21.5 32.4/34.3 24.0/24.7 21.5/21.6 21.5 Gamma 30.8/34.4 23.0/24.4 24.7/27.0 19.9/19.9 28.9/31.7 22.0/22.7 19.9/19.9 19.9

Comparison in the Testing Test Error Rate and Standard Deviation (%) Dataset PA PAm PA-1 PAm-1 PA-2 PAm-2 PAmQ-1 LDA Svmguide1 7.4(3.2) 5.1(0.6) 7.2(4.5) 4.4(0.2) 6.6(2.0) 4.9(0.6) 4.8(0.3) 11.8 Ijcnn 9.7(1.8) 11.0(2.9) 8.8(0.3) 8.7(0.2) 8.7(0.4) 9.3(0.4) 8.9(0.2) 18.6 Cod-rna 9.2(6.9) 6.3(1.7) 5.4(0.7) 4.7(0.2) 5.8(0.7) 5.3(0.67) 4.6(0.1) 4.7 Alpha 31.7(4.9) 29.1(3.0) 23.8(1.0) 23.7(1.3) 23.6(0.7) 24.0(0.9) 23.7(0.4) 21.8 Delta 35.0(0.8) 25.3(0.3) 29.7(0.7) 21.7(0.01) 33.0(0.5) 24.5(0.2) 21.7(0.01) 21.7 Gamma 32.2(1.0) 23.9(0.4) 25.8(0.6) 20.0(0.01) 30.0(1.1) 22.3(0.2) 20.0(0.01) 20.0 Min/Max Test Error Rate (%) Dataset PA PAm PA-1 PAm-1 PA-2 PAm-2 PAmQ-1 LDA Svmguide1 4.7/13.5 4.4/6.1 4.7/19.6 4.3/4.9 5.5/11.7 4.4/6.0 4.5/5.5 11.8 Ijcnn 8.4/13.6 8.9/18.8 8.5/9.5 8.4/9.1 8.4/9.3 8.9/10.1 8.4/9.1 18.6 Cod-rna 4.8/24.5 4.7/10.3 4.7/6.8 4.5/5.2 5.1/6.9 4.5/6.5 4.5/4.9 4.7 Alpha 26.8/42.4 25.5/34.2 22.5/25.5 22.6/27.3 22.5/24.6 22.7/25.7 23.3/24.6 21.8 Delta 33.7/36.0 24.9/25.7 28.0/30.7 21.6/21.7 32.2/33.7 24.1/24.7 21.6/21.7 21.7 Gamma 30.9/34.4 23.0/24.5 24.8/26.8 19.9/20.0 28.9/32.0 22.1/22.7 19.9/20.0 20.0

Comparison of Average Running Time of 10 Runs Dataset PA PAm PA-1 PAm-1 PA-2 PAm-2 PAmQ-1 LDA Svmguide1 0.06 0.05 0.07 0.06 0.09 0.07 0.08 0.01 Ijcnn 0.66 0.57 0.69 0.72 1.04 0.71 0.91 0.19 Cod-rna 1.01 1.06 1.47 1.25 1.91 1.24 1.61 0.08 Alpha 15.43 15.48 17.70 17.64 21.15 21.14 22.76 72.6 Delta 23.20 16.75 24.82 15.51 24.58 16.18 16.48 68.1 Gamma 22.64 16.21 24.26 15.03 24.39 15.73 16.09 68.7 The unit of these results is Sec. The running times of PA family and PAm family are comparable Keep simple statistical information may not produce extra CPU time burden

Comparison of Average Number of Updating Dataset PA PAm PA-1 PAm-1 PA-2 PAm-2 PAmQ-1 Svmguide1 571.0 308.4 943.7 539.4 1849.1 655.2 805.8 Ijcnn 6101.8 3990.3 8474.0 6790.8 21227.0 6852.3 9866.8 Cod-rna 10415.9 8924.0 22093.6 13215.4 46350.1 12645.1 18032.7 Alpha 147157.5 147150.1 168715.0 168691.2 238931.7 229004.0 202854.3 Delta 98457.4 64608.9 120450.8 55567.6 114984.4 62475.5 57279.0 Gamma 88334.6 60291.7 109457.6 51949.1 111612.6 58205.7 52872.7 Our method has a less number of updating than conventional PA algorithm Less number of updating indicates less aggressive phases in an epoch

Observation on Convergence Behavior Run 10 epochs with the same input order for each methods and record the classifier when an epoch is completed The proximal classifier will not be changed once the first epoch is finished We compute the cosine similarity between two consecutive epoch classifiers

Test Error Rate of each Epoch It becomes a constant after 2 to 3 epochs

Number of Updating in these 10 Epochs It becomes a constant after 2 to 4 epochs

Outline 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

Conclusion We used the online learning framework to solve large-scale binary classification problem Similar to PA algorithm, we also derived a set of update schemes in closed form for PAm, PAm-1 and PAm-2 Compared to the conventional PA algorithm, utilizing the quasi-lda direction as a proximal classifier will have Less sensitive to the input order Less number of updating made in a single pass Higher similarity between two consecutive epochs resulting classifiers We can have a near-optimal classifier in a single pass

Future Work Try to explore the statistical information in a single pass to build a better proximal classifier In PAmQ, we can use weighted average instead of simple average and the weights can be determined by the performance of each classifier on a certain validation set We might collect multiple misclassified instances, then use these instances to approximate the Hessian inverse into the updating rule Extend the linear classifier to nonlinear classifier via reduced kernel trick Find an efficient updating scheme for small batch input instances. Similar to SMO vs. Decomposition method for SVMs

Reference Léon Bottou and Olivier Bousquet. The Tradeoffs of Large Scale Learning. Advances in Neural Information Processing Systems, 20:161 168, 2008 John Langford. Concerns about the Large Scale Learning Challenge, 2008 K. Crammer et al. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551 585, 2006. K. Hiraoka et al. Convergence analysis of online linear discriminant analysis. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks, 3:387 391, Como, Italy, 2000. L. I. Kuncheva and C. O. Plumpton. Adaptive learning rate for online linear discriminant classifiers. Structural, Syntactic, and Statistical Pattern Recognition, 5342:510 519, 2008.