Combine the PA Algorithm with a Proximal Classifier

Similar documents
All lecture slides will be available at CSC2515_Winter15.html

Classification: Linear Discriminant Functions

SUPPORT VECTOR MACHINES

Lecture #11: The Perceptron

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

How Learning Differs from Optimization. Sargur N. Srihari

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Online Learning for URL classification

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Efficient Iterative Semi-supervised Classification on Manifold

Theoretical Concepts of Machine Learning

CSE 573: Artificial Intelligence Autumn 2010

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

6. Linear Discriminant Functions

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

DM6 Support Vector Machines

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa

COMS 4771 Support Vector Machines. Nakul Verma

Class 6 Large-Scale Image Classification

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Machine Learning: Think Big and Parallel

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Large Margin Classification Using the Perceptron Algorithm

A Brief Look at Optimization

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Metric Learning for Large-Scale Image Classification:

Linear Discriminant Functions: Gradient Descent and Perceptron Convergence

1 Case study of SVM (Rob)

CS 229 Midterm Review

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Network Traffic Measurements and Analysis

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Instance-based Learning

Perceptron: This is convolution!

Kernel Methods & Support Vector Machines

CS570: Introduction to Data Mining

Content-based image and video analysis. Machine learning

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

Introduction to Machine Learning

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Lecture 9: Support Vector Machines

Support Vector Machines.

Machine Learning for NLP

ECS289: Scalable Machine Learning

Logistic Regression

Nonparametric Methods Recap

SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines

scikit-learn (Machine Learning in Python)

Introduction to Support Vector Machines

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Learning via Optimization

Classification: Feature Vectors

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Linear Classification and Perceptron

Support Vector Machines

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

A Taxonomy of Semi-Supervised Learning Algorithms

Application of Support Vector Machine Algorithm in Spam Filtering

ECG782: Multidimensional Digital Signal Processing

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

One-class Problems and Outlier Detection. 陶卿 中国科学院自动化研究所

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

Perceptron Learning Algorithm (PLA)

Part 5: Structured Support Vector Machines

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Practice Questions for Midterm

Support Vector Machines

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Optimization Methods for Machine Learning (OMML)

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

CS 179 Lecture 16. Logistic Regression & Parallel SGD

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Software Documentation of the Potential Support Vector Machine

Robust 1-Norm Soft Margin Smooth Support Vector Machine

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

CSEP 573: Artificial Intelligence

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Machine Learning. Chao Lan

5 Machine Learning Abstractions and Numerical Optimization

Generating the Reduced Set by Systematic Sampling

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Louis Fourrier Fabien Gaie Thomas Rolf

Face Recognition using SURF Features and SVM Classifier

Supervised Learning. CS 586 Machine Learning. Prepared by Jugal Kalita. With help from Alpaydin s Introduction to Machine Learning, Chapter 2.

Chakra Chennubhotla and David Koes

6.034 Quiz 2, Spring 2005

3 Perceptron Learning; Maximum Margin Classifiers

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Transcription:

Combine the Passive and Aggressive Algorithm with a Proximal Classifier Yuh-Jye Lee Joint work with Y.-C. Tseng Dept. of Computer Science & Information Engineering TaiwanTech. Dept. of Statistics@NCKU Sept. 29, 2011

Outline 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

Outline 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

Supervised Learning Problems Assumption: training instances are drawn from an unknown but fixed probability distribution P(x, y) independently. Our learning task: Given a training set T = {(x 1, y 1 ), (x 2, y 2 ),..., (x m, y m )} We would like to construct a rule, f (x) that can correctly predict the label y given unseen x If f (x) y then we get some loss or penalty For example: l(f (x), y) = 1 2 f (x) y Learning examples: classification, regression and sequence labeling If y is drawn from a finite set it will be a classification problem. The simplest case: y { 1, +1} called binary classification problem If y is a real number it becomes a regression problem More general case, y can be a vector and each element is drawn from a finite set. This is the sequence labeling problem

Binary Classification Problem Given a training set T = {(x j, y j ) x j R d, y j { 1, 1}, j = 1,..., m} Main Goal: x j P y j = 1 & x j N y j = 1 Predict the unseen class label for new data Find a function f : R d R by learning from data f (x) 0 x P and f (x) < 0 x N The simplest function is linear: f (x) = w, x + b

Expected Risk vs. Empirical Risk Assumption: training instances are drawn from an unknown but fixed probability distribution P(x, y) independently. Ideally, we would like to have the optimal rule f that minimizes the Expected Risk: E(f ) = l(f (x), y)dp(x, y) among all functions Unfortunately, we can not do it. P(x, y) is unknown and we have to restrict ourselves in a certain hypothesis space, F How about compute fm F that minimizes the Empirical Risk: E m (f ) = 1 m l(f (x j ), y j ) j Only minimizing the empirical risk will be in danger of overfitting

Approximation Optimization Approach Most of learning algorithms can be formulated as an optimization problem The objective function consists of two parts: E m (f )+ controls on VC-error bound Controlling the VC-error bound will avoid the overfitting risk It can be achieved via adding the regularization term into the objective function Note that: We have made lots of approximations when formulate a learning task as an optimization problem Why bother to find the optimal solution for the problem? One could stop the optimization iteration before its convergence

Approximation Optimization Approach Most of learning algorithms can be formulated as an optimization problem The objective function consists of two parts: E m (f )+ controls on VC-error bound Controlling the VC-error bound will avoid the overfitting risk It can be achieved via adding the regularization term into the objective function Note that: We have made lots of approximations when formulate a learning task as an optimization problem Why bother to find the optimal solution for the problem? One could stop the optimization iteration before its convergence

Approximation Optimization Approach Most of learning algorithms can be formulated as an optimization problem The objective function consists of two parts: E m (f )+ controls on VC-error bound Controlling the VC-error bound will avoid the overfitting risk It can be achieved via adding the regularization term into the objective function Note that: We have made lots of approximations when formulate a learning task as an optimization problem Why bother to find the optimal solution for the problem? One could stop the optimization iteration before its convergence

Approximation Optimization Approach Most of learning algorithms can be formulated as an optimization problem The objective function consists of two parts: E m (f )+ controls on VC-error bound Controlling the VC-error bound will avoid the overfitting risk It can be achieved via adding the regularization term into the objective function Note that: We have made lots of approximations when formulate a learning task as an optimization problem Why bother to find the optimal solution for the problem? One could stop the optimization iteration before its convergence

Approximation Optimization Approach Most of learning algorithms can be formulated as an optimization problem The objective function consists of two parts: E m (f )+ controls on VC-error bound Controlling the VC-error bound will avoid the overfitting risk It can be achieved via adding the regularization term into the objective function Note that: We have made lots of approximations when formulate a learning task as an optimization problem Why bother to find the optimal solution for the problem? One could stop the optimization iteration before its convergence

Approximation Optimization Approach Most of learning algorithms can be formulated as an optimization problem The objective function consists of two parts: E m (f )+ controls on VC-error bound Controlling the VC-error bound will avoid the overfitting risk It can be achieved via adding the regularization term into the objective function Note that: We have made lots of approximations when formulate a learning task as an optimization problem Why bother to find the optimal solution for the problem? One could stop the optimization iteration before its convergence

Gradient Descent: Batch Learning For an optimization problem min f (w) = min r(w) + 1 m m l(w; (x j, y j )) GD tries to find a direction and the learning rate decreasing the objective function value. j=1 w t+1 = w t η f (w t ) where η is the learning rate, f (w t ) is the steepest direction f (w t ) = r(w t ) + 1 m m l(w t ; (x i, y i )) When m is large, computing m l(w t ; (x j, y j )) may cost much time. j=1 i=1

Stochastic Gradient Descent: Online Learning In GD, we compute the gradient using the entire training set. In stochastic gradient descent(sgd), we use l(w t ; (x t, y t )) So the gradient of f (w t ) instead of 1 m m l(w t ; (x j, y j )) j=1 f (w t ) = r(w t ) + l(w t ; (x t, y t )) SGD computes the gradient using only one instance. In experiment, SGD is significantly faster than GD when m is large.

Large-Scale Problems Collecting and distributing large-scale datasets are no longer expensive tasks. Two definitions of large-scale problems, It consists of problems where the main computational constraint is the amount of time available, rather than the number of instances [Bottou, 2008]. Training set may not be stored in modern computer s memory [Langford, 2008]. We are in a need of learning algorithms that scale linearly with the size of datasets The performance of the algorithms should be better than processing a random subset of the data via conventional learning algorithms

Large-Scale Problems Collecting and distributing large-scale datasets are no longer expensive tasks. Two definitions of large-scale problems, It consists of problems where the main computational constraint is the amount of time available, rather than the number of instances [Bottou, 2008]. Training set may not be stored in modern computer s memory [Langford, 2008]. We are in a need of learning algorithms that scale linearly with the size of datasets The performance of the algorithms should be better than processing a random subset of the data via conventional learning algorithms

Outline Perceptron Algorithm Passive and Aggressive (PA) Algorithm 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

Online Learning Perceptron Algorithm Passive and Aggressive (PA) Algorithm Definition of online learning Given a set of new training data, Online learner can update its model without reading old data while improving its performance. In contrast, off-line learner must combine old and new data and start the learning all over again, otherwise the performance will suffer. Online is considered as a solution of large learning tasks Usually require several passes (or epochs) through the training instances Need to keep all instances unless we only run the algorithm one single pass

Outline Perceptron Algorithm Passive and Aggressive (PA) Algorithm 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

Perceptron Algorithm Passive and Aggressive (PA) Algorithm Perceptron Algorithm [Rosenblatt, 1956] An online learning algorithm and a mistake-driven procedure The current classifier is updated whenever the new arriving instance is misclassified Initiation: k = 0, R = max 1 j m xj 2 repeat for t = 1 : m if y t( w k, x t + b k ) 0 w k+1 = w k + ηy tx t b k+1 = b k + ηy tr 2 k = k + 1 end end until no mistake made within the for-loop k is number of mistakes. η > 0 is the learning rate.

Perceptron Algorithm Passive and Aggressive (PA) Algorithm Perceptron Algorithm [Rosenblatt, 1956] The Perceptron is considered as a SGD method. The underlying optimization problem of the algorithm min (w,b) R d+1 m ( y j ( w, x j + b)) + j=1 In the linearly separable case, the Perceptron alg. will be terminated in finite steps no matter what learning rate is chosen In the nonseparable case, how to decide the appropriate learning rate that will make the least mistake is very difficult Learning rate can be a nonnegative number. More general case, it can be a positive definite matrix

Outline Perceptron Algorithm Passive and Aggressive (PA) Algorithm 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

Perceptron Algorithm Passive and Aggressive (PA) Algorithm Key Idea of PA Algorithm [Crammer, 2006] An online learning algorithm and a mistake-driven procedure The PA algorithm suggests that the new classifier should not only classify the new arriving data correctly but also as close to the current classifier as possible It can be formulated the problem as follows: w t+1 1 arg min w R d 2 w wt 2 2 (1) s.t. l(w; (x t, y t )) = 0 where l(w; (x t, y t )) is a hinge loss function

Simplify the PA Algorithm Perceptron Algorithm Passive and Aggressive (PA) Algorithm We can simplify Eq.(1) as follows: w t+1 1 arg min w R d 2 w 2 2 w, wt (2) s.t. l(w; (x t, y t )) = 0 In the Eq.(2), the PA algorithm implicitly minimizes the regularization term, 1 2 w 2 2 Minimizing w, w t is also expressed that the new updated classifier must be similar to the current classifier

Notes on PA Algorithm Perceptron Algorithm Passive and Aggressive (PA) Algorithm If (x t, y t ) is correctly classified then w t is the solution of the optimization problem It is called the Passive phase If (x t, y t ) is misclassified then w t should be updated to correctly classify (x t, y t ) It is called the Aggressive phase We may treat the classifier learned by the PA algorithm as a hard-margin classifier However, if an instance comes with noise, the current classifier may be led to the wrong direction seriously

PA-1 and PA-2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm Based on the concept of the soft-margin classifier, the non-negative slack variable ξ was introduced into the PA algorithm in two different ways Linear scale with ξ (called PA-1) w t+1 1 arg min w R d 2 w wt 2 2 + Cξ s.t. l(w; (x t, y t )) ξ and ξ 0 Square scale with ξ (called PA-2) w t+1 1 arg min w R d 2 w wt 2 2 + Cξ 2 s.t. l(w; (x t, y t )) ξ

Perceptron Algorithm Passive and Aggressive (PA) Algorithm PA Algorithm Closed Form Updating It seems that we have to solve an optimization problem for each instance. Fortunately, PA, PA-1 and PA-2 come with the closed form of updating schemes They share the same closed form w t+1 = w t + τ t y t x t where τ t > 0 is defined as l(w t ; (x t,y t )) (PA) x t 2 2 τ t = min{c, l(wt ; (x t,y t )) } (PA-1) x t 2 2 (PA-2) l(w t ; (x t,y t )) x t 2 2 + 1 2C It replaced the fixed learning rate such as Perceptron algorithm with the dynamic learning rate depending on the current instance

Outline 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

Motivation for a Proximal Classifier In online learning framework, we do not keep the previous instances in the memory A lack of memory for previous instances might hurt the learning efficiency The previous instances which have been classified correctly may be misclassified again We keep some simple statistical information (mean and variance) to summarize the previous instances Have to take the cost into account, fixed and small size memory and linear CPU time

Illustrations of Simple Proximal Models If the dataset is easy to separate, the means difference suggests a good proximal classifier (left figure) For the more complicated dataset, the performance of the LDA direction is better (right figure) The proximal classifier would provide a good suggestion for our classifier updating

Proximal Classifier: Quasi-LDA However, the cost of computing the LDA is expensive when the input space is in the high dimensional space We proposed the quasi-lda direction as our proximal classifier (w t p) i = (mt + ) i (m t ) i (s t + ) i +(s t ) i, i = 1, 2,..., d where w t p m t +/ s t +/ : the proximal classifier on round t : mean vector of Pos./Neg. class on round t : variance vector of Pos./Neg. class on round t It is very cheap both in CPU time and in memory usage

The Details of Proximal Classifier Small Trick: Var(X) = E((X E(X)) 2 ) = E(X 2 ) (E(X)) 2 The details about the statistical information we maintained P t = {j y j positive, j = 1, 2,..., t} N t = {j y j negative, j = 1, 2,..., t} (m t 1 +) i = (x j ) P t i i = 1, 2,..., d j P t (m t 1 ) i = (x j ) N t i i = 1, 2,..., d j N t (s t 1 +) i = (x j ) 2 P t 1 i P t j P t P t 1 (mt +) 2 i i = 1, 2,..., d (s t 1 ) i = (x j ) 2 N t 1 i N t j N t N t 1 (mt ) 2 i i = 1, 2,..., d All we need are (m t +) i, (m t ) i, (x j ) 2 i and (x j ) 2 i for each attribute in j P t j N t online fashion

Key Idea of Our Proposed Method How to combine the PA algorithm with the proximal classifier? Remember, in the PA algorithm, w, w t is expressed as the similarity between w and w t Therefore, we can formulate our idea by adding γ w, w t p into the Eq.(2), called PAm: w t+1 1 arg min w R d 2 w 2 2 w, w t γ w, w t p s.t. l(w; (x t, y t )) = 0 Minimizing w, w t γ w, w t p is the way to keep the new classifier w to be close to w t and w t p γ w, w t p also can be introduced into the PA-1 and PA-2, called PAm-1 and PAm-2, respectively

Closed Form Updating Rule of PAm Algorithm We derive the closed form updating rules for PAm, PAm-1 and PAm-2 as follows, they also shared the same closed form where α t > 0 is defined as α t = w t+1 = w t + γw t p + α t y t x t l(w t ;(x t,y t )) γy t w t p,xt x t 2 2 (PAm) min{c, l(wt ;(x t,y t )) γy t w t p,xt } (PAm-1) x t 2 2 l(w t ;(x t,y t )) γy t w t p,xt x t 2 2 + 1 2C (PAm-2)

PAm Algorithm (Single Pass) Given a training set T = {(x t, y t ) x t R d, y t { 1, 1}, t = 1, 2,..., m} for t = 1,2,...,m Update the statistical information if 1 y t w t, x t > 0 Obtain the proximal classifier w t p Compute the learning rate α t w t+1 = w t + γw t p + α t y t x t end end

PAm Algorithm After a single pass, we will not update the statistical information. Updating scheme is very similar to PA algorithm The input data order is less affected on the final classifier after single pass. In our experiments, we show that classifiers between two consecutive passes are very similar when PAm algorithm is applied

Extension of PAm to Sliding Window of Previous Classifiers In each updating of the PA algorithm, it only considers one previous classifier. We suggest that taking more than one previous classifiers into account will be more robust. We maintain a queue which saves previous k classifiers and k is small, we can formulate as follow, w t+1 1 arg min w R d 2 w 2 2 w, w t γ w, w t p + Cξ s.t. l(w; (x t, y t )) ξ and ξ 0 where w t is the average of k previous classifiers on round t We call it as PAmQ-1.

Global and Local Information We can also derive the formulations and closed form updating rules for PAmQ and PAmQ-2 All we need to do is replacing w t with w t PAm algorithm takes global and local information into account in online learning framework The statistical information we kept can be treated as the global information The previous classifier w t or w t can be treated as the local information

Outline 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

Numerical Experiments Setting In the first part of our experiments, we would like to investigate the effect of the input order of instances We run a single epoch for PA algorithm and PAm algorithm 10 times with different input order at each time The second part of our experiments, we would like to observe the convergence behavior of our proposed methods How good it could be if we only run a single pass of the PAm family algorithms?

Benchmark Datasets Used in Numerical Experiments The sources of benchmark datasets 3 medium-size datasets from LIBSVM library site 3 large-scale datasets from Pascal Large Scale Learning Challenge in 2008 Dataset Training Testing Features Svmguide1 3089 4000 4 Ijcnn 49990 91701 22 Cod-rna 59535 271617 8 Alpha 250000 250000 500 Delta 250000 250000 500 Gamma 250000 250000 500

Comparison in the Training (Single Pass) Training Error Rate and Standard Deviation (%) Dataset PA PAm PA-1 PAm-1 PA-2 PAm-2 PAmQ-1 LDA Svmguide1 7.5(3.2) 5.0(1.0) 6.9(3.0) 4.9(0.3) 6.4(1.1) 4.9(0.3) 5.0(0.4) 12.2 Ijcnn 10.1(2.2) 9.5(1.0) 8.9(0.4) 8.5(0.1) 9.0(0.3) 9.0(0.1) 8.7(0.2) 13.9 Cod-rna 10.5(10.5) 7.7(1.7) 6.8(0.5) 6.3(0.1) 7.0(0.5) 6.9(0.6) 6.2(0.1) 6.4 Alpha 31.7(3.9) 26.9(1.7) 24.3(1.4) 26.3(0.7) 24.2(1.8) 26.1(0.6) 26.4(0.3) 21.7 Delta 35.0(0.7) 25.2(0.3) 30.7(0.2) 21.5(0.01) 33.5(0.5) 24.4(0.2) 21.5(0.02) 21.5 Gamma 32.2(1.0) 23.8(0.4) 25.8(0.7) 19.9(0.01) 30.0(1.0) 22.3(0.2) 19.9(0.01) 19.9 Min/Max Training Error Rate (%) Dataset PA PAm PA-1 PAm-1 PA-2 PAm-2 PAmQ-1 LDA Svmguide1 5.0/15.7 4.2/7.7 4.9/14.2 4.5/5.2 5.4/8.8 4.6/5.4 4.6/6.1 12.2 Ijcnn 8.8/15.7 8.7/12.0 8.1/9.4 8.4/8.7 8.6/9.4 8.8/9.1 8.3/8.9 13.9 Cod-rna 6.5/25.9 6.4/12.2 6.4/7.8 6.2/6.6 6.5/7.9 6.3/8.2 6.1/6.4 6.4 Alpha 27.6/39.7 24.2/29.5 22.5/26.3 25.6/27.3 22.5/28.0 25.5/27.6 25.9/27.1 21.7 Delta 33.7/36.0 24.8/25.6 30.3/31.1 21.5/21.5 32.4/34.3 24.0/24.7 21.5/21.6 21.5 Gamma 30.8/34.4 23.0/24.4 24.7/27.0 19.9/19.9 28.9/31.7 22.0/22.7 19.9/19.9 19.9

Comparison in the Testing Test Error Rate and Standard Deviation (%) Dataset PA PAm PA-1 PAm-1 PA-2 PAm-2 PAmQ-1 LDA Svmguide1 7.4(3.2) 5.1(0.6) 7.2(4.5) 4.4(0.2) 6.6(2.0) 4.9(0.6) 4.8(0.3) 11.8 Ijcnn 9.7(1.8) 11.0(2.9) 8.8(0.3) 8.7(0.2) 8.7(0.4) 9.3(0.4) 8.9(0.2) 18.6 Cod-rna 9.2(6.9) 6.3(1.7) 5.4(0.7) 4.7(0.2) 5.8(0.7) 5.3(0.67) 4.6(0.1) 4.7 Alpha 31.7(4.9) 29.1(3.0) 23.8(1.0) 23.7(1.3) 23.6(0.7) 24.0(0.9) 23.7(0.4) 21.8 Delta 35.0(0.8) 25.3(0.3) 29.7(0.7) 21.7(0.01) 33.0(0.5) 24.5(0.2) 21.7(0.01) 21.7 Gamma 32.2(1.0) 23.9(0.4) 25.8(0.6) 20.0(0.01) 30.0(1.1) 22.3(0.2) 20.0(0.01) 20.0 Min/Max Test Error Rate (%) Dataset PA PAm PA-1 PAm-1 PA-2 PAm-2 PAmQ-1 LDA Svmguide1 4.7/13.5 4.4/6.1 4.7/19.6 4.3/4.9 5.5/11.7 4.4/6.0 4.5/5.5 11.8 Ijcnn 8.4/13.6 8.9/18.8 8.5/9.5 8.4/9.1 8.4/9.3 8.9/10.1 8.4/9.1 18.6 Cod-rna 4.8/24.5 4.7/10.3 4.7/6.8 4.5/5.2 5.1/6.9 4.5/6.5 4.5/4.9 4.7 Alpha 26.8/42.4 25.5/34.2 22.5/25.5 22.6/27.3 22.5/24.6 22.7/25.7 23.3/24.6 21.8 Delta 33.7/36.0 24.9/25.7 28.0/30.7 21.6/21.7 32.2/33.7 24.1/24.7 21.6/21.7 21.7 Gamma 30.9/34.4 23.0/24.5 24.8/26.8 19.9/20.0 28.9/32.0 22.1/22.7 19.9/20.0 20.0

Comparison of Average Running Time of 10 Runs Dataset PA PAm PA-1 PAm-1 PA-2 PAm-2 PAmQ-1 LDA Svmguide1 0.06 0.05 0.07 0.06 0.09 0.07 0.08 0.01 Ijcnn 0.66 0.57 0.69 0.72 1.04 0.71 0.91 0.19 Cod-rna 1.01 1.06 1.47 1.25 1.91 1.24 1.61 0.08 Alpha 15.43 15.48 17.70 17.64 21.15 21.14 22.76 72.6 Delta 23.20 16.75 24.82 15.51 24.58 16.18 16.48 68.1 Gamma 22.64 16.21 24.26 15.03 24.39 15.73 16.09 68.7 The unit of these results is Sec. The running times of PA family and PAm family are comparable Keep simple statistical information may not produce extra CPU time burden

Comparison of Average Number of Updating Dataset PA PAm PA-1 PAm-1 PA-2 PAm-2 PAmQ-1 Svmguide1 571.0 308.4 943.7 539.4 1849.1 655.2 805.8 Ijcnn 6101.8 3990.3 8474.0 6790.8 21227.0 6852.3 9866.8 Cod-rna 10415.9 8924.0 22093.6 13215.4 46350.1 12645.1 18032.7 Alpha 147157.5 147150.1 168715.0 168691.2 238931.7 229004.0 202854.3 Delta 98457.4 64608.9 120450.8 55567.6 114984.4 62475.5 57279.0 Gamma 88334.6 60291.7 109457.6 51949.1 111612.6 58205.7 52872.7 Our method has a less number of updating than conventional PA algorithm Less number of updating indicates less aggressive phases in an epoch

Observation on Convergence Behavior Run 10 epochs with the same input order for each methods and record the classifier when an epoch is completed The proximal classifier will not be changed once the first epoch is finished We compute the cosine similarity between two consecutive epoch classifiers

Test Error Rate of each Epoch It becomes a constant after 2 to 3 epochs

Number of Updating in these 10 Epochs It becomes a constant after 2 to 4 epochs

Outline 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

Conclusion We used the online learning framework to solve large-scale binary classification problem Similar to PA algorithm, we also derived a set of update schemes in closed form for PAm, PAm-1 and PAm-2 Compared to the conventional PA algorithm, utilizing the quasi-lda direction as a proximal classifier will have Less sensitive to the input order Less number of updating made in a single pass Higher similarity between two consecutive epochs resulting classifiers We can have a near-optimal classifier in a single pass

Conclusion We used the online learning framework to solve large-scale binary classification problem Similar to PA algorithm, we also derived a set of update schemes in closed form for PAm, PAm-1 and PAm-2 Compared to the conventional PA algorithm, utilizing the quasi-lda direction as a proximal classifier will have Less sensitive to the input order Less number of updating made in a single pass Higher similarity between two consecutive epochs resulting classifiers We can have a near-optimal classifier in a single pass

Conclusion We used the online learning framework to solve large-scale binary classification problem Similar to PA algorithm, we also derived a set of update schemes in closed form for PAm, PAm-1 and PAm-2 Compared to the conventional PA algorithm, utilizing the quasi-lda direction as a proximal classifier will have Less sensitive to the input order Less number of updating made in a single pass Higher similarity between two consecutive epochs resulting classifiers We can have a near-optimal classifier in a single pass

Conclusion We used the online learning framework to solve large-scale binary classification problem Similar to PA algorithm, we also derived a set of update schemes in closed form for PAm, PAm-1 and PAm-2 Compared to the conventional PA algorithm, utilizing the quasi-lda direction as a proximal classifier will have Less sensitive to the input order Less number of updating made in a single pass Higher similarity between two consecutive epochs resulting classifiers We can have a near-optimal classifier in a single pass

Conclusion We used the online learning framework to solve large-scale binary classification problem Similar to PA algorithm, we also derived a set of update schemes in closed form for PAm, PAm-1 and PAm-2 Compared to the conventional PA algorithm, utilizing the quasi-lda direction as a proximal classifier will have Less sensitive to the input order Less number of updating made in a single pass Higher similarity between two consecutive epochs resulting classifiers We can have a near-optimal classifier in a single pass

Conclusion We used the online learning framework to solve large-scale binary classification problem Similar to PA algorithm, we also derived a set of update schemes in closed form for PAm, PAm-1 and PAm-2 Compared to the conventional PA algorithm, utilizing the quasi-lda direction as a proximal classifier will have Less sensitive to the input order Less number of updating made in a single pass Higher similarity between two consecutive epochs resulting classifiers We can have a near-optimal classifier in a single pass

Conclusion We used the online learning framework to solve large-scale binary classification problem Similar to PA algorithm, we also derived a set of update schemes in closed form for PAm, PAm-1 and PAm-2 Compared to the conventional PA algorithm, utilizing the quasi-lda direction as a proximal classifier will have Less sensitive to the input order Less number of updating made in a single pass Higher similarity between two consecutive epochs resulting classifiers We can have a near-optimal classifier in a single pass

Future Work Try to explore the statistical information in a single pass to build a better proximal classifier In PAmQ, we can use weighted average instead of simple average and the weights can be determined by the performance of each classifier on a certain validation set We might collect multiple misclassified instances, then use these instances to approximate the Hessian inverse into the updating rule Extend the linear classifier to nonlinear classifier via reduced kernel trick Find an efficient updating scheme for small batch input instances. Similar to SMO vs. Decomposition method for SVMs

Reference Léon Bottou and Olivier Bousquet. The Tradeoffs of Large Scale Learning. Advances in Neural Information Processing Systems, 20:161 168, 2008 John Langford. Concerns about the Large Scale Learning Challenge, 2008 K. Crammer et al. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551 585, 2006. K. Hiraoka et al. Convergence analysis of online linear discriminant analysis. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks, 3:387 391, Como, Italy, 2000. L. I. Kuncheva and C. O. Plumpton. Adaptive learning rate for online linear discriminant classifiers. Structural, Syntactic, and Statistical Pattern Recognition, 5342:510 519, 2008.