Combine the PA Algorithm with a Proximal Classifier

Size: px
Start display at page:

Download "Combine the PA Algorithm with a Proximal Classifier"

Transcription

1 Combine the Passive and Aggressive Algorithm with a Proximal Classifier Yuh-Jye Lee Joint work with Y.-C. Tseng Dept. of Computer Science & Information Engineering TaiwanTech. Dept. of Statistics@NCKU Sept. 29, 2011

2 Outline 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

3 Outline 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

4 Supervised Learning Problems Assumption: training instances are drawn from an unknown but fixed probability distribution P(x, y) independently. Our learning task: Given a training set T = {(x 1, y 1 ), (x 2, y 2 ),..., (x m, y m )} We would like to construct a rule, f (x) that can correctly predict the label y given unseen x If f (x) y then we get some loss or penalty For example: l(f (x), y) = 1 2 f (x) y Learning examples: classification, regression and sequence labeling If y is drawn from a finite set it will be a classification problem. The simplest case: y { 1, +1} called binary classification problem If y is a real number it becomes a regression problem More general case, y can be a vector and each element is drawn from a finite set. This is the sequence labeling problem

5 Binary Classification Problem Given a training set T = {(x j, y j ) x j R d, y j { 1, 1}, j = 1,..., m} Main Goal: x j P y j = 1 & x j N y j = 1 Predict the unseen class label for new data Find a function f : R d R by learning from data f (x) 0 x P and f (x) < 0 x N The simplest function is linear: f (x) = w, x + b

6 Expected Risk vs. Empirical Risk Assumption: training instances are drawn from an unknown but fixed probability distribution P(x, y) independently. Ideally, we would like to have the optimal rule f that minimizes the Expected Risk: E(f ) = l(f (x), y)dp(x, y) among all functions Unfortunately, we can not do it. P(x, y) is unknown and we have to restrict ourselves in a certain hypothesis space, F How about compute fm F that minimizes the Empirical Risk: E m (f ) = 1 m l(f (x j ), y j ) j Only minimizing the empirical risk will be in danger of overfitting

7 Approximation Optimization Approach Most of learning algorithms can be formulated as an optimization problem The objective function consists of two parts: E m (f )+ controls on VC-error bound Controlling the VC-error bound will avoid the overfitting risk It can be achieved via adding the regularization term into the objective function Note that: We have made lots of approximations when formulate a learning task as an optimization problem Why bother to find the optimal solution for the problem? One could stop the optimization iteration before its convergence

8 Approximation Optimization Approach Most of learning algorithms can be formulated as an optimization problem The objective function consists of two parts: E m (f )+ controls on VC-error bound Controlling the VC-error bound will avoid the overfitting risk It can be achieved via adding the regularization term into the objective function Note that: We have made lots of approximations when formulate a learning task as an optimization problem Why bother to find the optimal solution for the problem? One could stop the optimization iteration before its convergence

9 Approximation Optimization Approach Most of learning algorithms can be formulated as an optimization problem The objective function consists of two parts: E m (f )+ controls on VC-error bound Controlling the VC-error bound will avoid the overfitting risk It can be achieved via adding the regularization term into the objective function Note that: We have made lots of approximations when formulate a learning task as an optimization problem Why bother to find the optimal solution for the problem? One could stop the optimization iteration before its convergence

10 Approximation Optimization Approach Most of learning algorithms can be formulated as an optimization problem The objective function consists of two parts: E m (f )+ controls on VC-error bound Controlling the VC-error bound will avoid the overfitting risk It can be achieved via adding the regularization term into the objective function Note that: We have made lots of approximations when formulate a learning task as an optimization problem Why bother to find the optimal solution for the problem? One could stop the optimization iteration before its convergence

11 Approximation Optimization Approach Most of learning algorithms can be formulated as an optimization problem The objective function consists of two parts: E m (f )+ controls on VC-error bound Controlling the VC-error bound will avoid the overfitting risk It can be achieved via adding the regularization term into the objective function Note that: We have made lots of approximations when formulate a learning task as an optimization problem Why bother to find the optimal solution for the problem? One could stop the optimization iteration before its convergence

12 Approximation Optimization Approach Most of learning algorithms can be formulated as an optimization problem The objective function consists of two parts: E m (f )+ controls on VC-error bound Controlling the VC-error bound will avoid the overfitting risk It can be achieved via adding the regularization term into the objective function Note that: We have made lots of approximations when formulate a learning task as an optimization problem Why bother to find the optimal solution for the problem? One could stop the optimization iteration before its convergence

13 Gradient Descent: Batch Learning For an optimization problem min f (w) = min r(w) + 1 m m l(w; (x j, y j )) GD tries to find a direction and the learning rate decreasing the objective function value. j=1 w t+1 = w t η f (w t ) where η is the learning rate, f (w t ) is the steepest direction f (w t ) = r(w t ) + 1 m m l(w t ; (x i, y i )) When m is large, computing m l(w t ; (x j, y j )) may cost much time. j=1 i=1

14 Stochastic Gradient Descent: Online Learning In GD, we compute the gradient using the entire training set. In stochastic gradient descent(sgd), we use l(w t ; (x t, y t )) So the gradient of f (w t ) instead of 1 m m l(w t ; (x j, y j )) j=1 f (w t ) = r(w t ) + l(w t ; (x t, y t )) SGD computes the gradient using only one instance. In experiment, SGD is significantly faster than GD when m is large.

15 Large-Scale Problems Collecting and distributing large-scale datasets are no longer expensive tasks. Two definitions of large-scale problems, It consists of problems where the main computational constraint is the amount of time available, rather than the number of instances [Bottou, 2008]. Training set may not be stored in modern computer s memory [Langford, 2008]. We are in a need of learning algorithms that scale linearly with the size of datasets The performance of the algorithms should be better than processing a random subset of the data via conventional learning algorithms

16 Large-Scale Problems Collecting and distributing large-scale datasets are no longer expensive tasks. Two definitions of large-scale problems, It consists of problems where the main computational constraint is the amount of time available, rather than the number of instances [Bottou, 2008]. Training set may not be stored in modern computer s memory [Langford, 2008]. We are in a need of learning algorithms that scale linearly with the size of datasets The performance of the algorithms should be better than processing a random subset of the data via conventional learning algorithms

17 Outline Perceptron Algorithm Passive and Aggressive (PA) Algorithm 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

18 Online Learning Perceptron Algorithm Passive and Aggressive (PA) Algorithm Definition of online learning Given a set of new training data, Online learner can update its model without reading old data while improving its performance. In contrast, off-line learner must combine old and new data and start the learning all over again, otherwise the performance will suffer. Online is considered as a solution of large learning tasks Usually require several passes (or epochs) through the training instances Need to keep all instances unless we only run the algorithm one single pass

19 Outline Perceptron Algorithm Passive and Aggressive (PA) Algorithm 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

20 Perceptron Algorithm Passive and Aggressive (PA) Algorithm Perceptron Algorithm [Rosenblatt, 1956] An online learning algorithm and a mistake-driven procedure The current classifier is updated whenever the new arriving instance is misclassified Initiation: k = 0, R = max 1 j m xj 2 repeat for t = 1 : m if y t( w k, x t + b k ) 0 w k+1 = w k + ηy tx t b k+1 = b k + ηy tr 2 k = k + 1 end end until no mistake made within the for-loop k is number of mistakes. η > 0 is the learning rate.

21 Perceptron Algorithm Passive and Aggressive (PA) Algorithm Perceptron Algorithm [Rosenblatt, 1956] The Perceptron is considered as a SGD method. The underlying optimization problem of the algorithm min (w,b) R d+1 m ( y j ( w, x j + b)) + j=1 In the linearly separable case, the Perceptron alg. will be terminated in finite steps no matter what learning rate is chosen In the nonseparable case, how to decide the appropriate learning rate that will make the least mistake is very difficult Learning rate can be a nonnegative number. More general case, it can be a positive definite matrix

22 Outline Perceptron Algorithm Passive and Aggressive (PA) Algorithm 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

23 Perceptron Algorithm Passive and Aggressive (PA) Algorithm Key Idea of PA Algorithm [Crammer, 2006] An online learning algorithm and a mistake-driven procedure The PA algorithm suggests that the new classifier should not only classify the new arriving data correctly but also as close to the current classifier as possible It can be formulated the problem as follows: w t+1 1 arg min w R d 2 w wt 2 2 (1) s.t. l(w; (x t, y t )) = 0 where l(w; (x t, y t )) is a hinge loss function

24 Simplify the PA Algorithm Perceptron Algorithm Passive and Aggressive (PA) Algorithm We can simplify Eq.(1) as follows: w t+1 1 arg min w R d 2 w 2 2 w, wt (2) s.t. l(w; (x t, y t )) = 0 In the Eq.(2), the PA algorithm implicitly minimizes the regularization term, 1 2 w 2 2 Minimizing w, w t is also expressed that the new updated classifier must be similar to the current classifier

25 Notes on PA Algorithm Perceptron Algorithm Passive and Aggressive (PA) Algorithm If (x t, y t ) is correctly classified then w t is the solution of the optimization problem It is called the Passive phase If (x t, y t ) is misclassified then w t should be updated to correctly classify (x t, y t ) It is called the Aggressive phase We may treat the classifier learned by the PA algorithm as a hard-margin classifier However, if an instance comes with noise, the current classifier may be led to the wrong direction seriously

26 PA-1 and PA-2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm Based on the concept of the soft-margin classifier, the non-negative slack variable ξ was introduced into the PA algorithm in two different ways Linear scale with ξ (called PA-1) w t+1 1 arg min w R d 2 w wt Cξ s.t. l(w; (x t, y t )) ξ and ξ 0 Square scale with ξ (called PA-2) w t+1 1 arg min w R d 2 w wt Cξ 2 s.t. l(w; (x t, y t )) ξ

27 Perceptron Algorithm Passive and Aggressive (PA) Algorithm PA Algorithm Closed Form Updating It seems that we have to solve an optimization problem for each instance. Fortunately, PA, PA-1 and PA-2 come with the closed form of updating schemes They share the same closed form w t+1 = w t + τ t y t x t where τ t > 0 is defined as l(w t ; (x t,y t )) (PA) x t 2 2 τ t = min{c, l(wt ; (x t,y t )) } (PA-1) x t 2 2 (PA-2) l(w t ; (x t,y t )) x t C It replaced the fixed learning rate such as Perceptron algorithm with the dynamic learning rate depending on the current instance

28 Outline 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

29 Motivation for a Proximal Classifier In online learning framework, we do not keep the previous instances in the memory A lack of memory for previous instances might hurt the learning efficiency The previous instances which have been classified correctly may be misclassified again We keep some simple statistical information (mean and variance) to summarize the previous instances Have to take the cost into account, fixed and small size memory and linear CPU time

30 Illustrations of Simple Proximal Models If the dataset is easy to separate, the means difference suggests a good proximal classifier (left figure) For the more complicated dataset, the performance of the LDA direction is better (right figure) The proximal classifier would provide a good suggestion for our classifier updating

31 Proximal Classifier: Quasi-LDA However, the cost of computing the LDA is expensive when the input space is in the high dimensional space We proposed the quasi-lda direction as our proximal classifier (w t p) i = (mt + ) i (m t ) i (s t + ) i +(s t ) i, i = 1, 2,..., d where w t p m t +/ s t +/ : the proximal classifier on round t : mean vector of Pos./Neg. class on round t : variance vector of Pos./Neg. class on round t It is very cheap both in CPU time and in memory usage

32 The Details of Proximal Classifier Small Trick: Var(X) = E((X E(X)) 2 ) = E(X 2 ) (E(X)) 2 The details about the statistical information we maintained P t = {j y j positive, j = 1, 2,..., t} N t = {j y j negative, j = 1, 2,..., t} (m t 1 +) i = (x j ) P t i i = 1, 2,..., d j P t (m t 1 ) i = (x j ) N t i i = 1, 2,..., d j N t (s t 1 +) i = (x j ) 2 P t 1 i P t j P t P t 1 (mt +) 2 i i = 1, 2,..., d (s t 1 ) i = (x j ) 2 N t 1 i N t j N t N t 1 (mt ) 2 i i = 1, 2,..., d All we need are (m t +) i, (m t ) i, (x j ) 2 i and (x j ) 2 i for each attribute in j P t j N t online fashion

33 Key Idea of Our Proposed Method How to combine the PA algorithm with the proximal classifier? Remember, in the PA algorithm, w, w t is expressed as the similarity between w and w t Therefore, we can formulate our idea by adding γ w, w t p into the Eq.(2), called PAm: w t+1 1 arg min w R d 2 w 2 2 w, w t γ w, w t p s.t. l(w; (x t, y t )) = 0 Minimizing w, w t γ w, w t p is the way to keep the new classifier w to be close to w t and w t p γ w, w t p also can be introduced into the PA-1 and PA-2, called PAm-1 and PAm-2, respectively

34 Closed Form Updating Rule of PAm Algorithm We derive the closed form updating rules for PAm, PAm-1 and PAm-2 as follows, they also shared the same closed form where α t > 0 is defined as α t = w t+1 = w t + γw t p + α t y t x t l(w t ;(x t,y t )) γy t w t p,xt x t 2 2 (PAm) min{c, l(wt ;(x t,y t )) γy t w t p,xt } (PAm-1) x t 2 2 l(w t ;(x t,y t )) γy t w t p,xt x t C (PAm-2)

35 PAm Algorithm (Single Pass) Given a training set T = {(x t, y t ) x t R d, y t { 1, 1}, t = 1, 2,..., m} for t = 1,2,...,m Update the statistical information if 1 y t w t, x t > 0 Obtain the proximal classifier w t p Compute the learning rate α t w t+1 = w t + γw t p + α t y t x t end end

36 PAm Algorithm After a single pass, we will not update the statistical information. Updating scheme is very similar to PA algorithm The input data order is less affected on the final classifier after single pass. In our experiments, we show that classifiers between two consecutive passes are very similar when PAm algorithm is applied

37 Extension of PAm to Sliding Window of Previous Classifiers In each updating of the PA algorithm, it only considers one previous classifier. We suggest that taking more than one previous classifiers into account will be more robust. We maintain a queue which saves previous k classifiers and k is small, we can formulate as follow, w t+1 1 arg min w R d 2 w 2 2 w, w t γ w, w t p + Cξ s.t. l(w; (x t, y t )) ξ and ξ 0 where w t is the average of k previous classifiers on round t We call it as PAmQ-1.

38 Global and Local Information We can also derive the formulations and closed form updating rules for PAmQ and PAmQ-2 All we need to do is replacing w t with w t PAm algorithm takes global and local information into account in online learning framework The statistical information we kept can be treated as the global information The previous classifier w t or w t can be treated as the local information

39 Outline 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

40 Numerical Experiments Setting In the first part of our experiments, we would like to investigate the effect of the input order of instances We run a single epoch for PA algorithm and PAm algorithm 10 times with different input order at each time The second part of our experiments, we would like to observe the convergence behavior of our proposed methods How good it could be if we only run a single pass of the PAm family algorithms?

41 Benchmark Datasets Used in Numerical Experiments The sources of benchmark datasets 3 medium-size datasets from LIBSVM library site 3 large-scale datasets from Pascal Large Scale Learning Challenge in 2008 Dataset Training Testing Features Svmguide Ijcnn Cod-rna Alpha Delta Gamma

42 Comparison in the Training (Single Pass) Training Error Rate and Standard Deviation (%) Dataset PA PAm PA-1 PAm-1 PA-2 PAm-2 PAmQ-1 LDA Svmguide1 7.5(3.2) 5.0(1.0) 6.9(3.0) 4.9(0.3) 6.4(1.1) 4.9(0.3) 5.0(0.4) 12.2 Ijcnn 10.1(2.2) 9.5(1.0) 8.9(0.4) 8.5(0.1) 9.0(0.3) 9.0(0.1) 8.7(0.2) 13.9 Cod-rna 10.5(10.5) 7.7(1.7) 6.8(0.5) 6.3(0.1) 7.0(0.5) 6.9(0.6) 6.2(0.1) 6.4 Alpha 31.7(3.9) 26.9(1.7) 24.3(1.4) 26.3(0.7) 24.2(1.8) 26.1(0.6) 26.4(0.3) 21.7 Delta 35.0(0.7) 25.2(0.3) 30.7(0.2) 21.5(0.01) 33.5(0.5) 24.4(0.2) 21.5(0.02) 21.5 Gamma 32.2(1.0) 23.8(0.4) 25.8(0.7) 19.9(0.01) 30.0(1.0) 22.3(0.2) 19.9(0.01) 19.9 Min/Max Training Error Rate (%) Dataset PA PAm PA-1 PAm-1 PA-2 PAm-2 PAmQ-1 LDA Svmguide1 5.0/ / / / / / / Ijcnn 8.8/ / / / / / / Cod-rna 6.5/ / / / / / / Alpha 27.6/ / / / / / / Delta 33.7/ / / / / / / Gamma 30.8/ / / / / / /

43 Comparison in the Testing Test Error Rate and Standard Deviation (%) Dataset PA PAm PA-1 PAm-1 PA-2 PAm-2 PAmQ-1 LDA Svmguide1 7.4(3.2) 5.1(0.6) 7.2(4.5) 4.4(0.2) 6.6(2.0) 4.9(0.6) 4.8(0.3) 11.8 Ijcnn 9.7(1.8) 11.0(2.9) 8.8(0.3) 8.7(0.2) 8.7(0.4) 9.3(0.4) 8.9(0.2) 18.6 Cod-rna 9.2(6.9) 6.3(1.7) 5.4(0.7) 4.7(0.2) 5.8(0.7) 5.3(0.67) 4.6(0.1) 4.7 Alpha 31.7(4.9) 29.1(3.0) 23.8(1.0) 23.7(1.3) 23.6(0.7) 24.0(0.9) 23.7(0.4) 21.8 Delta 35.0(0.8) 25.3(0.3) 29.7(0.7) 21.7(0.01) 33.0(0.5) 24.5(0.2) 21.7(0.01) 21.7 Gamma 32.2(1.0) 23.9(0.4) 25.8(0.6) 20.0(0.01) 30.0(1.1) 22.3(0.2) 20.0(0.01) 20.0 Min/Max Test Error Rate (%) Dataset PA PAm PA-1 PAm-1 PA-2 PAm-2 PAmQ-1 LDA Svmguide1 4.7/ / / / / / / Ijcnn 8.4/ / / / / / / Cod-rna 4.8/ / / / / / / Alpha 26.8/ / / / / / / Delta 33.7/ / / / / / / Gamma 30.9/ / / / / / /

44 Comparison of Average Running Time of 10 Runs Dataset PA PAm PA-1 PAm-1 PA-2 PAm-2 PAmQ-1 LDA Svmguide Ijcnn Cod-rna Alpha Delta Gamma The unit of these results is Sec. The running times of PA family and PAm family are comparable Keep simple statistical information may not produce extra CPU time burden

45 Comparison of Average Number of Updating Dataset PA PAm PA-1 PAm-1 PA-2 PAm-2 PAmQ-1 Svmguide Ijcnn Cod-rna Alpha Delta Gamma Our method has a less number of updating than conventional PA algorithm Less number of updating indicates less aggressive phases in an epoch

46 Observation on Convergence Behavior Run 10 epochs with the same input order for each methods and record the classifier when an epoch is completed The proximal classifier will not be changed once the first epoch is finished We compute the cosine similarity between two consecutive epoch classifiers

47 Test Error Rate of each Epoch It becomes a constant after 2 to 3 epochs

48 Number of Updating in these 10 Epochs It becomes a constant after 2 to 4 epochs

49 Outline 1 Machine Learning vs. Optimization 2 Perceptron Algorithm Passive and Aggressive (PA) Algorithm 3 4 5

50 Conclusion We used the online learning framework to solve large-scale binary classification problem Similar to PA algorithm, we also derived a set of update schemes in closed form for PAm, PAm-1 and PAm-2 Compared to the conventional PA algorithm, utilizing the quasi-lda direction as a proximal classifier will have Less sensitive to the input order Less number of updating made in a single pass Higher similarity between two consecutive epochs resulting classifiers We can have a near-optimal classifier in a single pass

51 Conclusion We used the online learning framework to solve large-scale binary classification problem Similar to PA algorithm, we also derived a set of update schemes in closed form for PAm, PAm-1 and PAm-2 Compared to the conventional PA algorithm, utilizing the quasi-lda direction as a proximal classifier will have Less sensitive to the input order Less number of updating made in a single pass Higher similarity between two consecutive epochs resulting classifiers We can have a near-optimal classifier in a single pass

52 Conclusion We used the online learning framework to solve large-scale binary classification problem Similar to PA algorithm, we also derived a set of update schemes in closed form for PAm, PAm-1 and PAm-2 Compared to the conventional PA algorithm, utilizing the quasi-lda direction as a proximal classifier will have Less sensitive to the input order Less number of updating made in a single pass Higher similarity between two consecutive epochs resulting classifiers We can have a near-optimal classifier in a single pass

53 Conclusion We used the online learning framework to solve large-scale binary classification problem Similar to PA algorithm, we also derived a set of update schemes in closed form for PAm, PAm-1 and PAm-2 Compared to the conventional PA algorithm, utilizing the quasi-lda direction as a proximal classifier will have Less sensitive to the input order Less number of updating made in a single pass Higher similarity between two consecutive epochs resulting classifiers We can have a near-optimal classifier in a single pass

54 Conclusion We used the online learning framework to solve large-scale binary classification problem Similar to PA algorithm, we also derived a set of update schemes in closed form for PAm, PAm-1 and PAm-2 Compared to the conventional PA algorithm, utilizing the quasi-lda direction as a proximal classifier will have Less sensitive to the input order Less number of updating made in a single pass Higher similarity between two consecutive epochs resulting classifiers We can have a near-optimal classifier in a single pass

55 Conclusion We used the online learning framework to solve large-scale binary classification problem Similar to PA algorithm, we also derived a set of update schemes in closed form for PAm, PAm-1 and PAm-2 Compared to the conventional PA algorithm, utilizing the quasi-lda direction as a proximal classifier will have Less sensitive to the input order Less number of updating made in a single pass Higher similarity between two consecutive epochs resulting classifiers We can have a near-optimal classifier in a single pass

56 Conclusion We used the online learning framework to solve large-scale binary classification problem Similar to PA algorithm, we also derived a set of update schemes in closed form for PAm, PAm-1 and PAm-2 Compared to the conventional PA algorithm, utilizing the quasi-lda direction as a proximal classifier will have Less sensitive to the input order Less number of updating made in a single pass Higher similarity between two consecutive epochs resulting classifiers We can have a near-optimal classifier in a single pass

57 Future Work Try to explore the statistical information in a single pass to build a better proximal classifier In PAmQ, we can use weighted average instead of simple average and the weights can be determined by the performance of each classifier on a certain validation set We might collect multiple misclassified instances, then use these instances to approximate the Hessian inverse into the updating rule Extend the linear classifier to nonlinear classifier via reduced kernel trick Find an efficient updating scheme for small batch input instances. Similar to SMO vs. Decomposition method for SVMs

58 Reference Léon Bottou and Olivier Bousquet. The Tradeoffs of Large Scale Learning. Advances in Neural Information Processing Systems, 20: , 2008 John Langford. Concerns about the Large Scale Learning Challenge, 2008 K. Crammer et al. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7: , K. Hiraoka et al. Convergence analysis of online linear discriminant analysis. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks, 3: , Como, Italy, L. I. Kuncheva and C. O. Plumpton. Adaptive learning rate for online linear discriminant classifiers. Structural, Syntactic, and Statistical Pattern Recognition, 5342: , 2008.

59

60

All lecture slides will be available at CSC2515_Winter15.html

All lecture slides will be available at  CSC2515_Winter15.html CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 9: Support Vector Machines All lecture slides will be available at http://www.cs.toronto.edu/~urtasun/courses/csc2515/ CSC2515_Winter15.html Many

More information

Classification: Linear Discriminant Functions

Classification: Linear Discriminant Functions Classification: Linear Discriminant Functions CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Discriminant functions Linear Discriminant functions

More information

SUPPORT VECTOR MACHINES

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Today Reading AIMA 8.9 (SVMs) Goals Finish Backpropagation Support vector machines Backpropagation. Begin with randomly initialized weights 2. Apply the neural network to each training

More information

Lecture #11: The Perceptron

Lecture #11: The Perceptron Lecture #11: The Perceptron Mat Kallada STAT2450 - Introduction to Data Mining Outline for Today Welcome back! Assignment 3 The Perceptron Learning Method Perceptron Learning Rule Assignment 3 Will be

More information

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs Behavioral Data Mining Lecture 10 Kernel methods and SVMs Outline SVMs as large-margin linear classifiers Kernel methods SVM algorithms SVMs as large-margin classifiers margin The separating plane maximizes

More information

How Learning Differs from Optimization. Sargur N. Srihari

How Learning Differs from Optimization. Sargur N. Srihari How Learning Differs from Optimization Sargur N. srihari@cedar.buffalo.edu 1 Topics in Optimization Optimization for Training Deep Models: Overview How learning differs from optimization Risk, empirical

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Online Learning for URL classification

Online Learning for URL classification Online Learning for URL classification Onkar Dalal 496 Lomita Mall, Palo Alto, CA 94306 USA Devdatta Gangal Yahoo!, 701 First Ave, Sunnyvale, CA 94089 USA onkar@stanford.edu devdatta@gmail.com Abstract

More information

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar.. .. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar.. Machine Learning: Support Vector Machines: Linear Kernel Support Vector Machines Extending Perceptron Classifiers. There are two ways to

More information

Efficient Iterative Semi-supervised Classification on Manifold

Efficient Iterative Semi-supervised Classification on Manifold . Efficient Iterative Semi-supervised Classification on Manifold... M. Farajtabar, H. R. Rabiee, A. Shaban, A. Soltani-Farani Sharif University of Technology, Tehran, Iran. Presented by Pooria Joulani

More information

Theoretical Concepts of Machine Learning

Theoretical Concepts of Machine Learning Theoretical Concepts of Machine Learning Part 2 Institute of Bioinformatics Johannes Kepler University, Linz, Austria Outline 1 Introduction 2 Generalization Error 3 Maximum Likelihood 4 Noise Models 5

More information

CSE 573: Artificial Intelligence Autumn 2010

CSE 573: Artificial Intelligence Autumn 2010 CSE 573: Artificial Intelligence Autumn 2010 Lecture 16: Machine Learning Topics 12/7/2010 Luke Zettlemoyer Most slides over the course adapted from Dan Klein. 1 Announcements Syllabus revised Machine

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron Matt Gormley Lecture 5 Jan. 31, 2018 1 Q&A Q: We pick the best hyperparameters

More information

6. Linear Discriminant Functions

6. Linear Discriminant Functions 6. Linear Discriminant Functions Linear Discriminant Functions Assumption: we know the proper forms for the discriminant functions, and use the samples to estimate the values of parameters of the classifier

More information

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1 SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1 Overview The goals of analyzing cross-sectional data Standard methods used

More information

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs) Data Mining: Concepts and Techniques Chapter 9 Classification: Support Vector Machines 1 Support Vector Machines (SVMs) SVMs are a set of related supervised learning methods used for classification Based

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

DM6 Support Vector Machines

DM6 Support Vector Machines DM6 Support Vector Machines Outline Large margin linear classifier Linear separable Nonlinear separable Creating nonlinear classifiers: kernel trick Discussion on SVM Conclusion SVM: LARGE MARGIN LINEAR

More information

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University September 20 2018 Review Solution for multiple linear regression can be computed in closed form

More information

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples. Supervised Learning with Neural Networks We now look at how an agent might learn to solve a general problem by seeing examples. Aims: to present an outline of supervised learning as part of AI; to introduce

More information

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013 Machine Learning Topic 5: Linear Discriminants Bryan Pardo, EECS 349 Machine Learning, 2013 Thanks to Mark Cartwright for his extensive contributions to these slides Thanks to Alpaydin, Bishop, and Duda/Hart/Stork

More information

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa Instructors: Parth Shah, Riju Pahwa Lecture 1 Notes Outline 1. Machine Learning What is it? Classification vs. Regression Error Training Error vs. Test Error 2. Linear Classifiers Goals and Motivations

More information

COMS 4771 Support Vector Machines. Nakul Verma

COMS 4771 Support Vector Machines. Nakul Verma COMS 4771 Support Vector Machines Nakul Verma Last time Decision boundaries for classification Linear decision boundary (linear classification) The Perceptron algorithm Mistake bound for the perceptron

More information

Class 6 Large-Scale Image Classification

Class 6 Large-Scale Image Classification Class 6 Large-Scale Image Classification Liangliang Cao, March 7, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Visual

More information

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016 CPSC 340: Machine Learning and Data Mining Logistic Regression Fall 2016 Admin Assignment 1: Marks visible on UBC Connect. Assignment 2: Solution posted after class. Assignment 3: Due Wednesday (at any

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty) Supervised Learning (contd) Linear Separation Mausam (based on slides by UW-AI faculty) Images as Vectors Binary handwritten characters Treat an image as a highdimensional vector (e.g., by reading pixel

More information

Large Margin Classification Using the Perceptron Algorithm

Large Margin Classification Using the Perceptron Algorithm Large Margin Classification Using the Perceptron Algorithm Yoav Freund Robert E. Schapire Presented by Amit Bose March 23, 2006 Goals of the Paper Enhance Rosenblatt s Perceptron algorithm so that it can

More information

A Brief Look at Optimization

A Brief Look at Optimization A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest

More information

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017 CPSC 340: Machine Learning and Data Mining More Linear Classifiers Fall 2017 Admin Assignment 3: Due Friday of next week. Midterm: Can view your exam during instructor office hours next week, or after

More information

Metric Learning for Large-Scale Image Classification:

Metric Learning for Large-Scale Image Classification: Metric Learning for Large-Scale Image Classification: Generalizing to New Classes at Near-Zero Cost Florent Perronnin 1 work published at ECCV 2012 with: Thomas Mensink 1,2 Jakob Verbeek 2 Gabriela Csurka

More information

Linear Discriminant Functions: Gradient Descent and Perceptron Convergence

Linear Discriminant Functions: Gradient Descent and Perceptron Convergence Linear Discriminant Functions: Gradient Descent and Perceptron Convergence The Two-Category Linearly Separable Case (5.4) Minimizing the Perceptron Criterion Function (5.5) Role of Linear Discriminant

More information

1 Case study of SVM (Rob)

1 Case study of SVM (Rob) DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University January 24 2019 Logistics HW 1 is due on Friday 01/25 Project proposal: due Feb 21 1 page description

More information

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18 CSE 417T: Introduction to Machine Learning Lecture 22: The Kernel Trick Henry Chai 11/15/18 Linearly Inseparable Data What can we do if the data is not linearly separable? Accept some non-zero in-sample

More information

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule. CS 188: Artificial Intelligence Fall 2008 Lecture 24: Perceptrons II 11/24/2008 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Bagging and Boosting Algorithms for Support Vector Machine Classifiers Bagging and Boosting Algorithms for Support Vector Machine Classifiers Noritaka SHIGEI and Hiromi MIYAJIMA Dept. of Electrical and Electronics Engineering, Kagoshima University 1-21-40, Korimoto, Kagoshima

More information

Instance-based Learning

Instance-based Learning Instance-based Learning Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 15 th, 2007 2005-2007 Carlos Guestrin 1 1-Nearest Neighbor Four things make a memory based learner:

More information

Perceptron: This is convolution!

Perceptron: This is convolution! Perceptron: This is convolution! v v v Shared weights v Filter = local perceptron. Also called kernel. By pooling responses at different locations, we gain robustness to the exact spatial location of image

More information

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines & Support Vector Machines & Support Vector Machines Arvind Visvanathan CSCE 970 Pattern Recognition 1 & Support Vector Machines Question? Draw a single line to separate two classes? 2 & Support Vector

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters 4 & 5 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining.

More information

Content-based image and video analysis. Machine learning

Content-based image and video analysis. Machine learning Content-based image and video analysis Machine learning for multimedia retrieval 04.05.2009 What is machine learning? Some problems are very hard to solve by writing a computer program by hand Almost all

More information

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund Optimization Plugin for RapidMiner Technical Report Venkatesh Umaashankar Sangkyun Lee 04/2012 technische universität dortmund Part of the work on this technical report has been supported by Deutsche Forschungsgemeinschaft

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Maximum Margin Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models DB Tsai Steven Hillion Outline Introduction Linear / Nonlinear Classification Feature Engineering - Polynomial Expansion Big-data

More information

Lecture 9: Support Vector Machines

Lecture 9: Support Vector Machines Lecture 9: Support Vector Machines William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 8 What we ll learn in this lecture Support Vector Machines (SVMs) a highly robust and

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines srihari@buffalo.edu SVM Discussion Overview 1. Overview of SVMs 2. Margin Geometry 3. SVM Optimization 4. Overlapping Distributions 5. Relationship to Logistic Regression 6. Dealing

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Support Vector Machines Aurélie Herbelot 2018 Centre for Mind/Brain Sciences University of Trento 1 Support Vector Machines: introduction 2 Support Vector Machines (SVMs) SVMs

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences

More information

Logistic Regression

Logistic Regression Logistic Regression ddebarr@uw.edu 2016-05-26 Agenda Model Specification Model Fitting Bayesian Logistic Regression Online Learning and Stochastic Optimization Generative versus Discriminative Classifiers

More information

Nonparametric Methods Recap

Nonparametric Methods Recap Nonparametric Methods Recap Aarti Singh Machine Learning 10-701/15-781 Oct 4, 2010 Nonparametric Methods Kernel Density estimate (also Histogram) Weighted frequency Classification - K-NN Classifier Majority

More information

SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines

SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines Boriana Milenova, Joseph Yarmus, Marcos Campos Data Mining Technologies Oracle Overview Support Vector

More information

scikit-learn (Machine Learning in Python)

scikit-learn (Machine Learning in Python) scikit-learn (Machine Learning in Python) (PB13007115) 2016-07-12 (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 1 / 29 Outline 1 Introduction 2 scikit-learn examples 3 Captcha recognize

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines CS 536: Machine Learning Littman (Wu, TA) Administration Slides borrowed from Martin Law (from the web). 1 Outline History of support vector machines (SVM) Two classes,

More information

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 23 Announcements...

More information

Learning via Optimization

Learning via Optimization Lecture 7 1 Outline 1. Optimization Convexity 2. Linear regression in depth Locally weighted linear regression 3. Brief dips Logistic Regression [Stochastic] gradient ascent/descent Support Vector Machines

More information

Classification: Feature Vectors

Classification: Feature Vectors Classification: Feature Vectors Hello, Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just # free YOUR_NAME MISSPELLED FROM_FRIEND... : : : : 2 0 2 0 PIXEL 7,12

More information

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Reference Most of the slides are taken from the third chapter of the online book by Michael Nielson: neuralnetworksanddeeplearning.com

More information

Linear Classification and Perceptron

Linear Classification and Perceptron Linear Classification and Perceptron INFO-4604, Applied Machine Learning University of Colorado Boulder September 7, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets

More information

A Taxonomy of Semi-Supervised Learning Algorithms

A Taxonomy of Semi-Supervised Learning Algorithms A Taxonomy of Semi-Supervised Learning Algorithms Olivier Chapelle Max Planck Institute for Biological Cybernetics December 2005 Outline 1 Introduction 2 Generative models 3 Low density separation 4 Graph

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes 1 CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215 Overview

More information

One-class Problems and Outlier Detection. 陶卿 中国科学院自动化研究所

One-class Problems and Outlier Detection. 陶卿 中国科学院自动化研究所 One-class Problems and Outlier Detection 陶卿 Qing.tao@mail.ia.ac.cn 中国科学院自动化研究所 Application-driven Various kinds of detection problems: unexpected conditions in engineering; abnormalities in medical data,

More information

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning Overview T7 - SVM and s Christian Vögeli cvoegeli@inf.ethz.ch Supervised/ s Support Vector Machines Kernels Based on slides by P. Orbanz & J. Keuchel Task: Apply some machine learning method to data from

More information

Perceptron Learning Algorithm (PLA)

Perceptron Learning Algorithm (PLA) Review: Lecture 4 Perceptron Learning Algorithm (PLA) Learning algorithm for linear threshold functions (LTF) (iterative) Energy function: PLA implements a stochastic gradient algorithm Novikoff s theorem

More information

Part 5: Structured Support Vector Machines

Part 5: Structured Support Vector Machines Part 5: Structured Support Vector Machines Sebastian Nowozin and Christoph H. Lampert Providence, 21st June 2012 1 / 34 Problem (Loss-Minimizing Parameter Learning) Let d(x, y) be the (unknown) true data

More information

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry Jincheng Cao, SCPD Jincheng@stanford.edu 1. INTRODUCTION When running a direct mail campaign, it s common practice

More information

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines DATA MINING LECTURE 10B Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines NEAREST NEIGHBOR CLASSIFICATION 10 10 Illustrating Classification Task Tid Attrib1

More information

Practice Questions for Midterm

Practice Questions for Midterm Practice Questions for Midterm - 10-605 Oct 14, 2015 (version 1) 10-605 Name: Fall 2015 Sample Questions Andrew ID: Time Limit: n/a Grade Table (for teacher use only) Question Points Score 1 6 2 6 3 15

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3 Neural Network Optimization and Tuning 11-785 / Spring 2018 / Recitation 3 1 Logistics You will work through a Jupyter notebook that contains sample and starter code with explanations and comments throughout.

More information

Optimization Methods for Machine Learning (OMML)

Optimization Methods for Machine Learning (OMML) Optimization Methods for Machine Learning (OMML) 2nd lecture Prof. L. Palagi References: 1. Bishop Pattern Recognition and Machine Learning, Springer, 2006 (Chap 1) 2. V. Cherlassky, F. Mulier - Learning

More information

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Konstantinos Sechidis School of Computer Science University of Manchester sechidik@cs.man.ac.uk Abstract

More information

CS 179 Lecture 16. Logistic Regression & Parallel SGD

CS 179 Lecture 16. Logistic Regression & Parallel SGD CS 179 Lecture 16 Logistic Regression & Parallel SGD 1 Outline logistic regression (stochastic) gradient descent parallelizing SGD for neural nets (with emphasis on Google s distributed neural net implementation)

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule. CS 188: Artificial Intelligence Fall 2007 Lecture 26: Kernels 11/29/2007 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit your

More information

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz Gradient Descent Wed Sept 20th, 2017 James McInenrey Adapted from slides by Francisco J. R. Ruiz Housekeeping A few clarifications of and adjustments to the course schedule: No more breaks at the midpoint

More information

Software Documentation of the Potential Support Vector Machine

Software Documentation of the Potential Support Vector Machine Software Documentation of the Potential Support Vector Machine Tilman Knebel and Sepp Hochreiter Department of Electrical Engineering and Computer Science Technische Universität Berlin 10587 Berlin, Germany

More information

Robust 1-Norm Soft Margin Smooth Support Vector Machine

Robust 1-Norm Soft Margin Smooth Support Vector Machine Robust -Norm Soft Margin Smooth Support Vector Machine Li-Jen Chien, Yuh-Jye Lee, Zhi-Peng Kao, and Chih-Cheng Chang Department of Computer Science and Information Engineering National Taiwan University

More information

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time, Chapter 2 Although stochastic gradient descent can be considered as an approximation of gradient descent, it typically reaches convergence much faster because of the more frequent weight updates. Since

More information

CSEP 573: Artificial Intelligence

CSEP 573: Artificial Intelligence CSEP 573: Artificial Intelligence Machine Learning: Perceptron Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer and Dan Klein. 1 Generative vs. Discriminative Generative classifiers:

More information

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Kernels + K-Means Matt Gormley Lecture 29 April 25, 2018 1 Reminders Homework 8:

More information

Machine Learning. Chao Lan

Machine Learning. Chao Lan Machine Learning Chao Lan Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian

More information

5 Machine Learning Abstractions and Numerical Optimization

5 Machine Learning Abstractions and Numerical Optimization Machine Learning Abstractions and Numerical Optimization 25 5 Machine Learning Abstractions and Numerical Optimization ML ABSTRACTIONS [some meta comments on machine learning] [When you write a large computer

More information

Generating the Reduced Set by Systematic Sampling

Generating the Reduced Set by Systematic Sampling Generating the Reduced Set by Systematic Sampling Chien-Chung Chang and Yuh-Jye Lee Email: {D9115009, yuh-jye}@mail.ntust.edu.tw Department of Computer Science and Information Engineering National Taiwan

More information

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017 CPSC 340: Machine Learning and Data Mining Multi-Class Classification Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2015 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows K-Nearest

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

Face Recognition using SURF Features and SVM Classifier

Face Recognition using SURF Features and SVM Classifier International Journal of Electronics Engineering Research. ISSN 0975-6450 Volume 8, Number 1 (016) pp. 1-8 Research India Publications http://www.ripublication.com Face Recognition using SURF Features

More information

Supervised Learning. CS 586 Machine Learning. Prepared by Jugal Kalita. With help from Alpaydin s Introduction to Machine Learning, Chapter 2.

Supervised Learning. CS 586 Machine Learning. Prepared by Jugal Kalita. With help from Alpaydin s Introduction to Machine Learning, Chapter 2. Supervised Learning CS 586 Machine Learning Prepared by Jugal Kalita With help from Alpaydin s Introduction to Machine Learning, Chapter 2. Topics What is classification? Hypothesis classes and learning

More information

Chakra Chennubhotla and David Koes

Chakra Chennubhotla and David Koes MSCBIO/CMPBIO 2065: Support Vector Machines Chakra Chennubhotla and David Koes Nov 15, 2017 Sources mmds.org chapter 12 Bishop s book Ch. 7 Notes from Toronto, Mark Schmidt (UBC) 2 SVM SVMs and Logistic

More information

6.034 Quiz 2, Spring 2005

6.034 Quiz 2, Spring 2005 6.034 Quiz 2, Spring 2005 Open Book, Open Notes Name: Problem 1 (13 pts) 2 (8 pts) 3 (7 pts) 4 (9 pts) 5 (8 pts) 6 (16 pts) 7 (15 pts) 8 (12 pts) 9 (12 pts) Total (100 pts) Score 1 1 Decision Trees (13

More information

3 Perceptron Learning; Maximum Margin Classifiers

3 Perceptron Learning; Maximum Margin Classifiers Perceptron Learning; Maximum Margin lassifiers Perceptron Learning; Maximum Margin lassifiers Perceptron Algorithm (cont d) Recall: linear decision fn f (x) = w x (for simplicity, no ) decision boundary

More information

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron CS 188: Artificial Intelligence Spring 2010 Lecture 24: Perceptrons and More! 4/20/2010 Announcements W7 due Thursday [that s your last written for the semester!] Project 5 out Thursday Contest running

More information