Linear models. Subhransu Maji. CMPSCI 689: Machine Learning. 24 February February 2015

Size: px

Start display at page:

Download "Linear models. Subhransu Maji. CMPSCI 689: Machine Learning. 24 February February 2015"

Thomasine Cummings
5 years ago
Views:

1 Linear models Subhransu Maji CMPSCI 689: Machine Learning 24 February February 2015

2 Overvie Linear models Perceptron: model and learning algorithm combined as one Is there a better ay to learn linear models? We ill separate models and learning algorithms Learning as optimization } Surrogate loss function model design Regularization Gradient descent } Batch and online gradients optimization Subgradient descent Support vector machines 2/29

3 Learning as optimization min X n 1[y n T x n < 0] + R() feest mistakes The perceptron algorithm ill find an optimal if the data is separable efficiency depends on the margin and norm of the data Hoever, if the data is not separable, optimizing this is NP-hard i.e., there is no efficient ay to minimize this unless P=NP 3/29

4 Learning as optimization min X n hyperparameter 1[y n T x n < 0] + R() feest mistakes simpler model In addition to minimizing training error, e ant a simpler model Remember our goal is to minimize generalization error Recall the bias and variance tradeoff for learners We can add a regularization term R() that prefers simpler models For example e may prefer decision trees of shallo depth Here λ is a hyperparameter of optimization problem 4/29

5 Learning as optimization min X n hyperparameter 1[y n T x n < 0] + R() feest mistakes simpler model The questions that remain are: What are good ays to adjust the optimization problem so that there are efficient algorithms for solving it? What are good regularizations R() for hyperplanes? Assuming that the optimization problem can be adjusted appropriately, hat algorithms exist for solving the regularized optimization problem? 5/29

6 Convex surrogate loss functions Zero/one loss is hard to optimize Small changes in can cause large changes in the loss Surrogate loss: replace Zero/one loss by a smooth function Easier to optimize if the surrogate loss is convex Examples: Zero/one Hinge Logistic Exponential Squared y =+1 ŷ T x concave convex /29

7 Weight regularization What are good regularization functions R() for hyperplanes? We ould like the eights To be small Change in the features cause small change to the score Robustness to noise To be sparse Use as fe features as possible Similar to controlling the depth of a decision tree This is a form of inductive bias 7/29

8 Weight regularization Just like the surrogate loss function, e ould like R() to be convex Small eights regularization Sparsity regularization R (norm) () = s X d 2 d Family of p-norm regularization R (sqrd) () = X d 2 d R (count) X () = 1[ d > 0] not convex d R (p-norm) () = X d d p 1/p 8/29

9 Contours of p-norms convex for p 1 9/29

10 Contours of p-norms not convex for 0 apple p<1 p = 2 3 Counting non-zeros: p =0 R (count) () = X d 1[ d > 0] 10/29

11 General optimization frameork hyperparameter min X n ` y n, T x n + R() surrogate loss regularization Select a suitable: convex surrogate loss convex regularization Select the hyperparameter λ Minimize the regularized objective ith respect to This frameork for optimization is called Tikhonov regularization or generally Structural Risk Minimization (SRM) 11/29

12 Optimization by gradient descent Convex function step size p1 1 p2 2 p3 3 p4 p5 p6 g (k) r p F (p) pk compute gradient at the current location p k+1 p k k g (k) take a step don the gradient Non-convex function local optima = global optima local optima global optima 12/29

Choice of step size The step size is important too small: slo convergence too large: no convergence A strategy is to use large step sizes initially and small step sizes later: t 0 /(t 0 + t) There

13 Choice of step size The step size is important too small: slo convergence too large: no convergence A strategy is to use large step sizes initially and small step sizes later: t 0 /(t 0 + t) There are methods that converge faster by adapting step size to the curvature of the function Field of convex optimization p1 p2 1 Good step size p3 p4 p 5 p6 Bad step size p1 p3 p5 1 p2 p4 p6 13/29

14 Example: Exponential loss L() = X n exp( y n T x n )+ 2 2 objective dl X d = n X n loss term y n x n exp( y n T x n )+ gradient y n x n exp( y n T x n )+ update regularization term + cy n x n (1 ) high for misclassified points shrinks eights toards zero similar to the perceptron update rule 14/29

15 Batch and online gradients L() = X n L n () objective dl d gradient descent batch gradient X n dl n d online gradient dln d sum of n gradients update eight after you see all points gradient at n th point update eights after you see each point Online gradients are the default method for multi-layer perceptrons 15/29

16 Subgradient `(hinge) (y, T x) = max(0, 1 y T x) z 1 z The hinge loss is not differentiable at z=1 Subgradient is any direction that is belo the function For the hinge loss a possible subgradient is: d`hinge d = yx 0 if y T x > 1 otherise subgradient 16/29

17 Example: Hinge loss L() = X n max(0, 1 y n T x n )+ 2 2 objective dl 1[y n T x n apple 1]y n x n + subgradient X d = n X n 1[y n T x n apple 1]y n x n + update loss term + y n x n regularization term (1 ) only for points y n T x n apple 1 shrinks eights toards zero perceptron update y n T x n apple 0 17/29

18 Example: Squared loss L() = X n y n T x n objective matrix notation equivalent loss 18/29

19 Example: Squared loss objective gradient At optima the gradient=0 exact closed-form solution 19/29

20 Matrix inversion vs. gradient descent Assume, e have D features and N points Overall time via matrix inversion The closed form solution involves computing: Total time is O(D 2 N + D 3 + DN), assuming O(D 3 ) matrix inversion If N > D, then total time is O(D 2 N) Overall time via gradient descent Gradient: dl d = X 2(y n T x n )x n + n Each iteration: O(ND); T iterations: O(TND) Which one is faster? Small problems D < 100: probably faster to run matrix inversion Large problems D > 10,000: probably faster to run gradient descent 20/29

21 Picking a good hyperplane Which hyperplane is the best? 21/29

22 Support Vector Machines (SVMs) Maximize the distance to the nearest point (margin), hile correctly classifying all the points margin () 22/29

23 Optimization for SVMs Separable case: hard margin SVM min 1 () maximize margin subject to: y n T x n 1, 8n separate by a non-trivial margin Non-separable case: soft margin SVM min 1 () + C X n maximize margin n minimize slack subject to: y n T x n 1 n, 8n allo some slack n 0 23/29

24 Margin of a classifier T x +1=0 T x 1=0 min () = 1 1 () min maximizing margin = minimizing norm margin () 24/29

25 Equivalent optimization for SVMs Separable case: hard margin SVM min maximize margin subject to: y n T x n 1, 8n squaring and half for convenience separate by a non-trivial margin Non-separable case: soft margin SVM min C X n maximize margin n minimize slack subject to: y n T x n 1 n, 8n allo some slack n 0 25/29

Slack variables 1 X min 2 2 + C n subject to: y n T n x n 1 n, 8n soft margin SVM n 0 Suppose I tell you hat is, but forgot to give you the slack variables Can you derive the optimal slack for the n

26 Slack variables 1 X min C n subject to: y n T n x n 1 n, 8n soft margin SVM n 0 Suppose I tell you hat is, but forgot to give you the slack variables Can you derive the optimal slack for the n th example? y n T x n n = 0.8, =? = -1, n =? y n T x n y n T x n n = 2.5, =? n = 0 yn T x n 1 1 y n T x n otherise min C X n max(0, 1 y n T x n ) Same as hinge loss ith squared norm regularization 26/29

27 Optimization for linear models Under suitable conditions*, provided you pick the step sizes appropriately, the convergence rate of gradient descent is O(1/N) i.e., if you ant a solution ithin of the optimal you have to run the gradient descent for N=1000 iterations. For linear models (hinge/logistic/exponential loss) and squared-norm regularization there are off-the-shelf solvers that are fast in practice: SVM perf, LIBLINEAR, PEGASOS SVM perf, LIBLINEAR use a different optimization method * the function is strongly convex: 27/29

28 Slides credit Figures of various p-norms are from Wikipedia Some of the slides are based on CIML book by Hal Daume III 28/29

Appendix: code for surrogateloss 9 8 7 6 Zero/one Hinge Logistic Exponential Squared Output Prediction 5 4 3 2 % Code to plot various loss functions y1=1; y2=linspace( 2,3,500); zerooneloss = y1*y2

29 Appendix: code for surrogateloss Zero/one Hinge Logistic Exponential Squared Output Prediction % Code to plot various loss functions y1=1; y2=linspace( 2,3,500); zerooneloss = y1*y2 <=0; hingeloss = max(0, 1 y1*y2); logisticloss = log(1+exp( y1*y2))/log(2); exploss = exp( y1*y2); squaredloss = (y1 y2).^2; Loss % Plot them figure(1); clf; hold on; plot(y2, zerooneloss, k, LineWidth,1); plot(y2, hingeloss, b, LineWidth,1); plot(y2, logisticloss, r, LineWidth,1); plot(y2, exploss, g, LineWidth,1); plot(y2, squaredloss, m, LineWidth,1); ylabel( Prediction, FontSize,16); xlabel( Loss, FontSize,16); legend({ Zero/one, Hinge, Logistic, Exponential, Squared }, Location, NorthEast, FontSize,16); box on; Matlab code 29/29

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

CPSC 340: Machine Learning and Data Mining Logistic Regression Fall 2016 Admin Assignment 1: Marks visible on UBC Connect. Assignment 2: Solution posted after class. Assignment 3: Due Wednesday (at any