Lecture 5: Linear Classification

Size: px

Start display at page:

Download "Lecture 5: Linear Classification"

Agatha Arnold
6 years ago
Views:

1 Lecture 5: Linear Classification CS , Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 8, 2011

2 Outline

3 Outline

4 Data We are given a training data set: Feature vectors: data points x i R p, i = 1,..., n. Labels: y i { 1, 1}, i = 1,..., n. Examples: Feature vectors Companies corporate info Stock price data News data News data s Genetic measures Labels default/no default price up/down price up/down sentiment (positive/negative) presence of a keyword presence of disease

5 Linear classification Using the training data set {x i, y i } n i=1, our goal is to find a classification rule ŷ = f (x) allowing to predict the label ŷ of a new data point x. Linear classification rule: assumes f is a combination of the sign function and a linear (in fact, affine) function: where w R p, b R are given. ŷ = sign(w T x + b), The goal of a linear classification algorithm is to find w, b, using the training data.

6 Separable data The data is linearly separable if there exist a linear classification rule that makes no error on the training set. This is a set of linear inequalities constraints on (w, b): y i (w T x i + b) 0, i = 1,..., n. Strict separability corresponds the the same conditions, but with strict inequalities.

7 Geometry Geometrically: the hyperplane {x : w T x + b = 0} perfectly separates the positive and negative data points.

8 Linear algebra flashback: hyperplanes Geometrically, an hyperplane H = { w : a T x = b }, with (WLOG) a 2 = 1, is a translation of the set of vectors orthogonal to a. The direction of the translation is determined by a, and the amount by b.

9 Geometry (cont d) Assuming strict separability, we can always rescale (w, b) and work with y i (w T x i + b) 1, i = 1,..., n. SVM - Support Vector Machines Amounts Optimum to make sureseparation that negative (resp. Hyperplane positive) class contained in half-space w T x + b 1 (resp. w T x + b 1). ation hyperplane (OSH) is the linear classifier with the maximum margin for a given finit terns. The OSH computation with a linear support vector machine is presented in thi The distance Figure 1. between The optimum the two separation ±1 boundaries hyperplane turns(osh). out the be equal to 2/ w 2. Thus the margin w 2 is a measure of how well the ication of two classes of patterns that are linearly separable, i.e., a linear classifier ca hem hyperplane (Figure 1). separates The linear theclassifier data apart. is the hyperplane H (w x+b=0) with the maximum een hyperplanes H 1 and H 2 ). Consider a linear classifier characterized by the set of pair the following inequalities for any pattern x i in the training set:

10 Non-separable data Separability constraints are homogeneous, so WLOG we can work with y i (w T x i + b) 1, i = 1,..., n. If the above is infeasible, we try to minimize the slacks min w,b,s n s i : s 0, y i (w T x i + b) 1 s i, i = 1,..., n. i=1 The above can be solved as a linear programming problem (in variables w, b, s).

11 Hinge loss function The previous LP can be interpreted as minimizing the hinge loss function n L(w, b) := max(1 y i (w T x i + b), 0). i=1 This serves as an approximation to the number of errors made on the training set:

12 Outline

13 Regularization The solution might not be unique, so we add a regularization term w 2 2: 1 min w,b 2 w C L(w, b) where C > 0 allows to trade-off the accuracy on the training set and the prediction error (more on why later). This makes the solution unique. The above model is called the Support Vector Machine. It can be reliably solved using special fast algorithms that exploit its structure. If C is large, and data is separable, reduces to min w,b 1 2 w 2 2 : y i (w T x i + b) 1, i = 1,..., n.

14 interpretation Return to separable data. The set of constraints y i (w T x i + b) 0, i = 1,..., n, has many possible solutions (w, b). We will select a solution based on the idea of robustness (to changes in data points).

15 Maximally robust separating hyperplane Spherical uncertainty model: assume that the data points are actually unknown, but bounded: x i S i := {ˆx i + u i : u i 2 ρ}, where ˆx i s are known, ρ > 0 is a given measure of uncertainty, and u i is unknown. Robust counterpart: we now ask that the separating hyperplane separates the spheres (and not just the points): x i S i : y i (w T x i + b) 0, i = 1,..., n. For separable data we can try to separate spheres around the given points. We ll grow the spheres radius until sphere separation becomes impossible.

16 Robust classification We obtain the equivalent condition y i (w T ˆx i + b) ρ w 2, i = 1,..., n. Now we seek (w, b) which maximize ρ subject to the above. By homogeneity we can always set ρ w 2 = 1, so that problem reduces to min w 2 : y i (w T ˆx i + b) 1, i = 1,..., n. w This is exactly the same problem as the SVM in separable case.

17 Dual problem Denote by C + (resp. C ) the set of points x i with y i = +1 (resp. 1). It can be shown that the SVM problem can be expressed as: min x + x x 2 : x + CoC +, x CoC, +,x where CoC denotes convex hull of set C, that is: { q } q CoC = λ i x i : x i C, λ 0, λ i = 1. i=1 i=1

18 Geometry Dual problem amounts to find the smallest distance between the two classes, each represented by the convex hull of its points. The optimal hyperplane sits at the middle of the line segment joining the two closest points.

19 Separating boxes instead of spheres We can use a box uncertainty model: x i B i := {ˆx i + u i : u i ρ}. This leads to min w 1 : y i (w T ˆx i + b) 1, i = 1,..., n. w Classifiers found that way tend to be sparse. In 2D, the boundary line tends to be vertical or horizontal.

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

CPSC 340: Machine Learning and Data Mining More Linear Classifiers Fall 2017 Admin Assignment 3: Due Friday of next week. Midterm: Can view your exam during instructor office hours next week, or after