Instructor: Dr. Benjamin Thompson Lecture 20: 31 March 2009

Size: px

Start display at page:

Download "Instructor: Dr. Benjamin Thompson Lecture 20: 31 March 2009"

Lambert Crawford
5 years ago
Views:

1 Instructor: Dr. Benjamin hompson Lecture 20: 31 March 2009

2 Announcements MIDERMS ARE GRADED! Yay! I will return them AFER the lecture, so that you pay attention to the lecture. I m sadistic like that. Some statistics: Mean score: 134 (out of 160 total points) Median: Standard Deviation: 36.3 points Skew: Kurtosis: Oh, and: NO CLASS ON HURSDAY

3 Last ime You were taking the midterm, remember? I hope you do. Maybe you ve blocked it out. Painful memories are like that.

4 oday Chapter 6: Support Vector Machines Motivation: Optimal Separation of Classes SVMs and You Constrained Optimization he Method of Lagrangian Multipliers he Optimal Hyperplane for Nonseparable Patterns Midterm Debriefing

6 After 11 a.m. versus never is a pretty optimal separation of classes for many of us

7 Rosenblatt Refresher Recall that, for the Rosenblatt Perceptron, the goal was linearhyperplane separationof two classes he RP was a good enough approach: Learning process terminated once all training patterns were successfully classified Many possible solutions (see next slide) No measure of how one hyperplane is better/worse than any other, so long as all patterns are properly classified In other words: the RP is a sufficientclassifier, but not an optimal classifier

8 Non-Uniqueness of the RP Hyperplane How much do these points contribute to the overall position/orientation of the hyperplane? How about these? hat s a hint about where we re going

9 So you want Optimality? So we want a hyperplane that is somehow betterthan all the other possible hyperplanes Subject to it correctly classifying all the points, of course. Better is a loaded term Any suggestions? We ll need a few tools to develop a clear idea of the best hyperplane

10 Our Dear Old Friend Recall the definition of the separating (linear) hyperplane: w x+ b= Now, let s assume we only have two classes we re interested in separating, C1 and C2 Further, let s assume fixed outputs for each class: Each member of class C1 should yield a +1 Each member of class C2 should yield a -1 Putting those together, we have: 0 w xi w xi + b 0 d =+ 1 i + b< 0 d = 1 i

11 On the Margin Consider a point x j very close to the hyperplane In this case, w x j +b 0 Suppose the approximation is exact: hat is, suppose the point yields an output that is exactly zero his will classify as class C1, but only because we defined it that way Now extend that to something that is only slightly non-zero It s clear from the hyperplane equation that a correspondingly small change in bor w would fudge that result as well In other words: points very close to the hyperplane may result in confusion if there is any noise in the system So: if points close to the decision boundary cause us problems, we should construct the hyperplane so that all the points in either class are as far away from the hyperplane as possible!

12 A Cautionary ale.

13 Who knew? ( ) he function defined by g x is known as the j = w xj+ b discriminant function, and is a measure of the algebraic distancefrom the point x j to the hyperplane defined by w o understand this, first we need to take a look at some arbitrary vector xand break it up into two components: x p, the normal projection of xalong the hyperplane defined by w x r, the component of xperpendicular to wstarting from x p

14 In pictures his is easiest to picture when b=0: x j x jr, x j, p w x= 0 So it should be clear that x r is somehow a measure of what we re looking for: or more specifically, the lengthof x r!

15 In pictures But changes in non-obvious ways when b 0 x j x jr, w x+ b= 0 x j, p Note: since I just shifted everything up by some value b, xpchanged, but x r did not!

16 Insight Into the Inner Product Recall that w xis an inner productbetween the two vectors wand x Suppose that one of the vectors (let s assume w) has unit length hat is, w = 1 his is known as a unit vector When this condition holds, the inner product of x with that unit vector tells you exactly how much of x lies along the direction of w Note: if neither vector is a unit vector, you can makeit a unit vector by setting: w u = w w

17 A Little Math Never Hurt Anybody Because x p lies onthe hyperplane defined by w, g(x p ) is, by definition, zero. Additionally, the directionof a vector normalto (perpendicular to) some plane defined by wis just w! So We may rewrite and arbitrary vector xas x=x p +x r And we may rewrite x r as some vector of length r pointing in the direction of w, or: x= x + p r w w Remember, the direction of w is perpendicular to the direction of the plane defined by w

18 A Little More Math Never Hurt Anybody Except Small Woodland Creatures and the Very Young Just for laughs, let s evaluate g(x p ): So: ( xp) 0 Which just simplifies to: g = = w x r + b r w w w x w r= In other words: the perpendicular distance from the hyperplane defined by wto any point x is just the discriminant function g(x) divided by the norm of the hyperplane vector w! w w = + b g( x) w

19 At Last, a Goal! So now the goal becomes clear: We want to find a hyperplane defined by some wand b that maximizes this value r, thus maximizing the margin of separation ρ o, over all the patterns to be classified! Note: ρ o is actually 2r, since ρ o is defined as the worst case minimum distance between any two data data points from different classes In other words, suppose we have some {x j } that are closest to the hyperplane defined by w: We want to find the wand bthat maximize the distance ρ o for the minimallyseparated patterns {x j } From now on, we will refer to these best parameters as {w o, b o }

20 A Bit of a rick We may cleverly (or cheaply, depending on your mood) phrase the conditions of the previous slide in the following way, as a pair of constraintson our choice of {w o, b o }: w x + b + 1 for d =+ 1 o i o i w x + b 1 for d = 1 o i o i Some discussion of these equations on the next slide

21 Remember When? Recall that the hyperplane of separation is defined by the equation wox+ bo = 0 So if we multiply both sides of that equation by an arbitrary constant, the dividing plane doesn t change at all, while w o and b o are both scaled by that constant Yes, this does imply that the optimal solution is still nonunique But the separating plane defined by that optimal solution is unchanged! So the constraints are really saying We require that our classifier always produce some positive constant (or greater) for Class 1 inputs, and some negative constant (or less) for Class 2 inputs. hen we just scale the optimal parameters so that positive constant and negative constant are just +/-1

22 A Veritable Dichotomy! Given the constraints w x + b + 1 for d =+ 1 o i o i woxi+ bo 1 for di = 1 here are two typesof vectors in our data set {xi, di}: hose that fulfill the strict inequality hose for whom the equality holds he vectors that fulfill the equalityabove are those that lie closest to the separating boundary Indeed, these are the points that definethe boundary, given the above constraints

23 Support Vectors Ahoy! As our previous demonstration hints at, the vectors that fulfill the strict inequality are irrelevant! All that matters (to define our boundary) is the vectors that fit the exact equality condition: w x ( s) ( s) + b =+ 1 for d =+ 1 o i o i w x ( s) ( s) + b = 1 for d = 1 o i o i he data points for which this holds are known as support vectorsand are denoted asx (s)

24 Immediate Ramifications Because of the equality in the constraint, then, for any support vector, our discriminant function becomes: g ( ( s) ) ( s) x wo x = + b =± 1 o so we may write: r and therefore, ( ( s) x ) ± 1 g = = w o 2 ρ o = 2r= w o w o

25 he Consequence! Remember, the goal was to maximizethe margin of separation, or 2 ρ o = wo Clearly, we can maximizeρ o by minimizing the parameter vector w o! Of course, this must be subject to the constraints w x ( s) ( s) + b =+ 1 for d =+ 1 o i o i w x ( s) ( s) + b = 1 for d = 1 o i o i which of course leads us to

26 Like unconstrained optimization, only less prone to violent outbursts.

27 Optimization We ve Seen So far, the only optimization problems we ve seen have been unconstrained optimization problems hat is, given some objective function E(w), find the minimum (or maximum) of that function with respect to every possible choice of w. Sometimes, the choice of wwhich truly optimizes the objective function may not be a useful/realistic /allowable value. hat is, there are constraints on the choice of wthat we may use to optimize our cost function.

28 Optimization with Constraints Simple example: Minimize the cost function E ( w) = w w subject to the constraintw α β Here, the unconstrained optimization of E(w) is trivial: his is just a scaled w 2, which takes a minimum value of zero at w=0 However, the constraint equation defines a plane(this time parameterized by the vector α and scalar β). Next slide: visualizing this for 2-D parameter vector

29 Constrained Optimization in Pictures Where is the solution if it is a minimizationproblem subject to a greater than constraint? Where is the solution if it is a minimizationproblem subject to a less than constraint? Where is the solution if it is a maximizationproblem subject to a greater than constraint? Where is the solution if it is a maximizationproblem subject to a less than constraint?

Lecture 3: Linear Classification

Lecture 3: Linear Classification Roger Grosse 1 Introduction Last week, we saw an example of a learning task called regression. There, the goal was to predict a scalar-valued target from a set of features.