CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Size: px

Start display at page:

Download "CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18"

Barrie Hudson
5 years ago
Views:

1 CSE 417T: Introduction to Machine Learning Lecture 22: The Kernel Trick Henry Chai 11/15/18

2 Linearly Inseparable Data What can we do if the data is not linearly separable? Accept some non-zero in-sample error How much in-sample error should we tolerate? Apply a non-linear transformation that shifts the data into a space where it is linearly separable How can we pick a non-linear transformation? 2

3 minimize 1 2,-, subject to * +, , , * +! Hard-Margin SVMs When! is not linearly separable, there are no feasible solutions to this optimization problem 3

4 minimize 1 2 )* ) + K 2 5 "34 subject to ( " ) * + " + ) - 1! " + ", ( " B! " subject to! " 0 _ # 1,, E Soft-Margin SVMs! " is the soft error on the # $% training 5 2 "34 If! " > 1, then ( " ) * + " + ) - < 0 + ", ( " is incorrectly classified If 0 <! " < 1, then ( " ) * + " + ) - > 0 + ", ( " is correctly classified but inside the margin! " is the soft in-sample error 4

5 minimize 1 2 CA C subject to / * C * + C E *, / * F Primal Primal-Dual Optimization minimize 1 2 ) *+, subject to ). * / * = 0 - *+, A ). *.? / * )?+, *+, subject to. * 0 4 1,, 9. * Dual 5

6 Primal Directly returns the hyperplane,! ",! Support vectors are all % &, ' & ) s.t. ' &! * % & +! " = 1 Primal-Dual Optimization Dual Returns the vector, / 4! = 0 / 1 ' 1 % 1 123! " = ' &! * % & where / & > 0 Support vectors are all % &, ' & ) s.t. / 1 > 0 6

7 Primal 5 0 = sign Primal-Dual Optimization Dual 5 0 = sign = sign & ' ) *, -. ' / ' 0 '

8 minimize 1 2 CA C subject to / * C * + C E *, / * F Primal Primal-Dual Optimization minimize 1 2 ) *+, subject to ). * / * = 0 - *+, A ). *.? / * )?+, *+, subject to. * 0 4 1,, 9. * Dual 8

9 minimize 1 2 +, + + B C F *DE subject to ) * +, - * + + / 1 3 * - *, ) * 7 subject to 3 * 0 _ : 1,, < 3 * Soft-Margin SVMs 9

10 Primal-Dual Soft-Margin SVMs minimize 1 2 +, + + B C F *DE subject to ) * +, - * + + / 1 3 * - *, ) * 7 subject to 3 * 0 _ : 1,, < minimize 1 2 C *DE subject to C G * ) * = 0 F *DE 3 * F F F, C G * G J ) * ) J - * -J C JDE *DE 0 G * B : 1,, < G * Primal Dual 10

11 minimize 1 2 '- ' + / ) 1 subject to 1 ) 1 > 1 ' -? 1 + ' ( 0? 1, > 1 D Soft-Margin SVMs subject to ) 1 0 _ F 1,, H maximize O, & 0 minimize ', ' (, ) " $, &, ', ' (, ) = 1 2 '- ' + / 0 4 " O, &, ', ' (, ) " $, &, ', ' (, ) + 0 O 1 1 ) 1 > 1 ' -? 1 + ' ( 0 & 1 ) )

12 Minimizing the Lagrangian minimize &, & (, * +,, -, &, & (, * +,, -, &, & (, * = 1 2 &1 & ,, -, &, & (, * * 5 ; 5 & 1 < 5 + & ( 4-5 * 5 =+,, -, &, & (, * =& * 5 = & ; 5 < 5 & = ; 5 < =+,, -, &, & (, * =& ( = ; ; 5 = =+,, -, &, & (, * =* 5 = = B

13 Minimizing the Lagrangian (! = $ ) % * % + % ( %&' $ ) % * % = 0 %&' - % =. ) % 1 ( ( 2 4, -,!,! 6, 7 = 1 2! :! +. $ 7 % $ - % 7 % %&' %&' ( 2 4, -,!,! 6, 7 + $ ) % 1 7 % * %! : + % +! 6 %&' ( ( 2 4, -,!,! 6, 7 = 1 2! :! +. $ 7 % $ (. ) % )7 % %&' %&' ( 2 4, -,!,! 6, 7 + $ ) % 1 7 % * %! : + % +! 6 %&' ( 2 4, -,!,! 6, 7 = 1 2! :! + $ ) % 1 * %! : + % +! 6 %&' 13

14 Maximizing the Minimum maximize 1 2 ) *+, subject to ). * / * = 0 - *+, - - F ). *. D / * / D E * ED + ) D+, *+, subject to. * 0 _ 5 1,, : subject to ; * 0 5 1,, : subject to ; * = <. * 5 1,, : subject to ). * / * = 0 - *+, - maximize 1 2 ) *+, subject to 0. * < 5 1,, : - F ). *. D / * / D E * ED + ) D+, - - *+,. *. * 14

15 Primal Directly returns the hyperplane,! ",! Support vectors are all % &, ' & ) s.t. ' &! * % & +! " = 1 Primal-Dual Soft-Margin SVMs Dual Returns the vector, / 4! = 0 / 1 ' 1 % 1 123! " =??? Support vectors are all % &, ' & ) s.t. / 1 > 0 15

16 minimize 1 2 /0 / + C D G -EF, - subject to 1, -. - / / ,. - 9 Complementary Slackness subject to, - 0 _ ; 1,, = maximize K, L 0 N O, L, /, / 3,, minimize /, / 3,, N K, L, /, / 3,, G = 1 2 /0 / + C D G -EF N O, L, /, / 3,, + D K - 1, -. - / / 3 -EF, - G D L -, - -EF 16

17 minimize 1 2 () ( + 4 J M "KL & " subject to 1 & " ' " ( ) * " + (, 0 * ", ' " 2 Complementary Slackness subject to & " 0 _ B 1,, D maximize!, 3 0 minimize (, (,, & R!, 3, (, (,, & Theorem:! " 1 & " ' " ( ) * " + (, = 0 * ", ' " 2 and 3 " & " = 4! " & " = 0 * ", ' " 2 If 0 <! 6, then 1 & 6 ' 6 ( ) * 6 + (, = 0 If! 6 < 4, then & 6 = 0 If 0 <! 6 < 4, then 1 ' 6 ( ) * 6 + (, = 0 17

18 minimize 1 2 () ( + 4 J M "KL & " subject to 1 & " ' " ( ) * " + (, 0 * ", ' " 2 Complementary Slackness subject to & " 0 _ B 1,, D maximize!, 3 0 minimize (, (,, & R!, 3, (, (,, & Theorem:! " 1 & " ' " ( ) * " + (, = 0 * ", ' " 2 and 3 " & " = 4! " & " = 0 * ", ' " 2 If 0 <! 6, then 1 & 6 ' 6 ( ) * 6 + (, = 0 If! 6 < 4, then & 6 = 0 If 0 <! 6 < 4, then (, = ' 6 ( ) * 6 18

19 Primal-Dual Soft-Margin SVMs Primal Directly returns the hyperplane,! ",! Support vectors are all % &, ' & ) s.t. ' &! * % & +! " = 1 Dual Returns the vector, / 4! = 0 / 1 ' 1 % 1 123! " = ' &! * % & where 0 < / & < 8 Support vectors are all % &, ' & ) s.t. 0 < / & < 8 If / & = 8, then 9 & > 0 % &, ' & is inside the margin or misclassified 19

20 Decide on a transformation Φ: # % Recall: Nonlinear Transforms Given & = ( ), + ),, ( -, + -, learn a hypothesis,./ 1, using 2& = Φ ( ) = 1 ), + ),, Φ ( - = 1 -, + - Return the corresponding predictor in the original space: / ( =./ Φ ( 20

21 Decide on a transformation Φ: # % Find a maximal-margin separating hyperplane in the transformed space, &', &' *, by solving the QP: Nonlinear SVMs minimize 1 2 &'2 &' subject to : ; &' 2 Φ < ; + >' * 1 < ;, : ; B Return the corresponding predictor in the original space: C < = sign &' 2 Φ < + &' * 21

22 Decide on a transformation Φ: # % Nonlinear Dual SVMs Find a maximal-margin separating hyperplane in the transformed space, &', &' *, by solving the QP: 6 minimize subject to : 3 = 0 subject to : 3 : 7 Φ ; < 3 Φ ; I 1,, L Return the corresponding predictor in the original space: M ; = sign : 3 Φ ; < 3 Φ ; + &' * 3 Q R S * 6 22

23 Perceptrons Low-Dimensional Input Space High-Dimensional Input Space! "# High Low Generalization Good Bad SVMs Low-Dimensional Input Space High-Dimensional Input Space! "# High Low Generalization Good Okay $ %& H = ) + 1 vs. $ %& H / min ),

24 Depending on the transformation Φ and the dimensionality of the original input space ", computing Φ $ can be computationally expensive Computing Φ % $ = $ ', $ %,, $ *, $ ' %, $ ' $ %,, $ * % Efficiency requires +, =, + * % +, = *. /0* % Computing Φ '2 $ requires 1, '2 time = 1, % time Tradeoff: High-dimensional transformations can result in good hypotheses (as long as they don t overfit) High-dimensional transformations are expensive 24

25 Nonlinear Dual SVMs Insight: the transformation Φ only appears as the inner product between two transformed points Find a maximal-margin separating hyperplane in the transformed space, "#, "# &, by solving the QP: minimize 1 2. /01 subject to. subject to 2 / / 6 / = 0 45 / / 6 3 Φ 7 / 8 Φ / 0 E 1,, H 2 /01 Return the corresponding predictor in the original space: 45 / I 7 = sign. / M N O & 45 / 6 / Φ 7 / 8 Φ 7 + "# & 25

26 Nonlinear Dual SVMs Insight: the transformation Φ only appears as the inner product between two transformed points Find a maximal-margin separating hyperplane in the transformed space, "#, "# &, by solving the QP: minimize 1 2. /01 subject to. subject to 2 / / 6 / = 0 45 / / 6 3 Φ 7 / 8 Φ / 0 E 1,, H 2 /01 Return the corresponding predictor in the original space: 45 / I 7 = sign. / M N O & 45 / 6 / Φ 7 / 8 Φ 7 + "# & 26

27 Approach: instead of computing Φ #, find a function $ % s.t. $ % #, # ' = Φ # ) Φ # ' #, # ', $ % #, # ' should be cheaper to compute than Φ # The Kernel Trick Example: ' Φ - # = #., # -,, # 0, # - -., 2#. # -,, 2# 02. # 0, # ' Φ - # ) ' Φ - # ' = 3 # 4 # ' # - '- 4 # # 4 # ' ' 4 # 7 # ' Φ - # ) ' Φ - # ' = 3 # 4 # ' ' # 4 # 4 = # ) # ' + # ) # ' $ %9 : #, # ' = # ) # ' + # ) # '

28 Approach: instead of computing Φ #, find a function $ % s.t. $ % #, # ' = Φ # ) Φ # ' #, # ', $ % #, # ' should be cheaper to compute than Φ # The Kernel Trick Example: ' Φ - # = #., # -,, # 0, # - -., 2#. # -,, 2# 02. # 0, # 0 ' Φ - # ) ' Φ - # ' = # ) # ' + # ) # ' - = $ %4 5 #, # ' ' Computing Φ - # ) ' Φ - # ' requires time whereas computing $ 5 %4 #, # ' only takes

29 ! "# $ &, & ( = & * & ( + & * & (, Implied feature transformation: Φ, ( & = &., &,,, & 0, &.,, 2&. &,,, 2& 02. & 0, & 0, Implied dimensionality: 34 = 0# 560, Common Kernels! "# 7,8 &, & ( = 9 + : & * & (, 9, <,= Implied feature transformation: Φ, & = 29:&.,, 29:& 0, :&,,., :&. &,,, :& 0 Implied dimensionality: 34 = 0# 560, : and 9 affect the geometry of the transform: changing them changes the support vectors / decision boundary Set using validation 29

30 Polynomial Kernel:! "# $,& (, ( ) = (. ( ) / + / Common Kernels Implied dimensionality: 12 = 3 2 / - and + affect the geometry of the transform: changing them changes the support vectors / decision boundary Set using validation Gaussian-RBF Kernel:! "4 (, ( ) = : :4 Implied feature transformation: Φ < ( = = : 5 67? :4,, 5 67 A : :4, : 5 67? :4 B? : : :4 C!<?,, 567A B A : : :4 C!<?, 567? B : : :? :4 E!< :,, 567A B A : : E!< :, > 30

31 Polynomial Kernel:! "# $,& (, ( ) = (. ( ) / + / Common Kernels Implied dimensionality: 12 = 3 2 / - and + affect the geometry of the transform: changing them changes the support vectors / decision boundary Set using validation Gaussian-RBF Kernel:! "4 (, ( ) = : :4 Implied feature transformation: Φ < ( = : 5 67 = :4 > =? : : 567C > =? Implied dimensionality: 12 =! Set I using validation : E N 31

32 Any function! is a valid kernel if and only if: a transformation Φ s.t.! %, % ' = Φ % ) Φ % ' %, % ' Valid Kernels the Gram matrix! % -, % -! % -, %.! % -, % 0 Κ =! %., % -! %., %.! %., % 0! % 0, % -! % 0, %.! % 0, % 0 is positive semi-definite sets % -, %.,, % 0 32

33 Decide on a (valid) kernel function! " Nonlinear Dual SVMs Find a maximal-margin separating hyperplane in the transformed space, #$, #$ ', by solving the QP: 3 minimize 1 2 / subject to / = 0 subject to / ! " 8 0, 8 4 / E 1,, H Return the corresponding predictor in the original space: 3 I 8 = sign / 0 M N O ' ! " 8 0, 8 + #$ ' 33

34 Gaussian- RBF Kernel 2

35 Gaussian- RBF Kernel 35

36 Decide on a (valid) kernel function! " Nonlinear Soft-Margin Dual SVMs Find a maximal-margin separating hyperplane in the transformed space, #$, #$ ', by solving the QP: 3 minimize 1 2 / subject to / = 0 subject to / ! " 8 0, 8 4 / D F 1,, I Return the corresponding predictor in the original space: 3 J 8 = sign / 0 N O P ' ! " 8 0, 8 + #$ ' 36

37 Smaller $ Larger $ Hard Margin 2 "# -Degree Polynomial Kernel $ is a tradeoff parameter (much like the tradeoff parameter in regularization) Use validation! 37

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

LECTURE 5: DUAL PROBLEMS AND KERNELS * Most of the slides in this lecture are from http://www.robots.ox.ac.uk/~az/lectures/ml Optimization Loss function Loss functions SVM review PRIMAL-DUAL PROBLEM Max-min