CSE 417T: Introduction to Machine Learning. Lecture 6: Bias-Variance Trade-off. Henry Chai 09/13/18

Size: px

Start display at page:

Download "CSE 417T: Introduction to Machine Learning. Lecture 6: Bias-Variance Trade-off. Henry Chai 09/13/18"

Laurel Sharp
5 years ago
Views:

1 CSE 417T: Introduction to Machine Learning Lecture 6: Bias-Variance Trade-off Henry Chai 09/13/18

2 Let! ", $ = the maximum number of dichotomies on " points s.t. no subset of $ points is shattered Recall If $ is a breakpoint for H, then ' H "! ", $ If! ", $ is bounded by a polynomial in " and ' H " is bounded by! ", $, then ' H " is bounded by a polynomial in " 2

3 Recall! ", $ & *+, " '() - *+, = & '() "! " -! -! "*+, + 1 3

4 If $ is a breakpoint for H, then! H # & #, $ Bounding! H # & #, $ # ()* + 1 If $ is a breakpoint for H, then! H # # ()* + 1 4

5 For 1D positive rays,! = 2 is a break point and $ H & = & + 1 & *+, + 1 = & + 1 Growth Function: Examples For 1D positive intervals,! = 3 is a break point and $ H & =./ * +. * + 1 &0+, + 1 = & * + 1 For 2D linear separators,! = 4 is a break point and $ H 3 = , + 1 = 28 5

6 For 1D positive rays,! = 2 is a break point and $ H & = & + 1 & *+, + 1 = & + 1 Growth Function: Examples For 1D positive intervals,! = 3 is a break point and $ H & =./ * +. * + 1 &0+, + 1 = & * + 1 For 2D linear separators,! = 4 is a break point and $ H 3 = , + 1 = 28 6

7 For 1D positive rays,! = 2 is a break point and $ H & = & + 1 & *+, + 1 = & + 1 Growth Function: Examples For 1D positive intervals,! = 3 is a break point and $ H & =./ * +. * + 1 &0+, + 1 = & * + 1 For 2D linear separators,! = 4 is a break point and $ H 3 = , + 1 = 28 7

8 For 1D positive rays,! = 2 is a break point and $ H & = & + 1 & *+, + 1 = & + 1 Growth Function: Examples For 1D positive intervals,! = 3 is a break point and $ H & =./ * +. * + 1 &0+, + 1 = & * + 1 For 2D linear separators,! = 4 is a break point and $ H 3 = , + 1 = 28 8

9 For 1D positive rays,! = 2 is a break point and $ H & = & + 1 & *+, + 1 = & + 1 Growth Function: Examples For 1D positive intervals,! = 3 is a break point and $ H & =./ * +. * + 1 &0+, + 1 = & * + 1 For 2D linear separators,! = 4 is a break point and $ H 4 = , + 1 = 65 9

10 ! "# H = the largest value of & s.t. ' H & = 2 ) The VC-dimension is the greatest number of points that can be shattered by H VC-Dimension If * is the smallest breakpoint for H, then! "# H = * 1 ' H & & / ) 7 + 9! "# :;< ) ) 10

11 ! "# H = the largest value of & s.t. ' H & = 2 ) The VC-dimension is the greatest number of points that can be shattered by H VC-Dimension If * is the smallest breakpoint for H, then! "# H = * 1 ' H & & / ) 7 + 9! "# :;< ) ) 11

12 How many samples do we need in our training data to say that the generalization error is less than! with probability at least 1 $? Sample Complexity Set % & log * +, -&./0 1! Conclude that we need 3 % 5 6 log * +, -&. /0 1 Practical rule of thumb: : 12

13 Penalty for Model Complexity Given! samples, how good can we say our learned hypothesis will do with confidence at least 1 $? Conclude that % &'( ) % +, ) +., log 2 3,

14 How well does % generalize? Approximation Generalization Tradeoff! "#$ %! '( % + * +,- log 1 1 How well does % approximate 2? 14

15 How well does % generalize? Approximation Generalization Tradeoff! "#$ %! '( % + * +,- log 1 1 Decreases as +,- increases 15

16 Increases as +,- increases Approximation Generalization Tradeoff! "#$ %! '( % + * +,- log 1 1 Decreases as +,- increases 16

17 Bias-Variance Tradeoff Regression with squared error:! = R and $ h, ', ) = ' ) h ) +, - H = the hypothesis returned by 0 when the input training data is - 17

18 ! "#$ % & = ( *~, % & - / - 0 Bias-Variance Tradeoff ( &! "#$ % & = ( & ( * % & - / - 0 ( &! "#$ % & = ( * ( & % & - / - 0 ( &! "#$ % & = ( * ( & % & - 0 2% & - / - / - 0 ( &! "#$ % & = ( * ( & % & - 0 2% - / - / - 0 where % - = ( & % & - : % &; - 18

19 ! "#$ % & = ( * % & Bias-Variance Tradeoff ( &! "#$ % & = ( & ( * % & ( &! "#$ % & = ( * ( & % & ( &! "#$ % & = ( * ( & % & +. 2% & ( &! "#$ % & = ( * ( & % & +. 2% where % + = ( & % & % &: + 19

20 Bias-Variance Tradeoff! " # $%& ' " =! *! " ' " +, 2' , =! *! " ' " +, ' +, + ' +, 2' , =! *! " ' " +, ' +, + ' + 0 +, =! * Variance of ' " + + Bias of ' + 20

21 How variable is '? Bias-Variance Tradeoff! " # $%& ' " =! *! " ' " +, ' +, + ' + 0 +, How well, on average, does ' approximate 0? 21

22 How well could ' approximate anything? Bias-Variance Tradeoff! " # $%& ' " =! *! " ' " +, ' +, + ' + 0 +, How well, on average, does ' approximate 0? 22

23 How well could ' approximate noise? Bias-Variance Tradeoff! " # $%& ' " =! *! " ' " +, ' +, + ' + 0 +, How well, on average, does ' approximate 0? 23

24 How well could ' approximate noise? Bias-Variance Tradeoff! " # $%& ' " =! *! " ' " +, ' +, + ' + 0 +, Decreases as H becomes more complex 24

25 Increases as H becomes more complex Bias-Variance Tradeoff! " # $%& ' " =! *! " ' " +, ' +, + ' + 0 +, Decreases as H becomes more complex 25

26 ! = R and $ = Uniform 0, 2/ Bias-Variance Tradeoff (Example) 0 = sin 2 3 = 2 4, sin 2 4, 2 5, sin 2 5 H 7 = h h 2 = : and H 4 = h h 2 = ;2 + : 26

27 Bias-Variance Tradeoff (Example) H " H # 27

28 Bias-Variance Tradeoff (Example) H " H # 28

29 Bias-Variance Tradeoff (Example) % ' % ' H " H # 29

30 Bias-Variance Tradeoff (Example) * - * - Bias of * Bias of * Variance of * Variance of * : ;<= * : ;<= *

31 Bias-Variance Tradeoff (Example) * - * - Bias of * Bias of * Variance of * Variance of * :; * :; *

32 ! "#$! "#$ Expected error! %& Expected error! %& Number of training points, ' Number of training points, ' Simple model Complex model 32

33 Expected error! "#$ Generalization error In-sample error! %& Expected error Variance Bias! "#$! %& Number of training points, ' Number of training points, ' VC analysis Bias-Variance analysis 33

34 ! " #$ % " '() % > + 4. H $ Vapornik- Chervonenkis (VC)-Bound Or " '() % " #$ % + 5 $ log <= H >$? with probability at least 1 A 34

35 # $ %& ' $ )*+ ' > - 40 H 2" & Why 2"? Or $ )*+ ' $ %& ' + 5 & log <= H >&? with probability at least 1 A 35

36 Intuition: # $%& ' is difficult to reason about Why 2"? * Replace # $%& ' with # () ', the error on a second dataset of size " not used in the training process * + # () ' # $%& ' >. 2+ # () ' # () ' >. 36

37 Instead of bounding! "#$ % using! &' %, estimate! "#$ % using the error on some test dataset ( )! $*+$ % = error on the test dataset Test Sets If the ( ) is not involved in the training process, then we are validating % using ( ) Therefore, Hoeffding s bound applies! Even better, Hoeffding s bound applies with - = H = 1 0! $*+$ %! "#$ % > : ' ; where < ) = ( ) 37

38 But at what cost? We are given a finite pool of data Test Sets Carving out a test dataset to bound! "#$ % leaves fewer data points to train with A smaller training dataset generally means the learned % is worse i.e.! $&'$ % is large Practical rule of thumb: 70-80% training, 20-30% testing 38

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff Overfitting Machine Learning CSE546 Carlos Guestrin University of Washington October 2, 2013 1 Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More