MLCC 2018 Local Methods and Bias Variance Trade-Off. Lorenzo Rosasco UNIGE-MIT-IIT

Size: px

Start display at page:

Download "MLCC 2018 Local Methods and Bias Variance Trade-Off. Lorenzo Rosasco UNIGE-MIT-IIT"

Augusta Stafford
6 years ago
Views:

1 MLCC 2018 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT

2 About this class 1. Introduce a basic class of learning methods, namely local methods. 2. Discuss the fundamental concept of bias-variance trade-off to understand parameter tuning (a.k.a. model selection) MLCC

3 Outline MLCC

4 The problem What is the price of one house given its area? MLCC

5 The problem What is the price of one house given its area? Start from data... MLCC

6 The problem What is the price of one house given its area? Start from data... 6 x 105 Area (m 2 ) Price (AC) x 1 = 62 y 1 = 99, 200 x 2 = 64 y 2 = 135, 700 x 3 = 65 y 3 = 93, 300 x 4 = 66 y 4 = 114, Let S the houses example dataset (n = 100) S = {(x 1, y 1 ),..., (x n, y n )} MLCC

7 The problem What is the price of one house given its area? Start from data... 6 x 105 Area (m 2 ) Price (AC) x 1 = 62 y 1 = 99, 200 x 2 = 64 y 2 = 135, 700 x 3 = 65 y 3 = 93, 300 x 4 = 66 y 4 = 114, Let S the houses example dataset (n = 100) S = {(x 1, y 1 ),..., (x n, y n )} Given a new point x we want to predict y by means of S. MLCC

8 Example Let x a 300m 2 house. MLCC

9 Example Let x a 300m 2 house. 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, MLCC

10 Example Let x a 300m 2 house. 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, What is its price? MLCC

11 Nearest Neighbors Nearest Neighbor: y is the same of the closest point to x in S. y = 311, x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, MLCC

12 Nearest Neighbors Nearest Neighbor: y is the same of the closest point to x in S. y = 311, x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, MLCC

13 Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R MLCC

14 Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, MLCC

15 Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, y the predicted output y = ˆf(x ) where y = y j j = arg min i=1,...,n x x i MLCC

16 Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, y the predicted output y = ˆf(x ) where y = y j j = arg min i=1,...,n x x i Computational cost O(nD): we compute n times the distance x x i that costs O(D) MLCC

17 Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, y the predicted output y = ˆf(x ) where y = y j j = arg min i=1,...,n x x i Computational cost O(nD): we compute n times the distance x x i that costs O(D) In general let d : R D R D a distance on the input space, then f(x) = y j j = arg min i=1,...,n d(x, x i) MLCC

18 Extensions Nearest Neighbor takes y is the same of the closest point to x in S. 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, MLCC

19 Extensions Nearest Neighbor takes y is the same of the closest point to x in S. 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, Can we do better? (for example using more points) MLCC

20 K-Nearest Neighbors K-Nearest Neighbor: y is the mean of the values of the K closest point to x in S. If K = 3 we have y = 274, , , = 303, x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, MLCC

21 K-Nearest Neighbors K-Nearest Neighbor: y is the mean of the values of the K closest point to x in S. If K = 3 we have y = 274, , , = 303, x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, MLCC

22 K-Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, MLCC

23 K-Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, Let K be an integer K << n, MLCC

24 K-Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, Let K be an integer K << n, j 1,..., j K defined as j 1 = arg min i {1,...,n} x x i and j t = arg min i {1,...,n}\{j1,...,j t 1} x x i for t {2,..., K}, MLCC

25 K-Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, Let K be an integer K << n, j 1,..., j K defined as j 1 = arg min i {1,...,n} x x i and j t = arg min i {1,...,n}\{j1,...,j t 1} x x i for t {2,..., K}, predicted output y = 1 K i {j 1,...,j K } y i MLCC

26 K-Nearest Neighbors (cont.) K f(x) = 1 K i=1 y ji MLCC

27 K-Nearest Neighbors (cont.) K f(x) = 1 K i=1 y ji Computational cost O(nD + n log n): compute the n distances x x i for i = {1,..., n} (each costs O(D)). Order them O(n log n). MLCC

28 K-Nearest Neighbors (cont.) K f(x) = 1 K i=1 y ji Computational cost O(nD + n log n): compute the n distances x x i for i = {1,..., n} (each costs O(D)). Order them O(n log n). General Metric d f is the same, but j 1,..., j K are defined as j 1 = arg min i {1,...,n} d(x, x i ) and j t = arg min i {1,...,n}\{j1,...,j t 1} d(x, x i ) for t {2,..., K} MLCC

29 Parzen Windows K-NN puts equal weights on the values of the selected points. MLCC

30 Parzen Windows K-NN puts equal weights on the values of the selected points. Can we generalize it? MLCC

31 Parzen Windows K-NN puts equal weights on the values of the selected points. Can we generalize it? Closer points to x should influence more its value MLCC

32 Parzen Windows K-NN puts equal weights on the values of the selected points. Can we generalize it? Closer points to x should influence more its value PARZEN WINDOWS: n i=1 ˆf(x) = y ik(x, x i ) n i=1 k(x, x i) where k is a similarity function k(x, x ) 0 for all x, x R D k(x, x ) 1 when x x 0 k(x, x ) 0 when x x MLCC

33 Parzen Windows Examples of k ( ) k 1 (x, x ) = sign 1 x x σ with a σ > 0 + ( ) k 2 (x, x ) = 1 x x σ with a σ > 0 + k 3 (x, x ) = (1 x x 2 with a σ > 0 σ 2 )+ k 4 (x, x ) = e x x 2 2σ 2 with a σ > 0 k 5 (x, x ) = e x x σ with a σ > 0 MLCC

34 K-NN example K-Nearest neighbor depends on K. When K = MLCC

35 K-NN example K-Nearest neighbor depends on K. When K = MLCC

36 K-NN example K-Nearest neighbor depends on K. When K = MLCC

37 K-NN example K-Nearest neighbor depends on K. When K = MLCC

38 K-NN example K-Nearest neighbor depends on K. When K = MLCC

39 K-NN example K-Nearest neighbor depends on K. When K = MLCC

40 K-NN example K-Nearest neighbor depends on K. When K = MLCC

41 K-NN example K-Nearest neighbor depends on K Changing K the result changes a lot! How to select K? MLCC

42 Outline MLCC

43 Optimal choice for the Hyper-parameters S = (x i, y i ) n i=1 training set. Name Y = (y 1,..., y n ) and X = (x 1,..., x n ). K N hyperparameter of the learning algorithm ˆfS,K learned function (depends on S and K) MLCC

44 Optimal choice for the Hyper-parameters S = (x i, y i ) n i=1 training set. Name Y = (y 1,..., y n ) and X = (x 1,..., x n ). K N hyperparameter of the learning algorithm ˆfS,K learned function (depends on S and K) The expected loss E K is E K = E S E x,y (y ˆf S,K (x)) 2 MLCC

45 Optimal choice for the Hyper-parameters S = (x i, y i ) n i=1 training set. Name Y = (y 1,..., y n ) and X = (x 1,..., x n ). K N hyperparameter of the learning algorithm ˆfS,K learned function (depends on S and K) The expected loss E K is E K = E S E x,y (y ˆf S,K (x)) 2 Optimal hyperparameter K should minimize E K MLCC

46 Optimal choice for the Hyper-parameters S = (x i, y i ) n i=1 training set. Name Y = (y 1,..., y n ) and X = (x 1,..., x n ). K N hyperparameter of the learning algorithm ˆfS,K learned function (depends on S and K) The expected loss E K is E K = E S E x,y (y ˆf S,K (x)) 2 Optimal hyperparameter K should minimize E K K = arg min K K E K MLCC

47 Optimal choice for the Hyper-parameters (cont.) Optimal hyperparameter K should minimize E K MLCC

48 Optimal choice for the Hyper-parameters (cont.) Optimal hyperparameter K should minimize E K K = arg min K K E K MLCC

49 Optimal choice for the Hyper-parameters (cont.) Optimal hyperparameter K should minimize E K K = arg min K K E K Ideally! (In practice we don t have access to the distribution) We can still try to understand the above minimization problem: does a solution exists? What does it depend on? Yet, ultimately, we need something we can compute! MLCC

50 Example: regression problem Define the pointwise expected loss E K (x) = E S E y x (y ˆf S,K (x)) 2 MLCC

51 Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). E K (x) = E S E y x (y ˆf S,K (x)) 2 MLCC

52 Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). Regression setting: E K (x) = E S E y x (y ˆf S,K (x)) 2 Regression model y = f (x) + δ MLCC

53 Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). Regression setting: E K (x) = E S E y x (y ˆf S,K (x)) 2 Regression model y = f (x) + δ Eδ = 0, Eδ 2 = σ 2 MLCC

54 Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). Regression setting: E K (x) = E S E y x (y ˆf S,K (x)) 2 Regression model y = f (x) + δ Eδ = 0, Eδ 2 = σ 2 Now E K (x) = E S E y x (y ˆf S,K (x)) 2 MLCC

55 Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). Regression setting: E K (x) = E S E y x (y ˆf S,K (x)) 2 Regression model y = f (x) + δ Eδ = 0, Eδ 2 = σ 2 Now E K (x) = E S E y x (y ˆf S,K (x)) 2 = E S E y x (f (x) + δ ˆf S,K (x)) 2 MLCC

56 Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). Regression setting: E K (x) = E S E y x (y ˆf S,K (x)) 2 Regression model y = f (x) + δ Eδ = 0, Eδ 2 = σ 2 Now E K (x) = E S E y x (y ˆf S,K (x)) 2 = E S E y x (f (x) + δ ˆf S,K (x)) 2 that is E K (x) = E S (f (x) ˆf S,K (x)) 2 + σ 2... MLCC

57 Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) f S,K (x) = 1 f (x l ) K l K x MLCC

58 Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) f S,K (x) = 1 f (x l ) K l K x Note that f S,K (x) = E y x ˆfS,K (x). MLCC

59 Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) Note that f S,K (x) = E y x ˆfS,K (x). Consider f S,K (x) = 1 f (x l ) K l K x E K (x) = (f (x) E X fs,k (x)) 2 + E S ( }{{} f S,K (x) ˆf S,K (x)) 2 + σ 2 }{{} bias variance... MLCC

60 Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) Note that f S,K (x) = E y x ˆfS,K (x). Consider f S,K (x) = 1 f (x l ) K l K x E K (x) = (f (x) E X fs,k (x)) }{{} K 2 E X E yl x l (y l f (x l )) 2 + σ 2 l K x bias }{{} variance... MLCC

61 Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) f S,K (x) = 1 f (x l ) K l K x Note that f S,K (x) = E y x ˆfS,K (x). Consider E K (x) = (f (x) E X fs,k (x)) 2 + σ2 }{{} K + σ2 }{{} bias variance... MLCC

62 Bias Variance trade-off Errors Variance Bias K MLCC

63 How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: MLCC

64 How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: an optimal parameter exists and MLCC

65 How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: an optimal parameter exists and it depends on the noise and the unknown target function. MLCC

66 How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: an optimal parameter exists and it depends on the noise and the unknown target function. How to choose K in practice? MLCC

67 How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: an optimal parameter exists and it depends on the noise and the unknown target function. How to choose K in practice? Idea: train on some data and validate the parameter on new unseen data as a proxy for the ideal case. MLCC

68 Hold-out Cross-validation For each K MLCC

69 Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) MLCC

70 Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) 2. train the algorithm on T and compute the empirical loss on V Ê K = 1 V x,y V (y ˆf T,K (x)) 2 MLCC

71 Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) 2. train the algorithm on T and compute the empirical loss on V Ê K = 1 V x,y V (y ˆf T,K (x)) 2 MLCC

72 Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) 2. train the algorithm on T and compute the empirical loss on V Ê K = 1 V x,y V (y ˆf T,K (x)) 2 3. Select ˆK that minimize ÊK. MLCC

73 Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) 2. train the algorithm on T and compute the empirical loss on V Ê K = 1 V x,y V (y ˆf T,K (x)) 2 3. Select ˆK that minimize ÊK. The above procedure can be repeated to augment stability and K selected to minimize error over trials. MLCC

74 Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) 2. train the algorithm on T and compute the empirical loss on V Ê K = 1 V x,y V (y ˆf T,K (x)) 2 3. Select ˆK that minimize ÊK. The above procedure can be repeated to augment stability and K selected to minimize error over trials. There are other related parameter selection methods (k-fold cross validation, leave-one out...). MLCC

75 Training and Validation Error behavior 0.25 Training vs. validation error T = 50% S; n = 200 Validation error Training error Error K MLCC

76 Training and Validation Error behavior 0.25 Training vs. validation error T = 50% S; n = 200 Validation error Training error Error ˆK = K MLCC

77 Wrapping up In this class we made our first encounter with learning algorithms (local methods) and the problem of tuning their parameters (via bias-variance trade-off and cross-validation) to avoid overfitting and achieve generalization. MLCC

78 Next Class High Dimensions: Beyond local methods! MLCC

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 17 Review 1 / 17 Decision Tree: Making a