MLCC 2018 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT
About this class 1. Introduce a basic class of learning methods, namely local methods. 2. Discuss the fundamental concept of bias-variance trade-off to understand parameter tuning (a.k.a. model selection) MLCC 2017 2
Outline MLCC 2017 3
The problem What is the price of one house given its area? MLCC 2017 4
The problem What is the price of one house given its area? Start from data... MLCC 2017 5
The problem What is the price of one house given its area? Start from data... 6 x 105 Area (m 2 ) Price (AC) x 1 = 62 y 1 = 99, 200 x 2 = 64 y 2 = 135, 700 x 3 = 65 y 3 = 93, 300 x 4 = 66 y 4 = 114, 000.. 5 4 3 2 1 Let S the houses example dataset (n = 100) 0 50 100 150 200 250 300 350 400 450 500 S = {(x 1, y 1 ),..., (x n, y n )} MLCC 2017 6
The problem What is the price of one house given its area? Start from data... 6 x 105 Area (m 2 ) Price (AC) x 1 = 62 y 1 = 99, 200 x 2 = 64 y 2 = 135, 700 x 3 = 65 y 3 = 93, 300 x 4 = 66 y 4 = 114, 000.. 5 4 3 2 1 Let S the houses example dataset (n = 100) 0 50 100 150 200 250 300 350 400 450 500 S = {(x 1, y 1 ),..., (x n, y n )} Given a new point x we want to predict y by means of S. MLCC 2017 7
Example Let x a 300m 2 house. MLCC 2017 8
Example Let x a 300m 2 house. 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 MLCC 2017 9
Example Let x a 300m 2 house. 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 What is its price? MLCC 2017 10
Nearest Neighbors Nearest Neighbor: y is the same of the closest point to x in S. y = 311, 200 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 MLCC 2017 11
Nearest Neighbors Nearest Neighbor: y is the same of the closest point to x in S. y = 311, 200 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 MLCC 2017 12
Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R MLCC 2017 13
Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, MLCC 2017 14
Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, y the predicted output y = ˆf(x ) where y = y j j = arg min i=1,...,n x x i MLCC 2017 15
Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, y the predicted output y = ˆf(x ) where y = y j j = arg min i=1,...,n x x i Computational cost O(nD): we compute n times the distance x x i that costs O(D) MLCC 2017 16
Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, y the predicted output y = ˆf(x ) where y = y j j = arg min i=1,...,n x x i Computational cost O(nD): we compute n times the distance x x i that costs O(D) In general let d : R D R D a distance on the input space, then f(x) = y j j = arg min i=1,...,n d(x, x i) MLCC 2017 17
Extensions Nearest Neighbor takes y is the same of the closest point to x in S. 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 MLCC 2017 18
Extensions Nearest Neighbor takes y is the same of the closest point to x in S. 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 Can we do better? (for example using more points) MLCC 2017 19
K-Nearest Neighbors K-Nearest Neighbor: y is the mean of the values of the K closest point to x in S. If K = 3 we have y = 274, 600 + 324, 900 + 311, 200 3 = 303, 600 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 MLCC 2017 20
K-Nearest Neighbors K-Nearest Neighbor: y is the mean of the values of the K closest point to x in S. If K = 3 we have y = 274, 600 + 324, 900 + 311, 200 3 = 303, 600 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 MLCC 2017 21
K-Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, MLCC 2017 22
K-Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, Let K be an integer K << n, MLCC 2017 23
K-Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, Let K be an integer K << n, j 1,..., j K defined as j 1 = arg min i {1,...,n} x x i and j t = arg min i {1,...,n}\{j1,...,j t 1} x x i for t {2,..., K}, MLCC 2017 24
K-Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, Let K be an integer K << n, j 1,..., j K defined as j 1 = arg min i {1,...,n} x x i and j t = arg min i {1,...,n}\{j1,...,j t 1} x x i for t {2,..., K}, predicted output y = 1 K i {j 1,...,j K } y i MLCC 2017 25
K-Nearest Neighbors (cont.) K f(x) = 1 K i=1 y ji MLCC 2017 26
K-Nearest Neighbors (cont.) K f(x) = 1 K i=1 y ji Computational cost O(nD + n log n): compute the n distances x x i for i = {1,..., n} (each costs O(D)). Order them O(n log n). MLCC 2017 27
K-Nearest Neighbors (cont.) K f(x) = 1 K i=1 y ji Computational cost O(nD + n log n): compute the n distances x x i for i = {1,..., n} (each costs O(D)). Order them O(n log n). General Metric d f is the same, but j 1,..., j K are defined as j 1 = arg min i {1,...,n} d(x, x i ) and j t = arg min i {1,...,n}\{j1,...,j t 1} d(x, x i ) for t {2,..., K} MLCC 2017 28
Parzen Windows K-NN puts equal weights on the values of the selected points. MLCC 2017 29
Parzen Windows K-NN puts equal weights on the values of the selected points. Can we generalize it? MLCC 2017 30
Parzen Windows K-NN puts equal weights on the values of the selected points. Can we generalize it? Closer points to x should influence more its value MLCC 2017 31
Parzen Windows K-NN puts equal weights on the values of the selected points. Can we generalize it? Closer points to x should influence more its value PARZEN WINDOWS: n i=1 ˆf(x) = y ik(x, x i ) n i=1 k(x, x i) where k is a similarity function k(x, x ) 0 for all x, x R D k(x, x ) 1 when x x 0 k(x, x ) 0 when x x MLCC 2017 32
Parzen Windows Examples of k ( ) k 1 (x, x ) = sign 1 x x σ with a σ > 0 + ( ) k 2 (x, x ) = 1 x x σ with a σ > 0 + k 3 (x, x ) = (1 x x 2 with a σ > 0 σ 2 )+ k 4 (x, x ) = e x x 2 2σ 2 with a σ > 0 k 5 (x, x ) = e x x σ with a σ > 0 MLCC 2017 33
K-NN example K-Nearest neighbor depends on K. When K = 1 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 34
K-NN example K-Nearest neighbor depends on K. When K = 2 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 35
K-NN example K-Nearest neighbor depends on K. When K = 3 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 36
K-NN example K-Nearest neighbor depends on K. When K = 4 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 37
K-NN example K-Nearest neighbor depends on K. When K = 5 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 38
K-NN example K-Nearest neighbor depends on K. When K = 9 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 39
K-NN example K-Nearest neighbor depends on K. When K = 15 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 40
K-NN example K-Nearest neighbor depends on K. 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 Changing K the result changes a lot! How to select K? MLCC 2017 41
Outline MLCC 2017 42
Optimal choice for the Hyper-parameters S = (x i, y i ) n i=1 training set. Name Y = (y 1,..., y n ) and X = (x 1,..., x n ). K N hyperparameter of the learning algorithm ˆfS,K learned function (depends on S and K) MLCC 2017 43
Optimal choice for the Hyper-parameters S = (x i, y i ) n i=1 training set. Name Y = (y 1,..., y n ) and X = (x 1,..., x n ). K N hyperparameter of the learning algorithm ˆfS,K learned function (depends on S and K) The expected loss E K is E K = E S E x,y (y ˆf S,K (x)) 2 MLCC 2017 44
Optimal choice for the Hyper-parameters S = (x i, y i ) n i=1 training set. Name Y = (y 1,..., y n ) and X = (x 1,..., x n ). K N hyperparameter of the learning algorithm ˆfS,K learned function (depends on S and K) The expected loss E K is E K = E S E x,y (y ˆf S,K (x)) 2 Optimal hyperparameter K should minimize E K MLCC 2017 45
Optimal choice for the Hyper-parameters S = (x i, y i ) n i=1 training set. Name Y = (y 1,..., y n ) and X = (x 1,..., x n ). K N hyperparameter of the learning algorithm ˆfS,K learned function (depends on S and K) The expected loss E K is E K = E S E x,y (y ˆf S,K (x)) 2 Optimal hyperparameter K should minimize E K K = arg min K K E K MLCC 2017 46
Optimal choice for the Hyper-parameters (cont.) Optimal hyperparameter K should minimize E K MLCC 2017 47
Optimal choice for the Hyper-parameters (cont.) Optimal hyperparameter K should minimize E K K = arg min K K E K MLCC 2017 48
Optimal choice for the Hyper-parameters (cont.) Optimal hyperparameter K should minimize E K K = arg min K K E K Ideally! (In practice we don t have access to the distribution) We can still try to understand the above minimization problem: does a solution exists? What does it depend on? Yet, ultimately, we need something we can compute! MLCC 2017 49
Example: regression problem Define the pointwise expected loss E K (x) = E S E y x (y ˆf S,K (x)) 2 MLCC 2017 50
Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). E K (x) = E S E y x (y ˆf S,K (x)) 2 MLCC 2017 51
Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). Regression setting: E K (x) = E S E y x (y ˆf S,K (x)) 2 Regression model y = f (x) + δ MLCC 2017 52
Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). Regression setting: E K (x) = E S E y x (y ˆf S,K (x)) 2 Regression model y = f (x) + δ Eδ = 0, Eδ 2 = σ 2 MLCC 2017 53
Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). Regression setting: E K (x) = E S E y x (y ˆf S,K (x)) 2 Regression model y = f (x) + δ Eδ = 0, Eδ 2 = σ 2 Now E K (x) = E S E y x (y ˆf S,K (x)) 2 MLCC 2017 54
Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). Regression setting: E K (x) = E S E y x (y ˆf S,K (x)) 2 Regression model y = f (x) + δ Eδ = 0, Eδ 2 = σ 2 Now E K (x) = E S E y x (y ˆf S,K (x)) 2 = E S E y x (f (x) + δ ˆf S,K (x)) 2 MLCC 2017 55
Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). Regression setting: E K (x) = E S E y x (y ˆf S,K (x)) 2 Regression model y = f (x) + δ Eδ = 0, Eδ 2 = σ 2 Now E K (x) = E S E y x (y ˆf S,K (x)) 2 = E S E y x (f (x) + δ ˆf S,K (x)) 2 that is E K (x) = E S (f (x) ˆf S,K (x)) 2 + σ 2... MLCC 2017 56
Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) f S,K (x) = 1 f (x l ) K l K x MLCC 2017 57
Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) f S,K (x) = 1 f (x l ) K l K x Note that f S,K (x) = E y x ˆfS,K (x). MLCC 2017 58
Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) Note that f S,K (x) = E y x ˆfS,K (x). Consider f S,K (x) = 1 f (x l ) K l K x E K (x) = (f (x) E X fs,k (x)) 2 + E S ( }{{} f S,K (x) ˆf S,K (x)) 2 + σ 2 }{{} bias variance... MLCC 2017 59
Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) Note that f S,K (x) = E y x ˆfS,K (x). Consider f S,K (x) = 1 f (x l ) K l K x E K (x) = (f (x) E X fs,k (x)) 2 + 1 }{{} K 2 E X E yl x l (y l f (x l )) 2 + σ 2 l K x bias }{{} variance... MLCC 2017 60
Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) f S,K (x) = 1 f (x l ) K l K x Note that f S,K (x) = E y x ˆfS,K (x). Consider E K (x) = (f (x) E X fs,k (x)) 2 + σ2 }{{} K + σ2 }{{} bias variance... MLCC 2017 61
Bias Variance trade-off Errors Variance Bias K MLCC 2017 62
How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: MLCC 2017 63
How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: an optimal parameter exists and MLCC 2017 64
How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: an optimal parameter exists and it depends on the noise and the unknown target function. MLCC 2017 65
How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: an optimal parameter exists and it depends on the noise and the unknown target function. How to choose K in practice? MLCC 2017 66
How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: an optimal parameter exists and it depends on the noise and the unknown target function. How to choose K in practice? Idea: train on some data and validate the parameter on new unseen data as a proxy for the ideal case. MLCC 2017 67
Hold-out Cross-validation For each K MLCC 2017 68
Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) MLCC 2017 69
Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) 2. train the algorithm on T and compute the empirical loss on V Ê K = 1 V x,y V (y ˆf T,K (x)) 2 MLCC 2017 70
Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) 2. train the algorithm on T and compute the empirical loss on V Ê K = 1 V x,y V (y ˆf T,K (x)) 2 MLCC 2017 71
Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) 2. train the algorithm on T and compute the empirical loss on V Ê K = 1 V x,y V (y ˆf T,K (x)) 2 3. Select ˆK that minimize ÊK. MLCC 2017 72
Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) 2. train the algorithm on T and compute the empirical loss on V Ê K = 1 V x,y V (y ˆf T,K (x)) 2 3. Select ˆK that minimize ÊK. The above procedure can be repeated to augment stability and K selected to minimize error over trials. MLCC 2017 73
Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) 2. train the algorithm on T and compute the empirical loss on V Ê K = 1 V x,y V (y ˆf T,K (x)) 2 3. Select ˆK that minimize ÊK. The above procedure can be repeated to augment stability and K selected to minimize error over trials. There are other related parameter selection methods (k-fold cross validation, leave-one out...). MLCC 2017 74
Training and Validation Error behavior 0.25 Training vs. validation error T = 50% S; n = 200 Validation error Training error 0.2 0.15 Error 0.1 0.05 0 0 2 4 6 8 10 12 14 16 18 20 K MLCC 2017 75
Training and Validation Error behavior 0.25 Training vs. validation error T = 50% S; n = 200 Validation error Training error 0.2 0.15 Error 0.1 0.05 ˆK = 8. 0 0 2 4 6 8 10 12 14 16 18 20 K MLCC 2017 76
Wrapping up In this class we made our first encounter with learning algorithms (local methods) and the problem of tuning their parameters (via bias-variance trade-off and cross-validation) to avoid overfitting and achieve generalization. MLCC 2017 77
Next Class High Dimensions: Beyond local methods! MLCC 2017 78