MLCC 2018 Local Methods and Bias Variance Trade-Off. Lorenzo Rosasco UNIGE-MIT-IIT

MLCC 2018 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT

About this class 1. Introduce a basic class of learning methods, namely local methods. 2. Discuss the fundamental concept of bias-variance trade-off to understand parameter tuning (a.k.a. model selection) MLCC 2017 2

Outline MLCC 2017 3

The problem What is the price of one house given its area? MLCC 2017 4

The problem What is the price of one house given its area? Start from data... MLCC 2017 5

The problem What is the price of one house given its area? Start from data... 6 x 105 Area (m 2 ) Price (AC) x 1 = 62 y 1 = 99, 200 x 2 = 64 y 2 = 135, 700 x 3 = 65 y 3 = 93, 300 x 4 = 66 y 4 = 114, 000.. 5 4 3 2 1 Let S the houses example dataset (n = 100) 0 50 100 150 200 250 300 350 400 450 500 S = {(x 1, y 1 ),..., (x n, y n )} MLCC 2017 6

Example Let x a 300m 2 house. MLCC 2017 8

Example Let x a 300m 2 house. 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 MLCC 2017 9

Nearest Neighbors Nearest Neighbor: y is the same of the closest point to x in S. y = 311, 200 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 MLCC 2017 11

Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R MLCC 2017 13

Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, MLCC 2017 14

Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, y the predicted output y = ˆf(x ) where y = y j j = arg min i=1,...,n x x i MLCC 2017 15

Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, y the predicted output y = ˆf(x ) where y = y j j = arg min i=1,...,n x x i Computational cost O(nD): we compute n times the distance x x i that costs O(D) MLCC 2017 16

Extensions Nearest Neighbor takes y is the same of the closest point to x in S. 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 MLCC 2017 18

K-Nearest Neighbors K-Nearest Neighbor: y is the mean of the values of the K closest point to x in S. If K = 3 we have y = 274, 600 + 324, 900 + 311, 200 3 = 303, 600 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 MLCC 2017 20

K-Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, MLCC 2017 22

K-Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, Let K be an integer K << n, MLCC 2017 23

K-Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, Let K be an integer K << n, j 1,..., j K defined as j 1 = arg min i {1,...,n} x x i and j t = arg min i {1,...,n}\{j1,...,j t 1} x x i for t {2,..., K}, MLCC 2017 24

K-Nearest Neighbors (cont.) K f(x) = 1 K i=1 y ji MLCC 2017 26

K-Nearest Neighbors (cont.) K f(x) = 1 K i=1 y ji Computational cost O(nD + n log n): compute the n distances x x i for i = {1,..., n} (each costs O(D)). Order them O(n log n). MLCC 2017 27

K-Nearest Neighbors (cont.) K f(x) = 1 K i=1 y ji Computational cost O(nD + n log n): compute the n distances x x i for i = {1,..., n} (each costs O(D)). Order them O(n log n). General Metric d f is the same, but j 1,..., j K are defined as j 1 = arg min i {1,...,n} d(x, x i ) and j t = arg min i {1,...,n}\{j1,...,j t 1} d(x, x i ) for t {2,..., K} MLCC 2017 28

Parzen Windows K-NN puts equal weights on the values of the selected points. MLCC 2017 29

Parzen Windows K-NN puts equal weights on the values of the selected points. Can we generalize it? MLCC 2017 30

Parzen Windows K-NN puts equal weights on the values of the selected points. Can we generalize it? Closer points to x should influence more its value MLCC 2017 31

Parzen Windows K-NN puts equal weights on the values of the selected points. Can we generalize it? Closer points to x should influence more its value PARZEN WINDOWS: n i=1 ˆf(x) = y ik(x, x i ) n i=1 k(x, x i) where k is a similarity function k(x, x ) 0 for all x, x R D k(x, x ) 1 when x x 0 k(x, x ) 0 when x x MLCC 2017 32

Parzen Windows Examples of k ( ) k 1 (x, x ) = sign 1 x x σ with a σ > 0 + ( ) k 2 (x, x ) = 1 x x σ with a σ > 0 + k 3 (x, x ) = (1 x x 2 with a σ > 0 σ 2 )+ k 4 (x, x ) = e x x 2 2σ 2 with a σ > 0 k 5 (x, x ) = e x x σ with a σ > 0 MLCC 2017 33

K-NN example K-Nearest neighbor depends on K. When K = 1 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 34

K-NN example K-Nearest neighbor depends on K. When K = 2 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 35

K-NN example K-Nearest neighbor depends on K. When K = 3 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 36

K-NN example K-Nearest neighbor depends on K. When K = 4 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 37

K-NN example K-Nearest neighbor depends on K. When K = 5 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 38

K-NN example K-Nearest neighbor depends on K. When K = 9 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 39

K-NN example K-Nearest neighbor depends on K. When K = 15 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 40

K-NN example K-Nearest neighbor depends on K. 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 Changing K the result changes a lot! How to select K? MLCC 2017 41

Outline MLCC 2017 42

Optimal choice for the Hyper-parameters S = (x i, y i ) n i=1 training set. Name Y = (y 1,..., y n ) and X = (x 1,..., x n ). K N hyperparameter of the learning algorithm ˆfS,K learned function (depends on S and K) The expected loss E K is E K = E S E x,y (y ˆf S,K (x)) 2 MLCC 2017 44

Optimal choice for the Hyper-parameters (cont.) Optimal hyperparameter K should minimize E K MLCC 2017 47

Optimal choice for the Hyper-parameters (cont.) Optimal hyperparameter K should minimize E K K = arg min K K E K MLCC 2017 48

Optimal choice for the Hyper-parameters (cont.) Optimal hyperparameter K should minimize E K K = arg min K K E K Ideally! (In practice we don t have access to the distribution) We can still try to understand the above minimization problem: does a solution exists? What does it depend on? Yet, ultimately, we need something we can compute! MLCC 2017 49

Example: regression problem Define the pointwise expected loss E K (x) = E S E y x (y ˆf S,K (x)) 2 MLCC 2017 50

Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). E K (x) = E S E y x (y ˆf S,K (x)) 2 MLCC 2017 51

Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). Regression setting: E K (x) = E S E y x (y ˆf S,K (x)) 2 Regression model y = f (x) + δ MLCC 2017 52

Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). Regression setting: E K (x) = E S E y x (y ˆf S,K (x)) 2 Regression model y = f (x) + δ Eδ = 0, Eδ 2 = σ 2 Now E K (x) = E S E y x (y ˆf S,K (x)) 2 MLCC 2017 54

Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) f S,K (x) = 1 f (x l ) K l K x MLCC 2017 57

Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) f S,K (x) = 1 f (x l ) K l K x Note that f S,K (x) = E y x ˆfS,K (x). MLCC 2017 58

Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) Note that f S,K (x) = E y x ˆfS,K (x). Consider f S,K (x) = 1 f (x l ) K l K x E K (x) = (f (x) E X fs,k (x)) 2 + E S ( }{{} f S,K (x) ˆf S,K (x)) 2 + σ 2 }{{} bias variance... MLCC 2017 59

Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) Note that f S,K (x) = E y x ˆfS,K (x). Consider f S,K (x) = 1 f (x l ) K l K x E K (x) = (f (x) E X fs,k (x)) 2 + 1 }{{} K 2 E X E yl x l (y l f (x l )) 2 + σ 2 l K x bias }{{} variance... MLCC 2017 60

Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) f S,K (x) = 1 f (x l ) K l K x Note that f S,K (x) = E y x ˆfS,K (x). Consider E K (x) = (f (x) E X fs,k (x)) 2 + σ2 }{{} K + σ2 }{{} bias variance... MLCC 2017 61

Bias Variance trade-off Errors Variance Bias K MLCC 2017 62

How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: MLCC 2017 63

How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: an optimal parameter exists and MLCC 2017 64

How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: an optimal parameter exists and it depends on the noise and the unknown target function. MLCC 2017 65

How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: an optimal parameter exists and it depends on the noise and the unknown target function. How to choose K in practice? Idea: train on some data and validate the parameter on new unseen data as a proxy for the ideal case. MLCC 2017 67

Hold-out Cross-validation For each K MLCC 2017 68

Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) MLCC 2017 69

Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) 2. train the algorithm on T and compute the empirical loss on V Ê K = 1 V x,y V (y ˆf T,K (x)) 2 3. Select ˆK that minimize ÊK. The above procedure can be repeated to augment stability and K selected to minimize error over trials. MLCC 2017 73

Training and Validation Error behavior 0.25 Training vs. validation error T = 50% S; n = 200 Validation error Training error 0.2 0.15 Error 0.1 0.05 0 0 2 4 6 8 10 12 14 16 18 20 K MLCC 2017 75

Training and Validation Error behavior 0.25 Training vs. validation error T = 50% S; n = 200 Validation error Training error 0.2 0.15 Error 0.1 0.05 ˆK = 8. 0 0 2 4 6 8 10 12 14 16 18 20 K MLCC 2017 76

Wrapping up In this class we made our first encounter with learning algorithms (local methods) and the problem of tuning their parameters (via bias-variance trade-off and cross-validation) to avoid overfitting and achieve generalization. MLCC 2017 77

Next Class High Dimensions: Beyond local methods! MLCC 2017 78