MLCC 2018 Local Methods and Bias Variance Trade-Off. Lorenzo Rosasco UNIGE-MIT-IIT

Similar documents
Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Lecture 25: Review I

Nearest Neighbor Classification

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Linear Regression and K-Nearest Neighbors 3/28/18

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

CSCI567 Machine Learning (Fall 2014)

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Going nonparametric: Nearest neighbor methods for regression and classification

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Evaluating Classifiers

Going nonparametric: Nearest neighbor methods for regression and classification

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Classification and K-Nearest Neighbors

Introduction to Artificial Intelligence

Nonparametric Methods Recap

Instance-based Learning

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

CPSC 340: Machine Learning and Data Mining

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Nearest Neighbor Predictors

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

PROBLEM 4

Problem 1: Complexity of Update Rules for Logistic Regression

Lecture 8: Grid Search and Model Validation Continued

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #19: Machine Learning 1

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Slides for Data Mining by I. H. Witten and E. Frank

Transductive Learning: Motivation, Model, Algorithms

Perceptron as a graph

Supervised Learning: K-Nearest Neighbors and Decision Trees

Evaluating Classifiers

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Introduction to Machine Learning. Xiaojin Zhu

Model Complexity and Generalization

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Machine Learning Lecture 3

Using Machine Learning to Optimize Storage Systems

Machine Learning Techniques for Data Mining

Classification Key Concepts

Large Scale Data Analysis Using Deep Learning

Clustering & Classification (chapter 15)

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

CSC411/2515 Tutorial: K-NN and Decision Tree

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Data Preprocessing. Supervised Learning

Distribution-free Predictive Approaches

Lasso Regression: Regularization for feature selection

Nearest Neighbor with KD Trees

Topic 1 Classification Alternatives

K- Nearest Neighbors(KNN) And Predictive Accuracy

Machine Learning Lecture 3

Supervised Learning. CS 586 Machine Learning. Prepared by Jugal Kalita. With help from Alpaydin s Introduction to Machine Learning, Chapter 2.

I211: Information infrastructure II

Machine Learning Lecture 3

LS-SVM Functional Network for Time Series Prediction

Announcements:$Rough$Plan$$

Lecture 16: High-dimensional regression, non-linear regression

Nearest Neighbor with KD Trees

Lecture 7: Linear Regression (continued)

Divide and Conquer Kernel Ridge Regression

Instance-based Learning

CISC 4631 Data Mining

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

k-nearest Neighbors + Model Selection

Classification Key Concepts

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

CS489/698 Lecture 2: January 8 th, 2018

Kernels and Clustering

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Lecture 03 Linear classification methods II

Model selection and validation 1: Cross-validation

A Probabilistic Approach to WLAN User Location Estimation

Machine Learning (CSE 446): Practical Issues

Nearest Neighbor Methods

LOESS curve fitted to a population sampled from a sine wave with uniform noise added. The LOESS curve approximates the original sine wave.

Machine Learning: k-nearest Neighbors. Lecture 08. Razvan C. Bunescu School of Electrical Engineering and Computer Science

5 Learning hypothesis classes (16 points)

Naïve Bayes for text classification

CS 340 Lec. 4: K-Nearest Neighbors

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

k-nearest Neighbor (knn) Sept Youn-Hee Han

K-Nearest Neighbour (Continued) Dr. Xiaowei Huang

Comparison of Linear Regression with K-Nearest Neighbors

Instance-based Learning

Support Vector Machines + Classification for IR

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

Lecture 20: Bagging, Random Forests, Boosting

CS 343: Artificial Intelligence

CSE 573: Artificial Intelligence Autumn 2010

Transcription:

MLCC 2018 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT

About this class 1. Introduce a basic class of learning methods, namely local methods. 2. Discuss the fundamental concept of bias-variance trade-off to understand parameter tuning (a.k.a. model selection) MLCC 2017 2

Outline MLCC 2017 3

The problem What is the price of one house given its area? MLCC 2017 4

The problem What is the price of one house given its area? Start from data... MLCC 2017 5

The problem What is the price of one house given its area? Start from data... 6 x 105 Area (m 2 ) Price (AC) x 1 = 62 y 1 = 99, 200 x 2 = 64 y 2 = 135, 700 x 3 = 65 y 3 = 93, 300 x 4 = 66 y 4 = 114, 000.. 5 4 3 2 1 Let S the houses example dataset (n = 100) 0 50 100 150 200 250 300 350 400 450 500 S = {(x 1, y 1 ),..., (x n, y n )} MLCC 2017 6

The problem What is the price of one house given its area? Start from data... 6 x 105 Area (m 2 ) Price (AC) x 1 = 62 y 1 = 99, 200 x 2 = 64 y 2 = 135, 700 x 3 = 65 y 3 = 93, 300 x 4 = 66 y 4 = 114, 000.. 5 4 3 2 1 Let S the houses example dataset (n = 100) 0 50 100 150 200 250 300 350 400 450 500 S = {(x 1, y 1 ),..., (x n, y n )} Given a new point x we want to predict y by means of S. MLCC 2017 7

Example Let x a 300m 2 house. MLCC 2017 8

Example Let x a 300m 2 house. 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 MLCC 2017 9

Example Let x a 300m 2 house. 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 What is its price? MLCC 2017 10

Nearest Neighbors Nearest Neighbor: y is the same of the closest point to x in S. y = 311, 200 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 MLCC 2017 11

Nearest Neighbors Nearest Neighbor: y is the same of the closest point to x in S. y = 311, 200 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 MLCC 2017 12

Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R MLCC 2017 13

Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, MLCC 2017 14

Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, y the predicted output y = ˆf(x ) where y = y j j = arg min i=1,...,n x x i MLCC 2017 15

Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, y the predicted output y = ˆf(x ) where y = y j j = arg min i=1,...,n x x i Computational cost O(nD): we compute n times the distance x x i that costs O(D) MLCC 2017 16

Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, y the predicted output y = ˆf(x ) where y = y j j = arg min i=1,...,n x x i Computational cost O(nD): we compute n times the distance x x i that costs O(D) In general let d : R D R D a distance on the input space, then f(x) = y j j = arg min i=1,...,n d(x, x i) MLCC 2017 17

Extensions Nearest Neighbor takes y is the same of the closest point to x in S. 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 MLCC 2017 18

Extensions Nearest Neighbor takes y is the same of the closest point to x in S. 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 Can we do better? (for example using more points) MLCC 2017 19

K-Nearest Neighbors K-Nearest Neighbor: y is the mean of the values of the K closest point to x in S. If K = 3 we have y = 274, 600 + 324, 900 + 311, 200 3 = 303, 600 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 MLCC 2017 20

K-Nearest Neighbors K-Nearest Neighbor: y is the mean of the values of the K closest point to x in S. If K = 3 we have y = 274, 600 + 324, 900 + 311, 200 3 = 303, 600 6 x 105 Area (m 2 ) Price (AC).. x 93 = 255 y 93 = 274, 600 x 94 = 264 y 94 = 324, 900 x 95 = 310 y 95 = 311, 200 x 96 = 480 y 96 = 515, 400.. 5 4 3 2 1 0 50 100 150 200 250 300 350 400 450 500 MLCC 2017 21

K-Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, MLCC 2017 22

K-Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, Let K be an integer K << n, MLCC 2017 23

K-Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, Let K be an integer K << n, j 1,..., j K defined as j 1 = arg min i {1,...,n} x x i and j t = arg min i {1,...,n}\{j1,...,j t 1} x x i for t {2,..., K}, MLCC 2017 24

K-Nearest Neighbors S = {(x i, y i )} n i=1 with x i R D, y i R x the new point x R D, Let K be an integer K << n, j 1,..., j K defined as j 1 = arg min i {1,...,n} x x i and j t = arg min i {1,...,n}\{j1,...,j t 1} x x i for t {2,..., K}, predicted output y = 1 K i {j 1,...,j K } y i MLCC 2017 25

K-Nearest Neighbors (cont.) K f(x) = 1 K i=1 y ji MLCC 2017 26

K-Nearest Neighbors (cont.) K f(x) = 1 K i=1 y ji Computational cost O(nD + n log n): compute the n distances x x i for i = {1,..., n} (each costs O(D)). Order them O(n log n). MLCC 2017 27

K-Nearest Neighbors (cont.) K f(x) = 1 K i=1 y ji Computational cost O(nD + n log n): compute the n distances x x i for i = {1,..., n} (each costs O(D)). Order them O(n log n). General Metric d f is the same, but j 1,..., j K are defined as j 1 = arg min i {1,...,n} d(x, x i ) and j t = arg min i {1,...,n}\{j1,...,j t 1} d(x, x i ) for t {2,..., K} MLCC 2017 28

Parzen Windows K-NN puts equal weights on the values of the selected points. MLCC 2017 29

Parzen Windows K-NN puts equal weights on the values of the selected points. Can we generalize it? MLCC 2017 30

Parzen Windows K-NN puts equal weights on the values of the selected points. Can we generalize it? Closer points to x should influence more its value MLCC 2017 31

Parzen Windows K-NN puts equal weights on the values of the selected points. Can we generalize it? Closer points to x should influence more its value PARZEN WINDOWS: n i=1 ˆf(x) = y ik(x, x i ) n i=1 k(x, x i) where k is a similarity function k(x, x ) 0 for all x, x R D k(x, x ) 1 when x x 0 k(x, x ) 0 when x x MLCC 2017 32

Parzen Windows Examples of k ( ) k 1 (x, x ) = sign 1 x x σ with a σ > 0 + ( ) k 2 (x, x ) = 1 x x σ with a σ > 0 + k 3 (x, x ) = (1 x x 2 with a σ > 0 σ 2 )+ k 4 (x, x ) = e x x 2 2σ 2 with a σ > 0 k 5 (x, x ) = e x x σ with a σ > 0 MLCC 2017 33

K-NN example K-Nearest neighbor depends on K. When K = 1 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 34

K-NN example K-Nearest neighbor depends on K. When K = 2 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 35

K-NN example K-Nearest neighbor depends on K. When K = 3 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 36

K-NN example K-Nearest neighbor depends on K. When K = 4 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 37

K-NN example K-Nearest neighbor depends on K. When K = 5 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 38

K-NN example K-Nearest neighbor depends on K. When K = 9 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 39

K-NN example K-Nearest neighbor depends on K. When K = 15 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 MLCC 2017 40

K-NN example K-Nearest neighbor depends on K. 1.5 1 0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 Changing K the result changes a lot! How to select K? MLCC 2017 41

Outline MLCC 2017 42

Optimal choice for the Hyper-parameters S = (x i, y i ) n i=1 training set. Name Y = (y 1,..., y n ) and X = (x 1,..., x n ). K N hyperparameter of the learning algorithm ˆfS,K learned function (depends on S and K) MLCC 2017 43

Optimal choice for the Hyper-parameters S = (x i, y i ) n i=1 training set. Name Y = (y 1,..., y n ) and X = (x 1,..., x n ). K N hyperparameter of the learning algorithm ˆfS,K learned function (depends on S and K) The expected loss E K is E K = E S E x,y (y ˆf S,K (x)) 2 MLCC 2017 44

Optimal choice for the Hyper-parameters S = (x i, y i ) n i=1 training set. Name Y = (y 1,..., y n ) and X = (x 1,..., x n ). K N hyperparameter of the learning algorithm ˆfS,K learned function (depends on S and K) The expected loss E K is E K = E S E x,y (y ˆf S,K (x)) 2 Optimal hyperparameter K should minimize E K MLCC 2017 45

Optimal choice for the Hyper-parameters S = (x i, y i ) n i=1 training set. Name Y = (y 1,..., y n ) and X = (x 1,..., x n ). K N hyperparameter of the learning algorithm ˆfS,K learned function (depends on S and K) The expected loss E K is E K = E S E x,y (y ˆf S,K (x)) 2 Optimal hyperparameter K should minimize E K K = arg min K K E K MLCC 2017 46

Optimal choice for the Hyper-parameters (cont.) Optimal hyperparameter K should minimize E K MLCC 2017 47

Optimal choice for the Hyper-parameters (cont.) Optimal hyperparameter K should minimize E K K = arg min K K E K MLCC 2017 48

Optimal choice for the Hyper-parameters (cont.) Optimal hyperparameter K should minimize E K K = arg min K K E K Ideally! (In practice we don t have access to the distribution) We can still try to understand the above minimization problem: does a solution exists? What does it depend on? Yet, ultimately, we need something we can compute! MLCC 2017 49

Example: regression problem Define the pointwise expected loss E K (x) = E S E y x (y ˆf S,K (x)) 2 MLCC 2017 50

Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). E K (x) = E S E y x (y ˆf S,K (x)) 2 MLCC 2017 51

Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). Regression setting: E K (x) = E S E y x (y ˆf S,K (x)) 2 Regression model y = f (x) + δ MLCC 2017 52

Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). Regression setting: E K (x) = E S E y x (y ˆf S,K (x)) 2 Regression model y = f (x) + δ Eδ = 0, Eδ 2 = σ 2 MLCC 2017 53

Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). Regression setting: E K (x) = E S E y x (y ˆf S,K (x)) 2 Regression model y = f (x) + δ Eδ = 0, Eδ 2 = σ 2 Now E K (x) = E S E y x (y ˆf S,K (x)) 2 MLCC 2017 54

Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). Regression setting: E K (x) = E S E y x (y ˆf S,K (x)) 2 Regression model y = f (x) + δ Eδ = 0, Eδ 2 = σ 2 Now E K (x) = E S E y x (y ˆf S,K (x)) 2 = E S E y x (f (x) + δ ˆf S,K (x)) 2 MLCC 2017 55

Example: regression problem Define the pointwise expected loss By definition E K = E x E K (x). Regression setting: E K (x) = E S E y x (y ˆf S,K (x)) 2 Regression model y = f (x) + δ Eδ = 0, Eδ 2 = σ 2 Now E K (x) = E S E y x (y ˆf S,K (x)) 2 = E S E y x (f (x) + δ ˆf S,K (x)) 2 that is E K (x) = E S (f (x) ˆf S,K (x)) 2 + σ 2... MLCC 2017 56

Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) f S,K (x) = 1 f (x l ) K l K x MLCC 2017 57

Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) f S,K (x) = 1 f (x l ) K l K x Note that f S,K (x) = E y x ˆfS,K (x). MLCC 2017 58

Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) Note that f S,K (x) = E y x ˆfS,K (x). Consider f S,K (x) = 1 f (x l ) K l K x E K (x) = (f (x) E X fs,k (x)) 2 + E S ( }{{} f S,K (x) ˆf S,K (x)) 2 + σ 2 }{{} bias variance... MLCC 2017 59

Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) Note that f S,K (x) = E y x ˆfS,K (x). Consider f S,K (x) = 1 f (x l ) K l K x E K (x) = (f (x) E X fs,k (x)) 2 + 1 }{{} K 2 E X E yl x l (y l f (x l )) 2 + σ 2 l K x bias }{{} variance... MLCC 2017 60

Bias Variance trade-off for K-NN Define the noisyless K-NN (it is ideal!) f S,K (x) = 1 f (x l ) K l K x Note that f S,K (x) = E y x ˆfS,K (x). Consider E K (x) = (f (x) E X fs,k (x)) 2 + σ2 }{{} K + σ2 }{{} bias variance... MLCC 2017 61

Bias Variance trade-off Errors Variance Bias K MLCC 2017 62

How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: MLCC 2017 63

How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: an optimal parameter exists and MLCC 2017 64

How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: an optimal parameter exists and it depends on the noise and the unknown target function. MLCC 2017 65

How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: an optimal parameter exists and it depends on the noise and the unknown target function. How to choose K in practice? MLCC 2017 66

How to choose the hyper-parameters Bias-Variance trade-off is theoretical, but shows that: an optimal parameter exists and it depends on the noise and the unknown target function. How to choose K in practice? Idea: train on some data and validate the parameter on new unseen data as a proxy for the ideal case. MLCC 2017 67

Hold-out Cross-validation For each K MLCC 2017 68

Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) MLCC 2017 69

Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) 2. train the algorithm on T and compute the empirical loss on V Ê K = 1 V x,y V (y ˆf T,K (x)) 2 MLCC 2017 70

Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) 2. train the algorithm on T and compute the empirical loss on V Ê K = 1 V x,y V (y ˆf T,K (x)) 2 MLCC 2017 71

Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) 2. train the algorithm on T and compute the empirical loss on V Ê K = 1 V x,y V (y ˆf T,K (x)) 2 3. Select ˆK that minimize ÊK. MLCC 2017 72

Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) 2. train the algorithm on T and compute the empirical loss on V Ê K = 1 V x,y V (y ˆf T,K (x)) 2 3. Select ˆK that minimize ÊK. The above procedure can be repeated to augment stability and K selected to minimize error over trials. MLCC 2017 73

Hold-out Cross-validation For each K 1. shuffle and split S in T (training) and V (validation) 2. train the algorithm on T and compute the empirical loss on V Ê K = 1 V x,y V (y ˆf T,K (x)) 2 3. Select ˆK that minimize ÊK. The above procedure can be repeated to augment stability and K selected to minimize error over trials. There are other related parameter selection methods (k-fold cross validation, leave-one out...). MLCC 2017 74

Training and Validation Error behavior 0.25 Training vs. validation error T = 50% S; n = 200 Validation error Training error 0.2 0.15 Error 0.1 0.05 0 0 2 4 6 8 10 12 14 16 18 20 K MLCC 2017 75

Training and Validation Error behavior 0.25 Training vs. validation error T = 50% S; n = 200 Validation error Training error 0.2 0.15 Error 0.1 0.05 ˆK = 8. 0 0 2 4 6 8 10 12 14 16 18 20 K MLCC 2017 76

Wrapping up In this class we made our first encounter with learning algorithms (local methods) and the problem of tuning their parameters (via bias-variance trade-off and cross-validation) to avoid overfitting and achieve generalization. MLCC 2017 77

Next Class High Dimensions: Beyond local methods! MLCC 2017 78