UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Size: px

Start display at page:

Download "UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia"

Dinah Byrd
5 years ago
Views:

1 UVA CS 6316/4501 Fall 2016 Machine Learning Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff Dr. Yanjun Qi University of Virginia Department of Computer Science 11/9/16 1

2 Rough Plan HW5 is due on Sat HW6 will have two coding Q (image + audio) HW7 will be sample final quesrons Final will be most contents ater midterm Midterm grade will be released this evening Mean 78 / Median 79 / Max 95 Final will be in-class / close note Dec5th 11/9/16 2

3 Where are we? è Five major secrons of this course q Regression (supervised) q ClassificaRon (supervised) q Unsupervised models q Learning theory q Graphical models 11/9/16 3

4 Where are we? è Three major secrons for classificaron We can divide the large variety of classification approaches into roughly three major types 1. Discriminative - directly estimate a decision rule/boundary - e.g., logistic regression, support vector machine, decisiontree 2. Generative: - build a generative statistical model - e.g., naïve bayes classifier, Bayesian networks 3. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors 11/9/16 4

5 Today : ü K-nearest neighbor ü Model SelecRon / Bias Variance Tradeoff 11/9/16 5

6 Nearest neighbor classifiers Basic idea: If it walks like a duck, quacks like a duck, then it s probably a duck compute distance test sample training samples choose k of the nearest samples 11/9/16 6

7 Nearest neighbor classifiers Unknown record Requires three inputs: 1. The set of stored training samples 2. Distance metric to compute distance between samples 3. The value of k, i.e., the number of nearest neighbors to retrieve 11/9/16 7

8 Nearest neighbor classifiers Unknown record To classify unknown sample: 1. Compute distance to other training records 2. IdenRfy k nearest neighbors 3. Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) 11/9/16 8

9 Definition of nearest neighbor X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor k-nearest neighbors of a sample x are datapoints that have the k smallest distances to x 11/9/16 9

10 1-nearest neighbor Voronoi diagram: partitioning of a plane into regions based on distance to points in a specific subset of the plane. 11/9/16 10

11 Nearest neighbor classification Compute distance between two points: For instance, Euclidean distance d( x, y) ( x i y i Options for determining the class from nearest neighbor list Take majority vote of class labels among the k-nearest neighbors Weight the votes according to distance example: weight factor w = 1 / d 2 = i 2 ) 11/9/16 11

12 Nearest neighbor classification Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes X 11/9/16 12

13 Nearest neighbor classification Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes X 11/9/16 13

14 Nearest neighbor classification Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example: height of a person may vary from 1.5 m to 1.8 m weight of a person may vary from 90 lb to 300 lb income of a person may vary from $10K to $1M 11/9/16 14

15 Nearest neighbor classification k-nearest neighbor classifier is a lazy learner Does not build model explicitly. Classifying unknown samples is relatively expensive. k-nearest neighbor classifier is a local model, vs. global model of linear classifiers. 11/9/16 15

16 Nearest neighbor classification k-nearest neighbor classifier is a lazy learner Does not build model explicitly. Classifying unknown samples is relatively expensive. k-nearest neighbor classifier is a local model, vs. global model of linear classifiers. 11/9/16 16

17 K-Nearest Neighbor Task classification Representation Local Smoothness Score Function NA Search/Optimization NA Models, Parameters Training Samples 11/9/16 17

18 Decision boundaries in global vs. local models K=15 K=1 linear regression 15-nearest neighbor 1-nearest neighbor global stable can be inaccurate K acts as a smoother local accurate unstable What ultimately matters: GENERALIZATION 11/9/16 18

19 Dr. Yanjun Qi / UVA 19 CS 6316 / f16 Vs. Locally weighted regression aka locally weighted regression, locally linear regression, LOESS, K λ (x i, x 0 ) linear_func(x)->y è could represent only the neighbor region of x_0 11/9/16 Use RBF funcron to pick out/emphasize the neighbor region of x_0 è K λ (x i, x 0 )

20 Vs. Locally weighted Dr. Yanjun Qi / UVA CS 6316 / f16 regression Separate weighted least squares at each target point x 0 : x 0 min α (x 0 ),β (x 0 ) N i=1 K λ (x i, x 0 )[y i α(x 0 ) β(x 0 )x i ] 2 fˆ ( x ˆ( ˆ( 0 ) =α x0) + β x0) x 0 " K τ (x i, x0) = exp (x i $ x0)2 # 2τ 2 % ' & 11/9/16 20

21 Today : ü K-nearest neighbor ü Model SelecRon / Bias Variance Tradeoff ü EPE ü DecomposiRon of MSE ü Bias-Variance tradeoff ü High bias? High variance? How to respond? 11/9/16 21

22 e.g. Training Error from KNN, Lesson Learned Dr. Yanjun Qi / UVA CS 6316 / f16 When k = 1, No misclassifications (on training): Overtraining 1-nearest neighbor averaging Minimizing training error is not always good (e.g., 1- NN) ˆ > 0.5 Y Y ˆ = 0.5 Y ˆ < 0. 5 X 1 11/9/16 22 X 2

EPE( f ) = E(L(Y, f (X))) = L(y, f (x)) Pr(dx,dy) e.g.

23 Statistical Decision Theory Random input vector: X Random output variable:y Joint distribution: Pr(X,Y ) Loss function L(Y, f(x)) Expected prediction error (EPE): EPE( f ) = E(L(Y, f (X))) = L(y, f (x)) Pr(dx,dy) e.g. = (y f (x)) 2 Pr(dx,dy) e.g. Squared error loss (also called L2 loss ) Consider population distribution 11/9/16 23

= (y f (x)) 2 Pr(dx, dy) under L2 loss, best estimator for EPE (Theoretically) is : Conditional mean!

24 Expected prediction error (EPE) EPE( f ) = E(L(Y, f (X))) = L(y, f (x)) Pr(dx, dy) Consider joint distribution For L2 loss: e.g. = (y f (x)) 2 Pr(dx, dy) under L2 loss, best estimator for EPE (Theoretically) is : Conditional mean! ˆf (x)= E(Y X = x) e.g. KNN NN methods are the direct implementation (approximation ) For 0-1 loss: L(k, l) = 1-d kl Bayes Classifier ˆf (X)=C k!if! Pr(C k X = x)= maxpr(g X = x) g C! 11/9/16 24

25 Dr. Yanjun Qi / 25 UVA CS 6316 / f16 KNN FOR MINIMIZING EPE We know under L2 loss, best estimator for EPE (Theoretically) is : Conditional mean f ( x) = E( Y X = x ) Nearest neighbors assumes that f(x) is well approximated by a locally constant function. 11/9/16

26 Review : WHEN EPE USES DIFFERENT LOSS Loss Function Estimator L 2 L (Bayes classifier / MAP) 11/9/16 Dr. Yanjun Qi / UVA CS 6316 / f16 26

27 Today : ü K-nearest neighbor ü Model SelecRon / Bias Variance Tradeoff ü EPE ü DecomposiRon of MSE ü Bias-Variance tradeoff ü High bias? High variance? How to respond? 11/9/16 27

28 Dr. Yanjun Qi / 28 UVA CS 6316 / f16 Decomposition of EPE When additive error model: Notations Output random variable: Prediction function: Prediction estimator: 11/9/16 Irreducible / Bayes error MSE component of f- hat in estimating f

29 Bias-Variance Trade-off for EPE: EPE (x_0) = noise 2 + bias 2 + variance Unavoidable error Error due to incorrect assumptions Error due to variance of training samples 11/9/16 29 Slide credit: D. Hoiem

30 BIAS AND VARIANCE TRADE-OFF for MSE (more general setting!!! ): Dr. Yanjun Qi / 30 UVA CS 6316 / f16 Bias measures accuracy or quality of the estimator low bias implies on average we will accurately estimate true parameter or func from training data Variance 11/9/16 Measures precision or specificity of the estimator Low variance implies the estimator does not change much as the training set varies

31 BIAS AND VARIANCE TRADE-OFF for MSE of parameter estimation: Error due to variance of training samples Error due to incorrect assumptions 11/9/16 31

32 Today : ü K-nearest neighbor ü Model SelecRon / Bias Variance Tradeoff ü EPE ü DecomposiRon of MSE ü Bias-Variance tradeoff ü High bias? High variance? How to respond? 11/9/16 32

33 Bias-Variance Tradeoff / Model Selection underfit region overfit region 11/9/16 33

34 (1) Training vs Test Error Training error can always be reduced when increasing model complexity, Expected Test Error Expected Test error and CV error è good approximaron of EPE Expected Training Error 11/9/16 34

35 (2) Bias-Variance Trade-off Models with too few parameters are inaccurate because of a large bias (not enough flexibility). Models with too many parameters are inaccurate because of a large variance (too much sensirvity to the sample randomness). 11/9/16 35 Slide credit: D. Hoiem

Medium Bias Medium Variance Model complexity = medium Smallest Bias

36 (2.1) Regression: Complexity versus Goodness of Fit Highest Bias Lowest variance Model complexity = low Low Variance / High Bias Medium Bias Medium Variance Model complexity = medium Smallest Bias Highest variance Model complexity = high Low Bias / High Variance 11/9/16 36

37 (2.2) Classification, Decision boundaries in global vs. local models Low Variance / High Bias linear regression global stable can be inaccurate 15-nearest neighbor Low Variance / High Bias KNN local accurate unstable 1-nearest neighbor Low Bias / High Variance What ultimately matters: GENERALIZATION 11/9/16 37

38 Bias-Variance Tradeoff / Model Selection underfit region overfit region 11/9/16 38

39 Middle RED: TRUE function Model bias & Model variance Dr. Yanjun Qi / UVA CS 6316 / f16 Error due to bias: How far off in general from the middle red Error due to variance: How wildly the blue points spread 11/9/16 39

40 need to make assumprons Dr. Yanjun Qi that / UVA CS 6316 / f16 are able to generalize Components of generalization error Bias: how much the average model over all training sets differ from the true model? Error due to inaccurate assumptions/simplifications made by the model Variance: how much models estimated from different training sets differ from each other Underfitting: model is too simple to represent all the relevant class characteristics High bias and low variance High training error and high test error Overfitting: model is too complex and fits irrelevant characteristics (noise) in the data Low bias and high variance Low training error and high test error 11/9/16 40 Slide credit: L. Lazebnik

41 Today : ü K-nearest neighbor ü Model SelecRon / Bias Variance Tradeoff ü EPE ü DecomposiRon of MSE ü Bias-Variance tradeoff ü High bias? High variance? How to respond? 11/9/16 41

42 (1) High variance Could also use CrossV Error Test error srll decreasing as m increases. Suggests larger training set will help. Large gap between training and test error. Low training error and high test error 11/9/16 42 Slide credit: A. Ng

43 How to reduce variance? Choose a simpler classifier Regularize the parameters Get more training data Try smaller set of features 11/9/16 43 Slide credit: D. Hoiem

44 (2) High bias Could also use CrossV Error Even training error is unacceptably high. Small gap between training and test error. High training error and high test error 11/9/16 44 Slide credit: A. Ng

45 How to reduce Bias? E.g. - Get addironal features - Try adding basis expansions, e.g. polynomial - Try more complex learner 11/9/16 45

46 (3) For instance, if trying to solve spam detecron using (Extra) L2 - If performance is not as desired 11/9/16 46 Slide credit: A. Ng

47 (4) Model SelecRon and Assessment Model SelecRon EsRmaRng performances of different models to choose the best one Model Assessment Having chosen a model, esrmarng the predicron error on new data 11/9/16 47

48 Model SelecRon and Assessment (Extra) Data Rich Scenario: Split the dataset Train Validation Test Dr. Yanjun Qi / UVA CS 6316 / f16 Model Selection Model assessment Insufficient data to split into 3 parts Approximate validaron step analyrcally AIC, BIC, MDL, SRM Efficient reuse of samples Cross validaron, bootstrap 11/9/16 48

49 Today Recap: ü K-nearest neighbor ü Model SelecRon / Bias Variance Tradeoff ü EPE ü DecomposiRon of MSE ü Bias-Variance tradeoff ü High bias? High variance? How to respond? 11/9/16 49

50 References q Prof. Tan, Steinbach, Kumar s IntroducRon to Data Mining slide q Prof. Andrew Moore s slides q Prof. Eric Xing s slides q HasRe, Trevor, et al. The elements of sta.s.cal learning. Vol. 2. No. 1. New York: Springer, /9/16 50

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia UVA CS 4501: Machine Learning Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff Dr. Yanjun Qi University of Virginia Department of Computer Science 1 Where are we? è Five major secfons