UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Size: px

Start display at page:

Download "UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia"

Alfred Stokes
5 years ago
Views:

1 UVA CS 4501: Machine Learning Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff Dr. Yanjun Qi University of Virginia Department of Computer Science 1

2 Where are we? è Five major secfons of this course q Regression (supervised) q ClassificaFon (supervised) q Unsupervised models q Learning theory q Graphical models 2

3 Three major secfons for classificafon We can divide the large variety of classification approaches into roughly three major types 1. Discriminative - directly estimate a decision rule/boundary - e.g., support vector machine, decision tree, logistic regression, - e.g. neural networks (NN), deep NN 2. Generative: - build a generative statistical model - e.g., Bayesian networks, Naïve Bayes classifier 3. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors 3

4 Today : ü K-nearest neighbor ü Model SelecFon / Bias Variance Tradeoff ü Bias-Variance tradeoff ü High bias? High variance? How to respond? 4

5 Nearest neighbor classifiers Unknown record Requires three inputs: 1. The set of stored training samples 2. Distance metric to compute distance between samples 3. The value of k, i.e., the number of nearest neighbors to retrieve 5

6 Nearest neighbor classifiers Unknown record To classify unknown sample: 1. Compute distance to training records 2. IdenFfy k nearest neighbors 3. Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) 6

7 Definition of nearest neighbor X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor k-nearest neighbors of a sample x are datapoints that have the k smallest distances to x 7

8 1-nearest neighbor Voronoi diagram: partitioning of a plane into regions based on distance to points in a specific subset of the plane. 8

9 Nearest neighbor classification Compute distance between two points: For instance, Euclidean distance d( x, y) ( x i y i Options for determining the class from nearest neighbor list Take majority vote of class labels among the k-nearest neighbors Weight the votes according to distance example: weight factor w = 1 / d 2 = i 2 ) 9

10 Nearest neighbor classification Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include (many) points from other classes X 10

11 Nearest neighbor classification Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes X 11

12 Nearest neighbor classification Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example: height of a person may vary from 1.5 m to 1.8 m weight of a person may vary from 90 lb to 300 lb income of a person may vary from $10K to $1M 12

13 Nearest neighbor classification k-nearest neighbor classifier is a lazy learner Does not build model explicitly. Classifying unknown samples is relatively expensive. k-nearest neighbor classifier is a local model, vs. global model like linear classifiers. 13

14 ComputaFonal Time Cost 14

15 K-Nearest Neighbor Task Regression/ Classification Representation Local Smoothness Score Function NA Search/Optimization NA Models, Parameters Training Samples 15

16 Decision boundaries in global vs. local models K=15 K=1 Linear classification 15-nearest neighbor 1-nearest neighbor global stable can be inaccurate local accurate unstable K acts as a smoother What ultimately matters: GENERALIZATION 16

17 Today : ü K-nearest neighbor ü Model SelecFon / Bias Variance Tradeoff ü Bias-Variance tradeoff ü High bias? High variance? How to respond? 17

18 e.g. Training Error from KNN, Lesson Learned Dr. Yanjun Qi / UVA CS When k = 1, No misclassifications (on training): Overtraining 1-nearest neighbor averaging Minimizing training error is not always good (e.g., 1- NN) ˆ > 0.5 Y Y ˆ = 0.5 Y ˆ < 0. 5 X 1 18 X 2

19 Review: Mean and Variance of Random Variable (RV) Mean (ExpectaFon): Discrete RVs: E X µ = E ( X) ( ) = v i *P( X = v ) i vi Dr. Yanjun Qi / UVA CS ConFnuous RVs: ( ) = x * p x E X + ( )dx 19 Adapt From Carols prob tutorial

20 Review: Mean and Variance of Random Variable (RV) Mean (ExpectaFon): Discrete RVs: µ = E ( X) = v i = ( X) P( X ) E v v i i Dr. Yanjun Qi / UVA CS E(g(X)) = vi g(v i )P(X = v i ) ConFnuous RVs: ( ) = x * p x E X + E(g(X)) = + ( )dx g(x)* p(x)dx 20 Adapt From Carols prob tutorial

21 Review: Mean and Variance of RV Variance: Discrete RVs: ConFnuous RVs: Var(X) = E((X µ) 2 ) 2 ( X) ( µ ) P( X ) = v i = V v v V X i + ( ) = ( x µ ) 2 p( x) dx i 21 Adapt From Carols prob tutorial

22 Dr. Yanjun 22 Qi / UVA CS BIAS AND VARIANCE TRADE-OFF Bias measures accuracy or quality of the estimator low bias implies on average we will accurately estimate true parameter from training data Variance Measures precision or specificity of the estimator Low variance implies the estimator does not change much as the training set varies

23 BIAS AND VARIANCE TRADE-OFF for Mean Squared Error of parameter estimation E( )= Error due to variance of training samples Error due to incorrect assumptions 23

24 Bias-Variance Tradeoff / Model Selection underfit region overfit region 24

25 (1) Training vs Test Error Expected Test Error Expected Training Error 25

26 (1) Training vs Test Error Training error can always be reduced when increasing model complexity, Expected Test Error Expected Training Error 26

27 (1) Training vs Test Error Training error can always be reduced when increasing model complexity, Expected Test error and CV error è good approximafon of of EPE Expected Training Error Expected Test Error 27

28 Statistical Decision Theory (Extra) Random input vector: X Random output variable:y Joint distribution: Pr(X,Y ) Loss function L(Y, f(x)) Expected prediction error (EPE): EPE( f ) = E(L(Y, f (X))) = e.g. = (y f (x)) 2 Pr(dx,dy) L(y, f (x)) Pr(dx,dy) e.g. Squared error loss (also called L2 loss ) One way to consider generalization : by considering the joint population distribution 28

29 (2) Bias-Variance Trade-off Models with too few parameters are inaccurate because of a large bias (not enough flexibility). Models with too many parameters are inaccurate because of a large variance (too much sensifvity to the sample randomness). 29 Slide credit: D. Hoiem

30 (2.1) Regression: Complexity versus Goodness of Fit Highest Bias Lowest variance Model complexity = low Low Variance / High Bias Medium Bias Medium Variance Model complexity = medium Smallest Bias Highest variance Model complexity = high Low Bias / High Variance 30

can be inaccurate 15-nearest neighbor KNN local accurate unstable

31 (2.2) Classification, Decision boundaries in global vs. local models Low Variance / High Bias linear regression global stable can be inaccurate 15-nearest neighbor KNN local accurate unstable 1-nearest neighbor Low Bias / High Variance What ultimately matters: GENERALIZATION 31

32 Bias-Variance Tradeoff / Model Selection underfit region overfit region 32

33 Model bias & Model variance Middle RED: TRUE function Error due to bias: How far off in general from the middle red Error due to variance: How wildly the blue points spread 33

34 Model bias & Model variance Middle RED: TRUE function Error due to bias: How far off in general from the middle red Error due to variance: How wildly the blue points spread 34

35 Randomness of Train Set => Variance of Models, e.g., Dr. Yanjun Qi / UVA CS 35

36 Randomness of Train Set => Variance of Models, e.g., Dr. Yanjun Qi / UVA CS 36

37 need to make assumpfons Dr. Yanjun that Qi / UVA CS are able to generalize Components Bias: how much the average model over all training sets differ from the true model? Error due to inaccurate assumptions/simplifications made by the model Variance: how much models estimated from different training sets differ from each other 37 Slide credit: L. Lazebnik

38 Today : ü K-nearest neighbor ü Model SelecFon / Bias Variance Tradeoff ü Bias-Variance tradeoff ü High bias? High variance? How to respond? 38

39 need to make assumpfons Dr. Yanjun that Qi / UVA CS are able to generalize Underfitting: model is too simple to represent all the relevant class characteristics High bias and low variance High training error and high test error Overfitting: model is too complex and fits irrelevant characteristics (noise) in the data Low bias and high variance Low training error and high test error 39 Slide credit: L. Lazebnik

40 (1) High variance Could also use CrossV Error Test error sfll decreasing as m increases. Suggests larger training set will help. Large gap between training and test error. Low training error and high test error 40 Slide credit: A. Ng

41 How to reduce variance? Choose a simpler classifier Regularize the parameters Get more training data Try smaller set of features 41 Slide credit: D. Hoiem

42 (2) High bias Could also use CrossV Error Even training error is unacceptably high. Small gap between training and test error. High training error and high test error 42 Slide credit: A. Ng

43 How to reduce Bias? E.g. - Get addifonal features - Try adding basis expansions, e.g. polynomial - Try more complex learner 43

44 (3) For instance, if trying to solve spam detecfon using (Extra) L2 - If performance is not as desired 44 Slide credit: A. Ng

45 (4) Model SelecFon and Assessment Model SelecFon EsFmaFng performances of different models to choose the best one Model Assessment Having chosen a model, esfmafng the predicfon error on new data 45

46 Model SelecFon and Assessment (Extra) When Data Rich Scenario: Split the dataset Train Validation Test Dr. Yanjun Qi / UVA CS Model Selection Model assessment When Insufficient data to split into 3 parts Approximate validafon step analyfcally AIC, BIC, MDL, SRM Efficient reuse of samples Cross validafon, bootstrap 46

47 Today Recap: ü K-nearest neighbor ü Model SelecFon / Bias Variance Tradeoff ü Bias-Variance tradeoff ü High bias? High variance? How to respond? 47

48 References q Prof. Tan, Steinbach, Kumar s IntroducFon to Data Mining slide q Prof. Andrew Moore s slides q Prof. Eric Xing s slides q HasFe, Trevor, et al. The elements of sta.s.cal learning. Vol. 2. No. 1. New York: Springer,

prediction error (EPE): EPE( f ) = E(L(Y, f (X))) = L(y, f (x)) Pr(dx,dy) e.g.

49 Statistical Decision Theory (Extra) Random input vector: X Random output variable:y Joint distribution: Pr(X,Y ) Loss function L(Y, f(x)) Expected prediction error (EPE): EPE( f ) = E(L(Y, f (X))) = L(y, f (x)) Pr(dx,dy) e.g. = (y f (x)) 2 Pr(dx,dy) e.g. Squared error loss (also called L2 loss ) Consider population distribution 49

= (y f (x)) 2 Pr(dx, dy) under L2 loss, best estimator for EPE (Theoretically) is : Conditional mean!

50 Expected prediction error (EPE) EPE( f ) = E(L(Y, f (X))) = L(y, f (x)) Pr(dx, dy) Consider joint distribution For L2 loss: e.g. = (y f (x)) 2 Pr(dx, dy) under L2 loss, best estimator for EPE (Theoretically) is : Conditional mean! ˆf (x)= E(Y X = x) e.g. KNN NN methods are the direct implementation (approximation ) For 0-1 loss: L(k, l) = 1-d kl Bayes Classifier ˆf (X)=C k!if! Pr(C k X = x)= maxpr(g X = x) g C! 50

EXPECTED PREDICTION ERROR for L2 Loss Expected prediction error (EPE) for L2 Loss: EPE( f ) = E(Y f (X)) 2 = (y f (x)) 2 Pr(dx, dy) Since Pr(X,Y )=Pr(Y X )Pr(X ), EPE can also be written as EPE( f

51 EXPECTED PREDICTION ERROR for L2 Loss Expected prediction error (EPE) for L2 Loss: EPE( f ) = E(Y f (X)) 2 = (y f (x)) 2 Pr(dx, dy) Since Pr(X,Y )=Pr(Y X )Pr(X ), EPE can also be written as EPE( f Thus it suffices to minimize EPE pointwise Best estimator under L2 loss: conditional expectation Conditional mean Solution for Regression: Solution for knn: ) = E X E Y 2 X ([ Y f ( X )] X ) 2 f ( x) = arg minc EY X ([ Y c] X = x) f ( x) = E( Y X = x) 51

52 Dr. Yanjun 52 Qi / UVA CS KNN FOR MINIMIZING EPE We know under L2 loss, best estimator for EPE (theoretically) is : Conditional mean f ( x) = E( Y X = x ) Nearest neighbors assumes that f(x) is well approximated by a locally constant function.

53 Review : WHEN EPE USES DIFFERENT LOSS Loss Function Estimator L 2 L (Bayes classifier / MAP) Dr. Yanjun Qi / UVA CS 53

54 Dr. Yanjun 54 Qi / UVA CS Decomposition of EPE When additive error model: Notations Output random variable: Prediction function: Prediction estimator: Irreducible / Bayes error MSE component of f- hat in estimating f

55 Bias-Variance Trade-off for EPE: EPE (x_0) = noise 2 + bias 2 + variance Unavoidable error Error due to incorrect assumptions Error due to variance of training samples 55 Slide credit: D. Hoiem

56 hpp://alex.smola.org/teaching/ /homework/hw3_sol.pdf 56

57 hpp://alex.smola.org/teaching/ /homework/hw3_sol.pdf 57

58 MSE of Model, aka, Risk Dr. Yanjun Qi / UVA CS 58 hpp://

59 Cross ValidaFon and Variance EsFmaFon Dr. Yanjun Qi / UVA CS hpp:// 59

60 hpp:// Dr. Yanjun Qi / UVA CS 60

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

UVA CS 6316/4501 Fall 2016 Machine Learning Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff Dr. Yanjun Qi University of Virginia Department of Computer Science 11/9/16 1 Rough Plan HW5