Announcements:$Rough$Plan$$

Size: px

Start display at page:

Download "Announcements:$Rough$Plan$$"

Edmund Robinson
5 years ago
Views:

1 UVACS6316 Fall2015Graduate: MachineLearning 10/27/15 Dr.YanjunQi UniversityofVirginia Departmentof ComputerScience 1 Announcements:RoughPlan HW3:dueonNov.8thmidnight MidphaseProjectReport:dueonNov.4 th LateMidterm: Opennote/Openlecture Nov.18 th /conflictswithmanystudents conferencetrips Nov.23 rd????/conflicts??? HW4: 20samplesquesWonsforthepreparaWonforexam DuedependingonLateYMidtermDate;If23 rd,dueon20 th 10/27/15 2

2 Wherearewe?! FivemajorsecWonsofthiscourse " Regression(supervised) " ClassificaWon(supervised) " Unsupervisedmodels " Learningtheory " Graphicalmodels 10/27/15 3 Wherearewe?! ThreemajorsecWonsforclassificaWon We can divide the large variety of classification approaches into roughly three major types 1. Discriminative - directly estimate a decision rule/boundary - e.g., logistic regression, support vector machine, decisiontree 2. Generative: - build a generative statistical model - e.g., naïve bayes classifier, Bayesian networks 3. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors 10/27/15 4

3 Today: # KYnearestneighbor # ModelSelecWon/BiasVarianceTradeoff 10/27/15 5 Nearest neighbor classifiers Basic idea: If it walks like a duck, quacks like a duck, then it s probably a duck compute distance test sample training samples choose k of the nearest samples 10/27/15 6

4 Nearest neighbor classifiers Unknown record Requiresthreeinputs: 1. Thesetofstored trainingsamples 2. Distancemetricto computedistance betweensamples 3. Thevalueofk,i.e.,the numberofnearest neighborstoretrieve 10/27/15 7 Nearest neighbor classifiers Unknown record Toclassifyunknownsample: 1. Computedistanceto othertrainingrecords 2. IdenWfyknearest neighbors 3. Useclasslabelsofnearest neighborstodetermine theclasslabelofunknown record(e.g.,bytaking majorityvote) 10/27/15 8

5 Definition of nearest neighbor X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor kynearestneighborsofasamplexaredatapointsthat havetheksmallestdistancestox 10/27/ nearest neighbor Voronoidiagram: partitioning of a plane into regions based on distance to points in a specific subset of the plane. 10/27/15 10

6 Nearest neighbor classification Compute distance between two points: For instance, Euclidean distance d( x, y) = ( x i y i ) Options for determining the class from nearest neighbor list Take majority vote of class labels among the k-nearest neighbors Weight the votes according to distance example: weight factor w = 1 / d 2 i 2 10/27/15 11 Nearest neighbor classification Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes X 10/27/15 12

7 Nearest neighbor classification Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example: height of a person may vary from 1.5 m to 1.8 m weight of a person may vary from 90 lb to 300 lb income of a person may vary from 10K to 1M 10/27/15 13 Nearest neighbor classification Problem with Euclidean measure: High dimensional data curse of dimensionality Can produce counter-intuitive results vs d = d = onesoluwon:normalizethevectorstounitlength 10/27/15 14

8 Nearest neighbor classification k-nearest neighbor classifier is a lazy learner Does not build model explicitly. Classifying unknown samples is relatively expensive. k-nearest neighbor classifier is a local model, vs. global model of linear classifiers. 10/27/15 15 K-Nearest Neighbor Task classification Representation Local Smoothness Score Function NA Search/Optimization NA Models, Parameters Training Samples 10/27/15 16

local accurate unstable What ultimately matters: GENERALIZATION 10/27/15 17 KYNearestYNeighboursfor

9 Decision boundaries in global vs. local models linear regression 15-nearest neighbor 1-nearest neighbor global stable can be inaccurate local accurate unstable What ultimately matters: GENERALIZATION 10/27/15 17 KYNearestYNeighboursfor ClassificaWon(Extra) Kactsasasmother For,theerrorrateofthe1YnearestYneighbourclassifierisnevermorethan twicetheopwmalerror(obtainedfromthetruecondiwonalclassdistribuwons). 10/27/15 18

KNN METHODS IN HIGH DIMENSIONS (Extra) In high dimensions, all sample points are close to the edge of the sample N data points uniformly distributed in a p-dimensional unit ball centered at the

10 KNN METHODS IN HIGH DIMENSIONS (Extra) In high dimensions, all sample points are close to the edge of the sample N data points uniformly distributed in a p-dimensional unit ball centered at the origin Median distance from the closest point to the origin 1 d( p, N) = 1 2 d(10,500) = 0.52 More than half the way to the boundary (unit ball s boundary edge is 1 distance to the origin) 1/ N 1/ p 10/27/ Vs. Locally weighted regression aka locally weighted regression, locally linear regression, LOESS, K λ (x i, x 0 ) linear_func(x)y>y! couldrepresent onlytheneighbor regionofx_0 10/27/15 UseRBFfuncWonto pickout/emphasize theneighborregion ofx_0! K λ (x i, x 0 )

11 Vs. Locally weighted regression Separate weighted least squares at each target point x 0 : x 0 min N α (x 0 ),β (x 0 ) i=1 f ˆ ( x ˆ( K λ (x i, x 0 )[y i α(x 0 ) β(x 0 )x i ] 2 ˆ( 0 ) =α x0) + β x0) x 0 " K τ (x i, x0) = exp (x % i x0)2 ' # 2τ 2 & 10/27/15 Today: # KYnearestneighbor # ModelSelecWon/BiasVarianceTradeoff # EPE # DecomposiWonofMSE # BiasYVariancetradeoff # Highbias?Highvariance?Howtorespond? 10/27/15 22

e.g. Training Error from KNN, Lesson Learned When k = 1, No misclassifications (on training):

variable:y Joint distribution: Pr(X,Y ) Loss function L(Y, f(x)) Expected prediction error

12 e.g. Training Error from KNN, Lesson Learned When k = 1, No misclassifications (on training): Overtraining 1-nearest neighbor averaging Minimizing training error is not always good (e.g., 1- NN) ˆ > 0.5 Y Y ˆ = 0.5 Y ˆ < 0. 5 X 1 10/27/15 23 X 2 Statistical Decision Theory Random input vector: X Random output variable:y Joint distribution: Pr(X,Y ) Loss function L(Y, f(x)) Expected prediction error (EPE): EPE( f ) = E(L(Y, f (X))) = L(y, f (x)) Pr(dx,dy) e.g. = (y f (x)) 2 Pr(dx,dy) e.g. Squared error loss (also called L2 loss ) Consider population distribution 10/27/15 24

KNN NN methods are the direct implementation (approximation ) For 0-1 loss: L(k, ) = 1- kl Bayes Classifier ˆf (X)=C k!if! Pr(C k X = x)= maxpr(g X = x) g C! 10/27/15 25 26 LR VS.

13 Expected prediction error (EPE) EPE( f ) = E(L(Y, f (X))) = L(y, f (x)) Pr(dx, dy) Consider joint distribution For L2 loss: e.g. = (y f (x)) 2 Pr(dx, dy) under L2 loss, best estimator for EPE (Theoretically) is : Conditional mean! ˆf (x)= E(Y X = x) e.g. KNN NN methods are the direct implementation (approximation ) For 0-1 loss: L(k, ) = 1- kl Bayes Classifier ˆf (X)=C k!if! Pr(C k X = x)= maxpr(g X = x) g C! 10/27/ LR VS. KNN FOR MINIMIZING EPE We know under L2 loss, best estimator for EPE (Theoretically) is : Conditional mean f ( x) = E( Y X = x ) Two simple approaches using different approximations: Least squares assumes that f(x) is well approximated by a globally linear function Nearest neighbors assumes that f(x) is well approximated by a locally constant function. 10/27/15

14 Review : WHEN EPE USES DIFFERENT LOSS Loss Function Estimator L 2 L (Bayes classifier / MAP) 10/27/15 27 Today: # KYnearestneighbor # ModelSelecWon/BiasVarianceTradeoff # EPE # DecomposiWonofMSE # BiasYVariancetradeoff # Highbias?Highvariance?Howtorespond? 10/27/15 28

Decomposition of EPE When additive error model: Notations Output

10/27/15 Irreducible / Bayes error MSE component of f- hat in

bias 2 + variance Unavoidable error Error due to incorrect

15 Decomposition of EPE When additive error model: Notations Output random variable: Prediction function: Prediction estimator: 29 10/27/15 Irreducible / Bayes error MSE component of f- hat in estimating f BiasYVarianceTradeYoffforEPE: EPE (x_0) = noise 2 + bias 2 + variance Unavoidable error Error due to incorrect assumptions Error due to variance of training samples 10/27/15 30 Slidecredit:D.Hoiem

or func from training data Variance Measures precision or specificity of the estimator 10/27/15 Low variance implies the

16 BIAS AND VARIANCE TRADE-OFF for MSE (more general setting!!! ): Bias measures accuracy or quality of the estimator low bias implies on average we will accurately estimate true parameter or func from training data Variance Measures precision or specificity of the estimator 10/27/15 Low variance implies the estimator does not change much as the training set varies 32 BIAS AND VARIANCE TRADE-OFF for MSE of parameter estimation: Error due to variance of training samples Error due to incorrect assumptions 10/27/15

# DecomposiWonofMSE # BiasYVariancetradeoff

17 e.g., BIAS AND VARIANCE IN KNN (Extra) Prediction 33 Bias Variance When under data model: 10/27/15 Today: # KYnearestneighbor # ModelSelecWon/BiasVarianceTradeoff # EPE # DecomposiWonofMSE # BiasYVariancetradeoff # Highbias?Highvariance?Howtorespond? 10/27/15 34

18 Bias-Variance Tradeoff / Model Selection underfit region overfit region 10/27/15 35 Whichisthebest? HighestBias Lowestvariance Modelcomplexity=low MediumBias MediumVariance Modelcomplexity=medium SmallestBias Highestvariance Modelcomplexity=high Whynotchoosethemethodwiththe bestfittothedata? Howwellareyougoingtopredictfuturedata? 10/27/15 36

Modelswithtoomany parametersareinaccurate becauseofalargevariance (toomuchsensiwvitytothe

19 BiasYVarianceTradeYoff Modelswithtoofew parametersareinaccurate becauseofalargebias(not enoughflexibility). Modelswithtoomany parametersareinaccurate becauseofalargevariance (toomuchsensiwvitytothe samplerandomness). 10/27/15 37 Slidecredit:D.Hoiem TrainingvsTestError Trainingerror canalwaysbe reducedwhen increasing model complexity, ExpectedTestError ButrisksoverY finngand generalize poorly. ExpectedTrainingError 38

Regression: Complexity versus Goodness of Fit y Training data y Too simple?

y LowVariance/ HighBias Meanofthe LowBias /HighVariance x What ultimately matters:

local models LowVariance/ HighBias linear regression global stable can be inaccurate

20 Regression: Complexity versus Goodness of Fit y Training data y Too simple? y x x Too complex? About right? y LowVariance/ HighBias Meanofthe LowBias /HighVariance x What ultimately matters: GENERALIZATION 10/27/15 39 x Classification, Decision boundaries in global vs. local models LowVariance/ HighBias linear regression global stable can be inaccurate 15-nearest neighbor LowVariance/ HighBias KNN local accurate unstable 1-nearest neighbor LowBias /HighVariance What ultimately matters: GENERALIZATION 10/27/15 40

Middle RED: TRUE function Model bias & Model variance Error due to bias: How far off in general from the middle red Error due to variance: How wildly the blue points spread 10/27/15 41

21 Middle RED: TRUE function Model bias & Model variance Error due to bias: How far off in general from the middle red Error due to variance: How wildly the blue points spread 10/27/15 41 needtomakeassumpwonsthat areabletogeneralize Components of generalization error Bias: how much the average model over all training sets differ from the true model? Error due to inaccurate assumptions/simplifications made by the model Variance: how much models estimated from different training sets differ from each other Underfitting: model is too simple to represent all the relevant class characteristics High bias and low variance High training error and high test error Overfitting: model is too complex and fits irrelevant characteristics (noise) in the data Low bias and high variance Low training error and high test error 10/27/15 42 Slidecredit:L.Lazebnik

22 MODEL COMPLEXITY CONTROL, EXAMPLES (Extra) Regularization (Bayesian) Kernel methods and local regression Basis functions 10/27/15 43 Today: # KYnearestneighbor # ModelSelecWon/BiasVarianceTradeoff # EPE # DecomposiWonofMSE # BiasYVariancetradeoff # Highbias?Highvariance?Howtorespond? 10/27/15 44

23 Highvariance Couldalsouse CrossVError TesterrorsWlldecreasingasmincreases.Suggestslargertrainingsetwillhelp. Largegapbetweentrainingandtesterror. Low training error and high test error 10/27/15 45 Slidecredit:A.Ng Howtoreducevariance? Chooseasimplerclassifier Regularizetheparameters Getmoretrainingdata Trysmallersetoffeatures 10/27/15 46 Slidecredit:D.Hoiem

24 Highbias Couldalsouse CrossVError Eventrainingerrorisunacceptablyhigh. Smallgapbetweentrainingandtesterror. High training error and high test error 10/27/15 47 Slidecredit:A.Ng HowtoreduceBias? E.g. Y GetaddiWonalfeatures Y Tryaddingbasisexpansions,e.g.polynomial Y Trymorecomplexlearner 10/27/15 48

25 Forinstance,iftryingtosolve spam detecwon using(extra) L2Y Ifperformanceisnotasdesired 10/27/15 49 Slidecredit:A.Ng ModelSelecWonandAssessment ModelSelecWon EsWmaWngperformancesofdifferentmodelstochoose thebestone ModelAssessment Havingchosenamodel,esWmaWngthepredicWonerror onnewdata 10/27/15 50

26 ModelSelecWonandAssessment (Extra) DataRichScenario:Splitthedataset Train Validation Test Model Selection Model assessment Insufficientdatatosplitinto3parts ApproximatevalidaWonstepanalyWcally AIC,BIC,MDL,SRM Efficientreuseofsamples CrossvalidaWon,bootstrap 10/27/15 51 TodayRecap: # KYnearestneighbor # ModelSelecWon/BiasVarianceTradeoff # EPE # DecomposiWonofMSE # BiasYVariancetradeoff # Highbias?Highvariance?Howtorespond? 10/27/15 52

27 References " Prof.Tan,Steinbach,Kumar s IntroducWon todatamining slide " Prof.AndrewMoore sslides " Prof.EricXing sslides " HasWe,Trevor,etal.The%elements%of% sta.s.cal%learning.vol.2.no.1.newyork: Springer, /27/15 53

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

UVA CS 6316/4501 Fall 2016 Machine Learning Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff Dr. Yanjun Qi University of Virginia Department of Computer Science 11/9/16 1 Rough Plan HW5