Nearest Neighbor Classification

Similar documents
CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2018)

k-nearest Neighbors + Model Selection

Introduction to Artificial Intelligence

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Experimental Design + k- Nearest Neighbors

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

Nearest neighbors classifiers

CS 340 Lec. 4: K-Nearest Neighbors

Machine Learning: Algorithms and Applications Mockup Examination

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Notes and Announcements

Machine Learning for. Artem Lind & Aleskandr Tkachenko

MLCC 2018 Local Methods and Bias Variance Trade-Off. Lorenzo Rosasco UNIGE-MIT-IIT

Input: Concepts, Instances, Attributes

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Introduction to Machine Learning. Xiaojin Zhu

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Computer Vision. Exercise Session 10 Image Categorization

Supervised Learning: K-Nearest Neighbors and Decision Trees

Performance Analysis of Data Mining Classification Techniques

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

CSE 573: Artificial Intelligence Autumn 2010

Chapter 4: Non-Parametric Techniques

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Classification Key Concepts

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Classification and K-Nearest Neighbors

Homework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Nearest Neighbor Predictors

Classification Key Concepts

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

INF 4300 Classification III Anne Solberg The agenda today:

Distribution-free Predictive Approaches

Statistical Methods in AI

Kernels and Clustering

Homework Assignment #3

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Nonparametric Methods Recap

Nearest Neighbor Classification

AM 221: Advanced Optimization Spring 2016

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

CS 584 Data Mining. Classification 1

CS 343: Artificial Intelligence

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Generative and discriminative classification techniques

Nearest Neighbor Classification. Machine Learning Fall 2017

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

We use non-bold capital letters for all random variables in these notes, whether they are scalar-, vector-, matrix-, or whatever-valued.

EPL451: Data Mining on the Web Lab 5

ADVANCED CLASSIFICATION TECHNIQUES

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Data analysis case study using R for readily available data set using any one machine learning Algorithm

Classification: Decision Trees

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Classification: Feature Vectors

Topics in Machine Learning

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Adaptive Metric Nearest Neighbor Classification

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Machine Learning Classifiers and Boosting

Computational Machine Learning, Fall 2015 Homework 4: stochastic gradient algorithms

Chapter 4: Algorithms CS 795

Semi-supervised Learning

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Basic Concepts Weka Workbench and its terminology

Cluster Analysis. Angela Montanari and Laura Anderlucci

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Machine Learning and Pervasive Computing

Gene Clustering & Classification

Data Mining: Exploring Data

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

6.867 Machine Learning

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

University of Florida CISE department Gator Engineering. Visualization

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Machine Learning / Jan 27, 2010

NOTE: Your grade will be based on the correctness, efficiency and clarity.

Nonparametric Classification Methods

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Artificial Intelligence. Programming Styles

Supervised Learning. CS 586 Machine Learning. Prepared by Jugal Kalita. With help from Alpaydin s Introduction to Machine Learning, Chapter 2.

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Mining di Dati Web. Lezione 3 - Clustering and Classification

CS 170 Algorithms Fall 2014 David Wagner HW12. Due Dec. 5, 6:00pm

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Multi-label Classification. Jingzhou Liu Dec

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Regularization and model selection

Jarek Szlichta

Question 1: knn classification [100 points]

Transcription:

Nearest Neighbor Classification Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 1 / 48

Outline 1 Administration 2 First learning algorithm: Nearest neighbor classifier 3 Deeper understanding of NNC 4 Some practical aspects of NNC 5 What we have learned Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 2 / 48

Registration / PTEs Course is currently full I am not giving out PTEs until after HW1 submission date I expect several students will drop the course ML is very popular, so many students are interested Many students don t have mathematical maturity for graduate level material Last year s attrition rate much higher than typical grad-level class If you re not registered, I d encourage you to stay patient I am confident that all qualified students will be able to enroll Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 3 / 48

Math Quiz Representative of math concepts you are excepted to know Math quiz grades available next week at office hours Graded to assess your background (but not part of final grade) We may contact students who perform poorly Math Quiz is a requirement We will not grade your HW if you don t take it You can take it in office hours if you haven t already Be honest / realistic with yourself about your background It s better for you, me, and your classmates to drop the course now rather than a month from now Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 4 / 48

Homework 1 Available online, and due next Wednesday at beginning of class Also representative of math concepts you are excepted to know Basic math questions, MATLAB coding assignment, Academic Integrity Form Look on course website for details about getting access to MATLAB Submission details are included in the assignment Join Piazza! (students are asking questions already) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 5 / 48

Outline 1 Administration 2 First learning algorithm: Nearest neighbor classifier Intuitive example General setup for classification Algorithm 3 Deeper understanding of NNC 4 Some practical aspects of NNC 5 What we have learned Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 6 / 48

Recognizing flowers Types of Iris: setosa, versicolor, and virginica Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 7 / 48

Measuring the properties of the flowers Features: the widths and lengths of sepal and petal Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 8 / 48

Often, data is conveniently organized as a table Ex: Iris data (click here for all data) 4 features 3 classes Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 9 / 48

Pairwise scatter plots of 131 flower specimens Visualization of data helps to identify the right learning model to use Each colored point is a flower specimen: setosa, versicolor, virginica sepal length sepal width petal length petal width petal width petal length sepal width sepal length Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 10 / 48

Different types seem well-clustered and separable Using two features: petal width and sepal length sepal length sepal width petal length petal width petal width petal length sepal width sepal length Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 11 / 48

Labeling an unknown flower type? sepal length sepal width petal length petal width petal width petal length sepal width sepal length Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 12 / 48

Labeling an unknown flower type? sepal length sepal width petal length petal width petal width petal length sepal width sepal length Closer to red cluster: so labeling it as setosa Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 12 / 48

Multi-class classification Classify data into one of the multiple categories Input (feature vectors): x R D Output (label): y [C] = {1, 2,, C} Learning goal: y = f(x) Special case: binary classification Number of classes: C = 2 Labels: {0, 1} or { 1, +1} Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 13 / 48

More terminology Training data (set) N samples/instances: D train = {(x 1, y 1 ), (x 2, y 2 ),, (x N, y N )} They are used for learning f( ) Test (evaluation) data M samples/instances: D test = {(x 1, y 1 ), (x 2, y 2 ),, (x M, y M )} They are used for assessing how well f( ) will do in predicting an unseen x / D train Training data and test data should not overlap: D train D test = Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 14 / 48

Nearest neighbor classification (NNC) Nearest neighbor x(1) = x nn(x) where nn(x) [N] = {1, 2,, N}, i.e., the index to one of the training instances Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 15 / 48

Nearest neighbor classification (NNC) Nearest neighbor x(1) = x nn(x) where nn(x) [N] = {1, 2,, N}, i.e., the index to one of the training instances nn(x) = arg min n [N] x x n 2 2 = arg min n [N] D d=1 (x d x nd ) 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 15 / 48

Nearest neighbor classification (NNC) Nearest neighbor x(1) = x nn(x) where nn(x) [N] = {1, 2,, N}, i.e., the index to one of the training instances nn(x) = arg min n [N] x x n 2 2 = arg min n [N] D d=1 (x d x nd ) 2 Classification rule y = f(x) = y nn(x) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 15 / 48

Visual example In this 2-dimensional example, the nearest point to x is a red training instance, thus, x will be labeled as red. x 2 (a) x 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 16 / 48

Example: classify Iris with two features Training data ID (n) petal width (x 1 ) sepal length (x 2 ) category (y) 1 0.2 5.1 setosa 2 1.4 7.0 versicolor 3 2.5 6.7 virginica Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 17 / 48

Example: classify Iris with two features Training data ID (n) petal width (x 1 ) sepal length (x 2 ) category (y) 1 0.2 5.1 setosa 2 1.4 7.0 versicolor 3 2.5 6.7 virginica Flower with unknown category petal width = 1.8 and sepal width = 6.4 Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 17 / 48

Example: classify Iris with two features Training data ID (n) petal width (x 1 ) sepal length (x 2 ) category (y) 1 0.2 5.1 setosa 2 1.4 7.0 versicolor 3 2.5 6.7 virginica Flower with unknown category petal width = 1.8 and sepal width = 6.4 Calculating distance = x 1 x n1 ) 2 + (x 2 x n2 ) 2 ID distance 1 1.75 2 0.72 3 0.76 Thus, the predicted category is versicolor (the real category is virginica) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 17 / 48

How to measure nearness with other distances? Previously, we use the Euclidean distance nn(x) = arg min n [N] x x n 2 2 We can also use alternative distances E.g., the following L 1 distance (i.e., city block distance, or Manhattan distance) nn(x) = arg min n [N] x x n 1 D = arg min n [N] x d x nd d=1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 18 / 48

How to measure nearness with other distances? Previously, we use the Euclidean distance nn(x) = arg min n [N] x x n 2 2 We can also use alternative distances E.g., the following L 1 distance (i.e., city block distance, or Manhattan distance) nn(x) = arg min n [N] x x n 1 = arg min n [N] D d=1 x d x nd Green line is Euclidean distance. Red, Blue, and Yellow lines are L 1 distance Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 18 / 48

Decision boundary For every point in the space, we can determine its label using the NNC rule. This gives rise to a decision boundary that partitions the space into different regions. x 2 (b) x 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 19 / 48

K-nearest neighbor (KNN) classification Increase the number of nearest neighbors to use? 1-nearest neighbor: nn 1 (x) = arg min n [N] x x n 2 2 2nd-nearest neighbor: nn 2 (x) = arg min n [N] nn1 (x) x x n 2 2 3rd-nearest neighbor: nn 2 (x) = arg min n [N] nn1 (x) nn 2 (x) x x n 2 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 20 / 48

K-nearest neighbor (KNN) classification Increase the number of nearest neighbors to use? 1-nearest neighbor: nn 1 (x) = arg min n [N] x x n 2 2 2nd-nearest neighbor: nn 2 (x) = arg min n [N] nn1 (x) x x n 2 2 3rd-nearest neighbor: nn 2 (x) = arg min n [N] nn1 (x) nn 2 (x) x x n 2 2 The set of K-nearest neighbor Let x(k) = x nnk (x), then knn(x) = {nn 1 (x), nn 2 (x),, nn K (x)} x x(1) 2 2 x x(2) 2 2 x x(k) 2 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 20 / 48

How to classify with K neighbors? Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 21 / 48

How to classify with K neighbors? Classification rule Every neighbor votes: suppose y n (the true label) for x n is c, then vote for c is 1 vote for c c is 0 We use the indicator function I(y n == c) to represent. Aggregate everyone s vote v c = Label with the majority n knn(x) I(y n == c), c [C] y = f(x) = arg max c [C] v c Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 21 / 48

Example x 2 K=1, Label: red (a) x 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 22 / 48

Example x 2 K=1, Label: red x 2 K=3, Label: red (a) x 1 (a) x 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 22 / 48

Example x 2 K=1, Label: red x 2 K=3, Label: red x 2 K=5, Label: blue (a) x 1 (a) x 1 (a) x 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 22 / 48

How to choose an optimal K? Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 23 / 48

How to choose an optimal K? 2 K = 1 2 K = 3 2 K = 3 1 x 7 x 7 x 7 1 1 1 0 0 1 2 x 6 0 0 1 2 x 6 0 0 1 2 x 6 When K increases, the decision boundary becomes smooth. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 23 / 48

Mini-summary Advantages of NNC Computationally, simple and easy to implement just compute distances Has strong theoretical guarantees that it is doing the right thing Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 24 / 48

Mini-summary Advantages of NNC Computationally, simple and easy to implement just compute distances Has strong theoretical guarantees that it is doing the right thing Disadvantages of NNC Computationally intensive for large-scale problems: O(ND) for labeling a data point We need to carry the training data around. Without it, we cannot do classification. This type of method is called nonparametric. Choosing the right distance measure and K can be involved. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 24 / 48

Outline 1 Administration 2 First learning algorithm: Nearest neighbor classifier 3 Deeper understanding of NNC Measuring performance The ideal classifier Comparing NNC to the ideal classifier 4 Some practical aspects of NNC 5 What we have learned Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 25 / 48

Is NNC too simple to do the right thing? To answer this question, we proceed in 3 steps 1 We define a performance metric for a classifier/algorithm. 2 We then propose an ideal classifier. 3 We then compare our simple NNC classifier to the ideal one and show that it performs nearly as well. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 26 / 48

How to measure performance of a classifier? Intuition We should compute accuracy the percentage of data points being correctly classified, or the error rate the percentage of data points being incorrectly classified. Two versions: which one to use? Defined on the training data set A train = 1 I[f(x n ) == y n ], N n ε train = 1 I[f(x n ) y n ] N n Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 27 / 48

How to measure performance of a classifier? Intuition We should compute accuracy the percentage of data points being correctly classified, or the error rate the percentage of data points being incorrectly classified. Two versions: which one to use? Defined on the training data set A train = 1 I[f(x n ) == y n ], N n ε train = 1 I[f(x n ) y n ] N n Defined on the test (evaluation) data set A test = 1 I[f(x m ) == y m ], M m ε test = 1 I[f(x m ) y m ] M M Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 27 / 48

Example Training data What are A train and ε train? Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 28 / 48

Example Training data What are A train and ε train? A train = 100%, ε train = 0% Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 28 / 48

Example Training data Test data What are A train and ε train? What are A test and ε test? A train = 100%, ε train = 0% Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 28 / 48

Example Training data Test data What are A train and ε train? A train = 100%, ε train = 0% What are A test and ε test? A test = 0%, ε test = 100% Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 28 / 48

Leave-one-out (LOO) Idea For each training instance x n, take it out of the training set and then label it. For NNC, x n s nearest neighbor will not be itself. So the error rate would not become 0 necessarily. Training data What are the LOO-version of A train and ε train? Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 29 / 48

Leave-one-out (LOO) Idea For each training instance x n, take it out of the training set and then label it. For NNC, x n s nearest neighbor will not be itself. So the error rate would not become 0 necessarily. Training data What are the LOO-version of A train and ε train? A train = 66.67%(i.e., 4/6) ε train = 33.33%(i.e., 2/6) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 29 / 48

Drawback of the metrics They are dataset-specific Given a different training (or test) dataset, A train (or A test ) will change. Thus, if we get a dataset randomly, these variables would be random quantities. A test D 1, A test,, A test D q, D 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 30 / 48

Drawback of the metrics They are dataset-specific Given a different training (or test) dataset, A train (or A test ) will change. Thus, if we get a dataset randomly, these variables would be random quantities. A test D 1, A test,, A test D q, These are called empirical accuracies (or errors). D 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 30 / 48

Drawback of the metrics They are dataset-specific Given a different training (or test) dataset, A train (or A test ) will change. Thus, if we get a dataset randomly, these variables would be random quantities. A test D 1, A test,, A test D q, These are called empirical accuracies (or errors). D 2 Can we understand the algorithm itself in a more certain nature, by removing the uncertainty caused by the datasets? Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 30 / 48

Expected mistakes Setup Assume our data (x, y) is drawn from the joint and unknown distribution p(x, y) Define a classification mistake on a single data point x with the ground-truth label y as: L(f(x), y) = { 0 if f(x) = y 1 if f(x) y Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 31 / 48

Expected mistakes Setup Assume our data (x, y) is drawn from the joint and unknown distribution p(x, y) Define a classification mistake on a single data point x with the ground-truth label y as: L(f(x), y) = { 0 if f(x) = y 1 if f(x) y Expected classification mistake on a single data point x R(f, x) = E y p(y x) L(f(x), y) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 31 / 48

Expected mistakes Setup Assume our data (x, y) is drawn from the joint and unknown distribution p(x, y) Define a classification mistake on a single data point x with the ground-truth label y as: L(f(x), y) = { 0 if f(x) = y 1 if f(x) y Expected classification mistake on a single data point x R(f, x) = E y p(y x) L(f(x), y) The average classification mistake by the classifier itself R(f) = E x p(x) R(f, x) = E (x,y) p(x,y) L(f(x), y) (law of iterated expectations, tower property, smoothing) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 31 / 48

Terminology L(f(x), y) is called 0/1 loss function many other forms of loss functions exist for different learning problems. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 32 / 48

Terminology L(f(x), y) is called 0/1 loss function many other forms of loss functions exist for different learning problems. Expected conditional risk R(f, x) = E y p(y x) L(f(x), y) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 32 / 48

Terminology L(f(x), y) is called 0/1 loss function many other forms of loss functions exist for different learning problems. Expected conditional risk R(f, x) = E y p(y x) L(f(x), y) Expected risk R(f) = E (x,y) p(x,y) L(f(x), y) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 32 / 48

Terminology L(f(x), y) is called 0/1 loss function many other forms of loss functions exist for different learning problems. Expected conditional risk R(f, x) = E y p(y x) L(f(x), y) Expected risk R(f) = E (x,y) p(x,y) L(f(x), y) Empirical risk R D (f) = 1 L(f(x n ), y n ) N (This is our empirical error from earlier.) n Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 32 / 48

Ex: binary classification Expected conditional risk of a single data point x R(f, x) = E y p(y x) L(f(x), y) = P (y = 1 x)i[f(x) = 0] + P (y = 0 x)i[f(x) = 1] Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 33 / 48

Ex: binary classification Expected conditional risk of a single data point x R(f, x) = E y p(y x) L(f(x), y) = P (y = 1 x)i[f(x) = 0] + P (y = 0 x)i[f(x) = 1] Let η(x) = P (y = 1 x), we have R(f, x) = η(x)i[f(x) = 0] + (1 η(x))i[f(x) = 1] = 1 {η(x)i[f(x) = 1] + (1 η(x))i[f(x) = 0]} }{{} expected conditional accuracy Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 33 / 48

Ex: binary classification Expected conditional risk of a single data point x R(f, x) = E y p(y x) L(f(x), y) = P (y = 1 x)i[f(x) = 0] + P (y = 0 x)i[f(x) = 1] Let η(x) = P (y = 1 x), we have R(f, x) = η(x)i[f(x) = 0] + (1 η(x))i[f(x) = 1] = 1 {η(x)i[f(x) = 1] + (1 η(x))i[f(x) = 0]} }{{} expected conditional accuracy Exercise: please verify the last equality. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 33 / 48

Bayes optimal classifier Imagine we had access to posterior probability η(x) = P (y = 1 x) f (x) = { 1 if??? 0 if??? Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 34 / 48

Bayes optimal classifier Imagine we had access to posterior probability η(x) = P (y = 1 x) f (x) = { 1 if η(x) 1/2 0 if η(x) < 1/2 equivalently f (x) = { 1 if p(y = 1 x) p(y = 0 x) 0 if p(y = 1 x) < p(y = 0 x) Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 35 / 48

Bayes optimal classifier Imagine we had access to posterior probability η(x) = P (y = 1 x) f (x) = { 1 if η(x) 1/2 0 if η(x) < 1/2 equivalently f (x) = { 1 if p(y = 1 x) p(y = 0 x) 0 if p(y = 1 x) < p(y = 0 x) Theorem For any labeling function f( ), R(f, x) R(f, x). Similarly, R(f ) R(f). Namely, f ( ) is optimal. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 35 / 48

Proof Definition of expected conditional risk: R(f, x) = 1 {η(x)i[f(x) = 1] + (1 η(x))i[f(x) = 0]} }{{} expected conditional accuracy R(f, x) = 1 {η(x)i[f (x) = 1] + (1 η(x))i[f (x) = 0]} Since η(x) = P (y = 1 x), we have: Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 36 / 48

Proof Definition of expected conditional risk: R(f, x) = 1 {η(x)i[f(x) = 1] + (1 η(x))i[f(x) = 0]} }{{} expected conditional accuracy R(f, x) = 1 {η(x)i[f (x) = 1] + (1 η(x))i[f (x) = 0]} Since η(x) = P (y = 1 x), we have: R(f, x) R(f, x) = η(x) {I[f (x) = 1] I[f(x) = 1]} + (1 η(x)) {I[f (x) = 0] I[f(x) = 0]} Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 36 / 48

Proof Definition of expected conditional risk: R(f, x) = 1 {η(x)i[f(x) = 1] + (1 η(x))i[f(x) = 0]} }{{} expected conditional accuracy R(f, x) = 1 {η(x)i[f (x) = 1] + (1 η(x))i[f (x) = 0]} Since η(x) = P (y = 1 x), we have: R(f, x) R(f, x) = η(x) {I[f (x) = 1] I[f(x) = 1]} + (1 η(x)) {I[f (x) = 0] I[f(x) = 0]} = η(x) {I[f (x) = 1] I[f(x) = 1]} { )} + (1 η(x)) I[f(x) = 1] I[f (x) = 1] Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 36 / 48

Proof Definition of expected conditional risk: R(f, x) = 1 {η(x)i[f(x) = 1] + (1 η(x))i[f(x) = 0]} }{{} expected conditional accuracy R(f, x) = 1 {η(x)i[f (x) = 1] + (1 η(x))i[f (x) = 0]} Since η(x) = P (y = 1 x), we have: R(f, x) R(f, x) = η(x) {I[f (x) = 1] I[f(x) = 1]} + (1 η(x)) {I[f (x) = 0] I[f(x) = 0]} = η(x) {I[f (x) = 1] I[f(x) = 1]} { )} + (1 η(x)) I[f(x) = 1] I[f (x) = 1] = (2η(x) 1) {I[f (x) = 1] I[f(x) = 1]} Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 36 / 48

Proof Definition of expected conditional risk: R(f, x) = 1 {η(x)i[f(x) = 1] + (1 η(x))i[f(x) = 0]} }{{} expected conditional accuracy R(f, x) = 1 {η(x)i[f (x) = 1] + (1 η(x))i[f (x) = 0]} Since η(x) = P (y = 1 x), we have: R(f, x) R(f, x) = η(x) {I[f (x) = 1] I[f(x) = 1]} + (1 η(x)) {I[f (x) = 0] I[f(x) = 0]} = η(x) {I[f (x) = 1] I[f(x) = 1]} { )} + (1 η(x)) I[f(x) = 1] I[f (x) = 1] = (2η(x) 1) {I[f (x) = 1] I[f(x) = 1]} 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 36 / 48

Bayes optimal classifier in general form For multi-class classification problem f (x) = arg max c [C] p(y = c x) when C = 2, this reduces to detecting whether or not η(x) = p(y = 1 x) is greater than 1/2. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 37 / 48

Bayes optimal classifier in general form For multi-class classification problem f (x) = arg max c [C] p(y = c x) when C = 2, this reduces to detecting whether or not η(x) = p(y = 1 x) is greater than 1/2. Remarks The Bayes optimal classifier is generally not computable as it assumes the knowledge of p(x, y) or p(y x). However, it is useful as a conceptual tool to formalize how well a classifier can do without knowing the joint distribution. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 37 / 48

Comparing NNC to Bayes optimal classifier How well does our NNC do? Theorem (Cover-Hart Inequality) For the NNC rule f nnc for binary classification, we have, R(f ) R(f nnc ) 2R(f )(1 R(f )) 2R(f ) Namely, the expected risk by the classifier is at worst twice that of the Bayes optimal classifier. In short, NNC is doing a reasonable thing Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 38 / 48

Outline 1 Administration 2 First learning algorithm: Nearest neighbor classifier 3 Deeper understanding of NNC 4 Some practical aspects of NNC How to tune to get the best out of it? Preprocessing data 5 What we have learned Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 39 / 48

Hyperparameters in NNC Two practical issues about NNC Choosing K, i.e., the number of nearest neighbors (default is 1) Choosing the right distance measure (default is Euclidean distance), for example, from the following generalized distance measure for p 1. ( ) 1/p x x n p = x d x nd p Those are not specified by the algorithm itself resolving them requires empirical studies and are task/dataset-specific. d Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 40 / 48

Tuning by using a validation dataset Training data N samples/instances: D train = {(x 1, y 1 ), (x 2, y 2 ),, (x N, y N )} They are used for learning f( ) Test data M samples/instances: D test = {(x 1, y 1 ), (x 2, y 2 ),, (x M, y M )} They are used for assessing how well f( ) will do in predicting an unseen x / D train Validation data L samples/instances: D val = {(x 1, y 1 ), (x 2, y 2 ),, (x L, y L )} They are used to optimize hyperparameter(s). Training data, validation and test data should not overlap! Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 41 / 48

Recipe For each possible value of the hyperparameter (say K = 1, 3,, 100) Train a model using D train Evaluate the performance of the model on D val Choose the model with the best performance on D val Evaluate the model on D test Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 42 / 48

Cross-validation What if we do not have validation data? We split the training data into S equal parts. We use each part in turn as a validation dataset and use the others as a training dataset. We choose the hyperparameter such that the model performs the best (based on average, variance, etc.) S = 5: 5-fold cross validation Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 43 / 48

Cross-validation What if we do not have validation data? We split the training data into S equal parts. We use each part in turn as a validation dataset and use the others as a training dataset. We choose the hyperparameter such that the model performs the best (based on average, variance, etc.) Special case: when S = N, this will be leave-one-out. S = 5: 5-fold cross validation Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 43 / 48

Yet, another practical issue with NNC Distances depend on units of the features! Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 44 / 48

Preprocess data Normalize data to have zero mean and unit standard deviation in each dimension Compute the means and standard deviations in each feature x d = 1 N Scale the feature accordingly n x nd, s 2 d = 1 N 1 x nd x nd x d s d (x nd x d ) 2 Many other ways of normalizing data you would need/want to try different ones and pick among them using (cross) validation n Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 45 / 48

Outline 1 Administration 2 First learning algorithm: Nearest neighbor classifier 3 Deeper understanding of NNC 4 Some practical aspects of NNC 5 What we have learned Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 46 / 48

Summary so far Described a simple learning algorithm Used intensively in practical applications you will get a taste of it in your homework Discussed a few practical aspects, such as tuning hyperparameters, with (cross)validation Briefly studied its theoretical properties Concepts: loss function, risks, Bayes optimal Theoretical guarantees: explaining why NNC would work Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 47 / 48

Administration Summary Course is full, but I m confident we can accommodate all students Math quiz / HW1 will give you a sense of mathematical rigor of course If you re not registered, be patient! I am not giving out PTEs until after HW1 HW1 is online Due next Wednesday at the beginning of class Submission details are included in the assignment Join Piazza if you haven t already Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 48 / 48