Nearest neighbors classifiers

Similar documents
We use non-bold capital letters for all random variables in these notes, whether they are scalar-, vector-, matrix-, or whatever-valued.

k-nearest Neighbors + Model Selection

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Nearest Neighbor Classification

CSCI567 Machine Learning (Fall 2014)

Stat 342 Exam 3 Fall 2014

Linear Regression and K-Nearest Neighbors 3/28/18

CSCI567 Machine Learning (Fall 2018)

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Homework Assignment #3

Machine Learning (CSE 446): Unsupervised Learning

CSE 573: Artificial Intelligence Autumn 2010

Classification: Feature Vectors

Notes and Announcements

Nearest neighbor classification DSE 220

Kernels and Clustering

Classification and K-Nearest Neighbors

Nonparametric Methods Recap

1 Case study of SVM (Rob)

CS 343: Artificial Intelligence

CS 170 Algorithms Fall 2014 David Wagner HW12. Due Dec. 5, 6:00pm

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Nearest Neighbor Classifiers

Nearest Neighbor Classification

Chapter 4: Non-Parametric Techniques

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

10-701/15-781, Fall 2006, Final

Fall 09, Homework 5

A Brief Look at Optimization

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Computer Vision. Exercise Session 10 Image Categorization

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

COMS 4771 Clustering. Nakul Verma

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Nearest Neighbor Predictors

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

K-Nearest Neighbour (Continued) Dr. Xiaowei Huang

Stat 602X Exam 2 Spring 2011

Introduction to Machine Learning. Xiaojin Zhu

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

CSC 411 Lecture 4: Ensembles I

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

7. Boosting and Bagging Bagging

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Decision Tree (Continued) and K-Nearest Neighbour. Dr. Xiaowei Huang

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

Homework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1

6.034 Design Assignment 2

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Network Traffic Measurements and Analysis

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Linear Classification and Perceptron

Distribution-free Predictive Approaches

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Natural Language Processing

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Homework 3 Solutions

K-Means Clustering 3/3/17

Instance-Based Learning.

Non-Parametric Modeling

Naïve Bayes for text classification

Using Machine Learning to Optimize Storage Systems

Machine Learning. Classification

CS446: Machine Learning Fall Problem Set 4. Handed Out: October 17, 2013 Due: October 31 th, w T x i w

Machine Learning Classifiers and Boosting

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Topic 7 Machine learning

A Dendrogram. Bioinformatics (Lec 17)

Supervised Learning: K-Nearest Neighbors and Decision Trees

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Machine Learning and Pervasive Computing

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Introduction to Artificial Intelligence

Application of Spectral Clustering Algorithm

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

What to come. There will be a few more topics we will cover on supervised learning

Tree-based methods for classification and regression

Machine Learning (CSE 446): Decision Trees

Notes slides from before lecture. CSE 21, Winter 2017, Section A00. Lecture 9 Notes. Class URL:

Missing Data. Where did it go?

Nearest Neighbor Classification. Machine Learning Fall 2017

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14

Nonparametric Classification Methods

9.1. K-means Clustering

Wild Mushrooms Classification Edible or Poisonous

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

1 Document Classification [60 points]

Simple Model Selection Cross Validation Regularization Neural Networks

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Transcription:

Nearest neighbors classifiers James McInerney Adapted from slides by Daniel Hsu Sept 11, 2017 1 / 25

Housekeeping We received 167 HW0 submissions on Gradescope before midnight Sept 10th. From a random sample, most look well done and are using the question assignment mechanism on Gradescope correctly. Example assignment submission. I received a few late submission emails. This one time only, I will turn the Gradescope submission page on again after class until 8pm EST. We will go over the trickiest parts of the homework at the beginning of the talk this Wednesday. Therefore I have pushed the first office hours to Wednesday. Here are the dates of the two exams for this course: Exam 1: Wednesday October 18th, 2017 Exam 2: Monday December 11th, 2017 2 / 25

Example: OCR for digits 1. Classify images of handwritten digits by the actual digits they represent. 2. Classification problem: Y = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} (a discrete set). 3 / 25

Nearest neighbor (NN) classifier Given: labeled examples D := {(x i, y i)} n i=1 Predictor: ˆfD : X Y On input x, 1. Find the point x i among {x i} n i=1 that is closest to x (the nearest neighbor). 2. Return y i. 4 / 25

How to measure distance? A default choice for distance between points in R d is the Euclidean distance (also called l 2 distance): u v 2 := d (u i v i) 2 i=1 (where u = (u 1, u 2,..., u d ) and v = (v 1, v 2,..., v d )). Grayscale 28 28 pixel images. Treat as vectors (of 784 real-valued features) that live in R 784. 5 / 25

Example: OCR for digits with NN classifier Classify images of handwritten digits by the digits they depict. 6 / 25

Example: OCR for digits with NN classifier Classify images of handwritten digits by the digits they depict. X = R 784, Y = {0, 1,..., 9}. 6 / 25

Example: OCR for digits with NN classifier Classify images of handwritten digits by the digits they depict. X = R 784, Y = {0, 1,..., 9}. Given: labeled examples D := {(x i, y i)} n i=1 X Y. 6 / 25

Example: OCR for digits with NN classifier Classify images of handwritten digits by the digits they depict. X = R 784, Y = {0, 1,..., 9}. Given: labeled examples D := {(x i, y i)} n i=1 X Y. Construct NN classifier ˆf D using D. 6 / 25

Example: OCR for digits with NN classifier Classify images of handwritten digits by the digits they depict. X = R 784, Y = {0, 1,..., 9}. Given: labeled examples D := {(x i, y i)} n i=1 X Y. Construct NN classifier ˆf D using D. Question: Is this classifier any good? 6 / 25

Error rate Error rate of classifier f on a set of labeled examples D: err D(f) := # of (x, y) D such that f(x) y D (i.e., the fraction of D on which f disagrees with paired label). 7 / 25

Error rate Error rate of classifier f on a set of labeled examples D: err D(f) := # of (x, y) D such that f(x) y D (i.e., the fraction of D on which f disagrees with paired label). Sometimes, we ll write this as err(f, D). 7 / 25

Error rate Error rate of classifier f on a set of labeled examples D: err D(f) := # of (x, y) D such that f(x) y D (i.e., the fraction of D on which f disagrees with paired label). Sometimes, we ll write this as err(f, D). Question: What is err D( ˆf D)? 7 / 25

A better way to evaluate the classifier Split the labeled examples {(x i, y i)} n i=1 into two sets (randomly). Training data S. Test data T. 8 / 25

A better way to evaluate the classifier Split the labeled examples {(x i, y i)} n i=1 into two sets (randomly). Training data S. Test data T. Only use training data S to construct NN classifier ˆf S. 8 / 25

A better way to evaluate the classifier Split the labeled examples {(x i, y i)} n i=1 into two sets (randomly). Training data S. Test data T. Only use training data S to construct NN classifier ˆf S. Training error rate of ˆfS: err S( ˆf S) = 0%. 8 / 25

A better way to evaluate the classifier Split the labeled examples {(x i, y i)} n i=1 into two sets (randomly). Training data S. Test data T. Only use training data S to construct NN classifier ˆf S. Training error rate of ˆfS: err S( ˆf S) = 0%. Use test data T to evaluate accuracy of ˆf S. 8 / 25

A better way to evaluate the classifier Split the labeled examples {(x i, y i)} n i=1 into two sets (randomly). Training data S. Test data T. Only use training data S to construct NN classifier ˆf S. Training error rate of ˆfS: err S( ˆf S) = 0%. Use test data T to evaluate accuracy of ˆf S. Test error rate of ˆfS: err T ( ˆf S) = 3.09%. 8 / 25

A better way to evaluate the classifier Split the labeled examples {(x i, y i)} n i=1 into two sets (randomly). Training data S. Test data T. Only use training data S to construct NN classifier ˆf S. Training error rate of ˆfS: err S( ˆf S) = 0%. Use test data T to evaluate accuracy of ˆf S. Test error rate of ˆfS: err T ( ˆf S) = 3.09%. Is this good? 8 / 25

Diagnostics Some mistakes made by the NN classifier (test point in T, nearest neighbor in S): First mistake (correct label is 2 ) could ve been avoided by looking at the three nearest neighbors (whose labels are 8, 2, and 2 ). test point three nearest neighbors 9 / 25

k-nearest neighbors classifier Given: labeled examples D := {(x i, y i)} n i=1 Predictor: ˆfD,k : X Y: On input x, 1. Find the k points x i1, x i2,..., x ik among {x i} n i=1 closest to x (the k nearest neighbors). 2. Return the plurality of y i1, y i2,..., y ik. (Break ties in both steps arbitrarily.) 10 / 25

Effect of k Smaller k: smaller training error rate. Larger k: higher training error rate, but predictions are more stable due to voting. OCR digits classification k 1 3 5 7 9 Test error rate 0.0309 0.0295 0.0312 0.0306 0.0341 11 / 25

Choosing k The hold-out set approach 1. Pick a subset V S (hold-out set, a.k.a. validation set). 2. For each k {1, 3, 5,... }: Construct k-nn classifier ˆf S\V,k using S \ V. Compute error rate of ˆf S\V,k on V ( hold-out error rate ). 3. Pick the k that gives the smallest hold-out error rate. 12 / 25

Other distance functions Lp norm dist(u, v) = x p 1 + x p 2 + + x p d 1 p 13 / 25

Other distance functions Lp norm dist(u, v) = x p 1 + x p 2 + + x p d 1 p OCR digits classification Distance l 2 l 3 Test error rate 3.09% 2.83% 13 / 25

Other distance functions Lp norm Manhattan distance dist(u, v) = x p 1 + x p 2 + + x p d 1 p OCR digits classification Distance l 2 l 3 Test error rate 3.09% 2.83% dist(u, v) = distance on grid between u and v on strict horizontal/vertical path 13 / 25

Other distance functions Lp norm Manhattan distance dist(u, v) = x p 1 + x p 2 + + x p d 1 p OCR digits classification Distance l 2 l 3 Test error rate 3.09% 2.83% dist(u, v) = distance on grid between u and v on strict horizontal/vertical path String edit distance dist(u, v) = # insertions/deletions/mutations needed to change u to v 13 / 25

Bad features Caution: nearest neighbor classifier can be broken by bad/noisy features! Feature 2 y = 0 y = 1 Feature 1 Feature 1 y = 0 y = 1 14 / 25

Questions of interest 1. How good is the classifier learned using NN on your problem? 2. Is NN a good learning method in general? 15 / 25

Statistical learning theory Basic assumption (main idea): labeled examples {(x i, y i} n i=1 come from same source as future examples. 16 / 25

Statistical learning theory Basic assumption (main idea): labeled examples {(x i, y i} n i=1 come from same source as future examples. new (unlabeled) example from P past labeled examples from P learning algorithm learned predictor predicted label More formally: {(x i, y i)} n i=1 is an i.i.d. sample from a probability distribution P over X Y. 16 / 25

Prediction error rate Define the (true) error rate of a classifier f : X Y w.r.t. P to be err P (f) := P (f(x) Y ) where (X, Y ) is a pair of random variables with joint distribution P (i.e., (X, Y ) P ). 17 / 25

Prediction error rate Define the (true) error rate of a classifier f : X Y w.r.t. P to be err P (f) := P (f(x) Y ) where (X, Y ) is a pair of random variables with joint distribution P (i.e., (X, Y ) P ). Let ˆf S be classifier trained using labeled examples S. 17 / 25

Prediction error rate Define the (true) error rate of a classifier f : X Y w.r.t. P to be err P (f) := P (f(x) Y ) where (X, Y ) is a pair of random variables with joint distribution P (i.e., (X, Y ) P ). Let ˆf S be classifier trained using labeled examples S. True error rate of ˆf S is err P ( ˆf S) := P ( ˆf S(X) Y ). 17 / 25

Prediction error rate Define the (true) error rate of a classifier f : X Y w.r.t. P to be err P (f) := P (f(x) Y ) where (X, Y ) is a pair of random variables with joint distribution P (i.e., (X, Y ) P ). Let ˆf S be classifier trained using labeled examples S. True error rate of ˆf S is err P ( ˆf S) := P ( ˆf S(X) Y ). We cannot compute this without knowing P. 17 / 25

Estimating the true error rate Suppose {(x i, y i) n i=1 (assumed to be an i.i.d. sample from P ) is randomly split into S and T, and ˆf S is based only on S. 18 / 25

Estimating the true error rate Suppose {(x i, y i) n i=1 (assumed to be an i.i.d. sample from P ) is randomly split into S and T, and ˆf S is based only on S. ˆfS and T are independent, and the test error rate of ˆf S is an unbiased estimate of the true error rate of ˆf S. 18 / 25

Estimating the true error rate Suppose {(x i, y i) n i=1 (assumed to be an i.i.d. sample from P ) is randomly split into S and T, and ˆf S is based only on S. ˆfS and T are independent, and the test error rate of ˆf S is an unbiased estimate of the true error rate of ˆf S. If T = m, then the test error rate err T ( ˆf S) of ˆf S (conditional on S) is a binomial random variable (scaled by 1/m): m err T ( ˆf S) S Bin(m, err P ( ˆf S)). 18 / 25

Estimating the true error rate Suppose {(x i, y i) n i=1 (assumed to be an i.i.d. sample from P ) is randomly split into S and T, and ˆf S is based only on S. ˆfS and T are independent, and the test error rate of ˆf S is an unbiased estimate of the true error rate of ˆf S. If T = m, then the test error rate err T ( ˆf S) of ˆf S (conditional on S) is a binomial random variable (scaled by 1/m): m err T ( ˆf S) S Bin(m, err P ( ˆf S)). The expected value of err T ( ˆf S) is err P ( ˆf S). (This means that err T ( ˆf S) is an unbiased estimator of err P ( ˆf S).) 18 / 25

Limits of prediction Binary classification: Y = {0, 1}. 19 / 25

Limits of prediction Binary classification: Y = {0, 1}. Probability distribution P over X {0, 1}; let (X, Y ) P. 19 / 25

Limits of prediction Binary classification: Y = {0, 1}. Probability distribution P over X {0, 1}; let (X, Y ) P. Think of P as being comprised of two parts. 1. Marginal distribution µ of X (a distribution over X ). 2. Conditional distribution of Y given X = x, for each x X : η(x) := P (Y = 1 X = x). 19 / 25

Limits of prediction Binary classification: Y = {0, 1}. Probability distribution P over X {0, 1}; let (X, Y ) P. Think of P as being comprised of two parts. 1. Marginal distribution µ of X (a distribution over X ). 2. Conditional distribution of Y given X = x, for each x X : η(x) := P (Y = 1 X = x). If η(x) is 0 or 1 for all x X where µ(x) > 0, then optimal error rate is zero (i.e., min f err P (f) = 0). 19 / 25

Limits of prediction Binary classification: Y = {0, 1}. Probability distribution P over X {0, 1}; let (X, Y ) P. Think of P as being comprised of two parts. 1. Marginal distribution µ of X (a distribution over X ). 2. Conditional distribution of Y given X = x, for each x X : η(x) := P (Y = 1 X = x). If η(x) is 0 or 1 for all x X where µ(x) > 0, then optimal error rate is zero (i.e., min f err P (f) = 0). Otherwise it is non-zero. 19 / 25

Bayes optimality What is the classifier with smallest true error rate? { f 0 if η(x) 1/2; (x) := 1 if η(x) > 1/2. (Do you see why?) 20 / 25

Bayes optimality What is the classifier with smallest true error rate? { f 0 if η(x) 1/2; (x) := 1 if η(x) > 1/2. (Do you see why?) f is called the Bayes (optimal) classifier, and err P (f ) = min err P (f) = E [min { η(x), 1 η(x) }] f which is called the Bayes (optimal) error rate. 20 / 25

Bayes optimality What is the classifier with smallest true error rate? { f 0 if η(x) 1/2; (x) := 1 if η(x) > 1/2. (Do you see why?) f is called the Bayes (optimal) classifier, and err P (f ) = min err P (f) = E [min { η(x), 1 η(x) }] f which is called the Bayes (optimal) error rate. Question: How far from optimal is the classifier produced by the NN learning method? 20 / 25

Consistency of k-nn We say that a learning algorithm A is consistent if [ lim E err P ( ˆf ] n n) = err(f ), where ˆf n is the classifier learned using A on an i.i.d. sample of size n. Theorem (e.g., Cover and Hart 1967) Assume η is continuous. Then: 1-NN is consistent if min f err P (f) = 0. k-nn is consistent, provided that k := k n is chosen as an increasing but sublinear function of n: k n lim kn =, lim n n n = 0. 21 / 25

Key takeaways 1. k-nn learning procedure; role of k, distance functions, features. 2. Training and test error rates. 3. Framework of statistical learning theory; estimating the true error rate; Bayes optimality; high-level idea of consistency. 22 / 25

Recap: definitions denote our classifier as f 23 / 25

Recap: definitions denote our classifier as f what does f(x) mean? 23 / 25

Recap: definitions denote our classifier as f what does f(x) mean? denote the dataset as D 23 / 25

Recap: definitions denote our classifier as f what does f(x) mean? denote the dataset as D what is err D(f)? 23 / 25

Recap: definitions denote our classifier as f what does f(x) mean? denote the dataset as D what is err D(f)? # of (x, y) D such that f(x) y D 23 / 25

Recap: definitions denote our classifier as f what does f(x) mean? denote the dataset as D what is err D(f)? what does P (X, Y ) mean? # of (x, y) D such that f(x) y D 23 / 25

Recap: definitions denote our classifier as f what does f(x) mean? denote the dataset as D what is err D(f)? what does P (X, Y ) mean? what is err P (f)? # of (x, y) D such that f(x) y D 23 / 25

Recap: definitions denote our classifier as f what does f(x) mean? denote the dataset as D what is err D(f)? what does P (X, Y ) mean? # of (x, y) D such that f(x) y D what is err P (f)? E P [I[f(X) Y ]] 23 / 25

Recap: definitions denote our classifier as f what does f(x) mean? denote the dataset as D what is err D(f)? what does P (X, Y ) mean? # of (x, y) D such that f(x) y D what is err P (f)? E P [I[f(X) Y ]] = P (f(x) Y ) 23 / 25

Recap: unbiased estimator what is an estimator? 24 / 25

Recap: unbiased estimator what is an estimator? using definitions from previous slide, what is err D(f) an estimator of? 24 / 25

Recap: unbiased estimator what is an estimator? using definitions from previous slide, what is err D(f) an estimator of? err P (f) 24 / 25

Recap: unbiased estimator what is an estimator? using definitions from previous slide, what is err D(f) an estimator of? err P (f) what is an unbiased estimator? 24 / 25

Recap: unbiased estimator what is an estimator? using definitions from previous slide, what is err D(f) an estimator of? err P (f) what is an unbiased estimator? E[err D(f)] = err P (f) 24 / 25

Recap: unbiased estimator what is an estimator? using definitions from previous slide, what is err D(f) an estimator of? err P (f) what is an unbiased estimator? E[err D(f)] = err P (f) what on earth does that mean? 24 / 25

Recap: unbiased estimator what is an estimator? using definitions from previous slide, what is err D(f) an estimator of? err P (f) what is an unbiased estimator? E[err D(f)] = err P (f) what on earth does that mean? if we repeatedly draw datasets of size n, D n P (X, Y ) and calculate err D(f), the average will converge to err P (f) 24 / 25

Recap: unbiased estimator what is an estimator? using definitions from previous slide, what is err D(f) an estimator of? err P (f) what is an unbiased estimator? E[err D(f)] = err P (f) what on earth does that mean? if we repeatedly draw datasets of size n, D n P (X, Y ) and calculate err D(f), the average will converge to err P (f) why should we care? 24 / 25

Recap: consistent algorithm let ˆf n mean a classifier trained using n examples 25 / 25

Recap: consistent algorithm let ˆf n mean a classifier trained using n examples [ what is the property lim n E err P ( ˆf ] n) = err(f )? 25 / 25

Recap: consistent algorithm let ˆf n mean a classifier trained using n examples [ what is the property lim n E err P ( ˆf ] n) = err(f )? consistency 25 / 25

Recap: consistent algorithm let ˆf n mean a classifier trained using n examples [ what is the property lim n E err P ( ˆf ] n) = err(f )? consistency what is the ideal classifier f also known as? 25 / 25

Recap: consistent algorithm let ˆf n mean a classifier trained using n examples [ what is the property lim n E err P ( ˆf ] n) = err(f )? consistency what is the ideal classifier f also known as? the Bayes optimal classifier 25 / 25

Recap: consistent algorithm let ˆf n mean a classifier trained using n examples [ what is the property lim n E err P ( ˆf ] n) = err(f )? consistency what is the ideal classifier f also known as? the Bayes optimal classifier how can we define f? 25 / 25

Recap: consistent algorithm let ˆf n mean a classifier trained using n examples [ what is the property lim n E err P ( ˆf ] n) = err(f )? consistency what is the ideal classifier f also known as? the Bayes optimal classifier how can we define f? err P (f ) = min f err P (f) = E [min { η(x), 1 η(x) }] 25 / 25