CSCI567 Machine Learning (Fall 2018)

Size: px

Start display at page:

Download "CSCI567 Machine Learning (Fall 2018)"

Jemimah Sutton
5 years ago
Views:

1 CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Aug 22, 2018 August 22, / 63

2 Outline 1 About this course 2 Overview of machine learning 3 Nearest Neighbor Classifier (NNC) 4 Some theory on NNC August 22, / 63

3 About this course Outline 1 About this course 2 Overview of machine learning 3 Nearest Neighbor Classifier (NNC) 4 Some theory on NNC August 22, / 63

4 About this course Enrollment There was some delay unfortunately. It is closing by the end of today. Two offerings: on-campus and DEN. You only need to attend one. Two sections: a lecture and a discussion section. You need to attend both. Discussion section is starting next week. August 22, / 63

5 About this course Teaching staff Instructors: TAs: Lecture: Haipeng Luo Discussion: Dr. Victor Adamchik and Dr. Kim Peters and a lot of help from Dr. Michael Shindler Chin-Cheng Hsu Shamim Samadi Chi Zhang Ke Zhang and more... More course producers and graders Office hours: TBA August 22, / 63

6 About this course Online platforms A course website: general information (course schedule, homework, etc.) August 22, / 63

7 About this course Online platforms A course website: general information (course schedule, homework, etc.) Piazza: main discussion forum everyone has to enroll August 22, / 63

8 About this course Online platforms A course website: general information (course schedule, homework, etc.) Piazza: main discussion forum everyone has to enroll D2L: lecture videos submit written assignments grade posting August 22, / 63

9 About this course Online platforms A course website: general information (course schedule, homework, etc.) Piazza: main discussion forum everyone has to enroll D2L: lecture videos submit written assignments grade posting GitHub: submit programming assignments everyone needs to have a GitHub account August 22, / 63

10 About this course Required preparation Undergraduate courses in probability and statistics, linear algebra, multivariate calculus Programming: Python and necessary packages, git August 22, / 63

11 About this course Required preparation Undergraduate courses in probability and statistics, linear algebra, multivariate calculus Programming: Python and necessary packages, git not an intro-level CS course, no training of basic programming skills. August 22, / 63

12 About this course Slides and readings Lectures Lecture slides will be posted before or soon after class. August 22, / 63

13 About this course Slides and readings Lectures Lecture slides will be posted before or soon after class. Readings No required textbooks Main recommended readings: Machine Learning: A Probabilistic Perspective by Kevin Murphy Elements of Statistical Learning by Hastie, Tibshirani and Friedman More: see course website August 22, / 63

14 About this course Grade 15%: 5 written assignments 25%: 5 programming assignments 60%: 2 exams August 22, / 63

15 About this course Homework assignments Five, each consists of problem set (3%) submit one PDF to D2L (scan copy or typeset with LaTeX etc.) graded based on effort August 22, / 63

16 About this course Homework assignments Five, each consists of problem set (3%) submit one PDF to D2L (scan copy or typeset with LaTeX etc.) graded based on effort programming tasks (5%) submit through GitHub graded by scripts August 22, / 63

17 About this course Policy Collaboration: Allowed, but only at high-level Each has to make a separate submission State clearly who you collaborated with (or obtained help from) August 22, / 63

18 About this course Policy Collaboration: Allowed, but only at high-level Each has to make a separate submission State clearly who you collaborated with (or obtained help from) Late submissions: A total of 5 grace days for the semester fill a form to apply within 24 hours of due time in place of excused late submissions, not in addition to no grace period late submissions without using grace days will be scored as 0 your responsibility to keep track August 22, / 63

19 About this course Exams One midterm: 10/03, 5:00-7:00 PM One final: Location: TBA 11/28, 5:00-7:00 PM Individual effort, closed-book and closed-notes Request for a different date/time must be submitted within first two weeks or asap in case of emergence August 22, / 63

20 About this course Academic honesty and integrity Plagiarism and other unacceptable violations Neither ethical nor in your self-interest Zero-tolerance August 22, / 63

21 About this course Teaching philosophy The nature of this course Describe basic concepts and tools Describe algorithms and their development with intuition and rigor August 22, / 63

22 About this course Teaching philosophy The nature of this course Describe basic concepts and tools Describe algorithms and their development with intuition and rigor Expectation on you Hone skills on grasping abstract concepts and thinking critically to solve problems with machine learning techniques Solidify your knowledge with hand-on programming assignments Prepare you for studying advanced machine learning techniques August 22, / 63

23 About this course Teaching philosophy The nature of this course Describe basic concepts and tools Describe algorithms and their development with intuition and rigor Expectation on you Hone skills on grasping abstract concepts and thinking critically to solve problems with machine learning techniques Solidify your knowledge with hand-on programming assignments Prepare you for studying advanced machine learning techniques Feel free to interrupt and ask questions! August 22, / 63

24 About this course Important things for you to do Take a look at the course website Enroll in Piazza Get a GitHub account Brush up your Python skills August 22, / 63

25 About this course Important things for you to do Take a look at the course website Enroll in Piazza Get a GitHub account Brush up your Python skills Questions? August 22, / 63

26 Overview of machine learning Outline 1 About this course 2 Overview of machine learning 3 Nearest Neighbor Classifier (NNC) 4 Some theory on NNC August 22, / 63

27 Overview of machine learning What is machine learning? One possible definition 1 a set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty cf. Murphy s book August 22, / 63

28 Overview of machine learning Example: detect patterns How the temperature has been changing? August 22, / 63

29 Overview of machine learning Example: detect patterns How the temperature has been changing? Patterns Seems going up Repeated periods of going up and down. August 22, / 63

30 Overview of machine learning How do we describe the pattern? Build a model: fit the data with a polynomial function The model is not accurate for individual years But collectively, the model captures the major trend Still, not able to model the pattern of the repeated up and down August 22, / 63

31 Overview of machine learning Predicting future What is temperature of 2010? Again, the model is not accurate for that specific year But then, it is close to the actual one August 22, / 63

32 Overview of machine learning What we have learned from this example? Key ingredients in machine learning Data collected from past observation (we often call them training data) August 22, / 63

33 Overview of machine learning What we have learned from this example? Key ingredients in machine learning Data collected from past observation (we often call them training data) Modeling devised to capture the patterns in the data The model does not have to be true All models are wrong, but some are useful by George Box. August 22, / 63

34 Overview of machine learning What we have learned from this example? Key ingredients in machine learning Data collected from past observation (we often call them training data) Modeling devised to capture the patterns in the data The model does not have to be true All models are wrong, but some are useful by George Box. Prediction apply the model to forecast what is going to happen in future August 22, / 63

35 Overview of machine learning A rich history of applying statistical learning methods Recognizing flowers (by R. Fisher, 1936) Types of Iris: setosa, versicolor, and virginica August 22, / 63

36 Overview of machine learning Huge success 20 years ago Recognizing handwritten zipcodes (AT&T Labs, late 1990s) true class = 7 true class = 2 true class = 1 true class = 0 true class = 4 true class = 1 true class = 4 true class = 9 true class = 5 August 22, / 63

37 Overview of machine learning More modern ones, in your social life Recognizing your friends on Facebook August 22, / 63

38 Overview of machine learning It might be possible to know about you than yourself Recommending what you might like August 22, / 63

39 Overview of machine learning Why is machine learning so hot? Tons of consumer applications: speech recognition, information retrieval and search, and document classification, stock price prediction, object recognition, biometrics, etc Highly desirable expertise from industry: Google, Facebook, Microsoft, Uber, Twitter, IBM, Linkedin, Amazon, August 22, / 63

40 Overview of machine learning Why is machine learning so hot? Tons of consumer applications: speech recognition, information retrieval and search, and document classification, stock price prediction, object recognition, biometrics, etc Highly desirable expertise from industry: Google, Facebook, Microsoft, Uber, Twitter, IBM, Linkedin, Amazon, Enable scientific breakthrough Climate science: understand global warming cause and effect Biology and genetics: identify disease-causing genes and gene networks Social science: social network analysis; social media analysis Business and finance: marketing, operation research Emerging ones: healthcare, energy, August 22, / 63

41 Overview of machine learning What is in machine learning? Different flavors of learning problems Supervised learning Aim to predict (as in previous examples) August 22, / 63

42 Overview of machine learning What is in machine learning? Different flavors of learning problems Supervised learning Aim to predict (as in previous examples) Unsupervised learning Aim to discover hidden and latent patterns and explore data August 22, / 63

43 Overview of machine learning What is in machine learning? Different flavors of learning problems Supervised learning Aim to predict (as in previous examples) Unsupervised learning Aim to discover hidden and latent patterns and explore data Reinforcement learning Aim to act optimally under uncertainty August 22, / 63

44 Overview of machine learning What is in machine learning? Different flavors of learning problems Supervised learning Aim to predict (as in previous examples) Unsupervised learning Aim to discover hidden and latent patterns and explore data Reinforcement learning Aim to act optimally under uncertainty Many other paradigms August 22, / 63

45 Overview of machine learning What is in machine learning? Different flavors of learning problems Supervised learning Aim to predict (as in previous examples) Unsupervised learning Aim to discover hidden and latent patterns and explore data Reinforcement learning Aim to act optimally under uncertainty Many other paradigms The focus and goal of this course Supervised learning (before midterm) Unsupervised learning (after midterm) August 22, / 63

46 Nearest Neighbor Classifier (NNC) Outline 1 About this course 2 Overview of machine learning 3 Nearest Neighbor Classifier (NNC) Intuitive example General setup for classification Algorithm How to measure performance Variants, Parameters, and Tuning Summary 4 Some theory on NNC August 22, / 63

47 Nearest Neighbor Classifier (NNC) Intuitive example Recognizing flowers Types of Iris: setosa, versicolor, and virginica August 22, / 63

48 Nearest Neighbor Classifier (NNC) Intuitive example Measuring the properties of the flowers Features and attributes: the widths and lengths of sepal and petal August 22, / 63

49 Nearest Neighbor Classifier (NNC) Intuitive example Pairwise scatter plots of 131 flower specimens Visualization of data helps identify the right learning model to use Each colored point is a flower specimen: setosa, versicolor, virginica sepal length sepal width petal length petal width petal width petal length sepal width sepal length August 22, / 63

50 Nearest Neighbor Classifier (NNC) Intuitive example Different types seem well-clustered and separable Using two features: petal width and sepal length sepal length sepal width petal length petal width petal width petal length sepal width sepal length August 22, / 63

51 Nearest Neighbor Classifier (NNC) Intuitive example Labeling an unknown flower type Closer to red cluster: so labeling it as setosa? sepal length sepal width petal length petal width petal width petal length sepal width sepal length August 22, / 63

52 Nearest Neighbor Classifier (NNC) General setup for classification General setup for multi-class classification Training data (set) N samples/instances: D train = {(x 1, y 1 ), (x 2, y 2 ),, (x N, y N )} August 22, / 63

53 Nearest Neighbor Classifier (NNC) General setup for classification General setup for multi-class classification Training data (set) N samples/instances: D train = {(x 1, y 1 ), (x 2, y 2 ),, (x N, y N )} Each x n R D is called a feature vector. August 22, / 63

54 Nearest Neighbor Classifier (NNC) General setup for classification General setup for multi-class classification Training data (set) N samples/instances: D train = {(x 1, y 1 ), (x 2, y 2 ),, (x N, y N )} Each x n R D is called a feature vector. Each y n [C] = {1, 2,, C} is called a label/class/category. August 22, / 63

55 Nearest Neighbor Classifier (NNC) General setup for classification General setup for multi-class classification Training data (set) N samples/instances: D train = {(x 1, y 1 ), (x 2, y 2 ),, (x N, y N )} Each x n R D is called a feature vector. Each y n [C] = {1, 2,, C} is called a label/class/category. They are used for learning f : R D [C] for future prediction. August 22, / 63

56 Nearest Neighbor Classifier (NNC) General setup for classification General setup for multi-class classification Training data (set) N samples/instances: D train = {(x 1, y 1 ), (x 2, y 2 ),, (x N, y N )} Each x n R D is called a feature vector. Each y n [C] = {1, 2,, C} is called a label/class/category. They are used for learning f : R D [C] for future prediction. Special case: binary classification Number of classes: C = 2 Conventional labels: {0, 1} or { 1, +1} August 22, / 63

57 Nearest Neighbor Classifier (NNC) General setup for classification Often, data is conveniently organized as a table Ex: Iris data (click here for all data) 4 features 3 classes August 22, / 63

58 Nearest Neighbor Classifier (NNC) Algorithm Nearest neighbor classification (NNC) Nearest neighbor x(1) = x nn(x) where nn(x) [N] = {1, 2,, N}, i.e., the index to one of the training instances, nn(x) = arg min n [N] x x n 2 = arg min D n [N] (x d x nd ) 2 where 2 is the L 2 /Euclidean distance. d=1 August 22, / 63

59 Nearest Neighbor Classifier (NNC) Algorithm Nearest neighbor classification (NNC) Nearest neighbor x(1) = x nn(x) where nn(x) [N] = {1, 2,, N}, i.e., the index to one of the training instances, nn(x) = arg min n [N] x x n 2 = arg min D n [N] (x d x nd ) 2 where 2 is the L 2 /Euclidean distance. d=1 Classification rule y = f(x) = y nn(x) August 22, / 63

60 Nearest Neighbor Classifier (NNC) Algorithm Visual example In this 2-dimensional example, the nearest point to x is a red training instance, thus, x will be labeled as red. x 2 (a) x 1 August 22, / 63

61 Nearest Neighbor Classifier (NNC) Algorithm Example: classify Iris with two features Training data ID (n) petal width (x 1 ) sepal length (x 2 ) category (y) setoas versicolor virginica... August 22, / 63

62 Nearest Neighbor Classifier (NNC) Algorithm Example: classify Iris with two features Training data ID (n) petal width (x 1 ) sepal length (x 2 ) category (y) setoas versicolor virginica.. Flower with unknown category petal width = 1.8 and sepal width = 6.4 (i.e. x = (1.8, 6.4)) Calculating distance x x n 2 = (x 1 x n1 ) 2 + (x 2 x n2 ) 2 Thus, the category is versicolor. ID distance August 22, / 63

63 Nearest Neighbor Classifier (NNC) Algorithm Decision boundary For every point in the space, we can determine its label using the NNC rule. This gives rise to a decision boundary that partitions the space into different regions. x 2 (b) x 1 August 22, / 63

64 Nearest Neighbor Classifier (NNC) How to measure performance Is NNC doing the right thing for us? Intuition We should compute accuracy the percentage of data points being correctly classified, or the error rate the percentage of data points being incorrectly classified. (accuracy + error rate = 1) August 22, / 63

65 Nearest Neighbor Classifier (NNC) How to measure performance Is NNC doing the right thing for us? Intuition We should compute accuracy the percentage of data points being correctly classified, or the error rate the percentage of data points being incorrectly classified. (accuracy + error rate = 1) Defined on the training data set A train = 1 I[f(x n ) == y n ], N where I[ ] is the indicator function. n ε train = 1 I[f(x n ) y n ] N n August 22, / 63

66 Nearest Neighbor Classifier (NNC) How to measure performance Is NNC doing the right thing for us? Intuition We should compute accuracy the percentage of data points being correctly classified, or the error rate the percentage of data points being incorrectly classified. (accuracy + error rate = 1) Defined on the training data set A train = 1 I[f(x n ) == y n ], N where I[ ] is the indicator function. n ε train = 1 I[f(x n ) y n ] N n Is this the right measure? August 22, / 63

67 Nearest Neighbor Classifier (NNC) How to measure performance Example Training data What are A train and ε train? August 22, / 63

68 Nearest Neighbor Classifier (NNC) How to measure performance Example Training data What are A train and ε train? A train = 100%, ε train = 0% For every training data point, its nearest neighbor is itself. August 22, / 63

69 Nearest Neighbor Classifier (NNC) How to measure performance Test Error Does it mean nearest neighbor is a very good algorithm? August 22, / 63

70 Nearest Neighbor Classifier (NNC) How to measure performance Test Error Does it mean nearest neighbor is a very good algorithm? Not really, having zero training error is simple! August 22, / 63

71 Nearest Neighbor Classifier (NNC) How to measure performance Test Error Does it mean nearest neighbor is a very good algorithm? Not really, having zero training error is simple! We should care about accuracy when predicting unseen data August 22, / 63

72 Nearest Neighbor Classifier (NNC) How to measure performance Test Error Does it mean nearest neighbor is a very good algorithm? Not really, having zero training error is simple! We should care about accuracy when predicting unseen data Test/Evaluation data D test = {(x 1, y 1 ), (x 2, y 2 ),, (x M, y M )} A fresh dataset, not overlap with training set. August 22, / 63

73 Nearest Neighbor Classifier (NNC) How to measure performance Test Error Does it mean nearest neighbor is a very good algorithm? Not really, having zero training error is simple! We should care about accuracy when predicting unseen data Test/Evaluation data D test = {(x 1, y 1 ), (x 2, y 2 ),, (x M, y M )} A fresh dataset, not overlap with training set. Test accuracy and test error A test = 1 I[f(x m ) == y m ], M m ε test = 1 I[f(x m ) y m ] M M August 22, / 63

74 Nearest Neighbor Classifier (NNC) How to measure performance Test Error Does it mean nearest neighbor is a very good algorithm? Not really, having zero training error is simple! We should care about accuracy when predicting unseen data Test/Evaluation data D test = {(x 1, y 1 ), (x 2, y 2 ),, (x M, y M )} A fresh dataset, not overlap with training set. Test accuracy and test error A test = 1 I[f(x m ) == y m ], M m Good measurement of a classifier s performance ε test = 1 I[f(x m ) y m ] M M August 22, / 63

75 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Variant 1: measure nearness with other distances Previously, we use the Euclidean distance nn(x) = arg min n [N] x x n 2 August 22, / 63

76 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Variant 1: measure nearness with other distances Previously, we use the Euclidean distance nn(x) = arg min n [N] x x n 2 Many other alternative distances E.g., the following L 1 distance (i.e., city block distance, or Manhattan distance) x x n 1 = D x d x nd d=1 Green line is Euclidean distance. Red, Blue, and Yellow lines are L 1 distance August 22, / 63

Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Variant 1: measure nearness with other distances Previously, we use the Euclidean distance nn(x) = arg min n [N] x x n 2 Many other

77 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Variant 1: measure nearness with other distances Previously, we use the Euclidean distance nn(x) = arg min n [N] x x n 2 Many other alternative distances E.g., the following L 1 distance (i.e., city block distance, or Manhattan distance) x x n 1 = D x d x nd d=1 More generally, L p distance (for p 1): ( ) 1/p x x n p = x d x nd p d Green line is Euclidean distance. Red, Blue, and Yellow lines are L 1 distance August 22, / 63

78 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Variant 2: K-nearest neighbor (KNN) Increase the number of nearest neighbors to use? 1-nearest neighbor: nn 1 (x) = arg min n [N] x x n 2 2-nearest neighbor: nn 2 (x) = arg min n [N] nn1 (x) x x n 2 3-nearest neighbor: nn 3 (x) = arg min n [N] nn1 (x) nn 2 (x) x x n 2 August 22, / 63

79 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Variant 2: K-nearest neighbor (KNN) Increase the number of nearest neighbors to use? 1-nearest neighbor: nn 1 (x) = arg min n [N] x x n 2 2-nearest neighbor: nn 2 (x) = arg min n [N] nn1 (x) x x n 2 3-nearest neighbor: nn 3 (x) = arg min n [N] nn1 (x) nn 2 (x) x x n 2 The set of K-nearest neighbor knn(x) = {nn 1 (x), nn 2 (x),, nn K (x)} August 22, / 63

80 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Variant 2: K-nearest neighbor (KNN) Increase the number of nearest neighbors to use? 1-nearest neighbor: nn 1 (x) = arg min n [N] x x n 2 2-nearest neighbor: nn 2 (x) = arg min n [N] nn1 (x) x x n 2 3-nearest neighbor: nn 3 (x) = arg min n [N] nn1 (x) nn 2 (x) x x n 2 The set of K-nearest neighbor knn(x) = {nn 1 (x), nn 2 (x),, nn K (x)} Note: with x(k) = x nnk (x), we have x x(1) 2 2 x x(2) 2 2 x x(k) 2 2 August 22, / 63

81 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning How to classify with K neighbors? Classification rule Every neighbor votes: naturally x n votes for its label y n. August 22, / 63

82 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning How to classify with K neighbors? Classification rule Every neighbor votes: naturally x n votes for its label y n. Aggregate everyone s vote on a class label c v c = I(y n == c), c [C] n knn(x) August 22, / 63

83 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning How to classify with K neighbors? Classification rule Every neighbor votes: naturally x n votes for its label y n. Aggregate everyone s vote on a class label c v c = I(y n == c), c [C] n knn(x) Predict with the majority y = f(x) = arg max c [C] v c August 22, / 63

84 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Example x 2 K=1, Label: red x 2 K=3, Label: red x 2 K=5, Label: blue (a) x 1 (a) x 1 (a) x 1 August 22, / 63

85 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Decision boundary 2 K = 1 2 K = 3 2 K = 3 1 x 7 x 7 x x x x 6 When K increases, the decision boundary becomes smoother. August 22, / 63

2 x 6 0 0 1 2 x 6 0 0 1 2 x 6 When K increases, the decision

86 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Decision boundary 2 K = 1 2 K = 3 2 K = 3 1 x 7 x 7 x x x x 6 When K increases, the decision boundary becomes smoother. What happens when K = N? August 22, / 63

87 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Variant 3: Preprocessing data One issue of NNC: distances depend on units of the features! August 22, / 63

88 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Variant 3: Preprocessing data One issue of NNC: distances depend on units of the features! One solution: preprocess data so it looks more normalized. August 22, / 63

89 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Variant 3: Preprocessing data One issue of NNC: distances depend on units of the features! One solution: preprocess data so it looks more normalized. Example: compute the means and standard deviations in each feature x d = 1 N Scale the feature accordingly n x nd, s 2 d = 1 N 1 x nd x nd x d s d (x nd x d ) 2 n August 22, / 63

90 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Variant 3: Preprocessing data One issue of NNC: distances depend on units of the features! One solution: preprocess data so it looks more normalized. Example: compute the means and standard deviations in each feature x d = 1 N Scale the feature accordingly Many other ways of normalizing data. n x nd, s 2 d = 1 N 1 x nd x nd x d s d (x nd x d ) 2 n August 22, / 63

91 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Which variants should we use? Hyper-parameters in NNC The distance measure (e.g. the parameter p for L p norm) K (i.e. how many nearest neighbor?) Different ways of preprocessing August 22, / 63

92 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Which variants should we use? Hyper-parameters in NNC The distance measure (e.g. the parameter p for L p norm) K (i.e. how many nearest neighbor?) Different ways of preprocessing Most algorithms have hyper-parameters. Tuning them is a significant part of applying an algorithm. August 22, / 63

93 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Tuning via a development dataset Training data N samples/instances: D train = {(x 1, y 1 ), (x 2, y 2 ),, (x N, y N )} They are used for learning f( ) Test data M samples/instances: D test = {(x 1, y 1 ), (x 2, y 2 ),, (x M, y M )} They are used for assessing how well f( ) will do. August 22, / 63

94 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Tuning via a development dataset Training data N samples/instances: D train = {(x 1, y 1 ), (x 2, y 2 ),, (x N, y N )} They are used for learning f( ) Test data M samples/instances: D test = {(x 1, y 1 ), (x 2, y 2 ),, (x M, y M )} They are used for assessing how well f( ) will do. Development/Validation data L samples/instances: D dev = {(x 1, y 1 ), (x 2, y 2 ),, (x L, y L )} They are used to optimize hyper-parameter(s). August 22, / 63

95 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Tuning via a development dataset Training data N samples/instances: D train = {(x 1, y 1 ), (x 2, y 2 ),, (x N, y N )} They are used for learning f( ) Test data M samples/instances: D test = {(x 1, y 1 ), (x 2, y 2 ),, (x M, y M )} They are used for assessing how well f( ) will do. Development/Validation data L samples/instances: D dev = {(x 1, y 1 ), (x 2, y 2 ),, (x L, y L )} They are used to optimize hyper-parameter(s). These three sets should not overlap! August 22, / 63

96 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Recipe For each possible value of the hyperparameter (e.g. K = 1, 3, ) Train a model using D train Evaluate the performance of the model on D dev August 22, / 63

97 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Recipe For each possible value of the hyperparameter (e.g. K = 1, 3, ) Train a model using D train Evaluate the performance of the model on D dev Choose the model with the best performance on D dev August 22, / 63

98 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Recipe For each possible value of the hyperparameter (e.g. K = 1, 3, ) Train a model using D train Evaluate the performance of the model on D dev Choose the model with the best performance on D dev Evaluate the model on D test August 22, / 63

99 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning S-fold Cross-validation What if we do not have a development set? Split the training data into S equal parts. S = 5: 5-fold cross validation August 22, / 63

100 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning S-fold Cross-validation What if we do not have a development set? Split the training data into S equal parts. Use each part in turn as a development dataset and use the others as a training dataset. S = 5: 5-fold cross validation August 22, / 63

101 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning S-fold Cross-validation What if we do not have a development set? Split the training data into S equal parts. Use each part in turn as a development dataset and use the others as a training dataset. Choose the hyper-parameter leading to best average performance. S = 5: 5-fold cross validation August 22, / 63

Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning S-fold Cross-validation What if we do not have a development set? Split the training data into S equal parts.

102 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning S-fold Cross-validation What if we do not have a development set? Split the training data into S equal parts. Use each part in turn as a development dataset and use the others as a training dataset. Choose the hyper-parameter leading to best average performance. S = 5: 5-fold cross validation Special case: S = N, called leave-one-out. August 22, / 63

103 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Cross-validation recipe Split the training data into S equal parts. Denote each part as D train s. August 22, / 63

104 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Cross-validation recipe Split the training data into S equal parts. Denote each part as D train s. For each possible value of the hyper-parameter (e.g. K = 1, 3, ) For every s [S] Train a model using D train \s = D train Ds train Evaluate the performance of the model on D train s Average the S performance metrics August 22, / 63

105 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Cross-validation recipe Split the training data into S equal parts. Denote each part as D train s. For each possible value of the hyper-parameter (e.g. K = 1, 3, ) For every s [S] Train a model using D train \s = D train Ds train Evaluate the performance of the model on D train s Average the S performance metrics Choose the hyper-parameter with the best averaged performance August 22, / 63

106 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Cross-validation recipe Split the training data into S equal parts. Denote each part as D train s. For each possible value of the hyper-parameter (e.g. K = 1, 3, ) For every s [S] Train a model using D train \s = D train Ds train Evaluate the performance of the model on D train s Average the S performance metrics Choose the hyper-parameter with the best averaged performance Use the best hyper-parameter to train a model using all D train August 22, / 63

107 Nearest Neighbor Classifier (NNC) Variants, Parameters, and Tuning Cross-validation recipe Split the training data into S equal parts. Denote each part as D train s. For each possible value of the hyper-parameter (e.g. K = 1, 3, ) For every s [S] Train a model using D train \s = D train Ds train Evaluate the performance of the model on D train s Average the S performance metrics Choose the hyper-parameter with the best averaged performance Use the best hyper-parameter to train a model using all D train Evaluate the model on D test August 22, / 63

108 Nearest Neighbor Classifier (NNC) Summary Mini-summary Advantages of NNC Simple, easy to implement (wildly used in practice) August 22, / 63

109 Nearest Neighbor Classifier (NNC) Summary Mini-summary Advantages of NNC Simple, easy to implement (wildly used in practice) Disadvantages of NNC Computationally intensive for large-scale problems: O(N D) for each prediction naively. August 22, / 63

110 Nearest Neighbor Classifier (NNC) Summary Mini-summary Advantages of NNC Simple, easy to implement (wildly used in practice) Disadvantages of NNC Computationally intensive for large-scale problems: O(N D) for each prediction naively. Need to carry the training data around. This type of method is called nonparametric. August 22, / 63

111 Nearest Neighbor Classifier (NNC) Summary Mini-summary Advantages of NNC Simple, easy to implement (wildly used in practice) Disadvantages of NNC Computationally intensive for large-scale problems: O(N D) for each prediction naively. Need to carry the training data around. This type of method is called nonparametric. Choosing the right hyper-parameters can be involved. August 22, / 63

112 Nearest Neighbor Classifier (NNC) Summary Mini-summary Typical steps of developing a machine learning system: Collect data, split into training, development, and test sets. Train a model with a machine learning algorithm. Most often we apply cross-validation to tune hyper-parameters. Evaluate using the test data and report performance. Use the model to predict future/make decisions. August 22, / 63

113 Some theory on NNC Outline 1 About this course 2 Overview of machine learning 3 Nearest Neighbor Classifier (NNC) 4 Some theory on NNC Step 1: Expected risk August 22, / 63

114 Some theory on NNC How good is NNC really? To answer this question, we proceed in 3 steps August 22, / 63

115 Some theory on NNC How good is NNC really? To answer this question, we proceed in 3 steps 1 Define more carefully a performance metric for a classifier. August 22, / 63

116 Some theory on NNC How good is NNC really? To answer this question, we proceed in 3 steps 1 Define more carefully a performance metric for a classifier. 2 Hypothesize an ideal classifier - the best possible one. August 22, / 63

117 Some theory on NNC How good is NNC really? To answer this question, we proceed in 3 steps 1 Define more carefully a performance metric for a classifier. 2 Hypothesize an ideal classifier - the best possible one. 3 Compare NNC to the ideal one. August 22, / 63

118 Some theory on NNC Step 1: Expected risk Why does test error make sense? Test error makes sense only when training set and test set are correlated. August 22, / 63

119 Some theory on NNC Step 1: Expected risk Why does test error make sense? Test error makes sense only when training set and test set are correlated. Most standard assumption: every data point (x, y) (from D train, D dev, or D test ) is an independent and identically distributed (i.i.d.) sample of an unknown joint distribution P. often written as (x, y) i.i.d. P August 22, / 63

120 Some theory on NNC Step 1: Expected risk Why does test error make sense? Test error makes sense only when training set and test set are correlated. Most standard assumption: every data point (x, y) (from D train, D dev, or D test ) is an independent and identically distributed (i.i.d.) sample of an unknown joint distribution P. often written as (x, y) i.i.d. P Test error of a fixed classifier is therefore a random variable. August 22, / 63

121 Some theory on NNC Step 1: Expected risk Why does test error make sense? Test error makes sense only when training set and test set are correlated. Most standard assumption: every data point (x, y) (from D train, D dev, or D test ) is an independent and identically distributed (i.i.d.) sample of an unknown joint distribution P. often written as (x, y) i.i.d. P Test error of a fixed classifier is therefore a random variable. Need a more certain measure of performance (so it s easy to compare different classifiers for example). August 22, / 63

122 Some theory on NNC Step 1: Expected risk Expected error What about the expectation of this random variable? E[ɛ test ] August 22, / 63

123 Some theory on NNC Step 1: Expected risk Expected error What about the expectation of this random variable? E[ɛ test ] = 1 M M E (xm,ym) PI[f(x m ) y m ] m=1 August 22, / 63

124 Some theory on NNC Step 1: Expected risk Expected error What about the expectation of this random variable? E[ɛ test ] = 1 M M E (xm,ym) PI[f(x m ) y m ] = E (x,y) P I[f(x) y] m=1 August 22, / 63

125 Some theory on NNC Step 1: Expected risk Expected error What about the expectation of this random variable? E[ɛ test ] = 1 M M E (xm,ym) PI[f(x m ) y m ] = E (x,y) P I[f(x) y] m=1 i.e. the expected error/mistake of f August 22, / 63

126 Some theory on NNC Step 1: Expected risk Expected error What about the expectation of this random variable? E[ɛ test ] = 1 M M E (xm,ym) PI[f(x m ) y m ] = E (x,y) P I[f(x) y] m=1 i.e. the expected error/mistake of f Test error is a proxy of expected error. The larger the test set, the better the approximation. August 22, / 63

127 Some theory on NNC Step 1: Expected risk Expected error What about the expectation of this random variable? E[ɛ test ] = 1 M M E (xm,ym) PI[f(x m ) y m ] = E (x,y) P I[f(x) y] m=1 i.e. the expected error/mistake of f Test error is a proxy of expected error. The larger the test set, the better the approximation. What about the expectation of training error? Is training error a good proxy of expected error? August 22, / 63

128 Some theory on NNC Step 1: Expected risk Expected risk More generally, for a loss function L(y, y), e.g. L(y, y) = I[y y], called 0-1 loss. many more other losses as we will see. August 22, / 63

129 Some theory on NNC Step 1: Expected risk Expected risk More generally, for a loss function L(y, y), e.g. L(y, y) = I[y y], called 0-1 loss. many more other losses as we will see. the expected risk of f is defined as R(f) = E (x,y) P L(f(x), y) August 22, / 63

130 Some theory on NNC Step 1: Expected risk Expected risk More generally, for a loss function L(y, y), e.g. L(y, y) = I[y y], called 0-1 loss. Default many more other losses as we will see. the expected risk of f is defined as R(f) = E (x,y) P L(f(x), y) August 22, / 63

131 Some theory on NNC Step 2: The ideal classifier Bayes optimal classifier What should we predict for x, knowing P(y x)? August 22, / 63

132 Some theory on NNC Step 2: The ideal classifier Bayes optimal classifier What should we predict for x, knowing P(y x)? Bayes optimal classifier: f (x) = arg max c [C] P(c x). August 22, / 63

133 Some theory on NNC Step 2: The ideal classifier Bayes optimal classifier What should we predict for x, knowing P(y x)? Bayes optimal classifier: f (x) = arg max c [C] P(c x). The optimal risk: R(f ) = E x Px [1 max c [C] P(c x)] where P x is the marginal distribution of x. August 22, / 63

134 Some theory on NNC Step 2: The ideal classifier Bayes optimal classifier What should we predict for x, knowing P(y x)? Bayes optimal classifier: f (x) = arg max c [C] P(c x). The optimal risk: R(f ) = E x Px [1 max c [C] P(c x)] where P x is the marginal distribution of x. It is easy to show R(f ) R(f) for any f. August 22, / 63

135 Some theory on NNC Step 2: The ideal classifier Bayes optimal classifier What should we predict for x, knowing P(y x)? Bayes optimal classifier: f (x) = arg max c [C] P(c x). The optimal risk: R(f ) = E x Px [1 max c [C] P(c x)] where P x is the marginal distribution of x. It is easy to show R(f ) R(f) for any f. For special case C = 2, let η(x) = P(0 x), then R(f ) = E x Px [min{η(x), 1 η(x)}]. August 22, / 63

136 Some theory on NNC Step 3: Comparing NNC to the ideal classifier Comparing NNC to Bayes optimal classifier Come back to the question: how good is NNC? August 22, / 63

137 Some theory on NNC Step 3: Comparing NNC to the ideal classifier Comparing NNC to Bayes optimal classifier Come back to the question: how good is NNC? Theorem (Cover and Hart, 1967) Let f N be the 1-nearest neighbor binary classifier using N training data points, we have (under mild conditions) R(f ) lim N E[R(f N)] 2R(f ) i.e., expected risk of NNC in the limit is at most twice of the best possible. August 22, / 63

138 Some theory on NNC Step 3: Comparing NNC to the ideal classifier Comparing NNC to Bayes optimal classifier Come back to the question: how good is NNC? Theorem (Cover and Hart, 1967) Let f N be the 1-nearest neighbor binary classifier using N training data points, we have (under mild conditions) R(f ) lim N E[R(f N)] 2R(f ) i.e., expected risk of NNC in the limit is at most twice of the best possible. A pretty strong guarantee. In particular, R(f ) = 0 implies E[R(f N )] 0. August 22, / 63

139 Some theory on NNC Step 3: Comparing NNC to the ideal classifier Proof sketch Fact: x(1) x with probability 1 August 22, / 63

140 Some theory on NNC Step 3: Comparing NNC to the ideal classifier Proof sketch Fact: x(1) x with probability 1 E[R(f N )] = E[E (x,y) P I[f N (x) y]] August 22, / 63

141 Some theory on NNC Step 3: Comparing NNC to the ideal classifier Proof sketch Fact: x(1) x with probability 1 E[R(f N )] = E[E (x,y) P I[f N (x) y]] E x Px E y,y i.i.d. P( x) [I[y y]] August 22, / 63

142 Some theory on NNC Step 3: Comparing NNC to the ideal classifier Proof sketch Fact: x(1) x with probability 1 E[R(f N )] = E[E (x,y) P I[f N (x) y]] E x Px E y,y i.i.d. P( x) [I[y y]] = E Px E y,y i.i.d. P( x) [I[y = 0 and y = 1] + I[y = 1 and y = 0]] August 22, / 63

143 Some theory on NNC Step 3: Comparing NNC to the ideal classifier Proof sketch Fact: x(1) x with probability 1 E[R(f N )] = E[E (x,y) P I[f N (x) y]] E x Px E y,y i.i.d. P( x) [I[y y]] = E Px E y,y i.i.d. P( x) [I[y = 0 and y = 1] + I[y = 1 and y = 0]] = E Px [η(x)(1 η(x)) + (1 η(x))η(x)] August 22, / 63

144 Some theory on NNC Step 3: Comparing NNC to the ideal classifier Proof sketch Fact: x(1) x with probability 1 E[R(f N )] = E[E (x,y) P I[f N (x) y]] E x Px E y,y i.i.d. P( x) [I[y y]] = E Px E y,y i.i.d. P( x) [I[y = 0 and y = 1] + I[y = 1 and y = 0]] = E Px [η(x)(1 η(x)) + (1 η(x))η(x)] = 2E Px [η(x)(1 η(x))] August 22, / 63

145 Some theory on NNC Step 3: Comparing NNC to the ideal classifier Proof sketch Fact: x(1) x with probability 1 E[R(f N )] = E[E (x,y) P I[f N (x) y]] E x Px E y,y i.i.d. P( x) [I[y y]] = E Px E y,y i.i.d. P( x) [I[y = 0 and y = 1] + I[y = 1 and y = 0]] = E Px [η(x)(1 η(x)) + (1 η(x))η(x)] = 2E Px [η(x)(1 η(x))] 2E Px [min{η(x), (1 η(x))}] August 22, / 63

146 Some theory on NNC Step 3: Comparing NNC to the ideal classifier Proof sketch Fact: x(1) x with probability 1 E[R(f N )] = E[E (x,y) P I[f N (x) y]] E x Px E y,y i.i.d. P( x) [I[y y]] = E Px E y,y i.i.d. P( x) [I[y = 0 and y = 1] + I[y = 1 and y = 0]] = E Px [η(x)(1 η(x)) + (1 η(x))η(x)] = 2E Px [η(x)(1 η(x))] 2E Px [min{η(x), (1 η(x))}] = 2R(f ) August 22, / 63

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 9, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 1 / 47