K- Nearest Neighbors(KNN) And Predictive Accuracy

Contact: mailto: Ammar@cu.edu.eg Drammarcu@gmail.com K- Nearest Neighbors(KNN) And Predictive Accuracy Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany) Spring 2019

Nearest Neighbors Set of Stored Cases Atr1... AtrN Class A B B C A C B Store the training samples Use training samples to predict the class label of unseen samples Unseen Case Atr1... AtrN Is called : Instance based learning

Instance Based Learning - Approximating real valued or discrete-valued target functions - Learning in this algorithm consists of storing the presented training data ( no Model is generated) - When a new query instance (unseen data) is encountered, a set of similar related instances is retrieved from memory and used to classify the new query instance - Disadvantage of instance-based methods is that the costs of classifying new instances can be high - Nearly all computation takes place at classification time rather than learning time

K-Nearest Neighbors KNN Most basic instance-based method Supervised learning Basic idea of KNN: Used to classify objects based on closest training examples in the training data If it walks like a duck, quacks like a duck, then it is probably a duck compute distance test sample training samples choose k of the nearest samples

Nearest Neighbors Unknown record Requires three inputs: 1. The set of stored samples(training data) 2. Distance metric to compute distance between samples 3. The value of k, the number of nearest neighbors to retrieve Nearest Neighbor classifiers are lazy learners No pre-constructed models for classification

Nearest Neighbors Example Food Chat Fast Price Bar BigTip (3) (2) (2) (3) (2) 1 great yes yes normal no yes 2 great no yes normal no yes 3 mediocre yes no high no no 4 great yes yes normal yes yes Similarity metric: Number of matching attributes (k=2) New examples: Example 1 (great, no, no, normal, no) most similar: number 2 (1 mismatch, 4 match) yes Second most similar example: number 1 (2 mismatch, 3 match) yes Example 2 (mediocre, yes, no, normal, no) Yes Most similar: number 3 (1 mismatch, 4 match) no Yes/No Second most similar example: number 1 (2 mismatch, 3 match) yes

Nearest Neighbors Compute distance between two points: x=(x 1,x 2,..x n ) and y=(y 1,y 2,..y n ) Euclidean distance d ( x,y )= ( x i y i ) 2 i Options for determining the class from nearest neighbor list Take majority vote of class labels among the k- nearest neighbors

Example In 5-nearest Neighbors what the class of the new instance x=(9.1, 11.0)? d ( x,y )= (9.1 0. 8 ) 2 +(11.0 6. 3) 2 = 4.7 d ( x,y )= (9.1 1. 4 ) 2 +(11.0 8. 1) 2 = 9.6 : : d ( x,y )= (9.1 19.6 ) 2 +(11.0 11. 1) 2 = 10.5 Select the 5 instances having minimum distance You will find 3 instances classified to + 2 instances classified to We conclude that X=(9.1,11.0) classified as +

Features Normalization Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example: height of a person may vary from 1.5 m to 1.8 m weight of a person may vary from 90 lb to 300 lb income of a person may vary from $10K to $1M

Features Normalization Distance is Dominated by the attribute Loan, but attribute Age has no impact How to solve this Problem? 10

KNN Standardized Distance X s = X Min Max Min 11

How to Determine the good value of K k = 1: Belongs to square class? k = 3: Belongs to triangle class k = 7: Belongs to square class Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes Choose an odd value for k, to eliminate ties 12

How to Determine the good value of K Determined experimentally - Start with k=1 and use a test set to validate the error rate of the classifier - Repeat with k=k+2 - Choose the value of k for which the error rate is minimum Note: k should be odd number to avoid ties 13

Predictive Accuracy

Classification step 1: Splitting data THE PAST Results Known Data + + - - + Training set Testing set

Classification Step 2: Train and Evaluate Results Known Data + + - - + Training set Model Builder Evaluate + Predictions Testing set Y N - + -

Methods for Evaluation Predictive accuracy: the most obvious method for estimating the performance of the classifier Efficiency Accuracy = time to construct the model time to use the model Number of correct Classification Total Number of test cases Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability: understandable and insight provided by the model

Predictive Accuracy: P = C / N P: accuracy N: number of instances C: correctly classified - Available data is split int two parts called Training set and Test set - In case Dataset is only single file, we need to divide it into a training and test set before using method1

Splitting the data Holdout set: The available data set D is divided into two disjoint subsets, the training set D train (for learning a model) the test set D test (for testing the model) Important: training set should not be used in testing and the test set should not be used in learning. Unseen test set provides a unbiased estimate of accuracy. The test set is also called the holdout set. (the examples in the original data set D are all labeled with classes.) This method is mainly used when the data set D is large.

Splitting Data using n-fold cross Validation - Available data is split int two parts called Training set and Test set - In case the Dataset is only single file, we need to divide it into a training and test set before using method1

n- fold Cross Validation n-fold cross-validation: The available data is partitioned into n equal-size disjoint subsets. Use each subset as the test set and combine the rest n-1 subsets as the training set to learn a classifier. The procedure is run n times, which give n accuracies. The final estimated accuracy of learning is the average of the n accuracies. 10-fold and 5-fold cross-validations are commonly used. This method is used when the available data is not large.

Accuracy Paradox Accuracy is not suitable in some applications. With class imbalance, accuracy alone cannot be trusted to select well training model. In text mining, we may only be interested in the documents of a particular topic, which are only a small portion of a big document collection. In classification involving skewed or highly imbalanced data, e.g., network intrusion and financial fraud detections, we are interested only in the minority class. High accuracy does not mean any intrusion is detected. E.g., 1% intrusion. Achieve 99% accuracy by doing nothing. The class of interest is commonly called the positive class, and the rest negative classes.

Confusion Matrix Performance A confusion matrix is a way of describing the breakdown of the errors in predictions. It shows the number of correct and incorrect predictions made by the classification model compared to the actual outcomes (target value) in the data. The matrix is NxN, where N is the number of target values (classes). Performance of such models is commonly evaluated using the data in the matrix. The following table displays a 2x2 confusion matrix for two classes (Positive and Negative)

Confusion Matrix Performance - Accuracy : The proportion of correct classifications from the overall number of cases. - Positive Predictive Value or Precision : the proportion of positive cases that were correctly identified. - Negative Predictive Value : the proportion of negative cases that were correctly identified. Sensitivity or Recall : the proportion of actual positive cases which are correctly identified. Specificity : the proportion of actual negative cases which are correctly identified.

Precision and recall Measures Used in information retrieval and text classification. We use a confusion matrix to introduce them. - TP ( True Positive): The number of correct classifications of the positive examples - FN (False Negative): The number of incorrect classifications of the positive examples. - FP(False Positive):The number of incorrect classifications of the negative examples. - TN(True Negative): The number of correct classifications of the negative examples.

Precision and Recall Measures p= TP TP+FP TP. r= TP+FN. Precision p : is the number of True Positives divided by the number of True positives and False Positives. Or it is the number of correctly classified positive examples divided by the total number of examples that are classified as positive. Recall r : is the number of True positives divided by the number of True positives and the number of False Negatives. Or it is the number of correctly classified positive examples divided by the total number of actual positive examples in the test set. 26

Example This confusion matrix gives precision p = 100% and recall r = 1% because we only classified one positive example correctly and no negative examples wrongly. Note: precision and recall only measure classification on the positive class. 27

F-Score ( F 1 -Score) It is hard to compare two classifiers using two measures. F 1 score combines precision and recall into one measure The harmonic mean of two numbers tends to be closer to the smaller of the two. For F 1 -value to be large, both p and r much be large. 28