Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Size: px

Start display at page:

Download "Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy"

Georgiana Sharp
5 years ago
Views:

1 Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy Machine Learning Dr.Ammar Mohammed

2 Nearest Neighbors Set of Stored Cases Atr1... AtrN Class A Store the training samples Use training samples to predict the class label of unseen samples B B C A C B Is called : Instance based learning Unseen Case Atr1... AtrN

3 Nearest Neighbors Basic idea: If it walks like a duck, quacks like a duck, then it s probably a duck compute distance training samples choose k of the nearest samples test sample

4 Nearest Neighbors Requires three inputs: 1. The set of stored samples 2. Distance metric to compute distance between samples 3. The value of k, the number of nearest neighbors to retrieve

5 Nearest Neighbors To classify unknown record: 1. Compute distance to other training records 2. Identify k nearest neighbors 3. Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

6 Nearest Neighbors Compute distance between two points: x=(x1,x2,..xn) and y=(y1,y2,..yn) Euclidean distance Options for determining the class from nearest neighbor list Take majority vote of class labels among the knearest neighbors

7 Example In 5-nearest Neighbors what the class of the new instance x=(9.1, 11.0)? = 4.7 = 9.6 : : = 10.5 Select the 5 instances having minimum distance You will find 3 instances classified to + 2 instances classified to We conclude that X=(9.1,11.0) classified as +

8 Features Normalization Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example: height of a person may vary from 1.5 m to 1.8 m weight of a person may vary from 90 lb to 300 lb income of a person may vary from $10K to $1M

9 Features Normalization Distance is Dominated by the attribute Loan, but attribute Age has no impact How to solve this Problem? 9

10 KNN Standardized Distance 10

11 Predictive Accuracy

12 Classification step 1: Splitting data THE PAST Results Known Data Testing set Training set

13 Classification Step 2: Train and Evaluate Results Known Data Training set Model Builder Evaluate + Y Testing set N + - Predictions

14 Methods for Evaluation Predictive accuracy: the most obvious method for estimating the performance of the classifier Accuracy = Efficiency Number of correct Classification Total Number of test cases time to construct the model time to use the model Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability: understandable and insight provided by the model

15 Predictive Accuracy: P=C/N P: accuracy N: number of instances C: correctly classified - Available data is split int two parts called Training set and Test set - In case Dataset is only single file, we need to divide it into a training and test set before using method1

16 Splitting the data Holdout set: The available data set D is divided into two disjoint subsets, Important: training set should not be used in testing and the test set should not be used in learning. the training set Dtrain (for learning a model) the test set Dtest (for testing the model) Unseen test set provides a unbiased estimate of accuracy. The test set is also called the holdout set. (the examples in the original data set D are all labeled with classes.) This method is mainly used when the data set D is large.

17 Splitting Data using n-fold cross Validation - Available data is split int two parts called Training set and Test set - In case Dataset is only single file, we need to divide it into a training and test set before using method1

18 n- fold Cross Validation n-fold cross-validation: The available data is partitioned into n equal-size disjoint subsets. Use each subset as the test set and combine the rest n-1 subsets as the training set to learn a classifier. The procedure is run n times, which give n accuracies. The final estimated accuracy of learning is the average of the n accuracies. 10-fold and 5-fold cross-validations are commonly used. This method is used when the available data is not large.

19 Accuracy Paradox Accuracy is not suitable in some applications. With class imbalance, accuracy alone cannot be trusted to select well training model. In text mining, we may only be interested in the documents of a particular topic, which are only a small portion of a big document collection. In classification involving skewed or highly imbalanced data, e.g., network intrusion and financial fraud detections, we are interested only in the minority class. High accuracy does not mean any intrusion is detected. E.g., 1% intrusion. Achieve 99% accuracy by doing nothing. The class of interest is commonly called the positive class, and the rest negative classes.

20 Confusion Matrix Performance A confusion matrix is a way of describing the breakdown of the errors in predictions. It shows the number of correct and incorrect predictions made by the classification model compared to the actual outcomes (target value) in the data. The matrix is NxN, where N is the number of target values (classes). Performance of such models is commonly evaluated using the data in the matrix. The following table displays a 2x2 confusion matrix for two classes (Positive and Negative)

21 Confusion Matrix Performance - Accuracy : The proportion of correct classifications from the overall number of cases. - Positive Predictive Value or Precision : the proportion of positive cases that were correctly identified. - Negative Predictive Value : the proportion of negative cases that were correctly identified. Sensitivity or Recall : the proportion of actual positive cases which are correctly identified. Specificity : the proportion of actual negative cases which are correctly identified.

22 Precision and recall Measures Used in information retrieval and text classification. We use a confusion matrix to introduce them. - TP ( True Positive): The number of correct classifications of the positive examples - FN (False Negative): The number of incorrect classifications of the positive examples. - FP(False Positive):The number of incorrect classifications of the negative examples. - TN(True Negative): The number of correct classifications of the negative examples.

23 Precision and Recall Measures Precision p : is the number of True Positives divided by the number of True positives and False Positives. Or it is the number of correctly classified positive examples divided by the total number of examples that are classified as positive. Recall r : is the number of True positives divided by the number of True positives and the number of False Negatives. Or it is the number of correctly classified positive examples divided by the total number of actual positive examples in the test set. 23

24 Example This confusion matrix gives precision p = 100% and recall r = 1% because we only classified one positive example correctly and no negative examples wrongly. Note: precision and recall only measure classification on the positive class. 24

25 F-Score ( F1-Score) It is hard to compare two classifiers using two measures. F1 score combines precision and recall into one measure The harmonic mean of two numbers tends to be closer to the smaller of the two. For F1-value to be large, both p and r much be large. 25

K- Nearest Neighbors(KNN) And Predictive Accuracy

Contact: mailto: Ammar@cu.edu.eg Drammarcu@gmail.com K- Nearest Neighbors(KNN) And Predictive Accuracy Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni.