K- Nearest Neighbors(KNN) And Predictive Accuracy

Similar documents
Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Chapter 3: Supervised Learning

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

9 Classification: KNN and SVM

k-nearest Neighbor (knn) Sept Youn-Hee Han

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

CS145: INTRODUCTION TO DATA MINING

Data Mining. Lecture 03: Nearest Neighbor Learning

Classification and Regression

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

CS249: ADVANCED DATA MINING

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

CISC 4631 Data Mining

Data Preprocessing. Supervised Learning

CS 584 Data Mining. Classification 1

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Artificial Intelligence. Programming Styles

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Classification Part 4

CS4491/CS 7265 BIG DATA ANALYTICS

Large Scale Data Analysis Using Deep Learning

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Machine Learning Models for Pattern Classification. Comp 473/6731

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

CS570: Introduction to Data Mining

I211: Information infrastructure II

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Jarek Szlichta

Basic Data Mining Technique

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Using Machine Learning to Optimize Storage Systems

Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio

Lecture 3. Oct

Introduction to Artificial Intelligence

Evaluating Classifiers

Model s Performance Measures

Information Management course

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Evaluating Classifiers

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Mining di Dati Web. Lezione 3 - Clustering and Classification

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Random Forest A. Fornaser

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

International Journal of Software and Web Sciences (IJSWS)

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Evaluating Classifiers

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Supervised Learning Classification Algorithms Comparison

Distribution-free Predictive Approaches

SOCIAL MEDIA MINING. Data Mining Essentials

PROBLEM 4

Linear Regression and K-Nearest Neighbors 3/28/18

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS6716 Pattern Recognition

ECLT 5810 Evaluation of Classification Quality

Topic 1 Classification Alternatives

Lecture 25: Review I

K-Nearest Neighbour Classifier. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Gene Clustering & Classification

Data Mining Concepts & Techniques

Network Traffic Measurements and Analysis

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

Clustering & Classification (chapter 15)

Data Mining and Machine Learning: Techniques and Algorithms

CSC411/2515 Tutorial: K-NN and Decision Tree

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Nearest Neighbor Classifiers

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Classification. Instructor: Wei Ding

Instance-based Learning

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Intro to Artificial Intelligence

Data Mining and Data Warehousing Classification-Lazy Learners

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

An Empirical Study on Lazy Multilabel Classification Algorithms

Nearest Neighbor Classification. Machine Learning Fall 2017

Data Mining and Knowledge Discovery: Practice Notes

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

CSE 158. Web Mining and Recommender Systems. Midterm recap

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Naïve Bayes for text classification

Decision Tree (Continued) and K-Nearest Neighbour. Dr. Xiaowei Huang

Nearest neighbor classification DSE 220

Announcements:$Rough$Plan$$

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Data Mining and Knowledge Discovery Practice notes 2

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

EE795: Computer Vision and Intelligent Systems

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Transcription:

Contact: mailto: Ammar@cu.edu.eg Drammarcu@gmail.com K- Nearest Neighbors(KNN) And Predictive Accuracy Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany) Spring 2019

Nearest Neighbors Set of Stored Cases Atr1... AtrN Class A B B C A C B Store the training samples Use training samples to predict the class label of unseen samples Unseen Case Atr1... AtrN Is called : Instance based learning

Instance Based Learning - Approximating real valued or discrete-valued target functions - Learning in this algorithm consists of storing the presented training data ( no Model is generated) - When a new query instance (unseen data) is encountered, a set of similar related instances is retrieved from memory and used to classify the new query instance - Disadvantage of instance-based methods is that the costs of classifying new instances can be high - Nearly all computation takes place at classification time rather than learning time

K-Nearest Neighbors KNN Most basic instance-based method Supervised learning Basic idea of KNN: Used to classify objects based on closest training examples in the training data If it walks like a duck, quacks like a duck, then it is probably a duck compute distance test sample training samples choose k of the nearest samples

Nearest Neighbors Unknown record Requires three inputs: 1. The set of stored samples(training data) 2. Distance metric to compute distance between samples 3. The value of k, the number of nearest neighbors to retrieve Nearest Neighbor classifiers are lazy learners No pre-constructed models for classification

Nearest Neighbors Example Food Chat Fast Price Bar BigTip (3) (2) (2) (3) (2) 1 great yes yes normal no yes 2 great no yes normal no yes 3 mediocre yes no high no no 4 great yes yes normal yes yes Similarity metric: Number of matching attributes (k=2) New examples: Example 1 (great, no, no, normal, no) most similar: number 2 (1 mismatch, 4 match) yes Second most similar example: number 1 (2 mismatch, 3 match) yes Example 2 (mediocre, yes, no, normal, no) Yes Most similar: number 3 (1 mismatch, 4 match) no Yes/No Second most similar example: number 1 (2 mismatch, 3 match) yes

Nearest Neighbors Compute distance between two points: x=(x 1,x 2,..x n ) and y=(y 1,y 2,..y n ) Euclidean distance d ( x,y )= ( x i y i ) 2 i Options for determining the class from nearest neighbor list Take majority vote of class labels among the k- nearest neighbors

Example In 5-nearest Neighbors what the class of the new instance x=(9.1, 11.0)? d ( x,y )= (9.1 0. 8 ) 2 +(11.0 6. 3) 2 = 4.7 d ( x,y )= (9.1 1. 4 ) 2 +(11.0 8. 1) 2 = 9.6 : : d ( x,y )= (9.1 19.6 ) 2 +(11.0 11. 1) 2 = 10.5 Select the 5 instances having minimum distance You will find 3 instances classified to + 2 instances classified to We conclude that X=(9.1,11.0) classified as +

Features Normalization Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example: height of a person may vary from 1.5 m to 1.8 m weight of a person may vary from 90 lb to 300 lb income of a person may vary from $10K to $1M

Features Normalization Distance is Dominated by the attribute Loan, but attribute Age has no impact How to solve this Problem? 10

KNN Standardized Distance X s = X Min Max Min 11

How to Determine the good value of K k = 1: Belongs to square class? k = 3: Belongs to triangle class k = 7: Belongs to square class Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes Choose an odd value for k, to eliminate ties 12

How to Determine the good value of K Determined experimentally - Start with k=1 and use a test set to validate the error rate of the classifier - Repeat with k=k+2 - Choose the value of k for which the error rate is minimum Note: k should be odd number to avoid ties 13

Predictive Accuracy

Classification step 1: Splitting data THE PAST Results Known Data + + - - + Training set Testing set

Classification Step 2: Train and Evaluate Results Known Data + + - - + Training set Model Builder Evaluate + Predictions Testing set Y N - + -

Methods for Evaluation Predictive accuracy: the most obvious method for estimating the performance of the classifier Efficiency Accuracy = time to construct the model time to use the model Number of correct Classification Total Number of test cases Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability: understandable and insight provided by the model

Predictive Accuracy: P = C / N P: accuracy N: number of instances C: correctly classified - Available data is split int two parts called Training set and Test set - In case Dataset is only single file, we need to divide it into a training and test set before using method1

Splitting the data Holdout set: The available data set D is divided into two disjoint subsets, the training set D train (for learning a model) the test set D test (for testing the model) Important: training set should not be used in testing and the test set should not be used in learning. Unseen test set provides a unbiased estimate of accuracy. The test set is also called the holdout set. (the examples in the original data set D are all labeled with classes.) This method is mainly used when the data set D is large.

Splitting Data using n-fold cross Validation - Available data is split int two parts called Training set and Test set - In case the Dataset is only single file, we need to divide it into a training and test set before using method1

n- fold Cross Validation n-fold cross-validation: The available data is partitioned into n equal-size disjoint subsets. Use each subset as the test set and combine the rest n-1 subsets as the training set to learn a classifier. The procedure is run n times, which give n accuracies. The final estimated accuracy of learning is the average of the n accuracies. 10-fold and 5-fold cross-validations are commonly used. This method is used when the available data is not large.

Accuracy Paradox Accuracy is not suitable in some applications. With class imbalance, accuracy alone cannot be trusted to select well training model. In text mining, we may only be interested in the documents of a particular topic, which are only a small portion of a big document collection. In classification involving skewed or highly imbalanced data, e.g., network intrusion and financial fraud detections, we are interested only in the minority class. High accuracy does not mean any intrusion is detected. E.g., 1% intrusion. Achieve 99% accuracy by doing nothing. The class of interest is commonly called the positive class, and the rest negative classes.

Confusion Matrix Performance A confusion matrix is a way of describing the breakdown of the errors in predictions. It shows the number of correct and incorrect predictions made by the classification model compared to the actual outcomes (target value) in the data. The matrix is NxN, where N is the number of target values (classes). Performance of such models is commonly evaluated using the data in the matrix. The following table displays a 2x2 confusion matrix for two classes (Positive and Negative)

Confusion Matrix Performance - Accuracy : The proportion of correct classifications from the overall number of cases. - Positive Predictive Value or Precision : the proportion of positive cases that were correctly identified. - Negative Predictive Value : the proportion of negative cases that were correctly identified. Sensitivity or Recall : the proportion of actual positive cases which are correctly identified. Specificity : the proportion of actual negative cases which are correctly identified.

Precision and recall Measures Used in information retrieval and text classification. We use a confusion matrix to introduce them. - TP ( True Positive): The number of correct classifications of the positive examples - FN (False Negative): The number of incorrect classifications of the positive examples. - FP(False Positive):The number of incorrect classifications of the negative examples. - TN(True Negative): The number of correct classifications of the negative examples.

Precision and Recall Measures p= TP TP+FP TP. r= TP+FN. Precision p : is the number of True Positives divided by the number of True positives and False Positives. Or it is the number of correctly classified positive examples divided by the total number of examples that are classified as positive. Recall r : is the number of True positives divided by the number of True positives and the number of False Negatives. Or it is the number of correctly classified positive examples divided by the total number of actual positive examples in the test set. 26

Example This confusion matrix gives precision p = 100% and recall r = 1% because we only classified one positive example correctly and no negative examples wrongly. Note: precision and recall only measure classification on the positive class. 27

F-Score ( F 1 -Score) It is hard to compare two classifiers using two measures. F 1 score combines precision and recall into one measure The harmonic mean of two numbers tends to be closer to the smaller of the two. For F 1 -value to be large, both p and r much be large. 28