k-nearest Neighbor (knn) Sept Youn-Hee Han

Similar documents
Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Data Preprocessing. Supervised Learning

K- Nearest Neighbors(KNN) And Predictive Accuracy

CS570: Introduction to Data Mining

Distribution-free Predictive Approaches

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Linear Regression and K-Nearest Neighbors 3/28/18

9 Classification: KNN and SVM

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Nearest Neighbor Classification. Machine Learning Fall 2017

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

K-Nearest Neighbour Classifier. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Nearest Neighbor Predictors

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

CISC 4631 Data Mining

Jarek Szlichta

Applying Supervised Learning

Data Mining. Lecture 03: Nearest Neighbor Learning

CS489/698 Lecture 2: January 8 th, 2018

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

CS145: INTRODUCTION TO DATA MINING

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

Lecture 3. Oct

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Chapter 2: Classification & Prediction

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion

Data Mining and Data Warehousing Classification-Lazy Learners

Data Mining and Machine Learning: Techniques and Algorithms

VECTOR SPACE CLASSIFICATION

Semi-supervised Learning

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Intro to Artificial Intelligence

Topic 1 Classification Alternatives

Basic Data Mining Technique

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Clustering & Classification (chapter 15)

Announcements:$Rough$Plan$$

Naïve Bayes for text classification

Statistical Learning Part 2 Nonparametric Learning: The Main Ideas. R. Moeller Hamburg University of Technology

Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Comparison of various classification models for making financial decisions

Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Nearest Neighbor Methods

CS 584 Data Mining. Classification 1

Multi-label classification using rule-based classifier systems

All lecture slides will be available at CSC2515_Winter15.html

Random Forest A. Fornaser

6.867 Machine Learning

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

Classification. 1 o Semestre 2007/2008

Data Mining and Knowledge Discovery: Practice Notes

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Introduction to Machine Learning. Xiaojin Zhu

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

k-nearest Neighbors + Model Selection

Introduction to Artificial Intelligence

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Instance and case-based reasoning

Performance Measures

A Taxonomy of Semi-Supervised Learning Algorithms

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Based on Raymond J. Mooney s slides

CSC 411: Lecture 05: Nearest Neighbors

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Classification and K-Nearest Neighbors

Notes and Announcements

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

CS249: ADVANCED DATA MINING

CSC411/2515 Tutorial: K-NN and Decision Tree

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

AKA: Logistic Regression Implementation

Lecture 25: Review I

Data Mining Classification - Part 1 -

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Generative and discriminative classification techniques

Data Mining and Knowledge Discovery: Practice Notes

ARE you? CS229 Final Project. Jim Hefner & Roddy Lindsay

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

ECE 5424: Introduction to Machine Learning

Decision Trees: Representation

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Jeff Howbert Introduction to Machine Learning Winter

Transcription:

k-nearest Neighbor (knn) Sept. 2015 Youn-Hee Han http://link.koreatech.ac.kr

²Eager Learners Eager vs. Lazy Learning when given a set of training data, it will construct a generalization model before receiving new (e.g., test) data to classify Classification by decision tree induction Common Linear Regression and Logistic Regression Support Vector Machines (SVM) ²Lazy Learners Simply stores training data (or only minor processing) and waits until it is given a test tuple less time in training but more time in predicting KNN Locally weighted regression 2

²Accuracy Eager vs. Lazy Learning Lazy learners effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation Eager learners must commit to a single hypothesis that covers the entire instance space ²Instance-Based Methods (Lazy Learners) Store training examples and delay the processing ( lazy evaluation ) until a new instance must be classified KNN 3

k-nearest Neighbor Algorithm ²Algorithm (1/2) All training data instances correspond to points in the n-d (Euclidean) space The nearest neighbor are defined in terms of Euclidean distance, d ij 4

²Algorithm (2/2) For an unknown case y If k=1, select the nearest neighbor If k>1, k-nearest Neighbor Algorithm For classification, select the most frequent neighbor. Øranking yields k feature vectors and store a set of k class labels Øpick the class label which is most common in this set ( vote ) Øclassify y as belonging to this class For regression, calculate the average of k neighbors. 5

The k-nearest Neighbor Algorithm ²Example: K=5

knn Classification $250,000 $200,000 Loan$ $150,000 $100,000 Non-Default Default $50,000 $0 0 20 40 60 80 Age www.ismartsoft.com 7

KNN Classification Distance Age Loan Default Distance 25 $40,000 N 102000 35 $60,000 N 82000 45 $80,000 N 62000 20 $20,000 N 122000 35 $120,000 N 22000 52 $18,000 N 124000 23 $95,000 Y 47000 40 $62,000 Y 80000 60 $100,000 Y 42000 48 $220,000 Y 78000 33 $150,000 Y 8000 48 $142,000? D = ( x y 2 2 1 x2) + ( y1 2) www.ismartsoft.com 8

KNN Classification Standardized Distance Age Loan Default Distance 0.125 0.11 N 0.7652 0.375 0.21 N 0.5200 0.625 0.31 N 0.3160 0 0.01 N 0.9245 0.375 0.50 N 0.3428 0.8 0.00 N 0.6220 0.075 0.38 Y 0.6669 0.5 0.22 Y 0.4437 1 0.41 Y 0.3650 0.7 1.00 Y 0.3861 0.325 0.65 Y 0.3771 0.7 0.61? X s X Min = Max Min www.ismartsoft.com 9

²Features knn Features for two-class problems, if we choose k to be odd (i.e., k=1, 3, 5, ) then there will never be any ties Learning is simple (no learning at all!) training is trivial for the knn classifier we just use the training set as a lookup table when we want to classify a new feature vector KNN is conceptually simple, yet able to solve complex problems Can work with relatively little information Memory and CPU cost Sensitive to feature representation 13

Value of k ²How can I determine the value of k, the number of neighbors? In general, the larger the number of training tuples is, the larger the value of k is ²Theoretical Considerations as k increases we are averaging over more neighbors the effective decision boundary is more smooth as N (the number of training data set) increases, the optimal k value tends to increase in proportion to logn 14

knn Regression ²Real-valued prediction for a given unknown data Returns the mean values of the k nearest neighbors Age Loan House Price Index Distance 25 $40,000 135 102000 35 $60,000 256 82000 45 $80,000 231 62000 20 $20,000 267 122000 35 $120,000 139 22000 52 $18,000 150 124000 23 $95,000 127 47000 40 $62,000 216 80000 60 $100,000 139 42000 48 $220,000 250 78000 33 $150,000 264 8000 48 $142,000? D = ( x y 2 2 1 x2) + ( y1 2) 15

knn Regression Age Loan House Price Index Distance 0.125 0.11 135 0.7652 0.375 0.21 256 0.5200 0.625 0.31 231 0.3160 0 0.01 267 0.9245 0.375 0.50 139 0.3428 0.8 0.00 150 0.6220 0.075 0.38 127 0.6669 0.5 0.22 216 0.4437 1 0.41 139 0.3650 0.7 1.00 250 0.3861 0.325 0.65 264 0.3771 0.7 0.61? X s X Min = Max Min www.ismartsoft.com 16

knn Variations ²Distance-weighted nearest neighbor algorithm Weight the contribution of each of the k neighbors according to their distance to the unknown y Give greater weight to closer neighbors ²Weighted features if some of features are more important, more weight is considered for the features if some of features are irrelevant, less weight is considered for the features 17

²kNN Different Names K-Nearest Neighbors Memory-Based Reasoning Example-Based Reasoning Instance-Based Learning Case-Based Reasoning Lazy Learning knn Names 18

knn Examples ² Speech Recognition Training data: words recorded and labeled in a database Test data: words recorded from new speakers, new locations ² Zipcode Recognition Training data: zipcodes manually selected, scanned, labeled Test data: actual letters being scanned in a post office ² Credit Scoring Training data: historical database of loan applications with payment history or decision at that time Test data: you 19