Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Similar documents
Chapter 2: Classification & Prediction

Data Mining and Data Warehousing Classification-Lazy Learners

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

K-Nearest Neighbour Classifier. Izabela Moise, Evangelos Pournaras, Dirk Helbing

9 Classification: KNN and SVM

Topic 1 Classification Alternatives

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

K-Nearest Neighbour (Continued) Dr. Xiaowei Huang

Data Preprocessing. Supervised Learning

Distribution-free Predictive Approaches

Lecture 3. Oct

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

Instance-based Learning

k-nearest Neighbor (knn) Sept Youn-Hee Han

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Instance-based Learning

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Data Mining and Machine Learning: Techniques and Algorithms

CISC 4631 Data Mining

CSC 411: Lecture 05: Nearest Neighbors

Nearest Neighbor Classification. Machine Learning Fall 2017

Machine Learning and Pervasive Computing

10/5/2017 MIST.6060 Business Intelligence and Data Mining 1. Nearest Neighbors. In a p-dimensional space, the Euclidean distance between two records,

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Data Mining. Lecture 03: Nearest Neighbor Learning

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

Perceptron as a graph

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

SYDE 372 Introduction to Pattern Recognition. Distance Measures for Pattern Classification: Part I

Instance Based Learning. k-nearest Neighbor. Locally weighted regression. Radial basis functions. Case-based reasoning. Lazy and eager learning

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

CS570: Introduction to Data Mining

Instance-Based Learning.

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Chapter 12 Feature Selection

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Nearest Neighbor Predictors

Nearest Neighbor Classification

Data Preprocessing. Slides by: Shree Jaswal

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

Discriminate Analysis

Work 2. Case-based reasoning exercise

Nearest Neighbor Methods

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CS 584 Data Mining. Classification 1

Lecture 7: Decision Trees

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

K- Nearest Neighbors(KNN) And Predictive Accuracy

Mining di Dati Web. Lezione 3 - Clustering and Classification

Problem 1: Complexity of Update Rules for Logistic Regression

Jarek Szlichta

Slides for Data Mining by I. H. Witten and E. Frank

Machine Learning Techniques for Data Mining

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM

CS6716 Pattern Recognition

Chapter 4: Non-Parametric Techniques

Lecture 25: Review I

Supervised Learning: K-Nearest Neighbors and Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

I211: Information infrastructure II

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

Instance-based Learning

Unsupervised Learning

Instance and case-based reasoning

Machine Learning using MapReduce

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Data Mining and Machine Learning. Instance-Based Learning. Rote Learning k Nearest-Neighbor Classification. IBL and Rule Learning

1) Give decision trees to represent the following Boolean functions:

2. Data Preprocessing

3. Data Preprocessing. 3.1 Introduction

Naïve Bayes for text classification

9881 Project Deliverable #1

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

SYMBOLIC FEATURES IN NEURAL NETWORKS

Basic Data Mining Technique

Non-Parametric Modeling

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Supervised Learning: Nearest Neighbors

Basis Functions. Volker Tresp Summer 2017

Instance-Based Learning. Goals for the lecture

Data Mining Classification - Part 1 -

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Overview of Clustering

UNIT 2 Data Preprocessing

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Machine Learning: k-nearest Neighbors. Lecture 08. Razvan C. Bunescu School of Electrical Engineering and Computer Science

CS5670: Computer Vision

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Transcription:

Data mining Piotr Paszek Classification k-nn Classifier (Piotr Paszek) Data mining k-nn 1 / 20

Plan of the lecture 1 Lazy Learner 2 k-nearest Neighbor Classifier 1 Distance (metric) 2 How to Determine the Value of k 3 Case-Based Reasoning (CBR) (Piotr Paszek) Data mining k-nn 2 / 20

Lazy vs. Eager Learning 1 Eager learning (e.g. decision tree) Given a set of training tuples, constructs a classification model before receiving new (e.g., test) data to classify Do lot of work on training data Do less work when test tuples are presented 2 Lazy learning (e.g., instance-based learning) Simply stores training data (or only minor processing) and waits until it is given a test tuple Do less work on training data Do more work when test tuples are presented (Piotr Paszek) Data mining k-nn 3 / 20

Lazy Learner: Instance-Based Methods Instance-based learning: Store training examples and delay the processing (lazy evaluation) until a new instance must be classified Typical approaches k-nearest neighbor approach (k-nn) Instances represented as points in a Euclidean space Case-based reasoning Uses symbolic representations and knowledge-based inference (Piotr Paszek) Data mining k-nn 4 / 20

k-nearest Neighbor Classifier Nearest-neighbor classifiers compare a given test tuple with training tuples that are similar Training tuples are described by n attributes (n-dimensional space) Find the k-nearest tuples from the training set to the unknown tuple k-nn classify an unknown example with the most common class among k closest examples (nearest neighbor) The closeness between tuples is defined in terms of distance metric (e.g., Euclidian distance) (Piotr Paszek) Data mining k-nn 5 / 20

Distance Metric Let d be a two-argument function (e.g, the distance between two objects). d is a metric if: 1 d(x, y) 0; 2 d(x, y) = 0 if x = y; 3 d(x, y) = d(y, x); 4 d(x, z) d(x, y) + d(y, z). (Piotr Paszek) Data mining k-nn 6 / 20

Distance (numeric attributes) Let x = [x 1, x 2,..., x n ] and y = [y 1, y 2,..., y n ] are two points in the n-dimensional space (Euclidean). Euclidian distance d e (x, y) = n (x i y i ) 2 i=1 Manhattan distance (taxicab metric) d m (x, y) = n x i y i i=1 (Piotr Paszek) Data mining k-nn 7 / 20

Distance (numeric attributes) Minkowski distance L m (x, y) = ( n x i y i q ) 1 q, where q is a positive natural number. For q = 1 Manhattan distance, for q = 2 Euclidian distance. Max distance i=1 d (x, y) = max n i=1 x i y i. (Piotr Paszek) Data mining k-nn 8 / 20

Distance (nominal or categorical attributes) Let x = [x 1, x 2,..., x n ] and y = [y 1, y 2,..., y n ] are two vectors (x i is nominal attribute). d(x, y) = n δ(x i, y i ) i=1 { 0 xi = y δ(x i, y i ) = i 1 x i y i (Piotr Paszek) Data mining k-nn 9 / 20

Normalization To improve the performance of the k-nn algorithm, the commonly used technique is normalization of data from the training set. As a result, all dimensions for which the distance is calculated have the same level of significance. (Piotr Paszek) Data mining k-nn 10 / 20

Normalization Min-max normalization Linear transformation of original data to a [0, 1] interval by the formula V old value, V new value, V = [min, max] old interval, V min max min (Piotr Paszek) Data mining k-nn 11 / 20

Normalization Z-score normalization Linearly scale to 0 mean, variance 1 according to the formula V old value, V new value, x mean value, σ 2 variance. V = V x σ (Piotr Paszek) Data mining k-nn 12 / 20

k-nn Classifiers Classification The unknown tuple is assigned the most common class among its k nearest neighbor When k = 1 the unknown tuple is assigned the class of the training tuple that is closest to it 1-NN scheme has a miss-classification probability that is no worse than twice that of the situation where we know the precise probability density of each function Prediction Nearest neighbor classifiers can also be used for prediction Return a real-valued prediction for a given unknown tuple The classifier returns the average value of the real-valued labels associated with the k-nearest neighbors of the unknown tuple (Piotr Paszek) Data mining k-nn 13 / 20

How to Determine the Value of k Larger k may lead to better performance But if we set k too large we may end up looking at samples that are not neighbors (are far away from the query) We can use test set (validation) to find best k Rule of thumb is k < sqrt(n), where n is the number of training examples (Piotr Paszek) Data mining k-nn 14 / 20

We can use validation to find k Start with k = 1; We use a test set to estimate the error rate of the classifier We increment k and estimate error rate for new k We chose the k value that gives the minimum error rate (Piotr Paszek) Data mining k-nn 15 / 20

How to Determine the Value of k Larger k produces smoother boundary effect and reduces the impact of class label noise (Piotr Paszek) Data mining k-nn 16 / 20

Shortcomings of k-nn Algorithms First: no time required to estimate parameters from training data, but the time to find the nearest neighbor can be prohibitive Some ideas to overcome this problem Reduce the time taken to compute distances by working in reduced dimension (use PCA) Use sophisticated data structure such as trees to speed up the identification of the nearest neighbor Edit the training data to remove redundant E.g., remove observations in the training data that have no effect on the classification because they are surrounded by observations that all belong to the same class (Piotr Paszek) Data mining k-nn 17 / 20

Shortcomings of k-nn Algorithms Second: the Curse of Dimensionality Let p be the number of dimensions The expected distance to the nearest neighbor goes up dramatically with p unless the size of the training data set increases exponentially with p Some ideas to overcome this problem Reduce the dimensionality of the space of attributes Select subsets of the predictor variables by combining them using methods such as principal components, singular value decomposition and factor analysis (Piotr Paszek) Data mining k-nn 18 / 20

k-nn Classifiers Summary Advantages Can be applied to the data from any distribution for example, data does not have to be separable with a linear boundary Very simple and intuitive Good classification if the number of samples is large enough Disadvantages Choosing k may be tricky Test stage is computationally expensive No training stage, all the work is done during the test stage This is actually the opposite of what we want. Usually we can afford training step to take a long time, but we want fast test step Need large number of samples for accuracy (Piotr Paszek) Data mining k-nn 19 / 20

Case-Based Reasoning (CBR) CBR: Uses a database of problem solutions to solve new problems Store symbolic description (tuples or cases) Applications: Customer-service, legal ruling Methodology instances represented by rich symbolic descriptions (e.g., function graphs) Search for similar cases, multiple retrieved cases may be combined Tight coupling between case retrieval, knowledge-based reasoning, and problem solving Challenges Find a good similarity metric Indexing based on syntactic similarity measure, and when failure, backtracking, and adapting to additional cases (Piotr Paszek) Data mining k-nn 20 / 20