Data mining Piotr Paszek Classification k-nn Classifier (Piotr Paszek) Data mining k-nn 1 / 20
Plan of the lecture 1 Lazy Learner 2 k-nearest Neighbor Classifier 1 Distance (metric) 2 How to Determine the Value of k 3 Case-Based Reasoning (CBR) (Piotr Paszek) Data mining k-nn 2 / 20
Lazy vs. Eager Learning 1 Eager learning (e.g. decision tree) Given a set of training tuples, constructs a classification model before receiving new (e.g., test) data to classify Do lot of work on training data Do less work when test tuples are presented 2 Lazy learning (e.g., instance-based learning) Simply stores training data (or only minor processing) and waits until it is given a test tuple Do less work on training data Do more work when test tuples are presented (Piotr Paszek) Data mining k-nn 3 / 20
Lazy Learner: Instance-Based Methods Instance-based learning: Store training examples and delay the processing (lazy evaluation) until a new instance must be classified Typical approaches k-nearest neighbor approach (k-nn) Instances represented as points in a Euclidean space Case-based reasoning Uses symbolic representations and knowledge-based inference (Piotr Paszek) Data mining k-nn 4 / 20
k-nearest Neighbor Classifier Nearest-neighbor classifiers compare a given test tuple with training tuples that are similar Training tuples are described by n attributes (n-dimensional space) Find the k-nearest tuples from the training set to the unknown tuple k-nn classify an unknown example with the most common class among k closest examples (nearest neighbor) The closeness between tuples is defined in terms of distance metric (e.g., Euclidian distance) (Piotr Paszek) Data mining k-nn 5 / 20
Distance Metric Let d be a two-argument function (e.g, the distance between two objects). d is a metric if: 1 d(x, y) 0; 2 d(x, y) = 0 if x = y; 3 d(x, y) = d(y, x); 4 d(x, z) d(x, y) + d(y, z). (Piotr Paszek) Data mining k-nn 6 / 20
Distance (numeric attributes) Let x = [x 1, x 2,..., x n ] and y = [y 1, y 2,..., y n ] are two points in the n-dimensional space (Euclidean). Euclidian distance d e (x, y) = n (x i y i ) 2 i=1 Manhattan distance (taxicab metric) d m (x, y) = n x i y i i=1 (Piotr Paszek) Data mining k-nn 7 / 20
Distance (numeric attributes) Minkowski distance L m (x, y) = ( n x i y i q ) 1 q, where q is a positive natural number. For q = 1 Manhattan distance, for q = 2 Euclidian distance. Max distance i=1 d (x, y) = max n i=1 x i y i. (Piotr Paszek) Data mining k-nn 8 / 20
Distance (nominal or categorical attributes) Let x = [x 1, x 2,..., x n ] and y = [y 1, y 2,..., y n ] are two vectors (x i is nominal attribute). d(x, y) = n δ(x i, y i ) i=1 { 0 xi = y δ(x i, y i ) = i 1 x i y i (Piotr Paszek) Data mining k-nn 9 / 20
Normalization To improve the performance of the k-nn algorithm, the commonly used technique is normalization of data from the training set. As a result, all dimensions for which the distance is calculated have the same level of significance. (Piotr Paszek) Data mining k-nn 10 / 20
Normalization Min-max normalization Linear transformation of original data to a [0, 1] interval by the formula V old value, V new value, V = [min, max] old interval, V min max min (Piotr Paszek) Data mining k-nn 11 / 20
Normalization Z-score normalization Linearly scale to 0 mean, variance 1 according to the formula V old value, V new value, x mean value, σ 2 variance. V = V x σ (Piotr Paszek) Data mining k-nn 12 / 20
k-nn Classifiers Classification The unknown tuple is assigned the most common class among its k nearest neighbor When k = 1 the unknown tuple is assigned the class of the training tuple that is closest to it 1-NN scheme has a miss-classification probability that is no worse than twice that of the situation where we know the precise probability density of each function Prediction Nearest neighbor classifiers can also be used for prediction Return a real-valued prediction for a given unknown tuple The classifier returns the average value of the real-valued labels associated with the k-nearest neighbors of the unknown tuple (Piotr Paszek) Data mining k-nn 13 / 20
How to Determine the Value of k Larger k may lead to better performance But if we set k too large we may end up looking at samples that are not neighbors (are far away from the query) We can use test set (validation) to find best k Rule of thumb is k < sqrt(n), where n is the number of training examples (Piotr Paszek) Data mining k-nn 14 / 20
We can use validation to find k Start with k = 1; We use a test set to estimate the error rate of the classifier We increment k and estimate error rate for new k We chose the k value that gives the minimum error rate (Piotr Paszek) Data mining k-nn 15 / 20
How to Determine the Value of k Larger k produces smoother boundary effect and reduces the impact of class label noise (Piotr Paszek) Data mining k-nn 16 / 20
Shortcomings of k-nn Algorithms First: no time required to estimate parameters from training data, but the time to find the nearest neighbor can be prohibitive Some ideas to overcome this problem Reduce the time taken to compute distances by working in reduced dimension (use PCA) Use sophisticated data structure such as trees to speed up the identification of the nearest neighbor Edit the training data to remove redundant E.g., remove observations in the training data that have no effect on the classification because they are surrounded by observations that all belong to the same class (Piotr Paszek) Data mining k-nn 17 / 20
Shortcomings of k-nn Algorithms Second: the Curse of Dimensionality Let p be the number of dimensions The expected distance to the nearest neighbor goes up dramatically with p unless the size of the training data set increases exponentially with p Some ideas to overcome this problem Reduce the dimensionality of the space of attributes Select subsets of the predictor variables by combining them using methods such as principal components, singular value decomposition and factor analysis (Piotr Paszek) Data mining k-nn 18 / 20
k-nn Classifiers Summary Advantages Can be applied to the data from any distribution for example, data does not have to be separable with a linear boundary Very simple and intuitive Good classification if the number of samples is large enough Disadvantages Choosing k may be tricky Test stage is computationally expensive No training stage, all the work is done during the test stage This is actually the opposite of what we want. Usually we can afford training step to take a long time, but we want fast test step Need large number of samples for accuracy (Piotr Paszek) Data mining k-nn 19 / 20
Case-Based Reasoning (CBR) CBR: Uses a database of problem solutions to solve new problems Store symbolic description (tuples or cases) Applications: Customer-service, legal ruling Methodology instances represented by rich symbolic descriptions (e.g., function graphs) Search for similar cases, multiple retrieved cases may be combined Tight coupling between case retrieval, knowledge-based reasoning, and problem solving Challenges Find a good similarity metric Indexing based on syntactic similarity measure, and when failure, backtracking, and adapting to additional cases (Piotr Paszek) Data mining k-nn 20 / 20