10/5/2017 MIST.6060 Business Intelligence and Data Mining 1. Nearest Neighbors. In a p-dimensional space, the Euclidean distance between two records,

Size: px

Start display at page:

Download "10/5/2017 MIST.6060 Business Intelligence and Data Mining 1. Nearest Neighbors. In a p-dimensional space, the Euclidean distance between two records,"

Wilfred Gilbert
5 years ago
Views:

1 10/5/2017 MIST.6060 Business Intelligence and Data Mining 1 Distance Measures Nearest Neighbors In a p-dimensional space, the Euclidean distance between two records, a = a, a,..., a ) and b = b, b,..., b ), is defined as: ( 1 2 p ( 1 2 p d (, b) = ( a1 b1 ) + ( a2 b2 ) + + ( a p bp ) a. It is not necessary to perform square root operation if the purpose is to compare distances. The Euclidean distance is typically calculated based on normalized values. The Euclidean distance measure implicitly assumes data are numeric. When applied to categorical data, the difference between two categorical values is defined as zero if they are the same, and one otherwise. Example: Consider two customer records, a and b, in a customer dataset. Suppose attribute 1 is gender. If a 1 = b1 (both customers are male or both are female), then a b 0; otherwise (one is male and the other is female), then a b 1 [also, 1 1 = 2 1 b1 ) = ( a 1]. 1 1 = K-Nearest Neighbors (k-nn) Method 0. Input: an integer value k. 1. To classify a new record, find the nearest k neighboring records in the training set, based on a distance measure (e.g., normalized Euclidean distance). 2. Classify the record as a member of the majority class of the k neighbors. If the problem is to predict a numeric value of an outcome variable, then take the average outcome value of the k neighbors as the predicted value. Drawback of K-Nearest Neighbors The k-nn does not provide explicit structures or models for classification or learning.

2 10/5/2017 MIST.6060 Business Intelligence and Data Mining 2 An Illustrative Example College Admission The dataset includes 24 college application records (rows 2 25 below), with 2 predictors, GPA and SAT, and a class attribute, Accept? (with 2 classes: yes, no). Let us use the record with {GPA = 3.32, SAT = 2060} (highlighted) as a validation record. That is, the validation set has only this record while the training set includes the other 23 records. We first calculate the Euclidean distances (without taking square root) between this validation record and each of the 23 training records, based on the normalized GPA and SAT values. Normalized values are shown in columns D and E, and the distances are shown in column F. Their Excel calculations are shown in the Formula sheet below. If k = 1, the nearest neighbor is the one right below this validation record (with normalized distance = ), which has a no value. Therefore, the validation record is classified as no. But we know the actual class of this record is yes ; so it is misclassified by 1-NN. If k = 3, the 3 nearest neighbors are indicated in column H, which include 2 yes s and 1 no. By the majority rule, the validation record is classified as yes. Therefore, 3-NN correctly classifies the record.

3 10/5/2017 MIST.6060 Business Intelligence and Data Mining 3

4 10/5/2017 MIST.6060 Business Intelligence and Data Mining 4 The scatter plot shows graphically how k-nn works in this example. The solid-lined loop shows 1-NN result, and the dash-lined loop shows the 3-NN result. Note that although the chart is plotted using the original values, the axis scales of the chart are adjusted so that it is very close to a square. In this sense, the values are approximately normalized. The k-nn works in the same way when applied to a new record. The only difference is, of course, the true class of the new record is unknown. Choice of k Run k-nn multiple times, each using a different k value. Choose the k with the lowest validation error rate for future classification of any new record. The above procedure is computationally very expensive and is practically prohibitive for large data. There are many approaches to reduce the computational cost; see pages of the WFHP book for more detail (not required). K-Nearest Neighbors in Weka The Admission.arff GPA SAT Accept {yes,no} % numeric attribute specification % numeric attribute specification % categorical attribute 2.83, 1910, yes 3.43, 1760, yes 2.94, 2210, yes 2.87, 2140, yes 3.46, 2400, yes 4.00, 1990, yes 3.95, 1840, yes 3.36, 2290, yes 3.04, 2060, yes 3.60, 2140, yes 2.62, 2250, yes 3.32, 2060, yes 3.18, 2030, no 2.66, 2140, no 2.94, 1800, no 2.44, 2100, no 3.39, 1840, no 2.58, 1840, no 2.82, 1690, no 2.97, 1910, no 2.54, 1730, no 2.20, 1950, no 2.62, 1500, no 2.90, 1580, no

5 10/5/2017 MIST.6060 Business Intelligence and Data Mining 5 1. Click Open file, find and open the Admission.arff file. By default, the last attribute is the class attribute. 2. Click Classify / Choose / lazy / IBk. The default is 1-NN. Click Start. The output results show that the total validation error rate is 9/24 = 37.5%.

Click the long horizontal box on the right of the Choose button. A pop-up weak.

6 10/5/2017 MIST.6060 Business Intelligence and Data Mining 6 3. Next, let s try k = 3. Click the long horizontal box on the right of the Choose button. A pop-up weak.gui.genericobjecteditor appears. Enter 3 for the KNN box. Click OK. 4. Click Start to get the results. The total validation error rate with 3-NN s 6/24 = 25%.

7 10/5/2017 MIST.6060 Business Intelligence and Data Mining 7 K-Nearest Neighbors in R R commands: > data <- read.table("c:/courses/mist.6060(63.755)/datasets/admission.csv", sep=',', header=true) > x <- data[, 1:2] > y <- data[1:23, 3] > normgpa <- (x[,1] - min(x[,1])) / (max(x[,1]) - min(x[,1])) > normsat <- (x[,2] - min(x[,2])) / (max(x[,2]) - min(x[,2])) > normx <- cbind(normgpa, normsat) > trainx <- normx[1:23, ] > testx <- normx[24, ] > library(class) > knn(trainx, testx, y, k=1) > knn(trainx, testx, y, k=3) R commands with results: > data <- read.table("c:/courses/mist.6060(63.755)/datasets/admission.csv", sep=',', header=true) > data GPA SAT Accept yes yes yes yes yes yes yes yes yes yes yes no no no no no no no no no no no no yes

8 10/5/2017 MIST.6060 Business Intelligence and Data Mining 8 > x <- data[, 1:2] > x GPA SAT > y <- data[1:23, 3] > y [1] yes yes yes yes yes yes yes yes yes yes yes no no no no no no no no no no no no Levels: no yes > normgpa <- (x[,1] - min(x[,1])) / (max(x[,1]) - min(x[,1])) > normgpa [1] [21] > normsat <- (x[,2] - min(x[,2])) / (max(x[,2]) - min(x[,2])) > normsat [1] [19]

9 10/5/2017 MIST.6060 Business Intelligence and Data Mining 9 > normx <- cbind(normgpa, normsat) > normx normgpa normsat [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,] [21,] [22,] [23,] [24,] > trainx <- normx[1:23, ] > trainx normgpa normsat [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,] [21,] [22,] [23,] > testx <- normx[24, ] > testx normgpa normsat > > library(class) > knn(trainx, testx, y, k=1) [1] no Levels: no yes > knn(trainx, testx, y, k=3) [1] yes Levels: no yes

k-nn classification with R QMMA

k-nn classification with R QMMA Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 1/16 HW (Height and weight) of adults Statistics