NOTE: Your grade will be based on the correctness, efficiency and clarity.

Size: px

Start display at page:

Download "NOTE: Your grade will be based on the correctness, efficiency and clarity."

Hugh Williams
6 years ago
Views:

1 THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY Department of Computer Science COMP328: Machine Learning Fall 2006 Assignment 2 Due Date and Time: November 14 (Tue) 2006, 1:00pm. NOTE: Your grade will be based on the correctness, efficiency and clarity. Task: Nearest Neighbor Classification In this assignment, you are required to implement a Nearest Neighbor Classifier and run it on a 2- dimensional toy dataset. You have to experiment with different distance measures, different values of k, and the use of a local fitting scheme to observe their influence on the classification performance. 1 2D Toy Data The training data has two classes (red and blue) and is shown in The red moon-shaped class contains 2,152 points, while the blue class contains 2,444 samples (Figure 1). Obviously, these two are not linearly separable. The test data is drawn from a regular grid on the 2D plane: {(x,y) : 0.5 x 2.5, 1 y 0}, with a grid width of 0.02 (Figure 2). y x Figure 1: The 2d training set. The coordinates of the training samples are stored in the file data.txt, and those of the test samples are in test.txt, with the following format: 1

Figure 2: The test data. line 1... I: [class label [±1 ]] [x coordinate [double]] [y coordinate [double]] line I+1: -1 (this is the end marker) Here, I is the number of training samples.

2 Figure 2: The test data. line 1... I: [class label [±1 ]] [x coordinate [double]] [y coordinate [double]] line I+1: -1 (this is the end marker) Here, I is the number of training samples. 2 Nearest-Neighbor Classification Recall that there are four ingredients in a memory-based classifier: 1. Distance metric (here, [x 1,x 2,...,x d ] and [y 1,y 2,...,y d ] are two points in R d ): d (a) Euclidean distance: i=1 (x i y i ) 2 ; (b) L 1 distance: d i=1 x i y i ; (c) L distance: max x i y i. i=1,2,...,d 2. Number of nearest neighbors: (a) one (leading to the one-nearest-neighbor classifier); (b) k (leading to the k-nearest-neighbor classifier). 3. Weighting function: 2

3 (a) uniform (b) Gaussian: w i = exp( distance 2 ( x i,query)/k 2 w), where K w is the kernel width. 4. Fitting of the local points: (a) predict with the weighted vote; (b) Predict with a local adaline: If all the neighbors have the same class label (which is trivially the case when k = 1), then predict that the query has this class label. Otherwise, train an adaline using this set of neighbors, and predict the label of the query with the trained adaline. The adaline is of the form f((x 1,x 2 )) = w 0 + w 1 x 1 + w 2 x 2. You should use the Adaline rule for training. The weights are initialized as w 0 = w 1 = w 2 = 0.1. The learning rate η will be specified by the user. For simplicity, we stop the training after 100 iterations. 3 Programming Specification 3.1 Command Syntax The program should be coded in C++ and run under UNIX. The source file must be named classifier.cpp, and the command syntax is (square brackets for parameters that are not always needed): training file test file output file distance type k weighting type [K w] fitting type [eta] training file, testing file, and output file: the file names for the training data, test data, and output, respectively. distance type: -e for Euclidean distance; -l for L 1 distance; i for L distance. k: number of neighbors used. 1: for 1-nearest-neighbor classifier; an integer k > 1: for k-nearest-neighbor classifier. weighting type: -u for uniform weighting; -g for Gaussian weighting. In this case, you also need to specify the kernel width K w. fitting type: -l for predicting with weighted votes; 3

4 -p for fitting a local adaline. In this case, you should also specify the learning rate η. Examples: train.txt test.txt output.txt -e 1 train.txt test.txt output.txt -l 1 train.txt test.txt output.txt -u 1 train.txt test.txt output.txt -e k -u -l train.txt test.txt output.txt -e k -g η -l train.txt test.txt output.txt -e k -u -p η 3.2 Output Format The output file is the file for outputting the test results in each classification task. The format of output.file is: line 1: [testing accuracy [double]] line 2... I+1: [class labels of the test samples (predicted by the classifier) [±1]] line I+2: -1 (this is the end marker) Here, I is the number of test samples. 3.3 Program Structure You should design the class classifier according to the following: typedef double (*dist fun ptr) (double*, double*); typedef void (*wht fun ptr) (double*, double**, int*); typedef int (*fit fun ptr) (double*, double**, int*); class classifier { private: double** train data; int* train labl; double** test data; int* test labl; dist fun ptr p df; 4

5 public: } wht fun ptr p wf; fit fun ptr p ff; int k; double* center; double** local set; int* local set lbl; char* output file; void get train(double** train set, int* train lbl); void get test(double** test set, int* test lbl); void get k(int k); void get dis ptr(dist fun ptr ptr); void get wht ptr(wht fun ptr ptr); void get fit ptr(wht fun ptr ptr); void get ouput(char* output file name); void output(); knn(); classifier(); classifier(); double dis u(double* x, double* y); double dis l(double* x, double* y); double dis i(double* x, double* y); void wht u(double* center, double** local set, int* local set lbl); void wht g(double* center, double** local set, int* local set lbl); int fit l(double* center, double** local set, int* local set lbl); int fit p(double* center, double** local set, int* local set lbl); Specifications of the variables and functions are as follows: double (*dist fun ptr) (double*, double*): format of the distance function. The input arguments are two 1D double arrays (coordinates of samples), and the return value is a distance (double). void (*wht fun ptr) (double* center, double** local set, int* local set lbl): format of the weighting function. The input arguments are: 1D double array (the center point), 2D double array (the k nearest neighbors), and 1D int array (labels of the k neighbors). The function uses the weighting scheme specified to change the labels of the set of neighbors local set lbl. int (*fit fun ptr) (double* center, double** local set, int* local set lbl): format of the fitting function. The input arguments are: 1D double array (the center point), 2D double array (the k nearest neighbors), 1D 5

6 int array (the labels of the nearest neighbors after weighting). The function will return the predicted label of the center point using the weighted labels and the specified fitting scheme (direct combination or fitting with a local adaline). train data: 2D array to store the training data. train labl: 1D array to store the training labels. test data: 2D array to store the test data. test labl: 1D array store the test labels. p df: function pointer of type dist fun ptr, which determines the type of distance measure to be used in the function knn(). p wf: function pointer of type dist fun ptr, which determines the type of weighting scheme to be used in the function knn(). p ff: function pointer of type fit fun ptr, which determines the type of fitting scheme to be used in the function knn(). get train(double** train set, int* train lbl): function to read the training data. The two arguments are passed on to class members train data, train labl. get test(double** test set, int* test lbl): function to read the test data. The two arguments are passed on to class members test data and test labl. get k(int k): function to obtain the user-specified number of nearest neighbors. get dis ptr(dist fun ptr ptr): function to obtain the pointer of the distance function, such as dis u, dis l, dist i, and pass it to the class member p df. get dis ptr(dist fun ptr ptr): function to obtain the pointer of the weighting function, such as wht u, wht g, and pass it to the class member p wf. get fit ptr(fit fun ptr ptr): function to obtain the pointer of the local fitting function, such as fit l, fit p, and pass it to the class member p ff. get ouput(char* output file name): function to get the user-specified output file name, passed on to the class member output file. center, local set, local set lbl: the center, the set of local nearest neighbors, and the label set. Note that to predict the label of each test pattern ( center), you need to find its k nearest neighbors (local set), and the neighbors labels (local set lbl). Then apply the proper weighting scheme and local fitting scheme to perform prediction with these information. Therefore, you should maintain these three class members for each test point. knn(): function to perform classification on the test data. In this function, for every test pattern (center), you should construct the set of local neighbors (local set) and their labels (local set lbl) using the user-specified distance measure (dis e(), dis l(), dis i()) and k. Then you should call the weighting functions (wht u(), wht g()) to re-weight the labels. Finally, you should call the fitting function (fit l(), fit p()) to combine the weighted labels to predict the label of the test point. 6

7 output(): function to output the classification results to output file. dis e(), dis l(), dis i(): distance functions using the Euclidean, L 1, and L distances. wht u(), wht g(): weighting functions using uniform weighting, Gaussian weighting. fit l(), fit p(): fitting functions using direct combination of local labels or fitting a local adaline. 4 Experiment Design and Output You are require to perform the following tasks: 1. 1-NN classification using the Euclidean distance; 2. 1-NN classification using the L 1 distance; 3. 1-NN classification using the L distance; 4. k-nn classification with k = 5, uniform weighting, weighted vote; 5. k-nn classification with k = 9, uniform weighting, weighted vote; 6. k-nn classification with k = 31, uniform weighting, weighted vote; 7. k-nn classification with k = 31, Gaussian weighting (with K w = 0.05), weighted vote; 8. k-nn classification with k = 31, uniform weighting, local fitting by adaline (with η = 0.1). You have to show the classification results on the test data (text.txt) by using red for the positive class, and blue for the negative class. Since the grid we used is very dense, the classification boundary should be easily seen. Altogether, you should provide 8 plots. Besides, we will also use other test data to evaluate the correctness of your program. 5 Submission We will collect your program using the CASS. Please submit the following files: 1. Well-documented program source codes for classifier.cpp. 2. If you have multiple source code files, submit all of them plus a makefile for the compilation. 3. A REPORT file, with your name, student ID, address, and short descriptions of your programs. You should provide the classification results on the 2D plane for all the classification tasks as mentioned in section 4, compare them and make some conclusion on the selection of the parameters. For more detail of CS UNIX account, please read IMPORTANT NOTE: You should NOT modify any file submitted after the assignment collection deadline. 7

8 6 Grading Basic requirement includes program clarity, documentation, and consistency with the assignment specifications. Your program will be compiled under UNIX and tested on a different test data set. Your discussion in the report will also be taken into consideration. 6.1 Late Submission We accept late submissions, but with the following penalties: submit on or before November 15, 1:00pm: deduct 30%; submit on or before November 16, 1:00pm: deduct 60%; submit after November 16, 1:00pm: deduct 100%. 8

NOTE: Your grade will be based on the correctness, efficiency and clarity.

NOTE: Your grade will be based on the correctness, efficiency and clarity. THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY Department of Computer Science COMP328: Machine Learning Fall 2006 Assignment 1 Due Date and Time: October 20 (Fri), 2006, 12:00 noon. NOTE: Your grade