Data Mining Final Project on NIPS Competition 2003

Size: px

Start display at page:

Download "Data Mining Final Project on NIPS Competition 2003"

Mabel Ellis
6 years ago
Views:

1 Data Mining Final Project on NIPS Competition 2003 Chun F.Hsu R Yao-Hui Yu R Yin-Tzu Lin R June 18, 2006 Abstract Feature Selection is an important issues in data mining, its main purpose is to reduce dimensionality and to filter the noise. In this project, we will perform classification on data sets with preprocessing by feature selection. We focus on two feature selection methods, PCA and univariate significance test and combined them with SVM to do predictions. The output prediction balanced error rate and area under curve(auc) has improved by using Feature selection methods. We will also have short discuss on different classifiers performance, such as KNN. Contents 1 Introduction Competition Scoring Criteria and Analysis AUC BER Training Procedure Feature Selection Method PCA Univariate Significance Test Result Experiment Result Using Libsvm Result without F.S Final Result Comparison with PCA+KNN Summary Critism on UST Critism on PCA Group Contribution and Participation

2 List of Tables 1 First Observation After Data Feature Selection Result without F.S Final Result Madelon comparison between SVM and KNN Gisette comparison between SVM and KNN Introduction 1.1 Competition Scoring Criteria and Analysis The data sets we used come from NIPS2003, there is a competition using five data sets to do classification. Our goal is to figure out why 1st place has won and by tracing their work, we will do a implementation on our own. In NIPS2003, they performed a series of test on the output result and collect these as the final score. However, we don t have the True labels of testing set, and we don t know where they had inserted noise in which they called Probes. So the only measures we can do is BER and AUC. We will apply them on adjusting results AUC AUC(Area Under Curve)has been proposed by many researcher using it to evaluate the predictive ability of learning algorithms. It will calculate the area under ROC curve. Our implementation of AUC is based on [1]. The formula is Â = S 0 n 0 (n 0 + 1)/2 n 0 n 1, (1) where n 0 and n 1 are the numbers of positive and negative examples, S 0 = r i, r i is the rank of i th example in the ranked list BER BER(Balanced Error Rate) is a measure that wants to pull within the importance of one large size data set and a small size data set. It set the weight of two data set to be equvalent as 50% each. The formula would be BER = F P R + F NR 2, while F P R is F P /F P + T P and F NR is F N/F N + T N, which are False Positive, True Positive,False Negative and True Negative of prediction result. 1.2 Training Procedure Our training procedure can be separated into several steps, 1. Data Preprocessing: Our classifier is libsvm[3], it is a commonly used SVM tool. We have to combine class labels with attributes or features, then throws into textsllibsvm. features, and transform them into sparse mode. However, in R[4] environment, they have a package e1071 to do this. The function is called write.matrix.csr. (2) 2

3 2. Feature selection: In this stage, we split the flow to two directions, one is using PCA and another is using permutation test. In both methods, when we received the data after selection, we used binary search to search for the best # of attributes. For examples, in Madelon, by using this method. We get the result that only need 5 attributes to acheive the best performance. 3. Run Classifiers: After feature selection, we used grid search for the best parameters in libsvm, therefore, when we grab the parameters in our hands, then we do training with probability output. However, since we don t have test data label, so we treated the valid data as test data and uses Cross Validation on training data to seek for good parameters. Scaling between 0 1 is also need to be considered. It may change the result significantly. 4. Compute BER and AUC: Using our implementations of these measures, we got a neat evaluations of our results and classfiiers. 2 Feature Selection 2.1 Method PCA The PCA (Principal Component Analysis) is a procedure computes the most meaningful basis to re-express a noisy, garbled data set. It is useful on pattern recognition, image compression, and feature selection. Here we use it to select useful features to improve our learning model. Below is the basic PCA steps used in this project. [6][7] 1. Construct the covariance matrix M for the training data. 2. Compute the eigenvectors of M. 3. Use the first k eigenvectors as the new basis, and transform the training and validate data to this new basis. K need to be preselected. 4. Now we can use the transformed data to do training Univariate Significance Test Significance Test is to set up a null hypothesis(h 0 ), then we use several statistical methods to compute the probability estimate(p-value) on this hypothesis H 0. i.e. if p = 0.5 means It is 50 % likely that H 0 is true. The winner s Method: They assume the relevant variables will be at least somewhat relevent on their own,and only apply significance test on features that were non-zero in at least 4 training instances. He use Pearson and Spearman 3

4 correlation between each attribute and class label, finally, using permutation test to estimate p-value, the formula of computing correlation are as follows: r s = 1 6 (X Y ) 2 n(n 2, (3) 1) p-value r p = X Y XY n ( X 2 ( X) 2 n )( Y 2 ( Y ) 2, (4) n ) p = 2 min( 1 I(r xyπ >= r xy ), 1 I(r xyπ <= r xy )) (5) n! n! π π x is feature, y is label, n is the total number of instance and there are m! possible permutations. y π represent one of this permutation. { I(.) = 1, if. = T rue I(.) = I(.) = 0, else (6) Thus, if we want the p-value to be larger, then the number of (r xyπ >= r xy ) and (r xyπ <= r xy ) should be identical. However, when we try the method mentioned above, we encountered a big problem: the number of permutations could be exhaustively large. Take the case if n = 6000 (number of instances of dataset Gisette), the permutations will be e 20065, which will be too much to implement. So, we try to find some alternatives [2].In this paper, the author use another approach of permutation test, let X = [X 1, X 2, X 3...X k ] t be features, Y is label 1 and -1, He assume that the relevence of feature X is measured as the difference between Pr[X = x Y = 1]and Pr[X = x Y = 1] Then he use the following four methods to compute this difference: difference in sample means(r M ), Symmetric variant of the Kullback- Leibler distance(j-measure,r J ), Information gain(r IG ) and chi 2 statistics-based measure(r CHI ). Afterall, he defined Dj 1 = x i,j y i = 1 and D 1 j = x i,j y i = 1 as the values of j t h features of all instances with class 1 and -1. θ(dj 1, D 1 j ) is the statistic methods mentioned above, π j represent p-value and the null hypothesis H 0 : Pr[X j Y = 1] = Pr[X j Y = 1]. It says the relevance of X j is inversely proposional to p-value π j then use the permutation test to estimate π j for each feature j. Formula 1 Let U (b) and V (b) denote the shuffled values of the j th features of all instances with class 1 and -1. b is one of the permutation. p = 1 B + 1 b I(θ(U (b), V (b)) > θ(d 1 j, D 1 j )) To reduce the large computations, just do B times permutation instead of n!, the author mentioned that the value above(i(θ(u (b), V (b))) is a random variable of binomial distribution b(b, π j ), so to limit the estimation error under 10%, he set the coefficient of variation CV = 1 πj π jb = 0.1, thus if we want to reduce 4

5 the permutation to 2000, only p-value estimated larger than 0.05 is acquired. Since when π j is small, it is not reliable in ranking, so using Z-score instead, Z-score = θ(d1 j, D 1 j ) mean(θ(u, V )) std(θ(u, V )) (7) to rank the estimate p-values under In his conclusion, difference in sample means(r M ) and Symmetric variant of the Kullback-Leibler distance(jmeasure,r J ) are better. So we adapted these two instead of all four. Therefore, for our implementation in significance test. We slightly combined the winner s methid and the paper s method. 1. We picked the features with at least 4 instances non-zero 2. Base on the paper s algorithm and use r M and r J [5] their formula is as follows, r M (X) = E[X Y = 1] E[X Y = 1], (8) r J (X) = Pr[X = x Y = 1] Pr[X = x Y = 1] Pr[X = x Y = 1] log 2 ( Pr[X = x Y = 1] ), 3. Apply shuffles on class label (the same as the winner). 4. Choose B=150 to estimate p-value(as for p-value under 0.4, use Z-score). Because the running time of 2000 permutations is exhaustive long(1 week or more). Since what we want is the larger θ(dj 1, D 1 j ) the better, so we consider bigger Z-score to be more important feature. 5. After compute the p-vlaue, we have 2 kinds of importance order, one is using r M as θ(d 1 j, D 1 j ), the other is r J. 2.2 Result Here we listed the result after picked by feature selectionin table 1, ranking is based on their Error Rate. Sel stands for Feature Selection algorithm, Ffeat means percentage of numbers of feature selected. Scale will scale data in 0 to 1. Parameter stands for the suggested parameters by libsvm. UST is abbreviation of Univariate Significance Test. (9) Table 1: First Observation After Data Feature Selection Data Set Sel Ffeat Scale Parameter CV Error Rate Arcene UST 25% Yes 2048, % Dexter UST 1.09% No 2, % Dorothea UST NA NA NA NA Gisette PCA 1% Yes 2, % Gisette UST NA NA NA NA Madelon PCA 1% Yes 0.5, % Madelon UST 1% Yes 32, % 5

6 Table 2: Result without F.S Data Set Scale Parameter CV Err Test Err. AUC Arcene Yes 32, % 16.0% 93.58% Dexter Yes 32, % 7.0% 97.81% Dorothea No 8, % % 93.95% Gisette No NA 19.55% NA NA Madelon Yes 32.0, % 3 Experiment Result 3.1 Using Libsvm We decided to used libsvm as our tool doing classification, one reason is it has almost all the strong and neat tools to help us run the experiment. Table 2 is the plain result which doesn t do any feature selecting work, actually in libsvm there is a feature selection tool called FScore. However, it is not in our project plan. We are supposed to follow our own discovered method to do selecting work. The result show that even without feature selection, libsvm performed very well on Aecene, Dexter, Dorothea. In Madelon, the result is a little bit not good, as compare to above data. In Gisette, training take almost a day to complete it, due to today, grid on Gisette is still running. 3.2 Result without F.S First we had run all datas without any feature selection on libsvm, in Table 2 it shows the experiment result. CV err stands for error of Cross Validation on training set. Parameter is c and gamma belongs to SVM. 3.3 Final Result The final result is on Table3, which we add two performance measure into it, AUC and BER. We had combined both selection method and chose the best on stands for each data set. However, Arcene and Dexter s BER and AUC is better in plain SVM, but we are closed to it. Without scaling, Gisette and Madelon won t achieve this result, but when in plain SVM, scaling in Madelon makes the result even worse, this is very tricky as if scaling doesn t work at one data set, it won t work at the same dat set if this data set has been after some transformation. Gisette s result is very good, it has reach above 95% accuracy, the Ffeat score also shows the Gisette result is indeed inspiring. Due to today, Dorothea data of UST has nt come out, or our result may be even better adding feature selection factor onto it. 3.4 Comparison with PCA+KNN Our PCA procedure only runs on Madelon and Gisette, so we compare the performance of knn and svm on these two dataset. 6

7 Table 3: Final Result Data Set Selection Ffeat Testing Err. AUC BER Arcene UST 25% 19.00% 90.99% 19.23% Dexter UST 1.09% 21.33% 88.10% 19.9% Dorothea SVM % % 93.95% 13.69% Gisette PCA 0.01% 5.3% 99.6% 4.79% Madelon PCA 0.01% 10.5% 95.82% 10.49% In table 4 and table 5, We can see that after feature selection, both classifiers Table 4: Madelon comparison between SVM and KNN Classifier Ffeat Testing Err. svm % knn(k=5,one of the best in 1nn 100nn) % knn(avg. of 1nn 100nn) % Table 5: Gisette comparison between SVM and KNN Classifier Ffeat Testing Err. svm % knn(k=5,one of the best in 1nn 100nn) % knn(avg. of 1nn 100nn) % improves to a similar level of accuracy. They nearly doesn t make any difference. But before feature selection, they both have very poor result. This shows our direction of feature selection really improves a lot as against no feature selection work. 4 Summary The whole project shows that if we apply feature selection method on certain data sets, it will improve the performance and lowers the dimensionality or complexity of classifying work. In order to complete this project properly, we will sent our result to the website and wait to see the response on how far we get in this project. 4.1 Critism on UST Because we reduce the permutation times to 150 to avoid exhaustive computation, it is inevitable that the error rate will be larger. Nevertheless, we still can see some potentials in our experiment results that if we have a more powerful computer to run more permutations, the result will be better. 7

8 4.2 Critism on PCA Although our PCA only works at two data sets, but it has shown a significant improvement on two data sets, it helps to decrease error rate and raise AUC to a acceptible level. 4.3 Group Contribution and Participation All three of us attends this project diligently, most ideas came from iterative discussion, and below is our individual contributions. Chun F.Hsu : AUC, fileconverter tool code implementation, report designer, plain SVM testing, data and result organizing and the coordinator in the project. Yao-Hui Yu: PCA part designer and ran whole PCA process, PCA+KNN experiment, BER tool design. Yin-Tzu Lin: Univariate Significance Test designer, paper collecting and surveying, SVM testing on UST data. References [1] Jin Huang Charles X.Ling, Using AUC and Accuracy in Evaluating Learning Algorithms IEEE Transaction on Knowledge and Data Engineering, Vol 17, No.3, 2005 [2] P. Radivojac, Z. Obradovic, A. K. Dunker, S. Vucetic. Feature selection filters based on the permutation test European Conference on Machine Learning, ECML 2004, Pisa, Italy,pp , September 2004 [3] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, Software available at cjlin/libsvm [4] [5] Lin, J., Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, (1): p [6] Lindsay I Smith. A tutorialon Principal Components Analysis, tutorials/principal components.pdf [7] Jon Shlens. TUTORIAL ON PRINCIPAL COMPONENT ANAL- YSIS Derivation, Discussion and Singular Value Decomposition 8

Combining SVMs with Various Feature Selection Strategies

Combining SVMs with Various Feature Selection Strategies Yi-Wei Chen and Chih-Jen Lin Department of Computer Science, National Taiwan University, Taipei 106, Taiwan Summary. This article investigates the