Unsupervised Feature Selection for Sparse Data

Similar documents
Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Kernel Combination Versus Classifier Combination

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

Statistical dependence measure for feature selection in microarray datasets

Knowledge Discovery and Data Mining 1 (VO) ( )

Tracking of Human Body using Multiple Predictors

Feature Selection for Supervised Classification: A Kolmogorov- Smirnov Class Correlation-Based Filter

Cluster homogeneity as a semi-supervised principle for feature selection using mutual information

Forward Feature Selection Using Residual Mutual Information

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

Features: representation, normalization, selection. Chapter e-9

Semi-Supervised Clustering with Partial Background Information

Information-Theoretic Feature Selection Algorithms for Text Classification

Statistical Pattern Recognition

Application of the Generic Feature Selection Measure in Detection of Web Attacks

Class-Specific Feature Selection for One-Against-All Multiclass SVMs

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1

A Feature Selection Method to Handle Imbalanced Data in Text Classification

Vector Space Models: Theory and Applications

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

Using PageRank in Feature Selection

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

THE AREA UNDER THE ROC CURVE AS A CRITERION FOR CLUSTERING EVALUATION

An Improvement of Centroid-Based Classification Algorithm for Text Classification

Individualized Error Estimation for Classification and Regression Models

Statistical Pattern Recognition

Conditional Mutual Information Based Feature Selection for Classification Task

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

An Introduction to Pattern Recognition

International Journal of Advanced Research in Computer Science and Software Engineering

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification

Filter methods for feature selection. A comparative study

3 Feature Selection & Feature Extraction

Link Prediction for Social Network

Online Streaming Feature Selection

A Novel Criterion Function in Feature Evaluation. Application to the Classification of Corks.

Data Distortion for Privacy Protection in a Terrorist Analysis System

Statistical Pattern Recognition

Trimmed bagging a DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI) Christophe Croux, Kristel Joossens and Aurélie Lemmens

Using PageRank in Feature Selection

On Feature Selection with Measurement Cost and Grouped Features

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

PARALLEL CLASSIFICATION ALGORITHMS

WAVELET BASED NONLINEAR SEPARATION OF IMAGES. Mariana S. C. Almeida and Luís B. Almeida

Aggregating Descriptors with Local Gaussian Metrics

Comparison of Various Feature Selection Methods in Application to Prototype Best Rules

Discriminate Analysis

String Vector based KNN for Text Categorization

An efficient algorithm for rank-1 sparse PCA

Chapter 12 Feature Selection

Multiple-Choice Questionnaire Group C

Unsupervised Instance Selection from Text Streams

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

Voxel selection algorithms for fmri

Fuzzy Bidirectional Weighted Sum for Face Recognition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

Combining one-class classifiers to classify missing data

Relevance Feedback for Content-Based Image Retrieval Using Support Vector Machines and Feature Selection

Learning Low-rank Transformations: Algorithms and Applications. Qiang Qiu Guillermo Sapiro

Information theory methods for feature selection

Three things everyone should know to improve object retrieval. Relja Arandjelović and Andrew Zisserman (CVPR 2012)

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

A Content Vector Model for Text Classification

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Supervised Variable Clustering for Classification of NIR Spectra

Machine Learning (BSMC-GA 4439) Wenke Liu

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

A Course in Machine Learning

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

String distance for automatic image classification

Standard and Convex NMF in Clustering UCI wine and sonar data

A NEW VARIABLES SELECTION AND DIMENSIONALITY REDUCTION TECHNIQUE COUPLED WITH SIMCA METHOD FOR THE CLASSIFICATION OF TEXT DOCUMENTS

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Morisita-Based Feature Selection for Regression Problems

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

Deep Learning for Computer Vision

Supervised Learning for Image Segmentation

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering.

Automatic Domain Partitioning for Multi-Domain Learning

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Preprocessing. Data Preprocessing

PATTERN CLASSIFICATION AND SCENE ANALYSIS

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Linear Discriminant Analysis for 3D Face Recognition System

Texture Image Segmentation using FCM

Feature-weighted k-nearest Neighbor Classifier

ISyE 6416 Basic Statistical Methods Spring 2016 Bonus Project: Big Data Report

Dictionary learning Based on Laplacian Score in Sparse Coding

ICA as a preprocessing technique for classification

An Introduction to Machine Learning

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

A Taxonomy of Semi-Supervised Learning Algorithms

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Some questions of consensus building using co-association

Fisher vector image representation

Data mining with Support Vector Machine

Face Recognition Based on LDA and Improved Pairwise-Constrained Multiple Metric Learning Method

CLASS-SPECIFIC POISSON DENOISING BY PATCH-BASED IMPORTANCE SAMPLING

Transcription:

Unsupervised Feature Selection for Sparse Data Artur Ferreira 1,3 Mário Figueiredo 2,3 1- Instituto Superior de Engenharia de Lisboa, Lisboa, PORTUGAL 2- Instituto Superior Técnico, Lisboa, PORTUGAL 3- Instituto de Telecomunicações, Lisboa, PORTUGAL Abstract. Feature selection is a well-known problem in machine learning and pattern recognition. Many high-dimensional datasets are sparse, that is, many features have zero value. In some cases, we do not known the class label for some (or even all) patterns in the dataset, leading us to semi-supervised or unsupervised learning problems. For instance, in text classification with the bag-of-words (BoW) representations, there is usually a large number of features, many of which may be irrelevant (or even detrimental) for categorization tasks. In this paper, we propose one efficient unsupervised feature selection technique for sparse data, suitable for both standard floating point and binary features. The experimental results on standard datasets show that the proposed method yields efficient feature selection, reducing the number of features while simultaneously improving the classification accuracy. 1 Introduction The need for feature selection (FS) is a well-known fact, since it arises in many machine learning and pattern recognition problems [1]. For instance, in text classification based on the bag-of-words (BoW) [2] each document is represented by high dimensional sparse vectors with the frequencies of a set of terms in each text; we usually have a large number of features, many of which are irrelevant, redundant or even harmful for classification performance. In this context, the need for FS techniques arises; these techniques may improve the accuracy of a classifier (avoiding the curse of dimensionality ) and speed up the training process [1]. There is a vast literature on FS techniques; see, for instance, [1, 3, 4] for detailed analysis of FS techniques and many references. 1.1 Feature Selection in Text Categorization In text classification tasks, each document is typically represented by a BoW [2], which is a high-dimensional vector with the relative frequencies of a set of terms in each document. A collection of documents is represented by the termdocument (TD) [5] matrix whose columns hold the BoW representation for each document whereas its rows correspond to the terms in the collection. An alternative representation for a collection of documents is provided by the (binary) term-document incidence (TDI) matrix [5]; this matrix holds the information, for each document, if a given term (word) is present or absent. Both these matrices usually have a large number of rows(i.e., terms/features); consequently, representing large collections of documents is expensive in terms 339

of memory. Feature selection contributes to alleviating this problem, in addition to often improving the performance of the learning algorithm being applied [3]. For supervised and unsupervised text categorization tasks, several techniques have been proposed for FS (see [6, 7, 8]). The majority of these techniques is applied directly on the TD matrix with floating point BoW representations. 1.2 Our Contribution In this paper, we propose one efficient unsupervised method for FS on sparse numeric floating point and binary features. We use a filter approach, which makes the method independent of the type of classifier considered. The remaining text is organized as follows. Section 2 reviews document representations and supervised and unsupervised FS techniques. Section 3 presents the proposed unsupervised method for FS. Section 4 provides the experimental evaluation of our method, compared against other supervised and unsupervised methods. Finally, Section 5 ends the paper with some concluding remarks. 2 Background This section briefly reviews some background concepts regarding document representations and supervised and unsupervised FS techniques. 2.1 Document Representation Let D = {(x 1,c 1 ),...,(x n,c n )} be a labeled dataset with training and test subsets, where x i R p denotes the i-th (BoW-type) feature vector and c i is its class label. The BoW representation contains some measure, like the term-frequency (TF) or the term-frequency inverse-document-frequency (TF-IDF) of a term (or word) [2]. Since each document only contains a small subset of terms, the BoW vector is usually sparse [2]. Let X be the p n term-document (TD) matrix representing D; each column of X corresponds to a document, whereas each row corresponds to a term (e.g., a word); each column is the BoW representation of a document [2, 5]. Let X b be the corresponding p n term-document-incidence (TDI) matrix, with binary entries, such that a 1 at line i, column j means that term i occurs in document j, whereas a 0 means that it does not occur [5]. TDI matrices are an adequate solution to represent very large collections of documents, due to the low memory requirements. The TDI matrix can be trivially obtained from the TD matrix. 2.2 Unsupervised and supervised feature selection A comprehensive listing of FS techniques can be found in [1, 3, 4]. In this Subsection, we briefly describe three unsupervised FS techniques that have been proven effective for classification on sparse data. On text categorization problems, these techniques have been applied directly on the TD matrix, achieving considerable dimension reduction and improving classification accuracy. 340

The unsupervised term-variance () [9] method selects features (terms) X i by their variance, given by i = var i = 1 n (X ij X i ) 2 = E[Xi] 2 X 2 i, (1) where X i is the average value of feature X i, and n is the number of patterns. The supervised minimum redundancy maximum relevancy (mrmr) method [10] computes both the redundancy and the relevance of each feature. Redundancy is the mutual information (MI) [] between pairs of features, whereas relevance is measured by the MI between features and class label. The supervised Fisher index () of each feature, on binary classification problems, is given by i = µ ( 1) i µ (+1) i 2 /(var ( 1) i +var (+1) i ), (2) where µ (±1) i and var (±1) i, are the mean and variance of feature i, for the patterns of each class. The measures how well each feature separates the two (or more, since it can be generalized) classes. Since these three methods are filter approaches, to perform FS we select the m ( p) features with the largest rank. 3 Proposed Unsupervised Method In this section we present the proposed method for unsupervised FS, which can be applied to floating point and binary representations (equivalently, the TD or TDI matrices). It relies on the idea that for sparse data, a feature has an importance/relevance proportional to its dispersion [9]. We use the generic dispersion of the values of each feature, instead of the dispersion around the mean, like in the method. We propose to measure this dispersion with the arithmetic mean (AM) and the geometric mean (GM). For a given feature X i on n patterns, the AM and the GM are AM i = 1 n X ij and n GM i = X ij 1 n, (3) respectively; it is well known that AM i GM i, with equality holding if and only if X i1 = X i2 =... = X in. The relevance criterion for each feature i is R i = AM i /GM i, with R i [1,+ ). However if a given feature has at least one zero occurrence, we have zero GM making this criterion useless. To overcome this problem, we apply the exponential function to each feature, yielding 1 R i = AM i/gm i = 1 n exp(x ij )/ exp X ij, (4) n }{{}}{{} a g 341

with 0 < g a. If we take the logarithm of (4), we keep the ranking of the set of features; after some algebraic manipulation and dropping constant terms, we obtain our feature dispersion () criterion i = log(a/g) = log exp(x ij ) 1 X ij. (5) n In many cases each BoW feature has values close to zero; in this case if we use the linear approximation exp(x) 1 + x and for convenience we define S i = n X ij, the expression in (5) becomes i log(n+s i ) 1 n S i, (6) being quite efficient to compute (we have the sum of each feature). In the special case of TDI matrices S i becomes the l 0 norm (the number of non-zero entries). 4 Experimental Evaluation This section reports the experimental results of our technique for text classification based on both standard and binary BoW representations. Our methods are evaluated by the test set error rate obtained by linear support vector machine (SVM) classifier, provided by the PRTools [12] 1 toolbox. We use the Spam- Base and the Dexter datasets from the UCI Repository 2 ; the SpamBase task is to classify email messages as SPAM or non-spam. For Dexter, the task is to classify Reuters articles as being about corporate acquisitions or not. Table 1: Standard Datasets SpamBase and Dexter. p and n are the number of features and patterns, respectively. l 0p and l 0n are the average l 0 norm of each feature and pattern, respectively. ( 1, +1) are the number of patterns per class. Dataset p l 0p l 0n Partition n ( 1,+1) SpamBase 54 841.2 9.8 4601 (1813,2788) Dexter 20000 1.4 94.1 Train 300 (150,150) Test 2000 (1000,1000) Valid. 300 (150,150) In the SpamBase dataset, we have used the first 54 features, which constitute a BoW. The Dexter dataset has 10053 additional distractor features (independent of the class), at random locations, and was created for the NIPS 2003 FS challenge 3. We train with a random subset of 200 patterns and evaluate on the validation set, since the labels for the test set are not publicly available; the results on the validation set correlate well with the results on the test set [3]. 1 http://www.prtools.org/prtools.html 2 http://archive.ics.uci.edu/ml/datasets.html 3 http://www.nipsfsc.ecs.soton.ac.uk 342

We evaluate the test set error rate on the SpamBase and Dexter datasets, using TD and TDI representations, respectively. The reported results are averages over ten replications of training/testing partition. Fig.1showstheaveragetestseterrorratesofthelinearSVMclassifierforthe SpamBase dataset (TD and TDI matrices) using supervised and unsupervised FS methods, as functions of the number of features m. The horizontal dashed line corresponds to the classifier trained without FS (baseline error). We see SpamBase TD with linear SVM SpamBase TDI with linear SVM 14.5 14 Original TD mrmr 14.5 14 13.5 Original TDI mrmr 13.5 13 12.5 12 13 12.5 12.5.5 10.5 10 20 25 30 35 40 45 50 20 25 30 35 40 45 50 Fig. 1: Test set error rates for SpamBase dataset, with TD and TDI matrices, for 20 m 54 (average of ten train/test replications). that the proposed method,, is able to perform better than the supervised methods. Fig. 2 showsthe averagetest seterrorratesofthe linearsvm classifier forthedexterdatasetontdandtdimatrices, respectively. Onthe TDmatrix, our method has about the same test error rate as ; on the TDI matrix we get lower test error rate than. Dexter TD with linear SVM Dexter TDI with linear SVM 15 14.5 Original TD 10.8 Original TDI 14 13.5 13 12.5 12 10.6 10.4 10.2 10.5 9.8 9.6 1500 2000 2500 3000 3500 4000 4500 1500 2000 2500 3000 3500 4000 4500 Fig. 2: Test set error rates for Dexter dataset, with TD and TDI matrices, for 1500 m 4500 (average of ten train/test replications). 343

5 Conclusions In this paper, we have proposed one efficient unsupervised method for feature selection on sparse data, based on simple analysis of the input training data, regardless of the label of each pattern. We compared our method with supervised and unsupervised techniques, on standard datasets with floating point and binary features. The experimental results show that the proposed method works equally well on both types of features reducing the number of features and improving classification accuracy, performing better than supervised feature selection methods in some cases. These methods can be applied to multi-class problems without modification, as we intend to do in future work. Acknowledgements This work was partially supported by the Fundação para a Ciência e Tecnologia (FCT), under grants PTDC/EEA-TEL/72572/2006 and SFRH/BD/45176/2008. References [1] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, 2nd edition, 2009. [2] T. Joachims. Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, 2001. [3] I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh (Editors). Feature Extraction, Foundations and Applications. Springer, 2006. [4] F. Escolano, P. Suau, and B. Bonev. Information Theory in Computer Vision and Pattern Recognition. Springer, 2009. [5] C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. [6] G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3:1289 1305, 2003. [7] K. Hyunsoo, P. Howland, and H. Park. Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research, 6:37 53, 2005. [8] K. Torkkola. Discriminative features for text document classification. Pattern Analysis and Applications, 6(4):301 308, 2003. [9] L. Liu, J. Kang, J. Yu, and Z. Wang. A comparative study on unsupervised feature selection methods for text clustering. In IEEE International Conference on Natural Language Processing and Knowledge Engineering, pages 597 601, 2005. [10] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1226 1238, August 2005. [] T. Cover and J. Thomas. Elements of Information Theory. John Wiley, 1991. [12] R. Duin, P. Juszczak, P. Paclik, E. Pekalska, D. Ridder, D. Tax, and S. Verzakov. PRTools4.1, a Matlab Toolbox for Pattern Recognition. Technical report, Delft University of Technology, 2007. 344