A Feature Selection Method to Handle Imbalanced Data in Text Classification

Similar documents
Lina Guzman, DIRECTV

CS145: INTRODUCTION TO DATA MINING

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

Information-Theoretic Feature Selection Algorithms for Text Classification

Classification with Class Overlapping: A Systematic Study

Naïve Bayes for text classification

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

PARALLEL SELECTIVE SAMPLING USING RELEVANCE VECTOR MACHINE FOR IMBALANCE DATA M. Athitya Kumaraguru 1, Viji Vinod 2, N.

Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm

Improving Recognition through Object Sub-categorization

CS249: ADVANCED DATA MINING

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

Content Based Image Retrieval system with a combination of Rough Set and Support Vector Machine

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Supervised Learning Classification Algorithms Comparison

Video annotation based on adaptive annular spatial partition scheme

Classification. 1 o Semestre 2007/2008

The Automation of the Feature Selection Process. Ronen Meiri & Jacob Zahavi

Using Gini-index for Feature Weighting in Text Categorization

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

A Random Forest based Learning Framework for Tourism Demand Forecasting with Search Queries

Statistics 202: Statistical Aspects of Data Mining

Discovering Advertisement Links by Using URL Text

Feature weighting classification algorithm in the application of text data processing research

SUPPORT VECTOR MACHINES

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

Multi-label classification using rule-based classifier systems

Lecture 9: Support Vector Machines

Chinese Text Auto-Categorization on Petro-Chemical Industrial Processes

A FUZZY NAIVE BAYESCLASSIFICATION USING CLASS SPECIFIC FEATURES FOR TEXT CATEGORIZATION

Application of Support Vector Machine Algorithm in Spam Filtering

Global Journal of Engineering Science and Research Management

Unsupervised Feature Selection for Sparse Data

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Support Vector Machine with Restarting Genetic Algorithm for Classifying Imbalanced Data

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

An efficient face recognition algorithm based on multi-kernel regularization learning

Applying Supervised Learning

Efficiently Mining Positive Correlation Rules

An Empirical Study of Lazy Multilabel Classification Algorithms

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Application of Support Vector Machine In Bioinformatics

ECG782: Multidimensional Digital Signal Processing

Fuzzy Entropy based feature selection for classification of hyperspectral data

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

Voxel selection algorithms for fmri

Cluster based boosting for high dimensional data

Categorization of Sequential Data using Associative Classifiers

Performance Evaluation of Various Classification Algorithms

Machine Learning with MATLAB --classification

Evaluating Classifiers

Rolling element bearings fault diagnosis based on CEEMD and SVM

Open Access Research on the Prediction Model of Material Cost Based on Data Mining

Machine Learning Techniques for Data Mining

The class imbalance problem

Based on Raymond J. Mooney s slides

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Identification of the correct hard-scatter vertex at the Large Hadron Collider

[Kaur, 5(8): August 2018] ISSN DOI /zenodo Impact Factor

International Journal of Advanced Research in Computer Science and Software Engineering

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

Face Recognition Based on LDA and Improved Pairwise-Constrained Multiple Metric Learning Method

Classification by Support Vector Machines

Slides for Data Mining by I. H. Witten and E. Frank

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Relevance Feedback for Content-Based Image Retrieval Using Support Vector Machines and Feature Selection

K- Nearest Neighbors(KNN) And Predictive Accuracy

A high performance Hybrid Algorithm for Text Classification

Analysis on the technology improvement of the library network information retrieval efficiency

VECTOR SPACE CLASSIFICATION

K-means clustering based filter feature selection on high dimensional data

CS145: INTRODUCTION TO DATA MINING

All lecture slides will be available at CSC2515_Winter15.html

Action Recognition & Categories via Spatial-Temporal Features

Classification of Imbalanced Data Using Synthetic Over-Sampling Techniques

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Automatic Shadow Removal by Illuminance in HSV Color Space

CLASS IMBALANCE LEARNING METHODS FOR SUPPORT VECTOR MACHINES

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Combinatorial PCA and SVM Methods for Feature Selection in Learning Classifications (Applications to Text Categorization)

Semi-Supervised Clustering with Partial Background Information

Data Mining and Data Warehousing Classification-Lazy Learners

Keyword Extraction by KNN considering Similarity among Features

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

A Novel Extreme Point Selection Algorithm in SIFT

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

The Imbalanced Problem in Mass-spectrometry Data Analysis

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Nonparametric Classification Methods

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification

Transcription:

A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University of Posts and Telecommunications, Beijing, 100876, China 2 Computing and Data Processing Center GRUEEX Ltd., Zagreb, Croatia haixiang189@163.com Journal of Digital Information Management ABSTRACT: Imbalanced data problem is often encountered in application of text classification. Feature selection, which could reduce the dimensionality of feature space and improve the performance of the classifier, is widely used in text classification. This paper presents a new feature selection method named NFS, which selects class information words rather than terms with high document frequency. To improve classifier performance further, we combine a feature selection method (NFS with data resampling technology to solve the problem of imbalanced data. Experiments were evaluated on Reuters- 21578 Collection, and results show that the NFS method performs better than chi-square statistics and mutual information on the original dataset when the number of selected features is greater than 1000. The maximum value of Macro-F1 is 0.7792 when the NFS method is applied to the resampling dataset, which represents an increase in Macro-F1 by 4.02% given the original dataset. Thus, our proposed method effectively improves minority class performance. Subject Categories and Descriptors I.2.7 [Artificial Intelligence]: Natural Language Processing - Text Analysis; 1.5.4 [Pattern Recognition]: Applications - Text processing General Terms: Algorithm, Performance Keywords: Text classification, Feature selection, Imbalanced data Received: 22 January 2015, Revised 2 March 2015, Accepted 11 March 2015 1. Introduction The volume of digital documents available online is progressively increasing, and text classification is the key technology used to process and organize text data. Nonetheless, a major problem in text classification is that the high dimensionality of the feature space can easily increase the number of features to hundreds of thousands. This increase can deteriorate classifier performance as a result of redundant and irrelevant terms. An effective way to solve this problem is by reducing feature space dimensionality [1 2]. Dimension reduction methods include feature extraction and selection. In feature extraction, a new feature set is generated by combining or transforming the original dataset. In feature selection, a subset is derived from the original set without transforming the feature space. Three methods are used to perform feature selection: embedded, wrapper and filter. Embedded and wrapper methods rely on a learning algorithm, but filter method is independent of such algorithms. As the filter method is sampler and has lower computational complexity than the other two methods, it has been widely used in text classification. Many filter methods have been proposed, including the improved Gini index, information gain (IG, chi-square statistics (CHI, document frequency (DF, orthogonal centroid feature selection (OCFS, and DIA association factor (DIA [3]. In fact, the new feature selection (NFS method proposed in the current paper is a filter method as well. Problems of imbalanced data are often observed in text classification because the number of positive samples is usually considerably smaller than that of negative samples. Imbalanced data generally cause classifiers to perform poorly on the minority class. The final aim of text classification on imbalanced datasets involves improving the minority-class classification performance without affecting the overall classification performance of the Journal of Digital Information Management Volume 13 Number 3 June 2015 169

classifier. A number of solutions have been proposed for this problem at both data and algorithmic levels. Datalevel solutions include oversampling on minority-class samples and undersampling on majority-class samples. Algorithmic-level solutions include the design of new algorithms or the optimization of original algorithms to improve classifier performance. Feature selection should be more important than classification algorithms in highly imbalanced situations [4]. Selecting a word with strong class information can improve classifier performance; for example, the word football usually appears in the class sports [5]. In this paper, we propose an NFS method that differs from many existing feature selection methods. Using this method, we select terms with class information; then, we combine the NFS method with data resampling technology to improve imbalanced classification. The effectiveness of our proposed method is proved by experiment results. 2. Method Many feature selection methods have recently been used extensively in text classification [1 8], such as mutual information (MI, IG, and CHI. CHI and MI are considered in this study and are described as follows: Let t be any word; then, we can define its presence and absence in class c i as: A i is the number of documents with word t that belongs to class c i ; B i is the number of documents without word t that belongs to class c i ; C i is the number of documents with word t that does not belong to class c i ; and D i is the number of documents without word t that does not belong to class c i. The CHI and MI metrics for a word t are defined by N(A CHI(t, c i = i D i B i C i 2 (A i + C i (A i + B i (B i + D i (C i + D i (1 The proposed feature selection method is based on the following hypothesis: (1 The more a word t occurs in class c i documents, the better it can represent this class. (2 If a word t is mainly detected in most positive documents and rarely occurs in negative documents, then the word exhibits strong class discriminative power. (3 The higher the inverse class frequency of a word t is, the greater the contribution of the word to classification. Inverse class frequency is defined as follows: ICF t, c = 1n n (3 n t where n is the total number of the classes and n t is the total number of the classes in which the word t appears. The NFS algorithm is defined as A i NFS (t, c i =. A i n B. (4 i C 1n i n t 2.2 Data Resampling on Imbalanced Data To overcome the imbalance problem in text classification and to improve classifier performance, data resampling technology is applied to imbalanced data. Solutions to address this issue include oversampling and undersampling. Alternatively, both methods can be integrated to lower the degree of imbalanced distribution. Oversampling techniques include random and synthetic minority oversampling techniques (SMOTE. Undersampling techniques include Tomek links, edited nearest neighbor rule, and cluster-based undersampling [9]. In the present study, SMOTE is used to overcome the problem of imbalanced data. In this approach, the minority class is oversampled by generating synthetic examples as follows: taking the difference between the sample under consideration and its nearest neighbor, multiplying this difference by a random number between 0 and 1, and then adding the resultant value to the sample under consideration. The algorithm is presented as follows: MI(t, c i = log A i N (A i + C i (A i + B i (2 where N is the total number of documents and can be defined as N = A i +B i +C i +D i. The documents that belong to class c i are defined as positive documents, whereas those that do not belong to class c i as regarded as negative documents. 2.1 Feature Selection Method Given Imbalanced Datasets Figure 1. Synthetic examples 170 Journal of Digital Information Management Volume 13 Number 3 June 2015

Algorithm SMOTE (T, N, k Input: Number of minority class samples T; Amount of SMOTE N%; Number of nearest neighbors k Output: (N/100*T synthetic minority class samples 1.(* If N is less than 100%, randomize the minority class samples as only a random percent of them will be SMOTEd.* 2. if N < 100 3. then randomize the T minority class samples 4. T = (N/100*T 5. N = 100 6. endif 7. numattrs = Number of attributes 8. Sample[ ][ ]: array for original minority class samples 9. newindex: keeps a count of number of synthetic samples generated, initialized to0 10. Synthetic[ ][ ]: array for synthetic samples (*Compute k nearest neighbors for each minority class sample* 11. for i 1 to T 12. Compute k nearest neighbors for i and save the indices in the nnarray 13. Populate(N, i, nnarray 14. End for Populate(N, i, nnarray (*Function to generate the synthetic samples* 15. while N 0 16. Choose a random number between 1 and k and call it nn. This step chooses one of the k nearest neighbors of i. 17. for attr 1 to numattrs 18. Compute: dif = Sample[nnarray][nn][attr] Sample[i][attr] 19. Compute: gap = random number between 0 and 1 20. Synthetic[newindex][attr] = Sample[i][attr] + gap*dif 21. Endfor 22. newindex++ 23. N=N 1 24. end while 25. return (* End of Populate* The algorithm is divided into three parts: randomizing the minority class samples if N < 100 (lines 2 6; computing k nearest neighbors (KNN for each minority class sample (lines 11 14; and generating synthetic samples (lines 15 25 (http://www.cs.cmu.edu/afs/cs/project/jair/pub/ volume16/chawla02a-html/node6.html. Figure 1 shows synthetic examples generated by SMOTE. In this figure, x i represents the i th minority class sample; x i1, x i2, x i3, and x i4 are the four nearest neighbors of x i ; and s 1, s 2, s 3, and s 4 are synthetic samples. 3. Experiments and Analysis 3.1 Data Collection To evaluate the performance of the proposed method, we employed the Reuters-21578 Collection in the experiment. Modapte split is selected. The collection contains 135 categories in the original version; however, we considered only a subset of the top 10 categories in this study. 3.2 Classifier Many classification algorithms are available for text classification, such as naïve Bayes, KNN, and support vector machine (SVM. In this experiment, the SVM method is used due to its simplicity and accuracy in this type of classification. This method [10 11] is a popular machine learning method for data classification. SVM distinguishes classes by Journal of Digital Information Management Volume 13 Number 3 June 2015 171

creating a decision boundary as a hyperplane that separates documents of different classes. The solution can be considered an optimization problem, min w, b, ξ 1 2 wt w + C Σ n y i ((w T x + b > 1 ξ i, (5 ξ i > 0, 1,... n ξ i i = 1 where C is a penalty parameter and ξ is a loss function. Additional details are provided in the works conducted by Vladimir and Vapnik (1995 and Vapnik (1998. A library for SVM toolkit (Chang & Lin, 2001 was used in the current study, as was the radial basis function kernel. 3.4 Experimental Results 3.4.1 Effectiveness of the NFS Method on the Original Dataset We compare our feature selection method with two typical feature selection methods, i.e., CHI and MI. Figure 2 shows the results for Micro-F1 given imbalanced data without resampling. CHI performs well given small feature sets of 500 or less. However, our method is a favorable option when the feature sets are increased. The NFS method performs better than both CHI and MI when the number of feature sets is greater than 1000; in fact, performance is maximized when the number of selected features is 2000. 3.3 Evaluation Methodology Micro- and Macro-F1 were employed to evaluate the effectiveness of our method. F1 is measured on the basis of precision and recall values; for a given class c i, these variables are defined as p (c i = + FP i (6 r (c (7 i = + FN i where is the number of documents that are correctly classified to class c i ; FP i is the number of documents that are incorrectly classified to class c i ; and FN i is the number of documents belonging to class c i that have been incorrectly classified to other classes. Micro-F1 denotes the F1 measurement computed over global precision and recall. This measurement can be determined with Σ C = 1 i P micro = (8 Σ C i = 1 + Σ C i = 1 FP i Σ C = 1 i R micro = (9 F1 micro = Σ C i = 1 + Σ C i = 1 FN i 2R micro P micro R micro + P micro where C is the total number of classes. (10 Macro-F1 is the average of the F1 values for each class, i.e., C 2p (c i r (c i Σ i = 1 p (c i + r (c i (11 F1 macro = C Micro-F1 is influenced by classifier performance on common classes; by contrast, Macro-F1 is dominated by rare classes [12]. Figure 2. Micro-FI results Table 1 presents detailed results. The bold value in the table indicates that it is the best performance of the classifier among these feature selection methods. Optimum classifier performance is observed with NFS (0.8579. Figure 3 displays the measured results for Macro-F1.The Macro-F1 based on NFS method is especially suitable for high-feature sets. Although this method performs poorly given small feature set, it generates maximum value when the number of feature sets is 2000. CHI performs well given small feature sets; however, MI is inferior to the CHI and NFS methods in all of the feature sets in the experiment. Table 2 presents the detailed results for Macro-F1. The best performance was obtained with NFS methods as 0.7390. The NFS method performed better than CHI when the numbers of selected features were 1000, 1500, and 2000. MI performed poorly, particularly given small feature sets. The performance obtained with MI was only 0.1455 when the number of selected features was 100. 3.4.2 Effectiveness of Data Resampling Technology on Imbalanced Data To further verify the efficiency of our proposed method, we compare the performance levels of the classifiers for the 172 Journal of Digital Information Management Volume 13 Number 3 June 2015

original and resampling datasets (the minority class is oversampled with SMOTE. The NFS method alone is used in this experiment. Table 3 shows the classification performance given the original and resampling datasets when the NFS method is employed. Feature selection methods Number of feature 100 500 1000 1500 2000 MI 0.6475 0.6957 0.7151 0.7877 0.8071 CHI 0.7514 0.8441 0.8466 0.8447 0.8466 NFS 0.7513 0.8247 0.8528 0.8569 0.8579 Table 1. Micro-F1 results Figure 3. Macro-F1 results Feature selection methods Number of feature 100 500 1000 1500 2000 MI 0.1455 0.2722 0.3175 0.4637 0.5136 CHI 0.5781 0.6820 0.7026 0.6901 0.6979 NFS 0.3246 0.5959 0.7235 0.7367 0.7390 Table 2. Macro-F1 results Original dataset Resampling dataset Number of features Micro-F1 Macro-F1 Number of features Micro-F1 Macro-F1 500 0.8247 0.5959 500 0.8395 0.6431 1000 0.8528 0.7235 1000 0.8677 0.7631 1500 0.8569 0.7367 1500 0.8700 0.7773 2000 0.8579 0.7390 2000 0.8721 0.7792 Table 3. Performance of the SVM classifier when the NFS method is used on the original and resampling datasets Journal of Digital Information Management Volume 13 Number 3 June 2015 173

According to Table 3, the Micro-F1 performance of SVM on the resampling dataset is superior to that on the original dataset in all feature sets. The maximum Micro-F1 value is 0.8721 and is obtained given the resampling dataset when the number of selected features is 2000. Macro-F1 performance improved when we increased the number of features. The maximum Macro-F1 value is 0.7792; this value is obtained on the resampling dataset as well and denotes an improvement in Macro-F1 by 4.02% over the original dataset. Figure 4 and Figure 5 plot all of the Micro- and Macro-F1 values from Table 3 to provide a general view of the results. Classifier performance improves given the resampling dataset in all of the feature sets considered in the experiment, particularly with respect to Macro-F1 values. The comparison results confirm that the data resampling technology used (SMOTE can overcome the imbalance problem and improve minority class performance. 4. Conclusion Figure 4. Micro-FI results An NFS method is proposed in this paper. Unlike general feature selection methods, such as MI and CHI, the NFS method chooses class-informative words rather than highfrequency words. The experiment results highlight the effectiveness of this method, whose performance is compared with that of MI and CHI. The findings show that the proposed method performs better than the other two when the number of selected features is above 1000. To improve classifier performance further, we combine a feature selection method (NFS with data resampling technology (SMOTE to handle the imbalance problem in text classification. The experiment demonstrates the effectiveness of our proposed method. Moreover, Macro- F1 performance in the resampling dataset increased by 4.02% over that obtained in the original dataset when the NFS method was applied and when the number of selected features was 2000. Figure 5. Macro-F1 results Acknowledgement This work was supported by 111 Project of China under Grant No.B08004, key project of ministry of science and technology of China under Grant No. 2011ZX03002-005- 01, National Natural Science Foundation of China (61273217 and the Ph.D. Programs Foundation of Ministry of Education of China (20130005110004. References [1] Uuz, H. (2011. A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge- Based Systems, 24 (7 1024-1032. [2] Pinheiro, R. H. W., Cavalcanti, G. D. C., Correa, R. F., et al. (2012. A global-ranking local feature selection method for text categorization. Expert Systems with Applications, 39 (17 12851-12857. [3] Yang, J., Liu, Y., Zhu, X., et al. (2012. A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Information Processing & Management, 48 (4 741-754. [4] Li, S. S., Zong, C. Q. (2005. A new approach to feature selection for text categorization., In: 2005 International Conference On Natural Language Processing and Knowledge Engineering, Wuhan, China : IEEE, p. 626-630. [5] Zheng, Z. H., Wu, X. Y., Srihari, R (2004. Feature selection for text categorization on imbalanced data. SIGKDD Explorations, 6 (1 80-89. [6] Sun, X., Liu, Y. H., Xu, M. T., et al. (2013. Feature selection using dynamic weights for classification. Knowledge-Based Systems, 37 (1 541-549. [7] Azam, N., Yao, J. T. (2012. Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Systems with Applications, 39 (5 4760-4768. 174 Journal of Digital Information Management Volume 13 Number 3 June 2015

[8] Meng, J., Lin, H., Yu, Y. (2011. A two-stage feature selection method for text categorization. Computers & Mathematics with Applications, 62 (7 2793-2800. [9] Zhang, Y. Q., Zhang, D. L., Mi, G., et al. (2012. Using ensemble methods to deal with imbalanced data in predicting protein-protein interactions. Computational Biology and Chemistry, 36 (1 36-41. [10] Wen, L., Miao, D. Q., Wang, W. L. (2011. Two-level hierarchical combination method for text classification. Expert Systems with Applications, 38 (3 2030-2039. [11] Heng, W. C., Hong, L. L., Rajprasad, R., et al. (2012. A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine. Expert Systems with Applications, 39 (15 11880-11888. [12] Figueiredo, F., Rocha, L., Couto, T., et al. (2011. Word co-occurrence features for text classification. Information Systems, 36 (5 843-858. Journal of Digital Information Management Volume 13 Number 3 June 2015 175