Using Gini-index for Feature Weighting in Text Categorization

Journal of Computational Information Systems 9: 14 (2013) 5819 5826 Available at http://www.jofcis.com Using Gini-index for Feature Weighting in Text Categorization Weidong ZHU 1,, Yongmin LIN 2 1 School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China 2 College of Economics and Management, Hebei Polytechnic University, Tangshan 063009, China Abstract With the rapid development of World Wide Web, text categorization has played an important role in organizing and processing large amount of text data. As a simple, straightforward, high processing speed feature weighting method, TF IDF method is widely used in document classification. But this method simply considers the words of low frequency are important, the words of high frequency are unimportant, which may not reflect the usefulness of the words, and decreases the precision of classification. In this paper, a Gini-index based feature weighting method is presented, which solves the problem mentioned above. The experiments showed the TF GINI method has better classification performance. Keywords: Text Categorization; Feature Selection; Gini-index; Feature Weighting; VSM 1 Introduction Automatic text classification is a supervised learning task. It learns from the training document set with predefined class labels, and assigns the class label to the new document. With the rapid development of the network technology and digital library, online document grows quickly. Automatic text classification became the key technique for processing and organizing the high volume document data. Existing feature selection methods are based on statistical theory and machine learning methods. Among the famous methods are, Support Vector Machine (SVM), Naive Bayes, k-nearest Neighbor (knn), Neural Network (N net), Boosting, Linear Least Squares Fitting (LLSF) etc. In most classifiers, the vector space model is used to describe the document. The document is treated as a vector in the feature space. The measurement in its coordinate system uses TF IDF value proposed by Saltond in 1988. Text Frequency (TF) is the times of a word showed up in this document. Inverse document frequency IDF=log(N/Nt), where t is the word, N is the total number of document in the training set, Nt is the number of document showed t. The product of text frequency (TF) and inverse document frequency IDF as the value of the vector coordinate system has some advantages such as simple, intuitive and fast processing speed. Thus, it has been widely adopted in the document categorization. However, the TF IDF [1 3] Corresponding author. Email address: wdzhu@bjtu.edu.cn (Weidong ZHU). 1553 9105 / Copyright 2013 Binary Information Press DOI: 10.12733/jcis6629 July 15, 2013

5820 W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) 5819 5826 method simply considers the words of low frequency are important, the words of high frequency are unimportant, which may not reflect the usefulness of the words, and decreases the accuracy of classification. This paper present a different feature weighting solution (TF GINI) based on Gini-index. According to the distribution probability of the class with feature sample, calculate the Gini-index of the feature, then, weight the feature by the product of text frequency (TF) and the feature s Gini-index. It fully considers the feature s capability to distinguish different classes without increase the calculation complexity. Using the Reuters-21578 corpus and Chinese corpus from the International Database Center of Department of Computer Information and Technology, Fu Dan University, the comparing experiments showed the TF GINI method has better classification performance than the TF IDF method without increase the calculation time complexity. First chapter of this paper will discuss the defect of the TF IDF weighting solution. The second chapter will address the Gini-index based TF GINI feature weighting solution. The third chapter presents the experiment result and analysis. The last chapter is the conclusion and further study prospect. 2 Analysis of the TF IDF Weighting Solution Vector space model is currently one of the most simple and high efficient text describing models. Its basic scheme is: For certain natural language document D =D (t 1, w 1 ; t 2, w 2 ; ; t N, w N ), where t i is the feature selected from the document D, w i is the weight of the feature, 1 i N. To simplify the analysis, usually the sequence of the t k in the document is not considered and t k must different from each other, i.e. there is no repeating. Now, t 1, t 2,, t N can form a N dimension Coordinate system, and w 1, w 2,, w N are Coordinate values accordingly, therefore, D (w 1, w 2,, w N ) is a vector of the N dimension Coordinate system. The measurement of the coordinate system adopts TF IDF presented by Salton in 1988. Text Frequency (TF) is the times of a word showed up in this document. Inverse document frequency IDF=log(N/Nt), where t is the word, N is the total number of document in the training set, Nt is the number of document showed t. LU Yuchang has analyzed the two basic assumptions of the TF IDF scheme in the literature [3]. (1) If a word shows up many times in a document, it also appears many times in another document of the same class, and vice versa. Thus, the Text Frequency TF as part of the measurement to reflect the same class feature is acceptable. (2) If a word shows up less frequently, its capability to distinguish from other classes is stronger. Thus, the concept of Inverse document frequency IDF is introduced. The product of TF and IDF is used as the measurement of the feature space coordinate system. In the literature [3], from the angle of word weighting and vector rotation, it resolved the simple construction of IDF can not effectively reflect the usefulness of a word. It proposed removing the P(W) from the formula of Information Gain and text weight of Evidence, then weighting the word, and proved its effectiveness of improvement via the experiment. In the literature [2], Thorsten, using the probability theory, analyzed the product of TF and IDF as the measurement of the feature space coordinate system and pointed out it may not obtain high accuracy of the classification. He proposed a classification model between traditional TF IDF and Naïve bayes models. From the angle of the usefulness of feature for the classification, we found the weighting of

W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) 5819 5826 5821 TF IDF may assign the bigger weight to the rare word, without consider its class distribution characteristics. These rare words may lead to invalid classification. A simple case study can reveal the defect of the TF IDF. Look into the following conditions: the total document number in the training document set is 300, which 100 belong to class A, and rest 200 belong to class B. If and only if both words t 1 and t 2 show up in class B documents, and Nt1=200, Nt2=100. Document D belongs to class B, D B, both words t 1 and t 2 show up in Document D, and TF(t 1 )= TF(t 2 ). Using the TF IDF method to weight these two words, we got TF(t 1 ) log(n/nt1)< TF(t 2 ) log(n/nt2), word t 1 has higher TF IDF value due to its rareness. However, under such condition, t 2 obviously has stronger classification capability, and contributes more for the classification. TF IDF simply use the Inverse document frequency to weight the feature without considering its class distribution. This is the main reason for low classification accuracy after weighting. 3 Gini-index Based Text Feature Weighting Method Gini-index is an impurity splitting method. It is suitable to the category, binary, continuous numeric type text field. It was proposed by Breiman in 1984, and has widely used in algorithms such as CART, SLIQ, SPRINT and Intelligent Miner decision tree (IBM s Data mining tool), achieved fairly good classification accuracy. 3.1 Gini-index principle The specific algorithm: Suppose that S is a collection of data samples of the s, its class label attribute has m different values, which defines different classes of Ci, (i = 1, 2,, m). According to the class label attribute value, S can be divided into m subsets (Si, i = 1, 2,, m). If S i is the subset of samples belongs to class Ci, and s i is the number of the samples in the subset Si, then the Gini-index of set S is Gini (S) = 1 m P 2 i (1) Where Pi is the probability of any sample of Ci, which estimated by s i /s. When the minimum of Gini(S) is 0, i.e. all records belong to the same category at this collection; it indicates the maximum useful information can be obtained. When all the samples in this collection have uniform distribution for certain category, Gini(S) reaches maximum, it indicates the minimum useful information obtained. The original form of the Gini-index is used to measure the impurity of attribute for categorization. The smaller its value, i.e. the lesser impurity, the better attribute. On the other hand, Gini (S) = m P 2 i (2) measuring the purity of attribute for categorization, the bigger its value, the better purity, the better attribute.

5822 W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) 5819 5826 3.2 Gini-index based text feature weighting solution The Gini-index is an excellent purity evaluation measure for set. The usefulness of the feature for the classification can be measured by feature s purity, i.e. the feature should be as pure as possible. If all the document show a feature belongs to the same class, this feature is pure [4]. Thus, we adopt the purity of the feature instead of the inverse document frequency, present the TF GINI weighting scheme. To be specific, after the text feature selection, we will seek the probability of occurrence of each class in the document set showing the feature t. Then, calculate the Gini-index according to the formula below. Gini (t) = m P (Ci t) 2 (3) The formula using the TF GINI to weight the feature t k and normalization is, w ik = tf ik gini (t k ) m [ tf jk gini (t k ) ] 2 (4) j=1 Where w ik is entry, t k is the weight in the document Di, tf ik is the frequency of t k showing in the Di. In the literature [5], Shankar discussed issue of text document feature selection and weighting adjustment by using the Gini-index principle. First, according to TF-IDF, generate the class central vector from all the words in the original feature space; then, calculate the Gini-index of all the features based on all the class central vector; last, select the feature with bigger Giniindex according to predefined number. Also, this discussion is only limited to the centroid-based classification. However, the method we present in this paper is completely different from it. We emphasized on feature weighting after feature selection, and the weighting solution is not only good for centroid-based classification, but also good for other existing text classifier. 4 Experiment Result and Analysis In order to further investigate the effect of the algorithm, we use VC++6.0 to implement the algorithm, and partial of the source code is from the text classifier source code provided by Li Ronglu of Department of Computer and Information Technology, Fu Dan Univerisity. 4.1 Data sets In our experiment, two corpuses have been used. One of them is the recognized English Standard classification corpus Reuters-21578. The other corpus in our experiment is Chinese Corpus from the International Database Center of Department of Computer Information and Technology, Fu Dan University. The Reuters news corpus is the most widely used corpus for text classification study. There are total of 21578 documents in the 1987 amendments to the Reuters-21578. We use the most common 10 classes, training set of 7951 documents, testing set of 2726 documents. After word

W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) 5819 5826 5823 root recovery and removing the un-used words, there are 23281 words. Within the experiment set, the class distribution is uneven. There are 2875 documents belong to the largest class, which takes 40.764% of the total training documents. While there are 170 documents belong to the smallest class, only 2.41% of the total training documents. The second corpus in our experiment is Chinese Corpus from the International Database Center of Department of Computer Information and Technology, Fu Dan University. Totally it includes 19637 documents, which divided into 20 classes. We used 10 classes from them; include training set of 1882 documents, testing set of 900 documents. After word root recovery and removing the un-used words, there are 35028 words. The class distribution of training set is not even. Among them, 338 documents are political class, which is 17.96% of the training document set. Meanwhile the document of environment class has only 134 of them, just 7.12% of the total document set. 4.2 Classifier We used fknn classifier, the discriminant function is the FSWF rule presented in literature [6, 9]: µ j (X) = { k µ j (X i ) sim (X, X i ) (1 sim (X, X i )) 2/(b 1) } / { k 1 (1 sim (X, X i )) 2/(b 1) Where j=1,2,,c, µ j (X i ) sim (X, X i ) is the membership values of the j class of the known sample X. If the sample X belongs to j class, then µ j (X i ) is 1, otherwise, 0. In fknn, k is determined by parameter training optimization result, for Reuters-21578, k is 40; for Chinese document set, k is 10. Parameter b is equal to 2. } (5) 4.3 Experiment result and analysis First, select the text feature by using Information Gain, Expected Cross Entropy, the Weight of Evidence of Text and χ 2 Statistic [7]; then, weight the feature by two different weighting solutions TF IDF and TF GINI. To evaluate the feature weighting method, we study the performance from three aspects below: (1) Classification accuracy: The performance index used is Macro F1 and Micro F1 [8] which commonly adopted internationally; (2) The capability to process class uneven distribution; (3) Algorithm computational complexity. Table 1 and Table 2 summarized the performances of the classification under following conditions. Select the feature via the Information Gain, weight the feature by TF IDF and TF GINI respectively, on the most common 10 classes in the Reuters-21578 corpus and Chinese document set, use the fknn to classify after picking up different feature dimensions. From Table 1 and 2, we noticed, for both Chinese corpus and English corpus Reuter-21578, TF GINI weighting method outperform TF IDF on classification accuracy, no matter Macro F1 or Micro F1, with different feature selection number. For Chinese corpus, under 2000 dimensions, the best classification accuracy Micro F1 is better than TF IDF by 1.178%, while the Macro F1 is 1.248% higher. For Reuter-21578, under different dimension numbers, the average value is 1.465% higher than TF IDF on Micro F1, while Macro F1 is 0.903% higher. Also, the variance of the classification accuracy for both corpuses is much smaller than TF IDF method.

5824 W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) 5819 5826 This has fully demonstrated the effectiveness of our improvement. TF GINI method, without increasing the calculation complexity, not only increase the classification accuracy under uneven class distribution in English and Chinese corpuses, but also reduce the sensitivity to the feature dimensions in certain degree. Table 1: Classification performance of two weighting methods on Chinese training set Note: X, S show the mean and standard deviation of Macro F1 and Micro F1 on 10 common classes of Reuters-21578 with different feature numbers. TF IDF TF GINI number maf1 mif1 maf1 mif1 500 89.465 89.400 91.073 92.749 1000 91.702 91.756 92.950 92.934 2000 89.912 89.829 92.575 92.505 3000 89.969 89.829 92.852 92.719 4000 89.285 88.865 92.598 92.398 5000 87.377 87.045 92.493 92.291 6000 87.320 87.045 92.359 92.184 8000 86.214 85.225 92.026 91.649 10000 85.634 84.582 91.928 91.542 X 88.554 88.175 92.317 92.330 S 2.001 2.357 0.575 0.479 Table 3 and Table 4 summarized the Classification performance of both TF IDF and TF GINI weighting solution after three major feature selection schemes: Expected Cross Entropy, the Weight of Evidence of Text and χ 2 Statistic. After parameter optimization training, 2000 feature dimension are used on Chinese corpus, and 1000 feature dimension are used on Retuers-21578. Composite the information of Table 3 and Table 4, we found no matter which feature selection method is used, TF GINI weighting method always outperform TF IDF on Macro F1 and Micro F1. 5 Conclusions In this paper, we study the text document feature weighting based on feature s Gini-index. The study include the experiment and analysis between Gini-index based weighting method TF GINI and TF IDF method from three aspects, classification accuracy, processing unevenly distributed class document set and calculation complexity. The result shows, the Gini-index based text feature weighting method is a very prospective feature weighting method.

W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) 5819 5826 5825 Table 2: Classification performance of two weighting methods on Reuters-21578 set Note: X, S show the mean and standard deviation of Macro F1 and Micro F1 on Chinese corpus with different feature numbers. TF IDF TF GINI number maf1 mif1 maf1 mif1 500 68.726 86.867 68.887 86.949 1000 67.354 86.207 67.792 86.610 2000 65.958 84.483 67.190 86.244 3000 65.379 83.969 66.852 86.060 4000 65.341 83.712 66.421 85.547 5000 64.797 83.529 66.308 85.400 6000 65.232 83.456 66.470 85.437 8000 65.709 83.419 66.146 84.996 10000 65.776 83.125 66.332 85.003 X 66.030 84.307 66.933 85.772 S 1.237 1.331 0.900 0.640 Table 3: Performances of two weighting methods with three feature selection methods on Chinese corpus set methods TF IDF TF GINI maf1 mif1 maf1 mif1 CroEntTxt 90.015 89.722 95.358 95.182 χ 2 90.663 90.578 93.640 93.362 WeiEviTxt 90.297 90.043 92.741 92.505 Table 4: Performances of two weighting methods with three feature selection methods on Reuters-21578 corpus set methods TF IDF TF GINI maf1 mif1 maf1 mif1 CroEntTxt 66.412 85.07 67.963 86.427 χ 2 66.412 85.070 67.963 86.427 WeiEviTxt 67.151 85.95 67.934 86.427

5826 W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) 5819 5826 References [1] Roelleke Thomas, Wang Jun, TF-IDF uncovered: A study of theories and probabilities, ACM SIGIR 2008-31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Proceedings, p 435-442, 2008 [2] J Thorsten. A probabilistic analysis of the Rocchio algorithm with TF IDF for text categorization. In: Proc of 14th Int l Conf on Machine Learing (ICML 97). 1997. 143-151. [3] Zhang Xiaoyan, Wang Ting, Liang Xiaobo, Ao Feng, Li, Yan, A class-based feature weighting method for text classification, Journal of Computational Information Systems, v 8, n 3, p 965-972, March 2012; ISSN: 15539105 [4] Shang Wenqian, Huang Houkuan; Liu Yuling; Lin Yongmin, Research on the algorithm of feature selection based on Gini index for text categorization, Computer Research and Development, v 43, n 10, p 1688-1694, October 2006 [5] Shrikanth Shankar, George Karypis, A feature weight adjustment algorithm for Document Categorizaiton. The KDD2000, Boston, 2000. [6] Lin Yongmin, Zhu Weidong, Shang Wenqian, Improvement of Decisision Rhle knn Text Categorization, Computer Research and Development, v 42, p 378-382, 2005 [7] Li Kairong, Chen Guixiang, Cheng Jilin, Research on hidden markov model-based text categorization process. International Journal of Digital Content Technology and its Applications, v 5, n 6, p 244-251, June 2011 [8] Li Kunlun, Xie Jing, Sun Xue, Multi-class text categorization based on LDA and SVM, Procedia Engineering, v 15, p 1963-1967, 2011, [9] Shang Wenqian, Huang Houkuan, Zhu Haibin, Lin Yongmin, A novel feature selection algorithm for text categorization, Expert Systems with Applications, v 33, n 1, p 1-5, July 2007