A Novel Term_Class Relevance Measure for Text Categorization

A Novel Term_Class Relevance Measure for Text Categorzaton D S Guru, Mahamad Suhl Department of Studes n Computer Scence, Unversty of Mysore, Mysore, Inda Abstract: In ths paper, we ntroduce a new measure called Term_Class relevance to compute the relevancy of a term n classfyng a document nto a partcular class. The proposed measure estmates the degree of relevance of a gven term, n placng an unlabeled document to be a member of a known class, as a product of Class_Term weght and Class_Term densty; where the Class_Term weght s the rato of the number of documents of the class contanng the term to the total number of documents contanng the term and the Class_Term densty s the relatve densty of occurrence of the term n the class to the total occurrence of the term n the entre populaton. Unlke the other exstng term weghtng schemes such as TF-IDF and ts varants, the proposed relevance measure takes nto account the degree of relatve partcpaton of the term across all documents of the class to the entre populaton. To demonstrate the sgnfcance of the proposed measure expermentaton has been conducted on the 20 Newsgroups dataset. Further, the superorty of the novel measure s brought out through a comparatve analyss. Keywords: text categorzaton, term weght, term-document relevance, term_class relevance. 1. Introducton For the few decades automatc content based classfcaton of documents from huge collectons has become an actve area of research due to the fact that electronc data over the nternet has become unmanageably bg and day by day t s ncreasng exponentally. Manual, tag based classfcaton have lost ther sgnfcance because of the huge sze of the data that need to be processed and nablty of the tags n descrbng the content of the documents. Varetes of applcatons of text classfcaton whch are of current demand such as spam flterng n E-mals, classfcaton of E-Books, classfcaton of news documents, classfcaton of text data from socal networks and so on have also made the researchers to explore varous ways of analyzng and representng these data so that quck and effcent retreval and management of ths huge data can be done. 1.1 A revew of the avalable term weghtng schemes As our work focuses on proposal of a new term weghtng scheme, but not on classfcaton framework, here we consder the lterature on only dfferent term weghtng schemes. Terms are the basc nformaton unts of any text document. So, all weghtng schemes developed n the lterature measure the weght of a term n representng the content of a document [1-5]. Based on whether the membershp of the document n predefned categores s provded to

measure the weght of a term or not, term weghtng schemes are broadly classfed nto two classes namely, unsupervsed term weghtng schemes and supervsed term weghtng schemes. In the followng subsectons we provde a revew of both the weghtng scheme along wth the technques whch have adopted them. 1.1.1 Unsupervsed term weghtng schemes Most of the unsupervsed term weghtng schemes are from the nformaton retreval feld. These methods are very useful when the tranng documents are not labeled by ther class labels. The tradtonal term weghtng methods borrowed from IR, such as bnary, term frequency (TF), TF- IDF, and ts varous varants are unsupervsed schemes [2]. The TF-IDF proposed by Jones [6, 7] and ts varants are the most wdely used term weghtng schemes for text classfcaton. Some of the varants of TF are Raw term frequency, log(tf), log(tf+1), or log(tf)+1[1-2]. If n s the number of documents contanng the term and N s the number of documents n the collecton then, the varants of IDF are 1/n, log(1/n), log(n/n), log(n/n)+1 and log(n/n-1)[1]. In [18], a novel nverse corpus frequency (ICF) based technque s proposed whch computes the document representaton n lnear tme. 1.1.2 Supervsed term weghtng schemes Supervsed term weghtng schemes were developed especally for text categorzaton because of the fact that a supervsed knowledge on the class labels of the tranng samples s provded [1-4]. All the supervsed term weghtng schemes make use of ths class nformaton n dfferent ways. Supervsed term weghtng schemes are further classfed nto subcategores, based on whether the weght estmates relevancy of a term n preservng document content or the relevancy of a term n placng a document as a member of a class. So, t wll be more effectve to call the weghtng schemes whch are used to measure the relevance of a term n preservng the document content as term-document relevance measures and those whch can be used to measure the term relevance n categorzng a document as term_class relevance measures. Term-Document Relevance measure These measures are useful to select a dscrmnatng subset of terms for representng a document by weghtng the terms accordng to ther relevance n preservng the content of the document. These are created by replacng the IDF component of the TF-IDF scheme. Most frequently used technques to replace IDF nclude ch-square measure (X 2 ), Informaton Gan(IG), Gan Rato, Mutual Informaton(MI), Odds Rato(OR) [1-4,8-12]. From past few years, many researchers have proposed alternatve term-document relevance schemes [1, 13-16]. All these are bascally feature selecton technques used n term weghtng schemes. In [14], a comparson of corpusbased and class-based keyword selecton s proposed by usng TF-IDF as weghtng scheme. In [4], a class-ndexng-based term weghtng for automatc text classfcaton s proposed. An

nverse class space densty frequency ( ICS F a postve dscrmnaton on nfrequent and frequent terms. ) s used along wth TF-IDF method that provdes Term_class Relevance measures These measures compute the ablty of a term n classfyng a document as a member of a class. To the best of our knowledge, only one work of ths category has been proposed by Isa et al., [20] usng Bayes posteror probablty. Though, some works make use of Bayes probablty for representaton, they have not clearly stated the advantage of the measure n classfcaton [11, 18]. After [20], ths measure was extensvely used for term weghtng [21, 22]. The beauty of ths measure les n the fact that, nstead of computng the weght of a term n preservng the content of a document, the relevancy of the term n categorzng the document as a member of a class can be measured drectly. Whch s computed as the Bayes posteror probablty P(C/ t) for a class C and term t as gven by, where, P( t / C ) P( C ) P( C / t) Pt ( ) Total _ of _ Words _ n _ C PC ( ), Total _ of _ Words _ n _ Tranng _ Dataset Pt ( ) occurrence _ of _ t _ n _ all _ categores, and occurrence _ of _ all _ terms _ n _ all _ categores occurence _ of _ t _ n _ C P( t / C ) occurrence _ of _ all _ terms _ n _ C To make use of the complete advantage of the proposed relevance measure, Isa et al., [20] also propose a text representaton scheme whch works wth the reduced dmenson for each document at the tme of representaton tself. Ths work happened to be the very frst of ts knd n the lterature of text classfcaton where, a document s represented only wth number of dmensons equal to the number of classes n the corpus wthout any dmensonalty reducton technque appled. In ths representaton scheme, frst, a matrx F of sze m X k s created for every document where, m s the number of terms assumed to be avalable n the document and k s the number of classes. Then, every entry F(, ) of the matrx s flled by the relevancy of the correspondng term t n classfyng the correspondng document as a member of class C. Then, a feature vector

f of dmenson k s created as a representatve for the document where, f() s the average of relevancy of every term to a class C. It shall be carefully observed here that, a document wth any number of terms s represented wth a feature vector of dmenson equal to the number of classes n the populaton whch s very small n contrast to the feature vector that s created n any other vector space representaton scheme where the dmenson s equal to the total number of terms due to all documents of the populaton. Therefore, a great amount of dmensonalty reducton s acheved at the tme of representaton tself wthout the applcaton of any dmensonalty reducton technque. However, the classfcaton accuracy accomplshed s not of that hgh. Motvated by ths work, n ths paper we propose a novel term_class relevance measure wth the followng obectves, Explotng the complete advantage of text representaton scheme proposed by Isa et al.,[20]. Comparson of the effectveness of the proposed term_class relevance measure wth that of Bayes posteror probablty based measure. Isa et al., [20] make use of SVM as the classfer. So we are also nvestgatng the effect SVM on our proposed relevance measure and also compare t wth other avalable classfers. The rest of the paper s organzed as follows. The proposed Class_Term relevance measure s presented n the Secton 2. In secton 3, presents the results and dscusson on the expermentaton. A comparatve analyss of the proposed relevance measure wth other contemporary works s gven n the Secton 4. Fnally, secton 6 presents the concluson and future enhancements. 2. A New Term_Class Relevance Measure In ths secton, we propose a novel measure called term_class relevance measure. Term_class relevancy s defned as the ablty of a term t n classfyng a document D as a member of a class C. We begn wth ntroducng two new concepts whch decde the role of a term n a class, namely, Class_Term Weght and Class_Term Densty. Class_Term Weght: It s the relatve weght of the term wth respect to a class of nterest whch s computed by countng only those documents of the class of nterest that are contanng the term of nterest aganst that of the entre corpus. That s, the class_term weght of a term t n the class C s computed as the rato of ClassFrequency ( t, C ) to the CorpusFrequency ( t ). It s gven by the equaton below. ClassFrequency ( t, C ) Class _ TermWeght ( t, C ) CorpusFrequency ( t )

where, ClassFrequency ( t, C ) s the number of documents of C contanng t at least once and CorpusFrequency ( t ) once. s the number of documents of the entre corpus contanng t at least If the class_term weght of a term t wth respect to the class C s very hgh then the probablty that the document D whch contans t s most lkely a member of the class C s also hgh. Therefore, the relevancy of a term whch we call t as Term _ ClassRe levancy ( t, C ) n decdng the class of a document s drectly proportonal to the class_term weght of the term..e., Term _ Class Re levancy ( t, C ) Class _ TermWeght ( t, C ) (1) Class_Term Densty: It s the relatve densty of a term of nterest wth respect to the class of nterest. It s computed as the rato of the number of occurrences of the term n the class of nterest to that of the entre corpus. That s, the class_term densty of a term t wth respect to the class C s computed as the rato of frequency of t n C to ts frequency n the corpus. It s gven by the equaton below. Class _ TermDensty( t, C ) k TermFrequency ( t, C ) 1 TermFrequency ( t, C ) where, TermFrequency ( t, C ) s the frequency of t n the class C whch s computed as the sum of the frequences of t n every document of C as shown by the equaton below. TermFrequency ( t, C ) Frequency ( t, D ) d doc doc1 where, Frequency( t, D) s the frequency of occurrence of the term t n document D and d s the number of documents n the class C. It shall be notced that, f the class_term densty of a term t n a class C s very hgh then the probablty that a document D whch contans t s most lkely a member of the class C s also hgh. Therefore, the relevancy of a term n decdng the class of a document s drectly proportonal to the class_term densty of the term..e., Term _ Class Re levancy ( t, C ) Class _ TermDensty ( t, C ) (2) By combnng (1) and (2), the term_class relevancy s drectly proportonal to the product of the class_term weght and class_term densty of the term, Term _ Class Re levancy ( t, C ) Class _ TermWeght ( t, C )* Class _ TermDensty ( t, C ) Term _ Class Re levancy ( t, C ) c* Class _ TermWeght ( t, C )* Class _ TermDensty ( t, C )

.e., where, c s the proportonalty constant, whch we decde based on the class weght wth respect to the entre populaton. Class Weght ( c ): It s the weght of the th class C n the corpus whch s computed as the rato of the number of documents n C denoted by Sze _ of ( C ) to the total number of documents n the entre corpus as gven by, ClassWeght( C ) Sze _ of ( C ) k 1 Sze _ of ( C ) Where, k s the number of classes. If each class has equal number of documents, then the class-weght serves as a scalng factor n computng the relevance of a term and t ncreases or decreases the relevancy of a term to a class when the sze of the class compared to the sze of other classes s larger or smaller respectvely. Therefore, the proposed relevancy measure of a term t n placng a document D as a member of a class C s gven by the product of the three aspects namely, Class weght, Class_Term weght and Class_Term Densty as gven by the formula below. Term _ Class Re levancy ( t, C ) c* Class _ TermWeght ( t, C )* Class _ TermDensty ( t, C ) The man advantages of the proposed term_class relevancy measure are as follows, It drectly computes the relevancy of the term wth respect to a class of nterest; whch can tself be used as a clue to dentfy the possble class to whch a document may belong wthout the need of a classfer. The measure uses class as well as corpus nformaton together as opposed to the conventonal TF-IDF scheme, whch utlzes the document frequency from only the corpus. It shall be observed that, the relevancy of a term to a class s hgh only f the three factors class_term weght, class_term densty and class_weght are hgh. Ths helps n properly decdng the weght of a term wthout any bas towards a partcular class, whch n turn helps n decdng the class for a classfer. Once the term_class relevance of all terms of the tranng set of documents s computed wth respect to every class present, each tranng document s then represented usng the representaton scheme proposed by Isa et al., [20] as explaned n secton 1.1.2. A document s frst represented as a matrx of sze, where, m s the number of terms assumed to be avalable n the document and k s the number of classes. Then, every entry of the matrx s flled by the relevancy of the correspondng term t wth respect to the class C. Then, a feature vector f of

dmenson k s created as a representatve for the document where, f() s the average relevancy of all terms wth respect to a class C. The feature matrx of sze thus created for the n tranng documents s used for learnng process. A smlar vector of k dmenson s created for the test documents and gven to the learnng algorthm or a classfer for labelng. The process of tranng and testng the classfers s explaned n the next secton. 3. Classfcaton wth SVM and k-nn classfers To evaluate the applcablty of the proposed term_class relevance measure, we make use SVM as learnng algorthm to perform classfcaton because of ts good generalzaton ablty. Moreover, the tranng burden for SVM s very less even though, the tme requred for tranng s drectly proportonal to the tranng dataset, because the representatve feature vectors are of dmenson equal to the number of classes only. So, to test the effectveness of the proposed relevance measure we have expermented wth the SVM classfer wth Lnear, Gaussan radal bass functon (RBF) and Polynomal kernels. We consder the 20 Newsgroups data set for our expermentaton. It conssts of approxmately 20,000 newsgroup documents consstng 20 classes wth each class bearng nearly equal number of samples. It has become a popular data set for text classfcaton and clusterng applcatons. Some of the documents are closely related to each other whle others are hghly unrelated. We conduct experments wth varous proporton of tranng set to valdate the performance of the proposed relevancy measure. Fg 1 shows the overall classfcaton accuracy of the system wth varous percentages of tranng samples usng SVM classfer wth dfferent kernels. Fg 2 shows the precson of the SVM classfer wth dfferent kernels and Recall s shown n Fg 3. In Fg 4, the overall F- measure s presented. It can be observed from the fgures (1-4) that, the SVM classfer wth RBF kernel s workng well when compared to the other kernels. The results are also presented graphcally n fgures below. 90 A c c u r a c y 89 88 87 86 85 84 83 82 81 10 20 30 40 50 60 70 80 percentage of tranng Lnear RBF Polynomal

Fgure 1. Overall accuracy of classfcaton wth Lnear, RBF and Polynomal kernels 90 P r e c s o n 89 88 87 86 85 84 83 82 10 20 30 40 50 60 70 80 Percentage of Tranng Lnear RBF Polynomal Fgure 2. precson of the SVM classfer wth Lnear, RBF and Polynomal kernels R e c a l l 90 89 88 87 86 85 84 83 82 81 80 10 20 30 40 50 60 70 80 Percentage of Tranng Lnear RBF Polynomal Fgure 3. Recall of the SVM classfer wth Lnear, RBF and Polynomal kernels F - M e a s u r e 90 89 88 87 86 85 84 83 82 81 10 20 30 40 50 60 70 80 Lnear RBF Polynomal

F-measure Fgure 4. F-measure of the SVM classfer wth Lnear, RBF and Polynomal kernels Further, the k-nn classfer s also adapted to test the proposed method because of ts smplcty n classfcaton. We performed the expermentaton wth varous values of k from 1 to 20 and the performance of the classfer was hgh for k=10. Table 2 shows the results of k-nn classfer for k=10 and a comparson wth the best results of SVM s also gven. Table 2. Results of SVM wth RBF kernel and k-nn wth k=10 % of Accuracy Precson Recall F-measure Trang k-nn SVM k-nn SVM k-nn SVM k-nn SVM 10 90.38 86.88 90.12 88.46 90.14 86.67 90.13 87.56 20 91.15 88.57 91.10 89.04 90.72 88.24 90.91 88.64 30 91.78 88.63 91.61 88.59 91.43 88.11 91.52 88.35 40 92.04 89.38 91.99 89.31 91.66 88.99 91.82 89.15 50 91.49 89.57 91.40 89.62 91.18 89.20 91.29 89.41 60 92.26 89.39 92.20 89.48 91.89 88.91 92.04 89.20 70 92.14 89.43 92.23 89.60 91.76 88.84 91.99 89.22 80 93.01 89.07 93.03 89.27 92.66 88.67 92.85 88.97 To compare the class-wse performance of each classfer we show the varaton of F-measure vs. class n Fgure 5 and 6. Fgure 5, shows the values of F-measure vs. each class usng k-nn classfer wth k=10 and 10 percent of tranng. It can be notced that, the performance s relatvely low for classes 2, 3, 4, 7, 13 and 20. Further, the F-measure of SVM classfer vs. each class wth RBF kernel and 10 % of tranng s shown n Fgure 6. Though, the results of SVM are poor when compared to k-nn, SVM also has shown relatvely low performance for the same classes as n the case of k-nn. 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Class Number

F-measure Fgure 5. Classfcaton performance vs. class for k-nn classfer wth 10 % tranng 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Class Number Fgure 6. Classfcaton performance vs. class for SVM classfer wth RBF kernel and 10 % tranng 4. Comparatve Analyss In ths secton, we provde a quanttatve comparatve analyss of the proposed term_class relevance measure wth the results of Isa et al.,[20] n Table 3. The results correspondng to [20] have been extracted drectly from the paper as the representaton scheme s same n both the works and they also have provded the results on the same 20Newsgroups dataset usng only SVM classfer wth dfferent kernels. We can notce from the Table 3 that, the proposed term_class relevance measure outperforms the measure used by Isa et al.,[20]. Along wth SVM, we compare the results usng results of k-nn classfer wth k=10. It can also be notced from the Table 3 that, k-nn classfer wth k=10 s showng enhanced results when compared to SVM wth all the kernels for both the relevance measures. So, we recommend usng k-nn as the classfer for better classfcaton performance. Table 3. Comparson of Results of the proposed method wth the work of Isa et al.,[20]. Percentage of Tranng Results from [20] wth SVM Results of Proposed Method SVM Lnear RBF Polynomal Lnear RBF Polynomal k-nn 30 83.04 82.97 82.83 85.57 88.63 86.97 91.78 70 88.02 87.88 87.93 85.02 89.43 88.31 92.14

Concluson In ths paper, a novel term_class relevance measure to compute the relevance of a term n classfyng an unknown document as a member of a partcular class s proposed. The proposed term_class relevance measure s a product of three aspects namely class_term weght, class_term densty and class_weght. Experments are conducted on 20 Newsgroups dataset usng the SVM and k-nn classfers. An effectve text representaton scheme whch allows representaton of text documents n reduced dmenson s adapted to test the proposed term_class relevance measure. The comparatve analyss of the results of the proposed work wth the other contemporary research works shows the superorty of the proposed term_class relevance measure. References 1. Lan, M., Tan, C. L., Su. J., and Lu, Y.2009. Supervsed and Tradtonal Term Weghtng Methods for Automatc Text Categorzaton. IEEE Transactons on Pattern Analyss and Machne Intellgence, Volume: 31 (4), pp. 721 735 2. G. Salton and C. Buckley. 1988. Term-Weghtng Approaches n Automatc Text Retreval, Informaton Processng and Management, vol. 24(5), pp. 513-523. 3. Debole F, Sebastan. F. 2003. Supervsed Term Weghtng for Automated Text Categorzaton. Proceedngs of the 2003 ACM symposum on appled computng, pp. 784-788. 4. Ren F, Sohrab M. G., 2013. Class-ndexng-based term weghtng for automatc text classfcaton. Informaton Scences 236 (2013) 109 125 5. Harsh B. S., Guru D. S., and Manunath. S. (2010). Representaton and Classfcaton of Text Documents: A Bref Revew. IJCA Specal Issue on Recent Trends n Image Processng and Pattern Recognton RTIPPR, pp. 110-119. 6. K. S. Jones, 1972. A statstcal nterpretaton of term specfcty and ts applcaton n retreval, Journal of Documentaton, Vol. 28, pp. 11-21. 7. K. S. Jones, 2004. A statstcal nterpretaton of term specfcty and ts applcaton n retreval, Journal of Documentaton, Vol. 60, pp. 493-502 8. Altınçay H, Erenel Z., 2010. Analytcal evaluaton of term weghtng schemes for text categorzaton. Pattern Recognton Letters Vol. 31, pp-1310 1323. 9. Lu, Y., Loh, H.T., Sun, A., 2009. Imbalanced text classfcaton: A term weghtng approach. Expert Systems wth Applcatons 36, 690 701 10. Mladenc, D., Grobelnk, M., 2003. Feature selecton on herarchy of web documents. Decson Support Syst. 35 (1), 45 87 11. Sebastan, F., 2002. Machne learnng n automated text categorzaton. ACM Comput. Surveys 34 (1), 1 47

12. Yang, Y., Pedersen, J.O., 1997. A comparatve study on feature selecton n text categorzaton. In: Proc. ICML 97, 14th Internat. Conf. on Machne Learnng. Morgan Kaufmann Publshers, San Francsco, US, pp. 412 420 13. Lu, H., Yu, L., 2005. Toward ntegratng feature selecton algorthms for classfcaton and clusterng. IEEE Trans. Knowledge Data Eng. 17 (4), 491 502 14. Ozgur, A., Ozgur, L., Gungor, T., 2005. Text categorzaton wth class-based and corpusbased keyword selecton. In: Proc. 20th Internat. Symp. on Computer and Informaton Scences. Lecture Notes n Computer Scence, vol. 3733, Sprnger-Verlag, pp. 606 615. 15. Tsa, R.T., Hung, H., Da, H., Ln, Y., Hsu, W., 2008. Explotng lkely-postve and unlabeled data to mprove the dentfcaton of proten proten nteracton artcles. BMC Bonform. 9 16. Wang, D, Zhang, H., 2013. Inverse-Category-Frequency Based Supervsed Term Weghtng Schemes for Text Categorzaton. Journal of Informaton Scence and Engneerng Vol 29, pp. 209-225 17. Reed, J, W., Jao, Y., Potok T, E., Klump, B, A., Elmore, M, T., and Hurson, A, R., 2006. TF-ICF: A New Term Weghtng Scheme for Clusterng Dynamc Data Streams. 5th Internatonal Conference on Machne Learnng and Applcatons. pp.258-263. IEEE Computer Socety Washngton 18. Fuhr, N., Hartmann, S., Lustg, G., Schwantner, M., Tzeras, K., Darmstadt, T. H., et al. (1991). AIR/X A rule-based multstage ndexng system for large subect felds. In: Proceedngs of the proceedngs of RIAO(pp. 606 623) 19. P. Soucy and G.W. Mneau, Beyond tfdf Weghtng for Text Categorzaton n the Vector Space Model, Proc. Int l Jont Conf. Artfcal Intellgence, pp. 1130-1135, 2005 20. Isa, D., Lee, L. H., Kallman, V. P., and Ra Kumar, R. 2008. Text document preprocessng wth the Bayes formula for classfcaton usng the support vector machne. IEEE Transactons on Knowledge and Data Engneerng. Vol. 20, pp. 23 31. 21. Isa, D., Kallman, V. P., Lee, L. H., 2009. Usng the self-organzng map for clusterng of text documents. Expert Systems wth Applcatons. Vol. 36, pp. 9584 9591. 22. Guru D. S., Harsh B. S., and Manunath. S. 2010. Symbolc representaton of text documents. In Proceedngs of Thrd Annual ACM Bangalore Conference. do. 10.1145/1754288.1754306.