Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System

Size: px

Start display at page:

Download "Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System"

Loraine Bates
6 years ago
Views:

1 Journal of Computer Scence 3 (6): , 007 ISSN Scence Publcatons Ch Square Feature Extracton Based Svms Arabc Language Text Categorzaton System Abdelwadood Moh'd A MESLEH Faculty of Informaton Systems and Technology, Arab Academy for Bankng and Fnancal Scences, Amman, Jordan. Computer Engneerng Department, Faculty of Engneerng Technology, Balqa' Appled Unversty, Amman, Jordan Abstract: Ths paper ams to mplement a Support Vector Machnes (SVMs) based text classfcaton system for Arabc language artcles. Ths classfer uses CHI square method as a feature selecton method n the pre-processng step of the Text Classfcaton system desgn procedure. Comparng to other classfcaton methods, our system shows a hgh classfcaton effectveness for Arabc data set n term of F-measure (F=88.11). Keywords: Arabc Text Classfcaton, Arabc Text Categorzaton, CHI Square feature extracton. INTRODUCTION Text Classfcaton (TC) s the task to classfy texts to one of predefned categores based on ther contents [1]. It s also referred as Text categorzaton, document categorzaton, document classfcaton or topc spottng. And t s one of the mportant research problems n nformaton retreval IR, data mnng, and natural language processng. TC has many applcatons that are becomng ncreasngly mportant such as document ndexng, document organzaton, text flterng, word sense dsambguaton and web pages herarchcal categorzaton. TC research has receved much attenton []. It can be studed as a bnary classfcaton approach (a bnary classfer s desgned for each category of nterest), a lot of TC tranng algorthms have been reported n bnary classfcaton e.g. Naïve Bayesan method [3], k-nearest neghbours (knn) [3], support vector machnes (SVM) [4,5] etc. On the other hand, t has been studed as a mult classfcaton approach e.g. boostng [6], and multclass SVM [7]. In ths paper, we have restrcted our study of TC on bnary classfcaton methods and n partcular to Support Vector Machnes (SVM) classfcaton method for Arabc Language text. TC Procedure: The TC System Desgn Usually Compromse Three Phases: Data pre-processng, text classfcaton and performance measures: data preprocessng phase s to make the text documents compact and applcable to tran the text classfer. The text classfer, the core TC learnng algorthm, shall be constructed, learned and tuned usng the compact form of the Arabc dataset. Then the text classfer shall be evaluated by some performance measures. Then the TC system can mplement the functon of document classfcaton. The followng sectons are devoted to these three phases Data Pre-processng: Arabc Data set: Snce there s no publcly avalable Arabc TC corpus to test the proposed classfer, we have used an n-house collected corpus from onlne Arabc newspaper archves, ncludng Al-Jazeera, Al- Nahar, Al-hayat, Al-Ahram, and Al-Dostor as well as a few other specalzed webstes. The collected corpus contans 1445 documents that vary n length. These documents fall nto nne classfcaton categores (Table 1) that vary n the number of documents. In ths Arabc dataset, each document fle was saved n a separate fle wthn the correspondng category's drectory,.e. ths dataset documents are sngle-labelled. Representng Arabc dataset Documents: As mentoned before, ths representng ams to transform the Arabc text documents to a form that s sutable for the classfcaton algorthm. In ths phase, we have [8,9] [10] followed and and processed the Arabc documents accordng to the followng steps: 1. Each artcle n the Arabc data set s processed to remove the dgts and punctuaton marks.. We have followed [11] n the normalzaton of some Arabc letters such as the normalzaton of (hamza) n all ts forms to (alef). 430

2 J. Computer Sc., 3 (6): , All the non Arabc texts were fltered. 4. Arabc functon words were removed. The Arabc functon words (stop words) are the words that are not useful n IR systems e.g. The Arabc prefxes, pronouns, prepostons. 5. Infrequent terms removal: we have gnored those terms that occur less than 4 tmes n the tranng data. The vector space representaton [1] s used to represent the Arabc documents. Table1: Arabc Data set Category Document Number Computer 70 Economcs 0 Educaton 68 Engneer 115 Law 97 Medcne 3 Poltcs 184 Relgon 7 Sports 3 Total number of documents 1445 We have not done stemmng because t s not always benefcal for text categorzaton, snce many terms may be conflated to the same root form [13]. Based on the vector space model (VSM) each term corresponds to a text feature wth term frequencytf = t, the number of tmes term j occurs n document j, as ts value. Ths TF makes the frequent words for the document more mportant. We have used the nverse document frequency IDF [4] to mprove system performance. DF, the number of documents that term occurs n, s used to calculate IDF (), N IDF = log( ) DF where N s the total number of tranng documents. Then the vectors are normalzed to unt length. IDF. TF s calculated as a weght for each term text feature. Feature selecton: In text categorzaton, we are dealng wth a huge feature spaces. Ths s why; we need a feature selecton mechansm. The most popular feature selecton methods are document frequency thresholdng (DF) [14], the X statstcs (CHI) [15], term strength (TS) [16], nformaton gan (IG) [14], and mutual nformaton (MI) [14], The X statstc [14] measures the lack of ndependence between the text feature term t and the text category c and can be compared to the X dstrbuton wth one degree of freedom to judge the extremeness. Usng the two-way contngency table (Table ) of a termt and a category c, A s the number of tmest and c co-occur, B s the number of tmes t occurs wthout c, C s the number of tmes c occurs wthout t, D s the number of tmes nether c nor t occurs, and N s the total number of documents. Table : X statstcs two-way contngency table A = #(t,c) C = #( t,c) B = #(t, c) D = #( t, c) N = A + B + C + D The term-goodness measure s defned as follows: N ( AD CB) X = ( A+ C) ( B + D) ( A+ B) ( C + D) Ths X statstc has a natural value of zero f t and c are ndependent. Among above feature selecton methods [14] found (CHI) and (IG) most effectve. Unlke [4] where he has used (IG) n hs experment, we have used CHI as a feature selecton method for our Arabc TC. SVMs TC Classfer: As any classfcaton algorthm, TC algorthms have to be robust and accurate. There are a lot of machne learnng based methods that can be mplemented for TC tasks; It s obvous that Support Vector Machne (SVM) [4] and other kernel based [17] [18] methods e.g. and have shown emprcal successes n the feld of TC. TC emprcal results have shown that SVMs classfers are performng well. Smply because of the followng text propertes [4] : Hgh dmensonal text space: In text documents we are dealng wth a huge number of features. Snce SVMs use over fttng protecton, whch does not necessarly depend on the number of features, SVMs have the potental to handle ths large number of features. Few rrelevant features: One way to avod these hgh dmensonal nput spaces s to assume that most of the features are rrelevant. In text categorzaton there are only very few rrelevant features. Document vectors are sparse: For each document, the correspondng document vector contans only few entres, whch are not zero. 431

3 J. Computer Sc., 3 (6): , 007 Most text categorzaton problems are lnearly separable. Ths s why SVMs based classfers are workng well for TC problems. However, other kernel methods have outperformed SVMs lnear kernel method e.g. [18]. Support Vector Machnes (SVMs) are bnary classfers, whch were orgnally proposed by [19]. SVMs have acheved hgh accuracy n varous tasks, such as object recognton [0]. Suppose a set of ordered pars consstng of a feature vector and ts label s gven: ( x1, y1),( x, y),...,( xl, yl) (1) d x, R, y { 1, + 1} In SVMs, a separatng hyper plane wth the largest margn f( x) = wx. + b (The dstance between the hyper plane and ts nearest vectors, see Fgure 1) s constructed on the condton that the hyper plane dscrmnates all the tranng examples correctly (however, ths condton wll be relaxed n nonseparable case). To nsure that all the tranng examples are classfed correctly y( x. w + b) 1 0must hold for the nearest examples. Two margn-boundary hyper planes are formed by the nearest postve examples and the nearest negatve examples. Let d be the dstance between these two margn-boundary hyper planes, and x be a vector on the margn-boundary hyper plane formed by the nearest negatve examples. Then the followng equatons are hold: 1 ( xw. + b) 1= (( x + dw / w ). w + b) 1 = 0 Notng that the margn s half of the dstance d and computed as d /= 1/ w. It s clear that maxmzng the margn s equvalent to mnmzng the norm of w. So far, we have shown the general framework for SVMs. SVMs classfer s formulated n two dfferent cases: the separable case and the non-separable case. In the separable case, where the tranng data s lnearly separable, the norm w mnmzaton s accomplshed accordng to equaton (): 1 mn. w () st.., y( x. w+ b) 1 0 In the non-separable case, where real data s usually not lnearly separable, the norm s mnmzed by equaton (3): mn. 1 w + C ξ, (3) st.., y( x. w+ b) 1+ ξ 0,, ξ 0. where ξ,( ) are slack varables, whch are ntroduced to enable the non-separable problems to be solved [1], n ths case we allow few examples to penetrate nto the margn or even nto the other sde of the hyper plane. Skppng the detals of usng the Lagrangan theory, equatons () and (3) are converted to dual problem as shown n equatons (4) and (5), where α s a Lagrange multpler, C s a user-gven constant. Because dual problems have quadratc forms, they can be solved more easly than the prmal optmzaton problems n equaton () and (3). Soluton can be done by any general purpose optmzaton package lke MATLAB optmzaton toolbox max. 1 α. αα, jyyx j x j j (5) st.. αy = 0,,0 α C. max. 1 α αα., jyy j jx xj (4) st.. αy = 0,, α 0. As a result we obtan equaton (6) whch s used to classfy examples accordng to ts sgn, where * * α ( ) and b are real numbers. * * f( x) = α y x. x+ b (6) Snce SVMs are lnear classfers, ther separatng ablty s lmted. To compensate for ths lmtaton, the kernel method s usually combned wth SVMs [19]. In the kernel method, the dot products n (5) and (6) are replaced wth more general nner products K( x, x), called the kernel functon. The polynomal kernel and the Radal Basc Functon kernel (Gaussan) are often used. Ths means that the feature vectors are mapped nto a hgher dmensonal space and lnearly separated there. In ths process, the sgnfcant advantage s that only the general nner products of two vectors are needed. Ths leads to a relatvely small computatonal overhead. On the hand, the crucal ssues for SVMs are choosng the rght kernel functon and the parameter tunng. 43

4 J. Computer Sc., 3 (6): , 007 Other TC Classfers: (Precson x Recall) F-measure = (Precson + Recall) Many other TC classfers have been nvestgated n lteratures: k-nn Classfer: k-nn classfer [1], a generalzaton of the nearest neghbor rule, constructs k nearest neghbors as a bass for a decson to assgn a category for a document. k -nearest neghbor classfers shows a very good performance on text categorzaton tasks for Englsh Language [3]. It worth pontng that k-nn uses cosne as a smlarty metrc. Naïve Bayes classfer: The man dea of the naïve Bayes classfer [3] s to use a probablstc model of text. The probabltes of postve and negatve examples are computed. Performance measures: TC performance s always consdered n terms of computatonal effcency and categorzaton effectveness. When categorzng a large number of documents nto many categores, the computatonal effcency of the TC system shall be consdered. Ths ncludes: feature selecton method and the classfer learnng algorthm. TC effectveness s measured n terms of precson and recall [4]. Precson and Recall are defned as follows: [3]. a recall = a + c > 0 (a + c) a precson = a + b > 0 ( a + b) where a counts the assgned and correct cases, b counts the assgned and ncorrect cases, c counts the not assgned but ncorrect cases and d counts the not assgned and correct cases. A two-way contngency table (Table 3) contans abcand,, d. Table 3: A contngency table for measure performance YES s correct NO s correct Assgned YES a b Assgned NO c d The values of precson and recall often depend on parameter tunng; there s a trade-off between them. Ths s why we use other measures that combned both [] of the precson and recall: the F-measure whch s defned as follows: To evaluate the performance across categores, F- measure s averaged. There are two knds of averaged values, namely, mcro average and macro average [3]. RESULTS In our experment, we have used the mentoned Arabc data for tranng and testng the TC classfer. Followng the majorty of text classfcaton publcatons, we have removed the Arabc stop words, flter out the non Arabc letters, symbols and removed the dgts. But as mentoned before we have not appled a stemmng process. We have used one thrd of the Arabc data set for testng the classfer and two thrds for tranng the TC classfer as shown n (Table 4). Table 4: The categores and ther szes of Arabc data set Category Tranng texts Testng texts Computer 47 3 Economcs Educaton 45 Engneerng Law 65 3 Medcne Poltcs Relgon Sports We have used an SVM package, TnySVM whch can be downloaded from The softmargn parameter C s set to 1.0 (other values of C shown no sgnfcant changes n results). The results of our classfer n term of Precson, Recall and F- measure for the nne categores are shown n (Table 5). Table 5: SVMs classfer results for the nne categores Category Precson Recall F-measure Computer Economcs Educaton Engneerng Law Medcne Poltcs Relgon Sports Macro-Average

5 J. Computer Sc., 3 (6): , 007 The Macro averaged F-measure s 88.11, our X feature extracton based SVM classfer outperforms the Naïve Bayes and knn classfers (whch are mplemented for result comparsons) as shown n Table 6. Whle conductng many experments, we have tuned the X feature extracton method to acheve the best Macro averaged F-measure. The best results were acheved when extractng the top 16 terms for each classfcaton category. We have noted that ncreasng the terms number does not enhance the effectveness the TC, on the other hand t makes the tranng process slower. The performance s negatvely affected when decreasng the term number for each category. Table 6: F-measure results comparson Classfer Method F-measure X feature extracton based SVMs Classfer Naïve Bayes classfer k-nn classfer 7.7 Whle conductng some other experments, and usng the X scores, we tred to tune the number of selected CHI Square terms (n ths case, unequal number of terms s selected for each classfcaton category), but we could not acheve better results than those acheved usng the 16 mentoned terms for each classfcaton category. Followng [11] n the usage of lght stemmng to mprove to performance of Arabc TCs, we have used [5] stemmer to remove the suffxes and prefxes from the Arabc ndex terms. Unfortunately, we have concluded that lght stemmng does not mprove the performance of our CHI square feature extracton based SVMs classfer, the F-measure drops to As mentoned before, the stemmng s not always benefcal for text categorzaton problems [13]. Ths may justfy the averaged F-measure lght drop. CONCLUSION We have nvestgated the performance of CHI statstcs as a feature extracton method, and the usage of SVMs classfer for TC tasks for Arabc language artcles. We have acheved practcally accepted results and comparable research results. In regard to X, we lke to deeply nvestgate the relaton between A, B,C and D values n CHI algorthm when dealng wth small categores lke Computer. For ths partcular category, we have played wth the X and the classfer parameters, but we could not enhance the Recall or the Precson values. The nvestgaton of other feature selecton algorthms remans for future works. And Buldng a bgger Arabc Language TC Corpus shall be consdered as well n our future research. ACKNOWLEDGMENT Many thanks to Dr. Ghassan Kannaan (Yarmouk Unversty, Jordan) for provdng the TC Arabc dataset and thanks to Dr. Nevn Darwsh (Caro Unversty, Computer Engneerng Dept., Egypt) for emalng me her TC paper [11]. Many thanks to Dr. Tarq Almugrab for provdng many related books, papers and software. REFERENCES 1. Mannng, C., and H. Schütze, 1999, Foundatons of Statstcal Natural Language Processng. MIT Press.. Sebastan F., 00 Machne Learnng n Automated Text Categorzaton. ACM Computng Surveys, Vol. 34, No. 1, pp Yang, Y., and X. Lu, 1999, A re-examnaton of text categorzaton methods," n nd Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval (SIGIR'99), pp Joachms, T., Text categorzaton wth support vector machnes: Learnng wth many relevant features. In Proceedngs of the 10th European Conference on Machne Learnng, pages Joachms, T., 00, Learnng to classfy text usng support vector machnes, methods, theory and algorthms. Klumer academc publshers. 6. Schapre, R., and Y. Snger, 000. BoosTexter: A boostng-based system for text categorzaton. Machne Learnng, Vol.39, No./3. 7. Vladmr, N., Vapnk, Statstcal learnng theory, John Wley & Sons, Inc., N.Y. 8. Benkhalfa, M., A. Mourad, and H. Bouyakhf, 001. Integratng WordNet knowledge to supplement tranng data n sem-supervsed agglomeratve herarchcal clusterng for text categorzaton. Int. J. Intell. Syst. 16(8): Guo, G., H. Wang, D. Bell, Y. B, and K. Greer, 004. "An knn Model-based Approach and ts Applcaton n Text Categorzaton", Proc. of 5th Internatonal Conference on Intellgent Text Processng and Computatonal Lngustc, CICLng-004, LNCS 945, Sprnger-Verlag, pages

6 J. Computer Sc., 3 (6): , El-Kourd, M., A. Bensad, and T. Rachd, 004. Automatc Arabc documents categorzaton based on the nave Bayes algorthm. Workshop on Computatonal Approaches to Arabc Scrpt-Based Languages (COLING-004), Unversty of Geneva,Geneva, Swtzerland. 11. Samr, A., W. Ata, and N. Darwsh, 005, A New Technque for Automatc Text Categorzaton for Arabc Documents, 5 th IBIMA Conference (The nternet & nformaton technology n modern organzatons), December 13-15, 005, Caro, Egypt. 1. Salton, G., A. Wong, and S. Yang, A Vector Space Model for Automatc Indexng. Communcatons of the ACM, 18(11), pp Hofmann, T., 003. Introducton to Machne Learnng, Draft Verson 1.1.5, November 10, Yang Y., and J. Pedersen, 1997 A comparatve study on feature selecton n text categorzaton. In J. D. H. Fsher, edtor, The Fourteenth Internatonal Conference on Machne Learnng (ICML'97), pages Morgan Kaufmann. 15. Schutze, H., D. Hull, and J. Pedersen, A comparson of classfers and document representatons for the routng problem. In Internatonal ACM SIGIR conference on research and development n nformaton retreval. 16. Yang Y., and J. Wlbur Usng corpus statstcs to remove redundant words n text categorzaton. Journal of the Amercan Socety of Informaton Scence, 47(5). 17. Hofmann, T., 000. Learnng the smlarty of documents: An nformaton geometrc approach to document retreval and categorzaton. In Advances n Neural Informaton Processng Systems, 1, pages Takamura, H., M.Yuj and H. Yamada, 004, Modelng Category Structures wth a Kernel Functon, n Proc. of Computatonal Natural Language Learnng (CoNLL), Vladmr, N., Vapnk The Nature of Statstcal Learnng Theory. Sprnger-Verlag Berln. 0. Massmlano, P., and A. Verr Support vector machnes for 3D object recognton. IEEE Transactons on Pattern Analyss and Machne Intellgence, 0(6): Crstann, N., and J. Shawe-Taylor. 000 An Introducton to Support Vector Machnes (and other kernel-based learnng methods). Cambrdge Unversty Press.. Mtchell, T., 1996, Machne Learnng, New York, McGraw Hll. 3. Yang, Y., Mng An evaluaton of statstcal approaches to text categorzaton. Inform Retreval Baeza- Yates, R., and B. Rero-Neto, Modern Informaton Retreval. Addson-Wesley and ACM Press. 5. Larkey, L., L. Ballesteros, and M. Connell, 00. Improvng Stemmng for Arabc Informaton Retreval: Lght Stemmng and Co-occurrence Analyss. Proceedngs of the 5 th Annual Internatonal Conference on Research and Development n Informaton Retreval (SIGIR 00), Tampere, Fnland, August 11-15, 00,

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto