Experiments in Text Categorization Using Term Selection by Distance to Transition Point

Experments n Text Categorzaton Usng Term Selecton by Dstance to Transton Pont Edgar Moyotl-Hernández, Héctor Jménez-Salazar Facultad de Cencas de la Computacón, B. Unversdad Autónoma de Puebla, 14 Sur y Av. San Claudo. Edf. 135. Cudad Unverstara, Puebla, Pue. 72570. Méxco, Tel. (01222) 229 55 00 ext. 7212 Fax (01222) 229 56 72, emoyotl@mal.cs.buap.mx, hjmenez@fcfm.buap.mx Abstract. Ths paper presents a novel term selecton method called dstance to transton pont (DTP) that s equally effectve for unsupervsed and supervsed term selecton. DTP computes the dstance between the frequency of a term and the transton pont (TP) and then, by usng ths dstance as a crteron, t selects the terms more close to TP. Expermental results on Spansh texts show that feature selecton by DTP acheves superor performance to document frequency, and comparable performance to nformaton gan and ch-statstc. Moreover, when DTP s used to select terms n an unsupervsed polcy, t mproves the performance of tradtonal classfcaton algorthms such as -NN and Roccho. Keywords: dstance to transton pont, term selecton, text categorzaton. 1 Introducton The rapd growth n the volume of text documents avalable electroncally has led to an ncreased nterest n developng tools that allow organze textual nformaton. Text categorzaton (TC), whch s the classfcaton of text documents nto a set of predefned categores, s an mportant tas for handlng and organzng textual nformaton. Snce buldng text classfers manually s dffcult and tme consumng, the domnant approach to TC s based on machne learnng technques [10]. Wthn ths approach, a classfcaton learnng algorthm automatcally bulds a text classfer from a set of preclassfed documents, a tranng set. In TC a document d j s usually represented as a vector of term weghts d j =(w 1j,...,w Vj ), where V s the number of terms (the vocabulary sze) that occur n the tranng set, and w j measures the mportance of term t for the characterzaton of document d j. However, many classfcaton algorthms are computatonally hard, and ther computatonal cost s a functon of V [2]. Hence, feature selecton (FS)

technques are used to select a subset from the orgnal term set n order to mprove categorzaton effectveness and reduce computatonal complexty. In [12] fve FS methods were tested: document frequency, nformaton gan, ch-statstc, mutual nformaton and term strength. The frst three were found the most effectve. For that reason they wll be tested n ths paper. A wdely used approach to FS s the flterng, whch consst n selectng the terms that score hghest accordng to a crteron that measures the mportance of the term for the TC tas [4]. There are two man polces to perform term selecton: an unsupervsed polcy, where term scores are determned wthout usng any category nformaton, and a supervsed polcy, where nformaton on the membershp of tranng documents s used to determne term scores [5]. In ths paper we present a new term selecton method called dstance to transton pont (DTP), whch can be used for both unsupervsed and supervsed term selecton. DTP computes the dstance between the frequency of a term and the transton pont (TP),.e., the frequency that splts the terms of a text (or a set of texts) nto low frequency terms and hgh frequency terms. In the case of unsupervsed polcy, DTP calculates TP usng all tranng documents, whereas n the case of supervsed polcy, DTP calculates TP usng the tranng documents belongng to a specfc category. We report expermental results obtaned on Spansh texts wth two classfcaton algorthms: -NN and Roccho, three term selecton technques: document frequency (DF), nformaton gan (IG) and ch-statstc (CHI), and both unsupervsed and supervsed term selecton by DTP. The paper s organzed as follows. Secton 2 brefly ntroduces the term selecton methods (DF, IG and CHI). Secton 3 presents the detals of the DTP term selecton method for both unsupervsed and supervsed polces. Secton 4 descrbes the classfers and data used n the experments. Secton 5 presents our experments and results. Secton 6 concludes. 2 Term Selecton Methods In ths secton we gve a bref ntroducton on three effectve FS technques, one unsupervsed method (document frequency) and two supervsed methods (nformaton gan and ch-statstc). These methods assgn a score to each term and then select the terms that score hghest. In the followng, let D be the tranng set, N the number of documents n D, V the number of terms n D, and C={c 1,,c M } the set of categores. Document Frequency (DF). The document frequency of a term t s the number of documents n whch ths term occurs [9]. DF s a tradtonal term selecton method that does not need the category nformaton. It s the smplest technque and easly scales to a large data set wth a computaton complexty approxmately lnear n the number N [12].

Informaton Gan (IG). Informaton gan of a term t measures the number of bts of nformaton obtaned by nowng the presence or absence of t n a document. If t occurs equally frequently n all categores, then ts IG s 0. The nformaton gan of term t s defned as M IG( t ) = P( c )log P( c ) (1) = 1 + P( t ) + P( t ) M = 1 M = 1 P( c t )log P( c t ) P( c t )log P( c t ) where P(c ) s the number of documents belongng to category c dvded by N, P(t ) s the number of documents wth term t dvded by N, P(c t ) s the number of documents belongng to c wth t dvded by the total number of documents wth t. The computaton ncludes the estmaton of the condtonal probabltes of a category gven a term, and the entropy computatons n the defnton. The probablty estmaton has a tme complexty of O(N) and the entropy computatons has a tme complexty of O(VM) [12]. Ch-Statstc (CHI). The ch-statstc method measures the lac of ndependence between the term and the category. If term t and category c are ndependent, then CHI s 0. In TC, gven a two-way contngency table for each term t and category c (as represented n Table 1), CHI s calculated as follows 2 N( ad cb) CHI( t, c ) = ( a + c)( b + d)( a + b)( c + d) where a, b, c and d are the number of documents for each combnaton of c, c and t t. In order to get a global score CHI(t ) from CHI(t, c ) scores relatve to the, M ndvdual categores, the maxmum score t ) max { CHI ( t, c )} CHI max( = s used. = 1 The computaton of CHI scores has a quadratc complexty, smlar to IG [12]. Table 1. Two-way contngency table Category/Term t t c a b c c d (2) Yang and Pedersen [12] have shown that IG and CHI are the most effectve FS methods for -NN and LLSF classfcaton algorthms. Term selecton based on DF had smlar performance to IG and CHI methods. The latter result seems to states that the most mportant terms for categorzaton are those that occur more frequently n the tranng set.

3 Dstance to Transton Pont Our term selecton method DTP s based on TP. TP s derved from the Law of Zpf [1],[11],[14], and s the frequency that splts the terms of a text (or a set of texts) nto low frequency terms and hgh frequency terms. In [11] t was observed that TP ndcates the frequency around whch there are ey words of a text. In our prevous experments [7] we found that performance of categorzaton can be slghtly ncreased f terms that occur more often than TP are dsregarded. In ths paper TP s used to measure the mportance of the term for the categorzaton tas. Such measure s an nverse functon of the dstance between the frequency of a term and the TP; when the frequency of a term s dentcal to TP, the dstance wll be zero, producng a maxmum closeness score. Throughout the rest of ths secton we descrbe the computaton of TP and the detals of DTP for both unsupervsed and supervsed polces. The computaton of TP s performed as follows. Let T be a text (or a set of texts), and let I 1 be the number of terms wth frequency 1. Then accordng to [11] the transton pont of T s defned as TP = ( 1+ 8I1 1) / 2 (3) As we can see, TP calculaton only requres scannng the vocabulary of T n order to fnd I 1 (for more detals on TP see [11] and [8]). DTP unsupervsed. DTP computes the dstance to TP n the unsupervsed polcy as follows DTP t ) = TP frq( t ) (4) ( where frq(t ) s the frequency of t n D (D s the tranng set) and TP s computed on D. The computaton has a tme complexty of O(V). DTP supervsed. In the case of supervsed term selecton, DTP uses the category nformaton DTP t, c ) = TP frq ( t ) (5) ( where frq (t ) s the frequency of t n D (D s the set of tranng documents belongng to a specfc category c ) and TP s computed on D. As the globalzaton technque we have chosen DTP max because, n prelmnary experments [8], t consstently outperformed other globalzaton technques. The computaton ncludes the calculaton of the TP for each category and has a tme complexty of O(VM). DTP (whose use as a FS functon was frst proposed n [8]) selects the terms more close to TP. In FS we measure how close the frequency of a term and TP are to eachother. Thus the terms wth the hghest value for DTP are the more dstant to TP; snce we are nterested n the terms less dstant, we select the terms for whch DTP s

lowest. Our experments presented n Secton 5 show that the performance of tradtonal classfcaton algorthms (such as -NN and Roccho) s outperformed by term selecton wth DTP. 4 Classfers and Data In order to assess the effectveness of FS methods we used two classfers frequently used as a baselne n TC, -NN [13] and Roccho [3], both treat documents as term vectors. -NN s based on the categores assgned to the nearest tranng documents to the new document. The categores of these neghbors are weghted usng the smlarty of each neghbor to the new document, where the smlarty s measured by the cosne between the two document vectors. If one category belongs to multple neghbors, then the sum of the smlarty scores of these neghbors s the weght of the category [2],[10],[13]. Roccho s based on the relevance feedbac algorthm orgnally proposed for nformaton retreval. The basc dea s to construct a prototype vector for each category usng a tranng set of documents. Gven a category, the vectors of documents belongng to ths category are gven a postve weght, and the vectors of remanng documents are gven a negatve weght. By addng these postvely and negatvely weghted vectors, the prototype vector of ths category s obtaned. To classfy a new document, the cosne between the new document and prototype vector s computed [6],[10],[13]. The texts used n our experments are Spansh news downloaded from Mexcan newspaper La Jornada. We preprocess the texts removng stopwords, punctuaton and numbers, and stemmng the remanng words by means of a Porter's stemmer adapted to Spansh. Term weghtng was done by means of the standard tf df functon [9]. We have used a total of 1,449 documents belongng to sx dfferent categores (C: Culture, S: Sports, E: Economy, W: World, P: Poltcs, J: Socety & Justce) for tranng and two testng sets (see Table 2). We only managed one label settng,.e., each document was assgned n only one category. Table 2. Tranng and testng data Categores C S E W P J Tranng data No. of documents 104 114 107 127 93 91 No. of terms 7,205 4,747 3,855 5,922 4,857 4,458 Test data set 1 No. of documents 58 57 69 78 89 56 No. of terms 5,301 3,333 3,286 4,659 4,708 3,411 Test data set 2 No. of documents 83 65 61 51 90 56 No. of terms 6,420 3,855 2,831 3,661 4,946 3,822

To evaluate the effectveness of the classfcaton of documents by classfer, the standard precson, recall and F 1 measures were used. Precson s the number of documents correctly classfed, dvded by the total number of documents classfed. Recall s the number of documents correctly classfed, dvded by the total number of documents that should be classfed. The F 1 measure combnes precson (P) and recall (R) as follows: F 1 = 2RP/(R+P). These values can be computed for each ndvdual category frst and then be averaged over all categores. Or they can be globally computed over all the categores. These strateges are respectvely called macroaveragng and mcroaveraged. Same as [10], we evaluated mcroaveraged (F 1 ). 5 Experments We performed our FS experments wth both, a -NN classfer (usng = 30), and a Roccho classfer (where β = 16 and α = 4 as used n [6]). In these experments we compared three baselne term selecton technques: DF, IG and CHI max, and two varants of our DTP technque: DTP and DTP max. Table 3 lsts our F 1 values obtaned for -NN and Roccho wth the evaluated FS technques at dfferent percent of terms (the vocabulary sze n the tranng set s 14,272). Table 3. Mcroaveraged F 1 values for -NN and Roccho on test sets -NN Roccho Percent of terms DF IG CHI max DTP DTP max DF IG CHI max DTP DTP max 1.627.716.720.676.667.611.723.712.681.668 3.697.769.758.759.756.701.756.749.742.739 5.754.780.779.760.786.750.767.760.774.780 10.782.806.803.791.797.775.783.787.807.788 15.802.811.801.807.799.782.801.793.811.791 20.807.811.804.811.804.799.806.806.820.803 25.804.824.813.815.806.799.806.811.815.804 50.809.813.803.814.806.807.807.815.829.811 As seen n table 3, on both -NN and Roccho tests DTP s superor to DF, and comparable to IG and CHI max up to percents of terms around 5% and 3% respectvely, but becomes superor for percents hgher than those. These results, obtaned under both DTP varants show that an unsupervsed polcy performs better than ts supervsed counterpart. Results publshed n [12] showed that common terms are often nformatve, and vceversa. Our results under DTP do not contradct ths for, only the terms that have an extremely low or hgh frequency are removed, whle the terms wth medum

frequency score hghest and are preserved. Another nterestng result s that DTP unsupervsed, whle not usng category nformaton from the tranng set, has a performance smlar to supervsed IG and CHI. In addton to that DTP s much easer to compute than IG and CHI. 6 Conclusons In ths paper we have presented a novel term selecton method for TC: dstance to transton pont (DTP), whch s based on the proxmty to the frequency that splts the terms of a text as low and hgh frequency terms,.e., the transton pont (TP). Experments performed on Spansh texts wth two classfers (-NN and Roccho) showed that feature selecton by DTP acheves superor performance to document frequency, and comparable performance to nformaton gan and ch-statstc; three well nown and effectve technques. Remarably, DTP s a smple and easy to compute method. The degree of enhancement from our method n TC and ts relatonshp to other methods n the lterature s the subject of future nvestgatons by the authors. References 1. Booth, A.: A Law of Occurrences for Words of Low Frequency, Informaton and Control, (1967) 10(4) 386 93. 2. Galavott, L., Sebastan, F., Sm, M.: Experments on the Use of Feature Selecton and Negatve Evdence n Automated Text Categorzaton, Proc. of ECDL-00, 4th European Conference on Research and Advanced Technology for Dgtal Lbrares, (2000) 59 68. 3. Joachms, T.: A Probablstc Analyss of the Roccho Algorthm wth TFIDF for Text Categorzaton, Proc. of ICML-97, 14th Int. Conf. on Machne Learnng, (1997) 143 151. 4. John, G.H., Kohav, R., Pfleger, K.: Irrelevant Features and the Subset Selecton Problem, Proc. of ICML-94, 11th Int. Conf. on Machne Learnng, (1994) 121 129. 5. Karyps, G., Han, E.H.: Concept Indexng: A Fast Dmensonalty Reducton Algorthm wth Applcatons to Document Retreval & Categorzaton, Techncal Report TR-00-0016, Unversty of Mnnesota, (2000). 6. Lews, D.D., Schapre, R.E., Callan, J.P., Papa, R.: Tranng Algorthms for Lnear Text Classfers, Proc. of SIGIR-96, 19th ACM Int. Conf. on Research and Development n Informaton Retreval, (1996) 298 306. 7. Moyotl, E., Jménez, H.: An Analyss on Frequency of Terms for Text Categorzaton, Proc. of SEPLN-04, (2004) 141 146. 8. Moyotl, E., Jménez, H.: Dstanca al Punto de Transcón: Un Nuevo Método de Seleccón de Térmnos para Categorzacón de Textos, Tess de Lcencatura, Facultad de Cencas de la Computacón, BUAP, Puebla, Méxco, (2004). 9. Salton, G., Wong, A., Yang, C.: A Vector Space Model for Automatc Indexng, Communcatons of the ACM, (1975) 18(11) 613 620.

10. Sebastan, F.: Machne Learnng n Automated Text Categorzaton, ACM Computng Surveys, Vol. 34(1), (2002) 1 47. 11. Urbzagástegu-Alvarado, R.: Las posbldades de la ley de Zpf en la ndzacón automátca, Reporte de la Unversdad de Calforna Rversde, (1999). 12. Yang, Y., Pedersen, P.: A Comparatve Study on Feature Selecton n Text Categorzaton, Proc. of ICML-97, 14th Int. Conf. on Machne Learnng, (1997) 412 420. 13. Yang, Y., Lu, X.: A Re-examnaton of Text Categorzaton Methods, Proc. of SIGIR-99, 22nd ACM Int. Conf. on Research and Development n Informaton Retreval, (1999) 42 49. 14. Zpf, G.K.: Human Behavour and the Prncple of Least Effort, Addson-Wesley, (1949).