Issues and Empirical Results for Improving Text Classification

Issues and Emprcal Results for Improvng Text Classfcaton Youngoong Ko 1 and Jungyun Seo 2 1 Dept. of Computer Engneerng, Dong-A Unversty, 840 Hadan 2-dong, Saha-gu, Busan, 604-714, Korea yko@dau.ac.kr 2 Dept. of Computer Engneerng, Sogang Unversty, Snsu-dong, Mapo-gu, Seoul, 121-742, Korea seoy@sogang.ac.kr 1. Correspondng author : Jungyun Seo 2. Correspondng address : Dept. of Computer Engneerng, Sogang Unversty, Snsu-dong, Mapo-gu, Seoul 121-742, Korea 3. Correspondng telephone number : 82-2-705-8488 4. Correspondng fax number : 82-2-706-8954 5. Correspondng Emal address : seoy@sogang.ac.kr 1

Issues and Emprcal Results for Improvng Text Classfcaton ABSTRACT Automatc text classfcaton has a long hstory and many studes have been conducted n ths lterature. In partcular, many machne learnng algorthms and nformaton retreval technques have been appled to text classfcaton tasks. Even though much techncal progress has been made n text classfcaton, there s stll room for mprovement n text classfcaton. We wll here dscuss what ssues are remaned to us for mprovng text classfcaton. In ths paper, three mprovement ssues are presented ncludng automatc tranng data generaton, nosy data treatment, and term weghtng and ndexng, and four actual studes and ther emprcal results for those ssues are ntroduced. Frstly, sem-supervsed learnng technque s appled to text classfcaton for effcently creatng tranng data. For effectve nosy data treatment, a nosy data reducton method and a robust text classfer from nosy data are developed as a soluton. Fnally, the term weghtng and ndexng technque s reformed by reflectng the mportance of sentences nto term weght calculaton usng summarzaton technques. KEYWORDS Text Classfcaton, Improvement Issues, Sem-supervsed Learnng, Nosy Data Reducton, TCFP Classfer, Importance of Sentence, Term Weghtng and Indexng 1. INTRODUCTION In recent years, automatc content-based document management tasks have ganed a promnent status n the nformaton systems feld, due to the wdespread and contnuously ncreasng avalablty of documents n dgtal form. In partcular, the task of classfyng natural language documents nto a pre-defned set of semantc categores has become one of the key methods for organzng onlne nformaton accordng to the rapd growth of 2

the World Wde Web [Sabastan, 2002]. Ths task s commonly referred to as text classfcaton. Snce there has been an exploson of electronc texts from not only the World Wde Web but also varous onlne sources (electronc mal, corporate databases, chat rooms, dgtal lbrares, etc.) recently, one way of organzng ths overwhelmng amount of data s to classfy them nto topcal categores. Snce the machne learnng paradgm emerged n the 90 s, many machne learnng algorthms have been appled to text classfcaton. Wthn the machne learnng paradgm, a general nductve process automatcally bulds an automatc text classfer by learnng from a set of prevously classfed documents. The advantages of ths approach are ts accuracy comparable to human performance and a consderable savngs n terms of manpower. In addton, text classfcaton has deep relaton to nformaton retreval, as the foundaton of an automated content-based document management. Thus, text classfcaton may be seen as the meetng pont of 2 machne learnng and nformaton retreval. Informaton gan and statstcs, etc. have been used for feature selecton [Yang, 1997], and Nave Bayes [Ko and Seo, 2000; McCallum and Ngam, 1998], Roccho [Lews et al., 1996], Nearest Neghbor (k-nn) [Yang et al., 2002], Support Vector Machne (SVM) [Joachms, 2001], etc. have been broadly employed as text classfers. After SVM was appled to text classfcaton, t has domnated over other text classfers n the vew of performance. However, text classfcaton stll has many ponts to be mproved such as automatc tranng data generaton, nosy data reducton, a classfer wth robustness from nosy data, term weghtng and ndexng, etc., especally accordng to applcaton areas. In ths paper, we enumerate mprovement ssues for text classfcaton and ntroduce mprovement attempts and ther emprcal results to be actually appled n classfcaton tasks. Among many mprovement ssues n text classfcaton, we focus on three mportant mprovement ssues on automatc tranng data generaton, nosy data treatment (nosy data reducton and a robust classfer from nosy data), and term weghtng and ndexng. In supervsed-learnng based text classfcaton, obtanng the good qualty of tranng data s very mportant whereas labelng tasks for tranng data must be a panfully tme consumng process. To reduce that knd of burden for labelng tasks, sem-supervsed or unsupervsed learnng technques have been appled to text classfcaton [Ko and Seo, 2009; Slonm et al., 2002]. Event thought those labelng tasks are 3

performed by experts and by consumng long tme, tranng data can almost all nclude some nosy data. Removng nosy data s another key mprovement ssue for text classfcaton. For ths, we ntroduce two dfferent solutons: nosy data reducton n tranng settngs for postve and negatve examples [Han et al, 2007], and the development of a robust classfer from nosy data [Ko and Seo, 2004a]. Fnally, we pay attenton to the term weghtng and ndexng phase of text classfcaton. Most technques related to them are orgnated from the Informaton Retreval lterature. There exsts a problem that they are appled to text classfcaton n the same methodology as nformaton retreval even though text classfcaton and nformaton retreval have dfferent propertes and applcaton surroundngs. The mportance measure of sentences n a document by usng some text summarzaton technques s used for a new term weghtng scheme [Ko and Seo, 2004b]. The rest of ths paper s devoted to enumerate ssues for mprovng text classfcaton and ther emprcal results step by step as follows. Secton 2 presents the frst mprovement ssue about labelng task and ts one soluton related to sem-supervsed learnng usage. In secton 3, we explan the necessty of nosy data treatment and ts two clues to soluton: the nosy date reducton method appled to bnary text classfcaton and the TCFP classfer wth robustness from nosy data. Secton 4 descrbes the necessty of mproved term weghtng for text classfcaton and ntroduces a new term weghtng scheme. Fnally, we descrbe conclusons. 2. UTILIZATION OF UNLABELED DATA TO REDUCE THE PAINFUL AMOUNT OF MANUAL LABELING TASKS FOR OBTAINING TRAINING DATA Generally, the supervsed-learnng based text classfcaton requres a large, often prohbtve, number of labeled tranng data for accurate learnng. Snce the labelng tasks must be done by hand and the applcaton area of text classfcaton has dversfed from artcles and web pages to emals and blogs, the labelng tasks for each applcaton area are harder and harder. Thus, most users of a practcal system must obvously prefer algorthms that can have a hgh accuracy but do not requre a panful amount of manual labelng task. To answer ther wsh, unsupervsed learnng [Slonm et al., 2002] as well as sem-supervsed learnng [Ko and Seo, 2009; Ngam et al., 4

2006] and actve learnng [Tong and Koller, 2001] has been appled to text classfcaton; they all attempt to utlze unlabeled data nstead of labeled data. Whle labeled data s hardly obtaned, unlabeled data s readly avalable and plentful. Therefore, those learnng methods are very useful to utlze unlabeled data for text classfcaton. We advocated a new automatc machne-labelng method usng a bootstrappng technque. The proposed method used only unlabeled documents and the ttle word of each category s used as ntal data for learnng of text classfcaton. The nput to the bootstrappng process s a large amount of unlabeled documents and a small amount of seed nformaton to tell the learner about the specfc task. Here, a ttle word assocated wth a category s consdered as seed nformaton. 2.1 Bootstrappng Technque to Generate Machne-labeled Data The bootstrappng process conssts of three modules: a module to preprocess unlabeled documents, a module to construct context-clusters for tranng, and a module to buld up the Nave Bayes classfer usng context-clusters. 2.1.1 Preprocessng Frst of all, the Brll POS tagger s used to extract content words as words wth noun or verb POS tags [Brll, 1995]. Snce machne-labeled data has to be created from only a ttle word, context s defned as a new unt of meanng, and t s used as a new meanng unt to bootstrap the meanng of each category. That s a mddle szed processng unt between a word and a document. A sequence of 60 content words wthn a document s regarded as a wndow sze for one context. To extract contexts from a document, we use a sldng wndow technque [Maarek et al., 1991]. The wndow sldes from the frst content word to the last content word of the document n the sze of the wndow (60 words) and wth the nterval of each wndow (30 words). 5

2.1.2 Constructng a Context-Cluster as the Tranng Data of Each Category Frstly, keywords are automatcally generated from a ttle word for each category usng co-occurrence nformaton. Then centrod-contexts are extracted by usng the ttle word and keywords. Each centrod-context ncludes at least one of the ttle and keywords. It s regarded as one of the most nformatve contexts for each category. Furthermore, more nformaton of each category s obtaned by assgnng remanng contexts to each context-cluster by a smlarty measure technque. Contexts wth a keyword or a ttle word of any category are selected as a centrod-context. From the selected contexts, we can obtan a set of words n the frst-order co-occurrence from centrod-contexts of each category. We next gather the second-order co-occurrence nformaton by assgnng remanng contexts to the contextcluster of each category. For the assgnng crteron, we calculate smlartes between remanng contexts and the centrod-contexts of each category. Thus, we propose that the smlarty measure algorthm by Karov and Edelman [1998] s reformed and appled to our context-cluster generaton algorthm; remanng contexts are assgned to each context-cluster by ths algorthm. In our smlarty measure algorthm, words and contexts play complementary roles. Contexts are smlar to the extent that they contan smlar words, and words are smlar to the extent that they appear n smlar contexts. Ths defnton s crcular. Thus t s appled teratvely usng two matrces, Word Smlarty Matrx (WSM) and Context Smlarty Matrx (CSM); the rows and columns of WSM are labeled by all the content words encountered n the centrod-contexts of each category and nput remanng contexts, and the rows of CSM correspond to the centrod-contexts and the columns to the remanng contexts. Each category has one WSM and one CSM. In each teraton n, WSM n, whose cell (,) holds a value between 0 and 1, s updated, and the value of each cell ndcates the extent to whch the -th word s contextually smlar to the -th word. Also, CSM n, whch holds smlartes among contexts, s kept and updated. The number of nput contexts of row and column n CSM s lmted to 200 as consderng executon tme and memory allocaton. To estmate the smlartes, WSM s ntalzed to the dentty matrx. That s, each word s fully smlar (1) to tself and completely dssmlar (0) to other words. The followng steps are terated untl the changes n the 6

smlarty values are small enough: Update the context smlarty matrx CSM n, usng the word smlarty matrx WSM n and update the word smlarty matrx WSM n, usng the context smlarty matrx CSM n. To smplfy the symmetrc teratve treatment of smlartes between words and contexts, an auxlary relaton between words and contexts s expressed as affnty and s represented by aff n (X,W) by formulae (1) and (2) [Karov and Edelman, 1998]. aff n ( W X n W, X ) max sm ( W, W ) (1) aff n ( W X n X, W ) max sm ( X, X ) (2) In the above formulae, n denotes the teraton number, W X means that a word W belongs to a context X, and the smlarty values are defned by WSM n and CSM n. Every word has some affnty to a context, and the context can be represented by a vector ndcatng the affnty of each word to t. The smlarty of W 1 to W 2 s the average affnty of the contexts that nclude W 1 to W 2, and the smlarty of a context X 1 to X 2 s a weghted average of the affnty of the words n X 1 to X 2. Smlarty formulae are defned as follows: sm n 1 ( X 1, X 2 ) weght ( W, X 1 ) aff n ( W, X 2 ) (3) W X 1 f W 1 W 2 else sm n 1 ( W 1, W 2 ) 1 (4) sm n 1 ( W 1, W 2 ) W 1 X weght ( X, W 1 ) aff n ( X, W 2 ) 7

The weghts n formula (3) are calculated by a product of three factors: global frequency, log-lkelhood factor, and part of speech. Snce each weght n formula (4) s a recprocal of the number of contexts that contan W 1, the sum of the weghts s 1. These values are used to update the correspondng entres of WSM n and CSM n. The smlarty of each remanng context to the centrod-contexts of a category s frst estmated, and then the smlarty value s averaged. Fnally, each remanng context s assgned to the context-cluster of any category, when the category has a maxmum smlarty. 2.1.3 Learnng a Nave Bayes Classfer Usng Context-Clusters In above secton, we obtaned labeled contexts tranng data: context-clusters. Snce the tranng documents are labeled as a context unt, a Nave Bayes classfer s selected to learn from context-clusters because t can be bult by only estmatng words probabltes n each category. Therefore, the Nave Bayes classfer s constructed by estmatng words dstrbuton n the context-cluster of each category, and t fnally classfy unlabeled documents nto each category. The Nave Bayes classfer s bult up wth mnor modfcatons based on Kullback-Lebler Dvergence [Craven et al., 2000]. Ths method makes exactly the same classfcatons as Nave Bayes, but produce classfcaton scores that are less extreme. Thus better reflect uncertanty than those produced by Nave Bayes. A document d s classfed by to the followng formula: P ( c d ˆ ˆ V P ( c ) ( ; ) ˆ P d c ˆ ˆ N ; ) P ( c ) P ( w t c ; ) P ( d ˆ ) t 1 ˆ V log P ( c ; ) ( ; ˆ ) P w ˆ t c P ( w t d ; ) log n t 1 P ( w ; ˆ) t d ( w, d ) (5) where n s the number of words n document d, w t s the t-th word n the vocabulary, N(w t,d ) s the frequency of word w t n document d. 8

2.2 Emprcal Evaluaton 2.2.1 Data Sets and Expermental Settngs To test the proposed method, we used three dfferent knds of data sets: UseNet newsgroups (Newsgroups), web pages (WebKB), and newswre artcles (Reuters 21578). For far evaluaton n Newsgroups and WebKB, we employed the fve-fold cross-valdaton method. That s, each data set s splt nto fve subsets, and each subset s used once as test data n a partcular run whle the remanng subsets are used as tranng data for that run. The splt nto tranng and test sets for each run s the same for all classfers. Therefore, all the results of our experments are averages of fve runs. About 25% of documents from tranng data of each data set were selected for a valdaton set. After all parameter values of our experments were set from the valdaton set, we evaluated the proposed method usng these parameter values. We appled a statstcal feature selecton method ( 2 statstcs) for each classfer at ts preprocessng stage [Yang and Pedersen, 1997]. As performance measures, we followed the standard defnton of recall (r), precson (p), and F 1 measure (2rp/(r+p)). For evaluatng performance average across categores, we used the mcro-averagng method that s to count the decsons for all the categores n a ont pool and computes the global recall, precson, and F 1 values for that global pool [Yang, 1999]. Results on Reuters are reported as a precson-recall breakeven pont, whch s a standard nformaton retreval measure for bnary classfcaton [Joachms, 2001; Yang, 1999]. 2.2.2 Expermental Results Here, we employ a supervsed Nave Bayes classfer for comparng our method wth the supervsed method; the supervsed Nave Bayes classfer learns from human-labeled documents. Fgure 1 and Table 1 report the results from three data sets and compare the performance dfferences between the proposed method and supervsed method. 9

M c ro - a vg. F 1 o r B EP O ur M ethod Supervsed N B 100 90 80 70 60 50 40 30 20 10 0 N ew sgroups W ebk B Reuters D a ta S e t Fgure 1. Performance dfferences of the best mcro-avg. F 1 scores or precson-recall break-even ponts n three data sets: our method vs. supervsed NB Table 1. Performance dfferences of the best mcro-avg. F 1 scores or precson-recall break- even ponts n three data sets: our method vs. supervsed NB Data Sets Our method Supervsed NB Newsgroups 79.36 91.72 WebKB 73.63 85.29 Reuters 88.62 91.64 As shown n Table 1, we obtaned a 79.36% mcro-average F 1 score n the Newsgroups data set, a 73.63% mcroaverage F 1 score n the WebKB data set, and an 88.62% mcro-average precson-recall breakeven pont n the Reuters data set. The dfferences between our method and the supervsed Nave Bayes classfer n each data set are 12.36% n the Newsgroups data set, 11.66% n the WebKB data set, and 3.02% n the Reuters data set. Snce we use only unlabeled data and ttle words, the performance of our method s much more sgnfcant. Partcularly, the proposed method n the Reuters data set almost acheved comparable performance wth the supervsed method. As prevously noted n [Joachms, 1997], categores lke wheat and corn are known for a strong correspondence between a small set of words (lke our ttle words and keywords) and the categores, whle categores lke acq are known for more complex characterstcs. Snce the categores wth narrow defntons attan best classfcaton wth small vocabulares, we can acheve good performance n the Reuters data set wth our method whch depends on ttle words. In the Newsgroups and WebKB data sets, we could not attan comparable performance wth the supervsed method. In fact, the categores of these data sets are somewhat 10

confusable. In the Newsgroups data set, many of the categores fall nto confusable clusters: for example, fve of them are comp.* dscusson groups, and three of them dscuss relgon. In the WebKB data set, meanngful words of each category also have hgh frequency n other categores. Worst of all, even ttle words (course, professor, faculty, proect) have a confusng usage. We thnk these factors contrbuted to a comparatvely poor performance of our method. 3. NECESSITY OF NOISE REDUCTION AND ITS TWO CLUES: NOISY DATA REDUCTION AND THE TCFP CLASSIFIER WITH ROBUSTNESS FROM NOISY DATA Effectvely Dealng wth nosy data s another key mprovement ssues for text classfcaton. There are two dfferent solutons: nosy data removal n bnary tranng settngs by the one-and-the-rest method [Han et al, 2007], and the TCFP classfer wth robustness from nosy data [Ko and Seo, 2004]. 3.1 Improvng the One-aganst-the-rest for Removng Nosy Data n Bnary Text Classfcaton In text classfcaton, bnary settng or mult-class settng have been used to organze tranng examples for learnng tasks. As the bnary settng conssts of only two classes, t s the smplest, yet most mportant formulaton of the learnng problem. Those two classes are composed of relevant (postve) and non-relevant (negatve) for nformaton retreval applcatons [Joachms, 2002]. Generally, some classfcaton tasks nvolve more than two classes. When we apply the bnary settng to the mult-class settng wth more than two classes, there s a problem that the mult-class settng conssts of only postve examples of each category; each category does not have negatve examples. In order to solve ths problem, the one-aganst-the-rest method has been used n many cases [Zadrozny and Elkan, 2001; Zadrozny and Elkan, 2002; Hsu and Ln, 2002]; t can reduce a multclass problem nto many bnary tasks. That s, whle all the documents of a category are generated as postve examples by hand, documents that do not belong to the category regard as negatve examples ndrectly. Ths labelng task concentrates on only selectng postve examples for each category, and t does not label the 11

negatve examples that have the opposte meanng of counterpart postve category drectly. Thus the negatve data set n the one-aganst-the-rest method probably nclude nosy examples. In addton, because the negatve data set conssts of the dfferent dstrbutons of postve examples from varous categores, t s hard to be consdered as the exact negatve examples of each category. Those nosy documents can be one of the maor causes of decreasng the performance for bnary text classfcaton. Therefore, classfers need to effcently handle the nosy tranng documents to acheve the hgh performance. 3.1.1 Detectng and Removng Nosy Data from the One-aganst-the-rest In the one-aganst-the-rest method, the documents of one category are regarded as postve examples and the documents of the other categores as negatve examples. To effectvely remove nosy data n the one-aganst-the-rest method for tranng settng, we have to fnd a boundary area that denotes a regon ncludng many nosy documents. Frst of all, usng ntal postve and negatve data sets for each category from the one-aganst-the-rest method, we can learn a Nave Bayes (NB) classfer and we can obtan a predcton score for each document by the followng formula (6). Predcton _Score ( c d ) P ( Postve P ( Postve d d ) P ( Negatve ) (6) d ) where c means a category and d means a document of c. P(Postve d ) means a probablty of the document d to be postve n c, and P(Negatve d ) means a probablty of the document d to be negatve n c. Accordng to the calculated predcton scores, the entre documents of each category are sorted out n the descendng order. Probabltes, P(Postve d ) and P(Negatve d ), of formula (6) s generally calculated by the Nave Bayes formula as follows [Lews, 1998; Ko and Seo, 2004b; Craven et al., 2000]: 12

P ( Postve d ) P ( Postve ) P ( d Postve P ( d ) ) P ( Postve T ) 1 P ( t Postve N ( t d ) ) (7) log( P ( Postve n )) T P ( t Postve P ( t d ) log 1 P ( t d ) ) where t s the -th word n the vocabulary, T s the sze of the vocabulary, and N(t d ) s the frequency of word t n document d. A boundary between postve and negatve examples can be detected n a block wth the most mxed degree of postve and negatve documents. The sldng wndow technque s frst used to detect the block [Lee et al., 2001]. In ths technque, wndows of a certan sze are sldng from the top document to the last document n a lst ordered by the predcton scores. An entropy value s calculated for estmatng the mxed degree of each wndow as follows [Mtchell, 1997]: Entropy ( W ) p log 2 p p log 2 p (8) where, gven a wndow (W), p + s the proporton of postve documents n W and p - s the proporton of negatve documents n W. Two wndows wth the hghest entropy value are pcked up; one wndow s frstly detected from the top and the other s frstly detected from the bottom. If there s no wndow or only one wndow wth the hghest entropy value, wndows wth the next hghest entropy value become targets of the selected wndows. Then maxmum (max) and mnmum (mn) threshold values can be searched from selected wndows respectvely. The max threshold value s found as the hghest predcton score of a negatve document n the former wndow and the mn 13

threshold value s as the lowest predcton score of a postve document n the latter wndow. We regard the documents between max and mn threshold values as unlabeled documents; these documents are consdered as potentally nosy documents. Now three classes for tranng documents of each category are constructed as defntely postve documents, unlabeled documents, and defntely negatve documents. By applyng the revsed EM algorthm to those three data sets, we can extract actual nosy documents and remove them. The EM algorthm s used to pck out nosy documents from unlabeled data and to remove them. The general EM algorthm conssts of two steps, the Expectaton step and the Maxmzaton step [Dempster et al., 1997]. Ths algorthm frst trans a classfer usng the avalable labeled documents and labels the unlabeled documents by hard classfcaton (Expectaton (E or E ) step). It then trans a new classfer usng the labels of all the documents (Maxmzaton (M) step), and terates to convergence. The Nave Bayes classfer s used n the two steps of the EM algorthm. Fgure 2 shows how the EM algorthm s revsed n our method. The Every Every Let Let Revsed d d p u document document be be the the EM - Algorthm n P (postve n N (negatve document document of P ; data) data) Let s s d of U (unlabeled assgned n assgned be the data); the class labeled the class labeled c document of N ; c p n ; ; Buld an ntal nave Bayesan classfer, cˆ, from P and N ; Loop whle classfer parameters ( cˆ ) are mproved (E - step) P {} ; N {} ; Use for the each f else current document P(c p d u ) classfer P P { d u } ; d u U do N N { d u };, P(c n d u ) then (E - step) P {} ; N {} ; Use for the each current document f P(c p d u ) else contnue classfer N N { d u }; d u U do P(c n d u ) then, ; ( P P {} ) (M - step) Re - estmate the classfer, cˆ, from P and N ; Use maxmum a posteror parameter estmaton to fnd cˆ arg max c P ( c ) P ( U c ); Fgure 2. The revsed EM Algorthm 14

E -step s reformed to effectvely remove the nose documents located n the boundary area. Unlke orgnal E- step, t does not assgn an unlabeled document, d u, to the postve data set, P, because t regards d u as anther nosy document; snce postve documents are labeled by hand and have enough nformaton for a category, addtonal postve documents can decrease performance. Fnally, we can learn the text classfers wth bnary tranng data generated by the revsed EM algorthm. 3.1.2 Emprcal Evaluaton The expermental results show that the proposed method acheved better performance than the orgnal oneaganst-the-rest method n all the three tranng data sets and all the four classfers. To evaluate the effectveness of the proposed method, we mplemented four dfferent text classfers (k-nn, Nave Bayes (NB), Roccho, and SVM). And the performance of the orgnal one-aganst-the-rest method s compared to that of the proposed method on three test data sets (Newsgroups, WebKB, and Reuters data sets). As performance measures, the standard defnton of recall and precson s used, and the mcro-averagng method and the macro-averagng method are appled for evaluatng performance average across categores; n the macro-averagng method, the recall, precson, and F 1 measures are frst computed for ndvdual categores and then averaged over categores as a global measure of the average performance over all categores [Yang, 2002]. Results are reported as the precson-recall BEP (BreakEven Ponts), whch s a standard nformaton retreval measure for bnary classfcaton; gven a rankng of documents, the precson-recall breakeven pont s a value at whch precson and recall are equal [Joachms, 1998; Yang, 1999; Ko and See, 2004a]. Table 2, Table 3, and Table 4 show the expermental results from each text classfer n Newsgroups, WebKB, and Reuters data sets respectvely. 15

Orgnal Table 2. Results n the Newsgroups data set k-nn NB Roccho SVMs Orgnal Orgnal Orgnal Mcro-avg. BEP Macro-avg. BEP 86.07 87.96 83.17 84.86 82.84 84.48 88.34 89.08 84.58 87.03 82.87 84.55 81.5 83.57 87.73 89.08 Orgnal Table 3. Results n the WebKB data set k-nn NB Roccho SVMs Orgnal Orgnal Orgnal Mcro-avg. BEP Macro-avg. BEP 84.97 86.74 85.67 87.21 86.52 88.26 92.12 92.64 82.13 85.55 83.58 86.53 83.71 87.03 91.52 92.17 Orgnal Table 4. Results n the Reuters data set k-nn NB Roccho SVMs Orgnal Orgnal Orgnal Mcro-avg. BEP Macro-avg. BEP 91.47 94.27 90.80 93.86 89.24 91.80 94.66 95.52 82.66 85.432 81.26 86.38 77.56 83.55 89.86 90.72 As shown n the above Tables, SVM acheved less mprovement than the other classfers. It s caused by the fact that the performance of SVM usng the orgnal one-aganst-the-rest method s too hgh n all the data sets. Note that t s more dffcult to mprove a classfer wth hgher performance. As a result, the proposed method acheved better performances than the orgnal method over all the classfers and all the data sets. Ths s an obvous proof that the proposed method s more effectve than the orgnal one-aganst-the-rest method. 16

3.2 The TCFP Classfer wth Robustness from Nosy Data Usng the Feature Proecton Technque To effectvely handle out nosy data, a new text classfer usng a feature proecton technque was developed and the classfer was named by TCFP [Ko and Seo, 2004a]. By the property of the feature proecton technque, the TCFP classfer can have robustness from nosy data. In the experment results, TCFP showed better performance than other conventonal classfers n nosy data. 3.2.1 The TCFP Classfer wth Robustness from Nosy Data In the TCFP classfer, the classfcaton knowledge s represented as a set of proectons of tranng data on each feature dmenson. The classfcaton of a test document s based on the votng of each feature (word) of the test document. That s, the fnal predcton score s calculated by accumulatng the votng scores of all features. Frst of all, the votng rato of each category must be calculated for all features. Snce elements wth a hgh TF-IDF value n proectons of a feature must become more useful classfcaton crtera for the feature, only elements wth TF-IDF values above the average TF-IDF value are used for votng. The selected elements partcpate n proportonal votng wth the same mportance as the TF-IDF value of each element. Thus, the votng rato of each category c n a feature f m s calculated by the followng formula: r ( c, f m ) = f m V m w ( f m, d ) y ( c, f m ) f m V m w ( f m, d ) (9) In formula (9), f m denotes the proecton element for a feature f m n a document d, w f, d ) s the weght of a ( m feature f m n a document d, V m denotes a set of elements selected for the votng of a feature f m, and y ( c, f ) { 0,1} m s a functon; f the category for an element f s equal to c, the output value s 1. m Otherwse, the output value s 0. Next, snce each feature separately votes on feature proectons, contextual nformaton s mssng. Thus, cooccurrence frequency s used to apply contextual nformaton to the proposed classfcaton algorthm. To 17

calculate a co-occurrence frequency value between any two features f and f, the number of documents, whch nclude both features, s counted. TF-IDF values of two features f and f n a test document are modfed by reflectng the co-occurrence frequency of the two features. That s, terms wth a hgh co-occurrence frequency value and a low category frequency value have hgher term weghts as the followng formula: fw ( f, d ) w ( f, ) 1 d log( 1 cf log( co ( f, ) 1) f 1) log( (, ) 1) maxco f k f l (10) where fw(f,d) denotes a modfed term weght assgned to term f, cf denotes the category frequency that s the number of categores n whch f and f co-occur, co f, f ) s a co-occurrence frequency value for f and f, and ( maxco ( f k, f l ) s the maxmum value among all co-occurrence frequency values. Note that the weght of feature f s also modfed by the same formula usng f nstead of f. formula: The fnal votng score of each category c n a feature f m of a test document d s calculated by the followng vs ( c, f m ) fw ( f m, d ) r ( c, f m ) log( 1 2 max ( f m )) (11) where fw(f m,d) denotes a modfed term weght by the co-occurrence frequency and 2 ( f ) maxmum score of the calculated 2 statstcs value of f n each category. m max m denotes the The outlne of the TCFP classfer s as follows: 18

Input: test document: d =<f 1,f 2,,f n > Man Process: For each feature f fw(f,d) s calculated For each feature f For each category c vote[c ]=vote[c ]+vs(c,f ) by Formula 11 predcton = arg max vote [ c ] c The robustness from nosy data of the TCFP classfer s due to ts votng mechansm. The votng mechansm of the TCFP classfer depends on separate votng n each feature and t can reduce the negatve effect of possble nosy data and rrelevant features n text classfcaton. That s, when a document contans rrelevant features or ncorrect label, the document may be n wrong locaton of vector space model. Ths document can gve bad effects n performance especally for k-nn and SVM. On the other hand, an rrelevant feature n the TCFP classfer contrbutes to only votng of the feature. Moreover, only feature elements wth a TFIDF weght over an average weght value can take part n our votng mechansm, and ths process makes the TCFP classfer more effectve to handle rrelevant features. 3.2.2 Expermental Evaluaton We provde emprcal evdences that TCFP s a useful classfer for text classfer and has robustness from nosy data. In our experments, we used three test data sets (Newsgroups, WebKB, and Reuters data sets) and employed four other classfers (k-nn, NB, Roccho, and SVMs) to compare wth the TCFP classfer. As performance measures, we followed the standard defnton of recall (r), precson (p), and F 1 measure (2rp/(r+p)) for the Newsgroups and WebKB data sets and results on Reuters are reported as precson-recall breakeven ponts [Yang, 1999]. 19

Frstly, Table 5 reports the performance comparson of classfers on the three data sets to verfy the general usefulness of TCFP. Table 5. Comparson of performance results for each classfer on each data set TCFP k-nn NB Roccho SVM Newsgroups 85.52 85.15 82.51 81.68 87.32 WebKB 88.07 84.83 85.22 85.98 91.75 Reuters 90.01 88.93 88.62 86.47 93.32 The results show that TCFP s superor to k-nn, Nave Bayes, and Roccho classfers. However, TCFP produced lower performance than SVMs, whch has been reported as a classfer wth the best performance n ths lterature. In order to verfy the superorty of TCFP n robustness from nosy data, we conducted experments for evaluatng the robustness of each classfer from nosy data. For ths experment, we generated four data sets wth ncreasng the number of nosy documents from 10% to 40% usng the Newsgroups data set: these nosy documents were randomly chosen from each category and randomly assgned nto other categores. The results of each classfer on each nosy data set are shown n Fgure 3 and Table 6. These results are also obtaned by a fvefold cross-valdaton method. 20

Fgure 3. Co. Comparson of performance results for each classfer on four nosy data sets Table 6. Comparson of performance results for each classfer on four nosy data sets Nosy degree TCFP k-nn NB Roccho SVM 10% 85.13 84.78 82.42 81.56 86.26 20% 84.64 84.04 81.87 81.17 84.49 30% 84.5 83.9 81.09 81.26 81.73 40% 83.3 82.39 79.83 80.9 76.02 As shown n Fgure 3 and Table 6, TCFP showed the best performance begnnng from 20% nosy data set, and the decreasng rate of performance of TCFP s less than that of k-nn and SVMs. Especally, we observed that the performance of SVMs degraded rapdly when the number of nosy documents ncreased. As a result, the expermental results show that TCFP has good performance and a specal characterstc wth regards to robustness from nosy data. 21

4. IMPROVING TERM WEIGHT SCHEME FOR TEXT CLASSIFICATION We here focus on the term weghtng and ndexng scheme of text classfcaton as another mprovement ssue. The vector space model has been used as a conventonal method for text representaton [Salton et al., 1988]. Ths model commonly represents a document as a vector of features usng Term Frequency (TF) and Inverted Document Frequency (IDF). Ths model smply counts TF wthout consderng where the term occurs. However, each sentence n a document has dfferent mportance for dentfyng the content of the document. Thus, text classfcaton can be mproved by assgnng a dfferent weght accordng to the mportance of the sentence nto each term. For ths, we apply text summarzaton technques to classfy mportant sentences and unmportant sentences from a document. The mportance of each sentence s measured by those text summarzaton technques and the term weghts n each sentence are modfed n proporton to the calculated sentence mportance. To test our proposed method, we used two dfferent newsgroup data sets; one s a well known data set, the Newsgroup data set, and the other was gathered from Korean UseNet dscusson group. 4.1 Measurng the Importance of Sentences The mportance of each sentence s measured by two methods. Frst, the sentences that are more smlar to the ttle have hgher weghts. In the next method, we frst measure the mportance of terms by TF, IDF, and 2 statstc values and then we assgn the hgher mportance to the sentence wth more mportant terms. Fnally, the mportance of a sentence s calculated by combnaton of two methods. 4.1.1 Importance of sentences by the ttle Generally, we beleve that a ttle summarzes the mportant content of a document [Endres-Nggemeyer, 1998]. Hence, we measure the smlarty between the ttle and each sentence and then we assgn the hgher mportance to the sentences wth the hgher smlarty. The ttle and each sentence of a document are represented as the vectors of content words. The smlarty value of them s calculated by the nner product and the calculated values are 22

normalzed nto values between 0 and 1 by a maxmum value. The smlarty value between the ttle T and the sentence S n a document d s calculated by the followng formula: Sm ( S, T ) S T max ( S T ) S d (12) where T denotes a vector of the ttle, and S denotes a vector of sentence. 4.1.2 Importance of sentences by the mportance of terms Snce the method by the ttle depends on the qualty of the ttle, t can be useless n the document wth a meanngless ttle or no ttle. Besdes, sentences wth mportant terms must be also handled mportantly although they are dssmlar to the ttle. Consderng these ponts, we frst measure the mportance values of terms by TF, IDF, and 2 statstc values, and then the sum of the mportance values of terms n each sentence s assgned to the mportance value of the sentence. In ths method, the mportance value of a sentence calculated as follows: S n a document d s Cen ( S ) t S max S d tf ( t ) df ( t ) ( t ) t S 2 tf ( t ) df ( t ) ( t ) 2 (13) where tf(t) denotes the term frequency of term t, df(t) denotes the nverted document frequency, and 2 (t) denotes the 2 statstc value. 23

4.1.3 Combnaton of two sentence mportance values Two knds of sentence mportance are smply combned by the followng formula: Score ( 1 2 S ) 1.0 k Sm ( S, T ) k Cen ( S ) (14) In formula (14), the k 1 and k 2 are constant weghts, whch control the rates of reflectng two mportance values. The 1.0 constant value s added to a calculated sentence mportance value n order to prevent the modfed TF value havng lower value than orgnal TF value by formula (15). 4.1.4 Indexng process The mportance value of a sentence by formula (14) s used for modfyng the TF value of a term. That s, snce a TF value of a term n a document s calculated by the sum of the TF values of terms n each sentence, the modfed TF value (WTF(d,t)) of the term t n the document d s calculated by formula (15). WTF d, t ) tf ( S, t ) Score ( S ) ( (15) S d where tf(s,t) denotes TF of the term t n sentence S. The weght by formula (15) s used n k-nn, NB, Roccho, and SVM. 4.2 Emprcal Evaluaton To test our proposed system, we used two newsgroup data sets wrtten by two dfferent languages: Englsh and Korean. Each document n both data sets has only one category. 24

The 20 Newsgroups data set s the same one used n prevous sectons. The second data set was gathered from the Korean UseNet group. Ths data set contans a total of 10,331 documents and conssts of 15 categores. 3,107 documents (30%) are used for test data and the remanng 7,224 documents (70%) for tranng data. The resultng vocabulary from tranng data has 69,793 words. As performance measures, we followed the standard defnton of recall (r), precson (p), and F 1 measure (2rp/(r+p)). For evaluatng an average performance across categores, we used the mcro-averagng method and macro-averagng method [Yang, 1999]. Table 7 and 8 lst the comparson of performances n each classfer usng dfferent ndexng schemes on two newsgroup data sets. Here, the bass method used the conventonal TF for NB and conventonal TF-IDF for the other classfers. Mcro-avg. Bass Table 7. Results n the Englsh Newsgroups data set k-nn NB Roccho SVMs Bass Bass Bass F 1 81.2 82.6 83.0 84.1 78.6 79.7 86.0 86.6 Macro-avg. F 1 81.4 82.7 83.3 84.4 79.1 80.0 86.1 86.6 Mcro-avg. Bass Table 8. Results n the Korean Newsgroups data set k-nn NB Roccho SVMs Bass Bass Bass F 1 77.4 79.6 78.4 80.8 76.5 78.2 84.5 85.3 Macro-avg. F 1 79.9 81.3 79.1 81.3 78.7 80.1 86.0 86.5 In both data sets, the proposed method produced a better performance n all these classfers. As a result, our proposed method can mprove the classfcaton performance wth all these classfers n both Englsh and Korean. 25

5. CONCLUSIONS Ths paper has been devoted to present three mportant ponts of mprovement and ther actual emprcal results. They can be summarzed as follows: - Tranng Data Generaton: sem-supervsed learnng or actve learnng technques can be appled to text classfcaton. They gve us a way to utlze nexpensve and plentful unlabeled data. In our experment, we acheved comparable performance to the supervsed method even only usng the ttle word of each category and unlabeled data. - Nosy Data Reducton: effectve nosy data reducton and robust classfer development from nosy data can resolve some nosy data problems. In our experments, the proposed nosy data reducton method led to hgher performances n all of the three test data and four dfferent conventonal classfers and the TCFP classfer, whch are developed as a robust classfer from nosy data, also led to good performance n the envronment wth much nosy data. - Term Weghtng and Indexng: the development of a new term weghtng and ndexng scheme s needed because that of text classfcaton s dfferent to that of nformaton retreval. Thus, we advocated a new term weghtng and ndexng method for text categorzaton usng two knds of text summarzaton technques: one uses the ttle and the other uses the mportance of terms. In our experments, the proposed method acheved a better performance than the bass system dd n all these classfers and both two languages, Englsh and Korean. Acknowledgment Ths research was supported by Basc Scence Research Program through the Natonal Research Foundaton of Korea (NRF) funded by the Mnstry of Educaton, Scence and Technology (2011-0003780) 26

REFERENCES [1] Brll, E. 1995. Transformaton-Based Error-drven Learnng and Natural Language Processng: A Case Study n Part of Speech Taggng. Computatonal Lngustcs 21, 4. Craven, M., DPasquo, D., Fretag, D., McCallum, A., Mtchell, T., Ngam, K., and Slattery, S. 2000. Learnng to Construct Knowledge Bases from the World Wde Web. Artfcal Intellgence 118, 1-2, 69-113. Dempster, A., Lard, N.M., and Rubn, D. 1997. Maxmum Lkelhood from Incomplete Data Va the EM Algorthm. Journal of the Royal Statstcal Socety Seres B 39, 1, 1-38. Endres-Nggemeyer B. 1998. Summarzng Informaton. Sprnger-Verlag Berln Hedelberg, 307-338. Fabrzo, S. 2002. Machne Learnng n Automated Text Categorzaton. ACM Computng Surveys 34, 1, 1-47. Han, H., Ko, Y., and Seo, J. 2007. Usng the Revsed EM Algorthm to Remove Nosy Data for Improvng the One-aganst-the-rest n Bnary Text Classfcaton, Informaton Processng & Management, Pergamon-Elsever Scence 43, 5, 1281-1293. Joachms, T, 1997. A Probablstc Analyss of the Roccho Algorthm wth TFIDF for Text Categorzaton, Machne Learnng: Proceedngs of the Fourteenth Internatonal Conference, 143-151. Joachms, T. 1998. Text Categorzaton wth Support Vector Machnes: Learnng wth Many Relevant Features. In Proceedngs of European Conference on Machne Learnng (ECML), Sprnger, 137-142. Joachms, T. 2001. Learnng to Classfy Text Usng Support Vector Machnes. The dssertaton for the degree of Doctor of Phlosophy, Kluwer Academc Publshers. Joachms, T. 2002. Learnng to Classfy Text Usng Support Vetcor Machnes. Kluwer Academc Publshers. [2] Karov, Y. and Edelman, S. 1998. Smlarty-based Word Sense Dsambguaton. Computatonal Lngustcs 24, 1, 41-60. Ko, Y. and Seo, J. 2000. Automatc Text Categorzaton by Unsupervsed Learnng. In Proceedngs of the 18 th Internatonal Conference on Computatonal Lngustcs (COLING 2000), 453-459. 27

Ko, Y. and Seo, J. 2004a. Usng the Feature Proecton Technque Based on a Normalzed Votng for Text Classfcaton, Informaton Processng & Management 40, 2, 191-208. Ko, Y., Park, J., and Seo, J. 2004b. Improvng Text Categorzaton Usng the Importance of Sentences, Informaton Processng & Management 40, 1, 65-79. Ko, Y. and Seo, J. 2009. Text Classfcaton from Unlabeled Documents wth Bootstrappng and Feature Proecton Technques. Informaton Processng & Management 45, 1, 70-83. Lee, C.H., Ln, C.R., and Chen, M.S. 2001. Sldng-wndow Flterng: an Effcent Algorthm for Incremental Mnng. In Proceedngs of the Tenth Internatonal Conference on Informaton and Knowledge Management, 263-270. Lews, D.D., Schapre, R.E., Callan, J.P., and Papka, R. 1996. Tranng Algorthms for Lnear Text Classfers. In Proceedngs of the 19 th Internatonal Conference on Research and Development n Informaton Retreval (SIGIR 96), 289-297. Lews, D.D. 1998. Nave (bayes) at Forty: The Independence Assumpton n Informaton Retreval. In Proceedngs of European Conference on Machne Learnng, 4-15. Maarek, Y., Berry, D., and Kaser, G. 1991. An Informaton Retreval Approach for Automatcally Constructon Software Lbrares. IEEE Transacton on Software Engneerng 17, 8. 800-813. McCallum, A. and Ngam, K. 1998. A Comparson of Event Models for Nave Bayes Text Classfcaton. AAAI 98 workshop on Learnng for Text Categorzaton, 41-48. Mtchell, T. 1997. Machne Learnng. New York: McGraw-Hll. Ngam, K., McCallum, A., and Mtchell, T. 2006. Sem-supervsed Text Classfcaton Usng EM. Sem- Supervsed Learnng. MIT press: Boston. [3] Salton, G. and Buckley, C. 1988. Term Weghtng Approaches n Automatc Text Retreval. Informaton Processng and Management 24, 513-523. 28

Slonm, N., Fredman, N., and Tshby, N. 2002. Unsupervsed Document Classfcaton usng Sequental Informaton Maxmzaton. In Proceedngs of the 25 th Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, 129-136. Tong, K. and Koller, D. 2001. Support Vector Machne Actve Learnng wth Applcatons to Text Classfcaton. Journal of Machne Learnng Research. 45-66. Yang, Y. and Pedersen, J.P. 1997. A Comparatve Feature Selecton n Statstcal Learnng of Text Categorzaton. In Proceedngs of the Fourteenth Internatonal Conference on Machne Learnng. 412-420. Yang, Y. 1999. An evaluaton of statstcal approaches to text categorzaton. Journal of Informaton Retreval 1, 1/2, 67-88. Yang, Y., Slattery, S., and Ghan, R. 2002. A study of approaches to hypertext categorzaton. Journal of Intellgent Informaton Systems 18, 2. Zadrozny, B. and Elkan, C. 2001. Obtanng Calbrated Probablty Estmates from Decson Trees and naïve Bayesan Classfers. In Proceedngs of the Eghteenth Internatonal Conference on Machne Learnn, 609-616. Zadrozny, B. and Elkan, C. 2002. Transformng Classfer Scores nto Accurate Multclass Probablty Estmates. In Proceedngs of Internatonal Conference on Knowledge Dscovery and Data Mnng (KDD-02), 694-699. 29