Pruning Training Corpus to Speedup Text Classification 1

Prunng Tranng Corpus to Speedup Text Classfcaton Jhong Guan and Shugeng Zhou School of Computer Scence, Wuhan Unversty, Wuhan, 430079, Chna hguan@wtusm.edu.cn State Key Lab of Software Engneerng, Wuhan Unversty, Wuhan, 43007, Chna Zhousg@whu.edu.cn Abstract: Wth the rapd growth of onlne text nformaton, effcent text classfcaton has become one of the key technques for organzng and processng text repostores. In ths paper, an effcent text classfcaton approach was proposed based on prunng tranng-corpus. By usng the proposed approach, nosy and superfluous documents n tranng corpuses can be cut off drastcally, whch leads to substantal classfcaton effcency mprovement. Effectve algorthm for tranng corpus prunng s proposed. Experments over the commonly used Reuters benchmark are carred out, whch valdates the effectveness and effcency of the proposed approach. Keywords: text classfcaton; fast classfcaton; k-nearest neghbor (knn); tranng-corpus prunng. Introducton As the amount of on-lne textual nformaton ncreases by leaps and bounds, effectve retreval s dffcult wthout support of approprate ndexng and summarzaton of text content. Text classfcaton s one soluton to ths problem. By placng documents nto dfferent classes accordng to ther respectve contents, retreval can be done by frst locatng a specfc class of documents relevant to the query and then searchng the targeted documents wthn the selected class, whch s sgnfcantly more effcent and relable than searchng n the whole documents repostory. Text classfcaton has been a hot research topc n machne learnng and nformaton retreval areas, and a number of methods for text classfcaton were proposed [, ]. Among the exstng methods, knn s the smplest strategy that searches the k-nearest tranng documents to the test document and use the classes assgned to those tranng documents to decde the class of the test document [3, 4, 5, 6]. knn classfcaton method s easy to mplement for t does not requre the phase of classfer tranng that other classfcaton methods must have. Furthermore, expermental researches show that knn method offers promsng performance n text Ths work was supported by the Natural Scence Foundaton of Chna (NSFC) (No. 607307) and the Provncal Natural Scence Foundaton of Hube of Chna (No. 00ABB050). R. Ccchett et al. (Eds.): DEXA 00, LNCS 453, pp. 83 840, 00. Sprnger-Verlag Berln Hedelberg 00

83 J. Guan and S. Zhou classfcaton [, 6]. However, the knn method s of low effcency because t requres a large amount of computatonal power for evaluatng a measure of the smlarty between a test document and every tranng document and for sortng the smlartes. Such a drawback makes t unsutable for some applcatons where classfcaton effcency s pressng. For example, on-lne text classfcaton where the classfer has to respond to a lot of documents arrvng smultaneously n stream format. Some researchers n IR have addressed the problem of usng representatve tranng documents for text classfcaton to mprove classfcaton effcency. In [7] we proposed an algorthm for selectng representatve boundary documents to replace the entre tranng sets so that classfcaton effcency can be mproved. However, [7] ddn t provde any crteron on how many boundary documents should be selected and t couldn t guarantee the classfcaton performance. Lnear classfers [8] represent a category wth a generalzed vector to summarze all tranng documents n that category; the decson of the assgnment of the category can be vewed as consderng the smlarty between the test document and the generalzed vector. Analogously, [9] utlzes the centrod of each class as the only representatve of the entre class. A test document s assgned to the class whose centrod s the nearest one to that test document. However, these approaches could not do well n the case that the szes of dfferent classes are qute dfferent and dstrbuton of tranng documents n each class s not regular n document space. Combnng tradtonal knn and lnear classfcaton methods, [5] uses a set of generalzed nstances to replace the entre tranng corpus, classfcaton s based on the set of generalzed nstances. Experments show ths approach outperforms both tradtonal knn and lnear classfcaton methods. In ths paper, our focus s also on the effcency of knn based text classfcaton. We provde a robust and controlled way to prune nosy and superfluous documents so that the tranng corpuses can be sgnfcantly condensed whle ther classfcaton competence s mantaned as much as possbly, whch leads to greatly mproved classfcaton effcency. We desgn effectve algorthm for text corpus prunng, and carry out experments over the commonly used Reuters benchmark to valdate the effectveness and effcency of our proposed approach. Our approach s especally sutable for on-lne classfcaton applcatons. The rest of ths paper s organzed as follows. Secton ntroduces the vector space model (VSM), a clusterng-based feature selecton method and the knn classfcaton method. Secton 3 frst presents the concepts and algorthm for nosy and superfluous tranng documents prunng, and then gves a fast knn classfcaton approach based on the proposed tranng-corpus prunng algorthm. Secton 4 descrbes some experments for evaluatng the proposed approach. Secton 5 concludes the paper. Prelmnares for knn Based Text Classfcaton. Documents Representaton by Vector Space Model (VSM) In knn based text classfcaton, the vector space model (VSM)[0] s used to represent documents. That s, a document corresponds to an n-dmensonal document

Prunng Tranng Corpus to Speedup Text Classfcaton 833 vector. Each dmenson of the document vector corresponds to an mportant term appearng n the tranng corpus. These terms are also called document features. Gven a document vector, ts dmensonal components ndcate the correspondng terms weghts that are related to the mportance of these terms n that document. Denote D a tranng corpus, V the set of document features, V={t, t,, t n }. For a document d n D, t can be represented by VSM as follows. d = ( w, w,..., wn ). Above, d ndcates the vector of document d, w (= n) s the weght of term t. Usually, the weght s evaluated by TFIDF method. A commonly used formula s lke ths: tf log( N / n ) w = n. () ( tf ) [log( N / n )] = Here, N s the total number of documents n D, tf s the occurrence frequency of t n document d, and n s the number of documents where t appears. Obvously, document vectors calculated by () are unt vector. Gven two documents d and d, the smlarty coeffcent between them s measured by the nner product of ther correspondng document vectors,.e., Sm d, d ) = d d. (3). Clusterng Based Feature Selecton ( To calculate document vectors for tranng documents, the frst step s to select a set of proper document features. A number of statstc methods have been used for document features selecton n the lterature []. However, n ths paper we use a new method, whch s referred to as clusterng-based feature selecton. From the pont of geometry vew, every document s a unt vector n document space (n-dmensonal space). Bascally, documents belong to the same class are closer to each other n document space than those that are not n the same class, that s, they have smaller dstance (or larger smlarty). Documents n the same class form a dense hyper-cone area n document space, and a tranng corpus corresponds to a cluster of hyper-cones each of whch corresponds to a class. Certanly, dfferent hyper-cones may overlay wth each other. Intutvely, the goal of feature selecton task here s to select a subset of documents features such that the overlayng among dfferent tranng classes n the document space s as lttle as possble. The basc dea of our clusterng-based feature selecton method s lke ths: treatng each tranng class as a dstnctve cluster, then usng a genetc algorthm to select a subset of document features such that the dfference among all clusters s maxmzed. We defne the dfference among all clusters as follows. m m Dff = ( C C k k m m k *( k ) sm( d, d )) sm( d, d ). m k= d d C ( Ck ) C, k k k = d C k k = k = d C k () (4)

834 J. Guan and S. Zhou Above, m s the number of clusters, the frst part ndcates the average ntra-cluster smlarty, and the second part means the average nter-cluster smlarty. Due to space lmtaton, we omt the detals of the clusterng based feature selecton algorthm..3 knn Based Text Classfcaton The knn based text classfcaton approach s qute smple []: gven a test document, the system fnds the k nearest neghbors among tranng documents n the tranng corpus, and uses the classes of the k nearest neghbors to weght class canddates. The smlarty score of each nearest neghbor document to the test document s used as the weght of the classes of the neghbor document. If several of k nearest neghbors share a class, then the per-neghbor weghts of that class are added together, and the resultng weghted sum s used as the lkelhood score of that class wth respect to the test document. By sortng the scores of canddate classes, a ranked lst s obtaned for the test document. By thresholdng on these scores, bnary class assgnments are obtaned. Formally, the decson rule n knn classfcaton can be wrtten as: score( d, c ) = d knn ( d ) Sm( d, d ) δ ( d, c ) b. Above, knn(d) ndcates the set of k nearest neghbors of document d; b s the class-specfc threshold for the bnary decsons, t can be automatcally learned usng cross valdaton; and δ ( d, c ) s the classfcaton for document d wth respect to class c, that s, δ ( d, c ) = 0 d d c ; c. Obvously, for a test document d, the smlarty between d and each document n the tranng corpus must be evaluated before t can be classfed. The tme complexty of knn classfcaton s O(n t D log( D )) where D and n t are the sze of tranng corpus and the number of test documents. To mprove classfcaton effcency, a possble way s to reduce D, whch the goal of ths paper. In ths paper, we assume that ) the class space has flat structure and all classes are semantcally dsonted; ) each document n the tranng corpus belongs to only one class; 3) each test document can be classfed nto only one class. Wth these assumptons, for test document d, t should belong to the class that has the hghest resultng weghted sum n (5). That s, d c only f score( d, c) max{ score( d, c ) = n}. (5) = (6) 3 Tranng-Corpus Prunng for Fast Text Classfcaton Examnng the process of knn classfcaton, we can see that outer documents or boundary documents (locatng near the boundary) of each class (or document hyper-cone) play more decsvely role n classfcaton. On the contrary, nner documents or central documents (locatng at the nteror area) of each class (or

Prunng Tranng Corpus to Speedup Text Classfcaton 835 document hyper-cone) are less mportant as far as knn classfcaton s concerned, because ther contrbuton to classfcaton decson can be obtaned from the outer documents. In ths sense, nner documents of each class can be seen as superfluous documents. Superfluous documents are ust not tell us much about makng classfcaton decson, the ob they do n nformng classfcaton decson can be done by other documents. Except for superfluous documents, there may be some nosy documents n tranng corpus, whch are n-correctly labeled tranng documents. We seek to dscard superfluous and nosy documents to reduce the sze of tranng corpus so that classfcaton effcency can bee boosted. Meanwhle, we try to guarantee that the prunng of superfluous documents wll not cause classfcaton performance (precson and recall) degradaton. In the context of knn text classfcaton, for tranng document d n tranng corpus D, there are two sets of documents n D that are related to d n dfferent way. Documents n one of the two sets are crtcal to the classfcaton decson on d f d were a test document; for documents n the other set, d can contrbute to the classfcaton decsons on these documents f they were treated as test documents. Formal defntons for the two document-sets are as follows. Defnton. Gven document d n tranng corpus D, the set of k nearest documents to d n D consttutes the k-reachablty set of d, whch s referred to as k-reachablty(d). Formally, k-reachablty (d)={d d D and d knn(d)}. Defnton. Gven document d n tranng corpus D, there s a set of documents n the same class that d belongs to, n whch each document s k-reachablty set contans d. We defne ths set of documents the k-coverage set of d, or smply k-coverage (d). Formally, k-coverage (d)={d d D and d class(d) and d k-reachablty (d )}. Here, class(d) ndcates the class to whch d belongs. Note that n defnton, k-coverage (d) contans only documents from the same class that d belongs to. The reason les n the fact: our am s to prune tranng-corpus whle mantanng ts classfcaton competence. Obvously, prunng d may mpact negatvely the classfcaton decsons on the documents n the same class that d belongs to; however, t can beneft the classfcaton decsons on the documents n the other classes. Hence, we need take care only the documents n the same class that d belongs to and whose k-reachablty sets contan d. Defnton 3. Gven document d n tranng corpus D, f t could be correctly classfed wth k-reachablty(d) based on the knn method, n other words, d can be mpled by k-reachablty(d), then t s a superfluous document n D. Defnton 4. Gven document d n tranng corpus D, t s a crtcal document f one of the followng condtons s fulflled: a) at least one document d n k-coverage(d) can not be mpled by ts k-reachablty(d ); b) after d s pruned from D, at least one document d n k-coverage(d) cannot be mpled by ts k-reachablty(d ).

836 J. Guan and S. Zhou Defnton 5. Gven document d n tranng corpus D, f t s not a superfluous document and ts k-coverage (d) s empty, then t s a nosy document n D. In summary, a superfluous document s superfluous because ts class assgnment can be derved from other documents; a crtcal document s crtcal to other documents because t can contrbute to makng correct classfcaton decsons about these documents; and a nosy documents s nose as far as classfcaton s concerned because t s ncorrectly labeled. In knn classfcaton, nose documents must be gven up; superfluous documents can be dscarded; however, crtcal documents must be kept n order to mantan tranng corpus classfcaton competence. Based on ths consderaton, we gve a rule for tranng corpus prunng as follows. Rule. The rule for tranng-document prunng. For document d n tranng corpus D, t can be pruned from D f ) t s a nosy document n D, or ) t s a superfluous document, but not a crtcal document n D. For the second case n Rule, the frst constrant s the prerequste for prunng a certan document from the tranng corpus, whle the second constrant s put to guarantee that the prunng of a certan document wll not cause degradaton of classfcaton competence of the tranng corpus. Whle prunng superfluous documents, t s worthy of pontng out that the order of prunng s also crtcal because the prunng of one document may mpact the decson on whether other documents can be pruned. Intutvely, nner documents of a class n the tranng corpus should be pruned before outer documents. Ths strategy can ncrease the chance of retanng outer documents as many as possble. Otherwse, f outer documents were pruned before nner documents, t would be possble to cause the Domno effect that a lot of documents are pruned from the tranng corpus, ncludng outer documents, whch would degrade greatly the classfcaton competence of the tranng corpus. Therefore, some rule s necessary to control the order of documents prunng. Generally speakng, nner documents of a certan class n the tranng corpus have some common features: ) nner documents may have more documents of ther own class around themselves than outer documents can have; ) nner documents are closer to the center of ther class than the outer documents are; 3) nner documents are further from the documents of other classes than the outer documents are. Based on these observatons, we gve a rule about superfluous document s prunng prorty as follows. Here, we denote H-kNN(d) the number of documents n knn(d) that belongs to the class of d; smlarty-c(d) the smlarty of document d to the center of ts own class, and smlarty-ne(d) the smlarty of document d to the nearest document that does not belong to ts own class. Rule. The rule for settng prorty of prunng superfluous documents. Gven two documents d, d n a class of the tranng corpus, both d and d are superfluous documents that can be pruned accordng to Rule. ) f H-kNN(d )>H-kNN(d ), then prune d before d ; ) f smlarty-c(d )> smlarty-c(d ), then prune d before d ; 3) f smlarty-ne(d )< smlarty-ne(d ), then prune d before d ;

Prunng Tranng Corpus to Speedup Text Classfcaton 837 4) f they have smlar H-kNN, smlarty-c and smlarty-ne, then any one can be pruned frst; 5) the prorty of usng H-kNN, smlarty-c and smlarty-ne: H-kNN>smlarty-c> smlarty-ne. Followng s an algorthm for tranng corpus prunng. In algorthm, we assume that there s only one class n the tranng corpus. If there are multple classes n the tranng corpus, ust carryng out the prunng process n algorthm over one class after another. Algorthm. Prunng-tranng-corpus (T: tranng corpus, P: pruned corpus) ) P=T; S=Φ; ) for each document d n T 3) compute k-reachablty(d); compute k-coverage(d); 4) for each nosy document d n T 5) S=S {d}; T=T-{d}; P=P-{d}; 6) for each document d n k-coverage(d) 7) remove d from k-reachablty(d ) and update k-reachablty(d ) n T; 8) for each document d n k-reachablty(d) 9) remove d from k-coverage (d ); 0) for each document d n T but not n S ) f d can be pruned and have the hghest prorty to be pruned, then ) S=S {d}; P=P-{d}; 3) for each document d n k-coverage(d) 4) update k-reachablty(d ) n T. 5) return P. Based on the technque of tranng corpus prunng, a fast algorthm for knn text classfcaton s outlned as follows. Algorthm. Fast knn classfcaton based on tranng documents prunng (outlne) ) Selectng document feature wth the proposed clusterng based feature selecton method; buldng tranng document vectors wth the selected features; ) Prunng the tranng corpus by usng algorthm ; 3) For each test document d, calculate ts smlarty to each tranng document n the pruned tranng corpus; 4) Sortng the computed smlartes to get knn(d); 5) Decdng d s class based on formula (5) and (6). 4 Expermental Results We evaluate the proposed approach by usng the Reuters benchmark compled by Apte et al. for ther evaluaton of the SWAP- by removng all of the unlabeled documents from the orgnal Reuters corpus and restrctng the categores to have a tranng-set frequency of at least two []. Usually ths corpus s smply referred to as

838 J. Guan and S. Zhou Apte. We do not use the Apte corpus drectly, nstead we frst remove tranng and test documents that belong to two or more categores, and then select the top 0 categores to form our own compled Apte corpus. Statstc nformaton of the compled Apte corpus s lsted n Table. Table. Our compled Apte corpus (TC-Apte) Category Number of tranng docs Number of test docs Acq 597 750 Coffee 93 5 Crude 55 4 Earn 840 70 Interest 9 90 money-fx 5 00 money-supply 3 30 Shp 38 Sugar 97 3 Trade 5 88 Total 5773 447 We mplemented a prototype wth VC++ 6.0 under Wndows 000. Experments were carred out on a PC wth P4.4GHz CPU and 56MHz memory. The goal of experments s to evaluate the performance (effectveness and effcency) of our approach. For smplcty, we denote the compled Apte corpus TC-Apte. In experments, TC-Apte s pruned frst by usng our prunng algorthm; the pruned result corpus s denoted as TC-Apte-pruned. Classfers are traned wth TC-Apte and ts correspondng pruned corpus TC-Apte-pruned, the traned classfers performances are then measured and compared. Three performance parameters are measured: precson (p), recall (r), and classfcaton speedup (or smply speedup), n whch p and r are used for effectveness measurements, and speedup s used for effcency mprovement measurement of our approach. Here we use the mcro-averagng method for evaluatng performance average across multple classes. In the context of ths paper (.e. each document, ether for tranng or for test, belongs to only one category), mcro-average p (or smply mcro-p) and mcro-average r (or smply mcro-r) have smlar values. In ths paper, we use mcro-p, whch can be evaluated as follows: the number of correctly assgned test documents mcro p =. the number of test ocuments We defne effcency speedup n the followng formula: speedup = t t TC Apte TC Apte pruned Above, t TC-Apte and t TC-Apte-pruned are the tme cost for classfyng a test document (or a set of test documents) based on TC-Apte and TC-Apte-pruned respectvely.. Apte corpus s avalable at: http://moscow.mt.cs.cmu.edu:808/reuters_450/apte

Prunng Tranng Corpus to Speedup Text Classfcaton 839 Due to space lmtaton, here we gve only partal expermental results. Fg. llustrates the results of the mpact of k value on prunng effectveness and classfcaton performance over TC-Apte. From Fg., we can see that by usng our corpus-prunng technology, classfcaton effcency can get mproved at a factor of larger than 4, wth less than 3% degradaton of mcro-averagng performance. Obvously, ths result s acceptable. mcro-p 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0. 0. 0 0 0 30 k No prunng Prunng (a) k value vs. mcro-p speedup 0 9 8 7 6 5 4 3 0 0 30 k (b) k value vs. speedup Fg.. Impact of k value on prunng effectveness and classfcaton performance 5 Concluson The rapd growth of text nformaton avalable arses the requrement of effcent text classfcaton method. Although knn based text classfcaton s a good method as far

840 J. Guan and S. Zhou as performance s concerned, t s neffcent for t has to calculate the smlarty of the test document to each tranng document n the tranng corpus. In ths paper, we propose a tranng-corpus prunng based approach to speedup the knn method. By usng our approach, the sze of tranng corpus can be reduced sgnfcantly whle classfcaton performance can be kept at a level close to that of wthout tranng documents prunng. Expermental results valdate the effcency and effectve of the proposed approach. References. Fabrzo Sebastan. Machne learnng n automated text categorzaton. ACM Computng Surveys, 34(): -47, 00. Y. Yang and X. Lu. A re-examnaton of text categorzaton. Proceedngs of the nd Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval (SIGIR 99), 999. 3. B. Masand, G. Lnoff, and D. Waltz. Classfyng news stores usng memory-based reasonng. Proceedngs of the 5th Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval (SIGIR 9), 99, 59-65. 4. Y. Yang. Expert network: effectve and effcent learnng from human decsons n text categorzaton and retreval. Proceedngs of the 7th Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval (SIGIR 94), 994, 3-. 5. W. Lam and C. Y. Ho. Usng a generalzed nstance set for automatc text categorzaton. Proceedngs of the st Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval (SIGIR 98), 998, 8-89. 6. S. Zhou, J. Guan. Chnese documents classfcaton based on N-grams. A. Gelbukh (Ed.): Intellgent Text Processng and Computatonal Lngustcs, LNCS, Vol. 76, Sprng-Verlag, 00, 405-44. 7. S. Zhou. Key Technques of Chnese Text Database. PhD thess of Fudan Unversty, Chna. 000. 8. D. D. Lews, R. E. Schapore, J. P. Callan, and R. Papka. Tranng algorthms for lnear text classfers. Proceedngs of the 9th Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval (SIGIR 96), 996, 98-306. 9. E. H. Han and G. Karyps. Centrod-based document classfcaton algorthm: analyss & expermental results. Techncal Report TR-00-07, Dept. of CS, Un. of Mnnesota, Mnneapols, 000. http://www.cs.umn.edu/~karyps 0. G. Salton, A. Wong, and C. S. Yang. A vector space model got automatc ndexng. K. S. Jones and P. Wllett (Eds.), Readngs n Informaton Retreval. Morgan Kaufmann, 997. 73-80.. Yang, Y., Pedersen J.P. A Comparatve Study on Feature Selecton n Text Categorzaton Proceedngs of the Fourteenth Internatonal Conference on Machne Learnng (ICML'97), 997.. C. Apte, F. Damerau, and S. Wess. Text mnng wth decson rules and decson trees. Proceedngs of the Conference on Automated Learnng and Dscovery, Workshop 6: Learnng from Text and the Web, 998.