A Hybrid Text Classification System Using Sentential Frequent Itemsets

A Hybrd Text Classfcaton System Usng Sentental Frequent Itemsets Shzhu Lu, Hepng Hu College of Computer Scence, Huazhong Unversty of Scence and Technology, Wuhan 430074, Chna stoneboo@26.com Abstract: Text classfcaton technques mostly rely on sngle term analyss of the document data set, whle more concepts especally the specfc ones are usually conveyed by set of terms. To acheve more accurate text classfer, more nformatve feature ncludng frequent co-occurrng words n the same sentence and ther weghts are partcularly mportant n such scenaros. In ths paper, we propose a novel approach usng sentental frequent temset, a concept comes from assocaton rule mnng, for text classfcaton, whch vews a sentence rather than a document as a transacton, and uses a varable precson rough set based method to evaluate each sentental frequent temset s contrbuton to the classfcaton. Experments over the Reuters corpus are carred out, whch valdate the practcablty of the proposed system. Key-Words: text classfcaton, sentental frequent temsets, varable precson rough set model.. Introducton In an effort to keep up wth the tremendous growth of the World Wde Web, many research projects were targeted on how to organze such nformaton n a way that wll make t easer for the end users to fnd the nformaton they want effcently and accurately. Informaton on the Web s mostly present n the form of text document, and that s the reason content-based document management task( collectvely known as nformaton retreval IR), n the last 0 years, have ganed a promnent statues n the nformaton systems feld. Text classfcaton(tc also known as text categorzaton, or topc spottng), the actvty of labelng natural language texts wth thematc categores from a predefned set, s one such task. TC, becomng a major subfeld of the nformaton systems dscplne n the early 90s, s now beng appled n many contexts, rangng from document ndexng based on a controlled vocabulary, to document flterng, automated metadata generaton, word sense dsambguaton, populaton of herarchcal catalogue of Web resources, and n general any applcaton or selectve and adaptve document dspatchng. Recent studes n the data mnng communty proposed new methods for classfcaton employng assocaton rule mnng[,2]. All these current assocatve classfer, to our best knowledge, explot document-level co-occurrng words, whch are groups of words co-occurrng frequently n the same document[3,4]: tranng documents are modeled as transactons where tems are words from the document. Frequent words (temsets) are then mned from such transactons to catch document semantcs and generate IF-THEN rules accordngly. However, assumng document s the unt representng an entre dea, the basc semantc unt n a document s actually the sentence n t. Words co-occurrng n the same sentence have semantc assocaton more or less, and convey more local nformaton than the set of words scatterng n several sentences of a document. Accordng to above observatons, n ths paper, we propose a system for text classfcaton based on two key concepts. The frst s the document DB model whch treats sentence rather document as the transacton to mne the sentental frequent temset (SFI) as the feature of that document. The second concept s usng varable precson rough set model based method to evaluate each SFI s contrbuton to the classfcaton. The system conssts of four components:. A document restructurng scheme that clean nosy nfo n the document and map the orgnal document nto a document DB n whch sentence s the transacton where tems are words n the sentence. 2. A SFIs generator usng Apror algorthm to mne sentental frequent temsets, employed as the feature of the matrx document, n the tranng documents DB. 3. A topc template generator that prune the SFIs and usng the remanng ones to construct topc templates. 4. A classfer that sore each SFI s weght n the test document and topc templates usng our novel weghtng scheme and measure the smlarty between them.. The ntegraton of these four components proved to be of superor performance to tradtonal text classfcaton methods. Although the whole system performance s qute good, each component could be

used ndependent of the other. The overall system desgn s llustrated n Fg.. The rest of ths paper s organzed as the follows: Secton 2 ntroduces some prelmnary knowledge and state the problem formally. Secton 3 presents the steps of data preparaton. Secton 4 ntroduces the document DB model and sentental frequent temsets mnng process. Secton 5 ntroduces SFI prunng method. Secton 6 presents our proposed SFI weghtng scheme and SFI-based smlarty measure. Secton 7 dscusses the expermental results. Fnally, we conclude and dscuss future work n the last secton. Tranng Documents Document Constructor DB SFI Mner Topc Template Generator Unlabeled Documents Classfer Fgure. Text classfcaton system desgn 2. Prelmnary and Problem Defnton 2. Text categorzaton Text categorzaton s the task of assgnng a Boolean value to each par d j, c D C, where D s a doman of documents and C { c,..., c c } s a set of predefned categores. A value of T assgned to d j, c ndcates a decson to fle d j under c. More formally, the task s to approxmate the unknown target functon Φ : D C { T, F} (that descrbes how documents ought to be classfed) by means of a functon Φ : D C { T, F} called the classfer (also known as rule, or hypothess, or model) such that Φ and Φ concde as much as possble. Most of researches n text categorzaton come from the machne learnng and nformaton retreval communtes. Rocchl s algorthm[0] s the classcal method n nformaton retreval, beng used n routng and flterng documents. Researchers tackled the text categorzaton n many ways. Classfer based on probablstc methods have been proposed startng wth the frst presented n lterature by Maron n 96 and contnung wth naïve-bayes[] that proved to perform well. ID3 and C4.5 are well-known packages whose cores are makng use of decson tree to buld automatc classfers[2, 3, 4]. K-nearest neghbor (k-nn) s another technque used n text categorzaton[5]. Another method to construct a text categorzaton system s by an nductve rule learnng method. Ths type of classfers s represented by a set of rules n dsjunctve normal form that best cover the tranng set[6, 7, 8]. As reported n [9] the use of bgrams mproved the text categorzaton accuracy as opposed to ungrams use. In addton, n the last decades neural networks and vector machnes (SVM) were used n text categorzaton and they proved to be powerful tools[20, 2, 4]. 2.2 Varable precson rough set model Classfcaton s the core foundaton of rough set theory. In Pawlak s rough set model there s a lmt that the classfcaton s completely correct or wrong, namely, the defntons of lower and upper approxmatons are crsp, whch wll not be applcable to some complcate classfcatons. Based on majorty ncluson relaton Zarko [7] presented a generalzed rough set model, named as varable precson rough set, to overcome the lmtaton. Gven, Y U, we defne the ncluson of to Y, denoted as C (, Y ), by: Y /, > 0 C (, Y) () 0, 0 IS < U, A, V, f > s the nformaton system of dsclosure, where A s the set of attrbutes,

A { a, a2,..., a k }. V s the doman of values of A. f s an nformaton functon f : U A V. In the text classfcaton, U s the text collectons and A s the feature set. V s doman of the weght values of feaures n A. R s ndscernblty relaton defned on U, U / R {, 2,..., N }. Is a famly of R_equvalence classes. U s a subset of nterest, and we defne α lower approxmaton and α upper approxmaton by: Rα { U / B C(, ) α} (2) Rα { U / B C(, ) > α} Accordngly, s α boundary regon s defned by: BNDα { U / B α < C(, ) < α} (3) Where α [ 0.5,]. It s easy to show ths model s equvalent to Pawlak s model when α. Ths generalzaton smoothes the boundary of lower and upper approxmatons. In the orgnal rough set model, the classfcaton of the data wth respect to the relatonshp wth the target event s developed by usng three regons: the postve regon n whch an event would occur wth certanty, the negatve n whch an event would not occur wth certanty, and a boundary regon n whch an event mght or mght not occur. The varable precson rough set model defnes the postve and negatve regons as areas where the approxmate classfcaton wth respect to target event wth an error frequency less than some predefned level s possble. In other words, the postve regon then becomes a regon where the event occurred most of the tme, and negatve regon s the regon where the event occurred nfrequently. 3. Data Preparaton In our approach, to convert text of document nto our proposed document DB model whch wll be ntroduced n secton 4., some data preprocessng measures are necessary to be taken to each document. A sentence boundary detector algorthm was developed to locate sentence boundares n the documents. The algorthm s based on a fnte state machne lexcal analyzer wth heurstc rules for fndng the boundares. About 97 percent of the actual boundares are correctly detected. The resultng documents contan very accurate sentence separaton, wth almost neglgble nose. Fnally, to weed out those words that contrbute lttle to buldng the classfer and to reduce the hgh dmensonalty of the data, a document cleanng step s performed to remove stop-words that have no sgnfcance, and to stem the words usng the popular Porter Stemmer algorthm[5]. The subsequent phase conssts of dscoverng SFI from each document DB. 4. Document Database and SFI mnng 4. Document DB Model Sentence s a grammatcal unt that s syntactcally ndependent and has a subject that s expressed or understood. And the central meanng of a document s stated by organzng the basc dea of sentences. Focusng on mnng the local context nformaton n the sentences, we propose a document DB model. In document database model, a word s vewed as an tem, a natural sentence s vewed as a transacton, and a document s vewed as a transacton database. The detaled work flow of constructng document DB s llustrated n Fg. 2. The work presented here takes t a step further toward an effcent way of mnng local context nformaton. Documents Sentence Segmenter Stop words Remover Stemer Encoder Document DB Fgure 2.Process of document DB constructon

4.2 SFI Mnng After mappng each document as a transacton DB, we employ the apror algorthm to extract frequent occurrng sets of terms n sentences of each document and use them as that document s characterstc. Compared to documental frequent co-occurrng words, sentental frequent words convey more local context nformaton. The algorthm s descrbes n more detal n fgure 3. In Algorthm step(2) generates the frequent -temset. In steps(3-3) all the k-frequent temsets are generated and merged wth the category n C. The sentence space s reduced n each teraton by elmnatng the transactons that do not contan any of the frequent temsets. Ths step s done by FlterTable( S, F ) functon. The sentental frequent temsets dscovered n ths stage of the process are further processed to buld the topc templates. Fgure 3. Algorthm: fnd sentental frequent temsets n the gven document DB 5. Varable Precson Rough Set Model Based SFI Evaluaton Method By mergng SFIs of documents whch belong to the same category, we get the features of that category spontaneously. We wll use these frequent temsets to construct each category s topc template. For the number of SFI concernng wth a category could be very large, how to calculate each SFI s global weght, SFI s contrbuton to the classfcaton, s the key problem. We propose a weghtng scheme based on varable precson rough set model to evaluate each SFI s global weght, on whch we can select the SFIs for each topc template. Let F {, 2,..., C } be a partton of D, whch s the classfcaton of tranng document accordng to a set of predefned categores C c,..., c }. R A s a SFI as the condton { c

attrbutes subset here. Accordng to the Pawlak s rough set model wedefne R F and R F are defned by: RF { R, R2,..., R } (4) RF { R, R2,..., R } Correspondngly, n varable precson rough set model R α F and Rα F are defned by: Rα F { Rα, Rα 2,..., Rα n} (5) Rα F { Rα, Rα 2,..., Rα n} A measure was ntroduced to calculate the mprecson of ths classfcaton, whch s named as approxmate classfcaton qualty defned by: γ R ( F) R U α Approxmate qualty denotes the rato of objects that can be classfed to the F_equvalence classes wth certanty by SFI R. In other words, γ (F R ) measure the consstency degree between the classfcaton by R and F, whch may be nterpreted as the contrbuton to classfcaton that SFI makes. If γ (F R ) of all SFIs are calculated and ordered n ascendng, we can obtan a concse representaton of data by cuttng the features whose classfcaton qualty value s lower than a threshold that users have predefned. (6) 6. A SFI-Based Smlarty Measure As mentoned earler, sentental frequent temsets convey local context nformaton, whch s essental n rankng accurately a document s approprateness to categores. Towards ths end, we devsed a scheme to calculate weght of SFI n test document and topc templates and the cosne measure based on the weght s used to performed the classfcaton. Ths SFI weghtng scheme s a functon of three factors: the length of the SFI l, the frequences of the SFI n both document f, and the levels of sgnfcance (global weght ) of the SFI γ, whch s presented n secton 5. w f l γ (7) j j j j Frequency of SFI s an mportant factor n the measure. The more frequent the SFI appears n the document or the topc template, the more mportant the nformaton conveyed by the SFI s. Smlarly, The longer the SFI s, the more specfc the nformaton conveyed by the SFI s. The smlarty of the test document topc c s calculated wth cosne measure: sm ( c, d ) SFI j N wk wjk k N N 2 2 wk wjk k k (8) d j and the 6. Combnng Sngle-Term and SFI Smlartes If the smlarty between document and topc s solely based on matchng frequent temsets, and no sngle-terms at the same tme, related documents could be judged as nonsmlar f they do not share enough SFIs(a typcal case.) Shared SFIs provde mportant local context matchng, but sometmes smlarty based on SFIs only s not suffcent. To allevate ths problem, and to produce hgh qualty classfcaton, we combne sngle-term smlarty measure wth our temset-based smlarty measure. We used the cosne correlaton smlarty measure[6],[7], wth TF_IDF term weghts, as the sngle-term measure. The cosne measure was chosen due to ts wde use n the text classfcaton lterature, and snce t s descrbed as beng able to capture human categorzaton behavor well. The TF-IDF weghtng s also a wdely used term weghtng scheme. The combnaton of the term-based and the SFI-based smlarty measures s a weghted average of the two quanttes, and s gven by (9). The reason for separatng sngle-terms and SFIs n the smlarty equaton, as opposed to treatng a sngle-term as one-word-temset, s to evaluate the blendng factor between the two quanttes, and see the effect of SFIs n smlarty as opposed to sngle-terms. sm( c, d ) α sm ( c, d ) + ( α) sm ( c, d ) (9) j SFI j t j where α s a value n the nterval [0,] whch determnes the weght of the SFIs smlarty measure, or, as we call t, the Smlarty Blend Factor. Accordng to expermental results dscussed n Secton 7, we found that a value between 0.6 and 0.8 for α results n the maxmum mprovement n classfcaton qualty. 7. Expermental Results In order to test the effectveness of the text classfcaton system, we conducted a set of experments usng our proposed document DB model, varable precson rough set model based SFI prunng method, SFI weghtng scheme, and smlarty measure. 7. Text Corpora

Our set of evaluate experment was conducted on the well-known Reuters-2578 collectons, whch are usually splt nto two parts: tranng set for buldng the classfer and testng set for evaluatng the effectveness of the system. There are many splts of Reuters collecton; we select the ModApte verson. Ths splts leads to a corpus of 2,202 documents consstng of 9,603 tranng documents and 3,299 testng documents. All these documents belong to 35 topcs. However, only 93 topcs have more than one document n the tranng set and 82 topcs have less than 00 documents [8]. Obvously, the performances n the categores wth just a few documents would be very low, especally for those that do not even have a document n the tranng set. Among the documents there are some that have no topc assgned to them. We chose to gnore such documents snce no knowledge can be derved from them. Fnally we select ten categores wth largest number of correspondng tranng documents to test our system. Because other researchers often employ the smlar strategy, we can compare our expermental results wth the work of other researchers convenently. There are 6488 tranng documents and 2545 testng documents n these ten retaned categores. 7.2 Evaluaton Measures In order to assess the performance of our approach, we adopted some qualty measures wdely used n the text mnng lterature for the purpose of text classfcaton. The frst tow measures are Precson and Recall. The terms used to express precson and recall are gven n the contngency Table. Estmates of precson wth respect to c and recall wth respect to c may be thus obtaned as P (0) + FP R () + FN For obtanng estmates of P and R, two dfferent methods are adopted: mcroaveragng: P and R are obtaned by summng over all ndvdual decsons: P + FP µ (2) ( + FP) R + FN µ (3) ( + FN ) Where µ ndcates mcroaveragng. The global contngency table(table 2) s thus obtaned by summng over category-specfc contngency tables; macroaveragng: precson and recall are frst evaluated locally for each-category, and then globally over the results of the dfferent categores: P R M M P C R (4) (5) C Where M ndcates macroaveragng. Another measure taken here s Break-Even Pont (BEP), that s, the value at whch precson equals recall. Category c Classfer Judgments YES NO Expert judgments YES FN NO FP TN Table. The Contngency Table for Category Category C { c,..., c C } Classfer Judgments YES Expert judgments NO c YES FP FP NO FN FN TN TN Table 2. The Global Contngency Table 7.3 Expermental Results In order to better understand the effect of the SFI-based smlarty measure on classfcaton qualty, we carry out a set of experments on the text corpora menton n secton 7. and compare the expermental results the most well-known method.

Table 3(the results for the other classfcaton systems are reported as gven n [9]) shows a comparson between our classfer and the other well-known methods. The measure used here are precson/recall-breakeven pont, mcro-average and macro-average on the ten most populated Reuters categores. Our system proves to outperform most of the conventonal method, although Its performance s not every good for three categores,.e., gran, money-fx, trade. Table 3. Precson/Recall-breakeven pont on ten most populated Reuters categores for SFI-BC and most known classfers 8. Concluson and Future Work Text classfcaton s a key test-mnng problem, whch s useful to a great number of text-based applcatons. We presented a system composed of four components n an attempt to mprove the text classfcaton problem. The frst component cleans the data and maps document as the document DB. The second component uses apror algorthm to mne the sentental frequent temsets from document DB and use them as the feature of the correspondng document. The thrd component s the topc template generator. We propose a varable precson rough set abased method to evaluate each SFI s contrbuton to the classfcaton. The fourth component s the SFI-based smlarty measure. By carefully examnng the factors affectng the classfcaton, we devsed a SFI-based smlarty measure that s capable of accurate calculaton of smlarty between test document and topc template. The merts of such a desgn are that each component could be utlzed ndependent of the other. But, we have confdence that the combnaton of these components leads to better result. The expermental results show that the SFI based classfer performs well and ts effectveness s comparable to most well-know text classfers. There are a number of future research drectons to extend and mprove ths work. One drecton that ths work mght contnue on s to mprove on the accuracy of SFI-BC. Although the current scheme proved more accurate than tradtonal methods, there s stll room for mprovement. Another drecton s to mprove the feature selecton qualty. Some other feature selecton technques, such as latent semantc analyss whch could gve an nsght on the dscrmnatve feature among classes maybe s the complement of our strategy. Although the work presented here s amed at text classfcaton, t could be easly adapted to Web document as well. However, t wll have to take semstructure of Web document nto account. Our ntenton s to develop a Web document classfcaton system wth our approach. References [] W. L and J. Pe. CMAR: Accurate and effcent classfcaton based on multple class-assocaton rules. In IEEE nternatonal Conference on Data Mnng (ICDM 0), San Jose, Calforna, Novermber 29-December 2 200. [2] B. Lu, W. Hsu, and Y. Ma. Integratng classfcaton and assocaton rule mnng. In ACM Int. Conf. on Knowledge Dscovery and Data Mnng (SIGKDD 98), pages 80-86, New York Cty, NY,

August 998. [3] M.Antone and O.R.Zaane. Text Document Categorzaton by Term Assosaton. In Proc. of IEEE Intl. Conf. on Data Mnng, pages 9-26, 2002. [4] D.Meretaks, D.Fragoutds, H.Lu and S.Lkothanass. Scalable Assocaton-based Text Classfcaton. In Proc. of ACM CIKM, 2000 [5] M.F. Poter, An Algorthm for Suffx Strppng, Program, vol.4, no.3, pp-30-37, July 980. [6] G..Salton, A. Wong, and C. Yang, A Vector Space Model for Automatc Indexng, Comn. ACM, vol. 8, no., pp.63-620, Nov. 75. [7] G. Salton, Automatc Text Processng: The Transformaton, Analyss and Retreval of Informaton by computer. Readng, Mas: Addson Wesley, 989.. [8] O.R.Zaïane and M.L.Antone. Classfyng text documents by assocaton terms wth text categores. In Thrteen Australasan Database Conference(ACD 02), pages 25-222, Melbourne, Australa, January 2002. [9] T. Joachms. Text categorzaton wth support vector machnes: learnng wth many relevant features. In 0 th European Conference on Machne Learnng (ECML-98), pages 37-42, 998. [0] D. A. Hull. Improvng text retreval for the routng problem usng latent semantc ndexng. In 7 th ACM nternatonal Conference on Machne learnng (ECML-98), pages 37-42, 998. [] D.Lews. Naïve (bayes) at forty: The ndependence assumpton n nformaton retreval. In 0 th Conference on Machne Learnng (ECML-98), pages 4-5, 998. [2] W. Cohen and H. Hrsch. Jons that generalze: text classfcaton usng whrl. In 4 th Internatonal Conference on Knowledge Dscovery and Data Mnng (SgKDD 98), pages 69-73, New York Cty, USA, 998. [3] W. Cohen and Y. Snger. Context-senstve learnng methods for text categorzaton. ACM transacton on Informaton systems, 7(2):46-73, 999. [4] T. Joachms. Text categorzaton wth support vector machnes: learnng wth many relevant feature. In 0 th European Conference on Machne Learnng(ECML-98), pages 37-42, 998. [5] Y.Yang. An evaluaton of statstcal approaches to text categorzaton. Techncal Report CUM-CS-97-27, Carnege mellon Unversty, Aprl 997. [6] I.Mounlner and J.G.. Ganasca. Applyng an exstng machne learnng algorthm to text xategorzaton. In S.Wermter, E.Rloff, and G.Scheler, edtors, Connectonst statstcal, and symbolc approaches to learnng for natural language processng. Sprnger Verlag, Hedelberg, Germany, 996. Lecture Notes for Computer Scence seres, number 040. [7] H.L and K. Yamansh. Text classfcaton usng esc-based stochastc decson lsts. In 8 th ACM nternatonal Conference on Informaton and Knowledge Management(CIKM-99), pages 22-30, Kansas Cty, USA,999. [8] C.Apte, F.Damerau, and S. Wess. Automated learnng of decson rules for text categorzaton. ACM Transactons on Informaton System, 2(3):232-25, 994. [9] C. M. Tan, Y. F. Wang, and C. D. Lee. The use of bgrams to enhance text categorzaton. Journal of Informaton Processng and Management, 2002. http://www.cs.ucsb.edu/yfwang/papers/g&m.pdf. [20] M. Ruz and P. S:nvasan. Neural networks for text categorzaton. In 22 nd ACM SIGIR nternatonal Conference on Informaton Retreval, pages 282-282, Berkeley, CA, USA, August 999. [2] Y. Yang and. Lu. A re-examnaton of text categorzaton methods. In 22ACM nternatonal Conference on Research and Development n Informaton Retreval (SIGIR-99), pags 42-49, Berkeley, US, 999.