Application of k-nn Classifier to Categorizing French Financial News

Size: px

Start display at page:

Download "Application of k-nn Classifier to Categorizing French Financial News"

Roy McCoy
5 years ago
Views:

1 Applcaton of k-nn Classfer to Categorzng French Fnancal News Huazhong KOU, Georges GARDARIN 2, Alan D'heygère 2, Karne Zetoun PRSM Laboratory, Unversty of Versalles Sant-Quentn 45 Etats-Uns Road, Versalles, France {Huazhong.Kou, 2 e-xmlmeda 3 Avenue du Général Leclerc, Bourg La Rene, France {Georges.Gardarn, Alan.D'heygère}@e-xmlmeda.fr Abstract: We have mplemented the document categorzaton system DocCat to automatcally organze French fnancal news for Frstnvest ste. Ths paper descrbes system framework and man technques we use. In DocCat, both relatonal database and XML are used to organze documents, our CBA algorthm s conducted to select features and k nearest neghbor algorthm s mplemented as categorzaton model. We use 4000 fnancal news to learn and evaluate DocCat. The prmary expermental results show that DocCat produces satsfactory performance. The flexble desgn allows users to easly adapt DocCat to dfferent applcaton doman. Keywords: k-nn, document categorzaton, machne learnng, XML. Introducton Created n 997, FrstInvest s a fnancal meda on Internet and specalzes n the dffuson of both fnancal news and expert s opnons of stock exchange on Internet. Today, t s one of most sgnfcant fnancal stes n France, wth more than vstors per month []. To facltate the dffuson of fnancal news, everyday the fnancal news are edted and then organzed nto predefned categores manually by experts at FrstInvest. Manually categorzng may nduce some problems, for example expensve cost and tme consumng. Ths leads us to collaborate wth FrstInvest to automatcally organze French fnancal news. In ths framework, we have proposed and mplemented a document categorzaton system called DocCat. Ths research s supported by a natonal RNTL Proect called CONTEXTE Bourse. Document categorzaton s the procedure of assgnng one or multple predefned category labels to a free text document. A prmary applcaton of text categorzaton s to assgn subect category/es to documents to support nformaton retreval or to ad human ndexers n assgnng such categores. Categorzaton can also help buld a personalzed net news flter. In DocCat, we mplement k nearest neghbor (k-nn) categorzaton algorthm. k-nn s a classcal nstancebased machne learnng algorthm. Many emprcal researches stated that k nearest neghbor (k-nn) s one of the top-performng classfers [2][3]. Ths paper focuses on the applcaton of DocCat to organzng the fnancal news at the FrstInvest ste. The rest of ths paper s organzed as follows: Secton 2 descrbes FrstInvest corpus; Secton 3 presents k-nn categorzaton model; Secton 4 dscusses system general framework, document organzaton schema and system functonalty; Secton 5 explans the system evaluaton measures and experments whle the concluson s made n Secton FrstInvest Corpus 4000 fnancal news have been collected at the FrstInvest ste from 08/0/200 to 0/3/2002. The news before and on January 0 of 2002 are selected as tranng documents to learn the system whle the rest 500 news are used to evaluate the system. Each news belongs to only one of 30 predefned categores (see Table ) but the dstrbutons of these categores n ths corpus are uneven. For example, there are.8% documents of Botechnologe and.3 % of Telecoms.

2 Aéronautque/Défense Immobler Medas/TV/Communcaton Pharmace/chme/gaz Automoble Dstrbuton almentare Web Agency Marchés fnancers Botechnologe SSII Assurances Holdngs MP/Bens d'équpement Bens de consommaton Courtage en lgne Agro-almentare FAI/Portal Constructon/BTP Hotellere/losr/transport Technologques Energe/envronnement/servces Telecoms Edteur de logcels Cosmétque/luxe Marketng/Bases de données Banques Matéraux de constructon Dstrbuton spécalsée Edteur de eux vdéos Pétroler Table Fnancal Categores of FrstInvest Fgure shows the format of an example of tranng news. Each news contans one attrbutes and fve elements. To manually categorze such news, the ndexer must read the content element and analyze t. <?xml verson".0" encodng"iso-8859-"?> <corpus> <news newsid"5000"> <newsdate>0-jan-2002 </newsdate> <category>edteur de logcels</category> <text> <ttle> BVRP affche un chffre d affares en hausse </ttle> <content>bvrp, édteur de logcel de communcaton, a publé ses chffres des neuf premers mos de l exercce (à la fn avrl). Il en ressort un chffre d affares en hausse sur un an de 26,9%, à 28,7 mllons d euros. SI on exclut Lab Producton, sa flale Multméda, qu l a cédé récemment, </content> </text> </news> </corpus> Fgure. News Format 3. k-nn Categorzaton Model documents are selected to be used n the followng steps. The categores of the k top-rankng neghbors are called canddate categores. Then the category score s calculated for each canddate category by usng the smlarty of the selected k documents to the new document. Fnally one or more categores are assgned to the new document by a sutable thresholdng strategy [4]. k-nn s a top-performng algorthm and t s comparable to the most effectve support vector machne algorthm reported n [2]. It uses the document vector representaton model under whch documents are mapped nto the ponts of hgh dmenson concept space [5]. In practce, all document vectors are normalzed to be of unt length. The values of document vector elements can be calculated by term weghtng algorthms. The f-df term weghtng model and ts varants are often used. In DocCat, the followng weghtng model s mplemented: w log( f +.0) * + ( ) log N df l df Where w s the weght of the th term n th document, N s the number of tranng documents, df s the number of tranng documents contanng the th term (document frequency), and f s the number of tmes the th term occurs n the th document (term frequency). Then cosne functon based smlarty noton s ntroduced to fnd the neghbors of a gven document as (2). Sml( d, d ) cos ( d, d ) 2 w l f d d wl wl d 2 d 2 2 w l Where d and d are two document vectors [5]. N l f log l (2) ( ) Example documents Preprocessng Unque terms n documents Feature Selecton New document Unque terms Representaton Dctonary Document Vector Smlarty Calculaton Document vectors Assgned Category Category Score Calculaton k top Neghbors Fgure 2 the General Framework Accordng to k-nn, gven a new document, the system ranks ts neghbors among all tranng documents by calculatng document smlarty and the top k neghbor 4. Implementaton of the System

3 Ths secton presents the general framework of the system, the man technques we use ncludng document organzaton and system functonalty. 4. General Framework Fgure 2 shows the general framework of DocCat. It s composed of two subsystems: the learnng subsystem, lnked by thck arrows; the categorzng subsystem, lnked by thn arrows. The goal of learnng subsystem s to determne all system parameters, and create knowledge database. It s conducted n the followng steps: Preprocessng: we extract all unque words present n each tranng document, remove stop words, punctuaton marks and non-letter characters, then the left words are folded nto low case and converted nto ther stems by Porter stemmng algorthms [6]. The fnal form of word s called term. Both term frequency and document frequency are counted for each term. Furthermore, the terms wth hgh and low document frequency are removed. The resultng terms and ther frequences are stored n database tables as ntermedate data. Feature selecton: after preprocessng, the number of left terms are stll very large and an optmal subset of terms must be selected by usng feature selecton algorthm. χ 2 test model s well-known algorthm used to select feature [7], and our system mplements t. Concept-Based Algorthm (CBA) we proposed [8] s also mplemented (see Secton 4.2). Both of them are flter algorthms: frst they calculate term weghts at the corpus level that ndcates the power of category predcton of terms, then all terms are ranked n the descendng order of calculated term weghts, and fnally some top terms are selected as feature terms that make up of ndexng dctonary. The ndexng dctonary s one of very mportant parts of knowledge database. Document representaton: at the preprocessng step, unque terms have been dentfed for every documents and both ther document frequency and term frequency have been counted. Then gven a document, term weght defned by () s calculated for each dctonary term t contans. Ths way, the th document can be represented by the followng vector (3). d T ( w, w, w, w ) R (3) 2 3 where w s the weght of the th dctonary term n th document d where T and N. All document vectors of tranng documents consttute the core of knowledge database n k-nn categorzaton system. The learnng phase s followed by categorzng phase. Categorzng a document begns by preprocessng t. The goal of preprocessng a document s to dentfy all T dctonary terms present n the document. Then ts correspondng document vector can be created by the way presented above. The other steps to categorze a document are: Smlarty calculaton: the smlarty between the new document vector and every tranng document vector stored n the knowledge database s calculated by (2). k nearest neghbors: based on the smlartes calculated above, all tranng document vectors are ranked n the descendng order of smlarty, then the top k document vectors are chosen for calculatng category score n the next step. Category score calculaton: the categores to whch the k nearest neghbor documents belong are called canddate categores. A score s calculated for each canddate category by some score calculaton algorthm, for example by summng the values of smlarty over the documents of k nearest neghbor documents belongng to ths category. Assgnng category: all canddate categores can be ordered n the descendng order of ther scores, then a thresholdng strategy s used to decde whch category(es) should be assgned to the new document. [4] studed the thresholdng strateges for text categorzaton. By the way, there exsts lots of system parameters such as the sze of dctonary, k value, language, etc. To make our system more flexble, we store all system parameters n a system property fle. By modfyng the property fle, users can very easly confgure and adapt DocCat to the needs of applcaton doman. 4.2 Feature Selecton Algorthm We present concept-based algorthm (CBA) to select features. Under the vector representaton model, a document d can be represented by (3). Then one vector s created for every category by averagng the vectors of documents belongng to the same category. Ths vector s called Concept vector of ths category. The values of concept vector elements can characterze the relatonshp between terms and categores. The concept vector of the category C s noted as C v. It s calculated by (4). Where C v C v d s the vector of the th document n th category C, and C v s number of example documents n the category C. The l th element w cl of C v can be calculated by (5). w cl C v C v C v d w l ( 4) ( 5)

4 Where wl s the weght of the l th term of the th document n the th category C and t can be calculated by (). We use w to measure term-goodness between cl l th term and th category C. It s a local weght value of l th term correspondng to the category C. Furthermore, we use all local weght values of term to calculate the global weght of l th term t at the level of corpus, noted as CW(t) by (6). CW ( t) Pr ( C ) wcl ( 6) Where P r ( C ) s the dstrbuton of the category C n corpus that s the proporton of the number of documents n the category C to the total number of documents n the corpus. Combnng (5) and (6), we have (7) CW Then all terms found n the corpus can be ranked n the descendng order of CW(t) and some top terms are selected to consttute corpus dctonary. We call ths algorthm Concept-Based Algorthm, noted as CBA. For the analyss of CBA, see [8]. 4.3 Document Database Schema k k C Cv v ( t) Pr ( C ) w ( 7) Fgure 3 shows the man parts of document data schema n DocCat. Here, dctonary, category and tranng document vector make up of knowledge database for k-nn categorzaton system and are stored n relatonal tables. Dctonary table has 5 attrbutes: term, document frequency, document IDs. Tranng document Vectors table s a vector representaton of orgnal example fnancal news, t has 4 attrbutes: document ID, date, document vector, category ID. Document vector s composed of all (term, weght) pars, where term s dctonary term found n the current document. Category ID ndcates the membershp between document and category. Category table contans the names and ID of category used by FrstInvest. StemWords table keeps the mappng relatonshp between words and stems. All orgnal productve fnancal news are stored n XML documents. In one XML document we store at most 2000 peces of news stores. The correspondng XML schema s shown as follows: <xsd:schema xmlns XMLSchema > <element name news maxoccurs 2000 > <attrbute name newsid type ID /> <element name newsdate type date /> <element name text > <complextype > <element name ttle type strng /> <element name content type strng /> </schema> For each new productve fnancal news, we create an mage to store ts category, keyword, vector, etc. Then these mages are stored n the XML mage documents. Note that we store at most 2000 news mage n one XML mage document. The schema of XML mage documents s defned by: <schema xmlns XMLSchema > <element name newsimage maxoccurs 2000 > <sequence> <attrbute name newsid type ID /> <element name category type strng / > <element name keywords type strng maxoccurs 20 /> <element name docvector > <element name termweghtpar maxoccurs unbounded > <sequence> <element name term type strng /> <element name weght type decmal /> </sequence> </sequence> </schema> The XML mage documents are very much shorter than the orgnal fnancal news, and they are orented to machne processng. Based on XML mage documents,

5 keyword- and content-based searches are conducted, See Secton 4.3. We store document data obtaned from the tranng example documents n the relatonal tables n order to make advantage of database system technology to analyze the corpus. We use XML documents to store the productve news and ther mages so that many orented Web technologes, for example XQuery and XSTL, can be used to process and dssemnate fnancal news across Internet. 4.4 System Functonalty One of most bascally functonaltes s to categorze new fnancal news nto a proper category. Besde ths, DocCat can support the followng functonaltes: keyword extracton, keyword- and content-based searches CATEGORIZING FINANCIAL NEWS Gven a new fnancal news, DocCat can assgn only one category to t. To categorze one news, DocCat frst dentfes the dctonary terms present n the content of the news and generates a correspondng document vector, then uses k-nn classfer to retreve a category sutable to the news. The document vector and retreved category are stored n XML mage document as the values of vector element and category element respectvely KEYWORD EXTRACTION In DocCat, keywords do not exactly mean the same thng as tradtonal lbrary keyword. They are statstcal keywords. To extract keywords for a news document, all elements of ts document vector are ranked n the descendng order of ther weghts, then the terms correspondng frst h elements are selected. If stemmng algorthm s conducted, the selected terms are not really words. In ths case, the mappng relatonshp stored n the StemWords table wll be exploted to convert selected terms from stem form to real words present n the document. The resultng words are consdered as keywords and stored n the XML mage document as the value of keywords element. In ths way, the words representng the document content are dentfed whle the words not sgnfcant to the document content are removed KEYWORD-BASED SEARCH By the tradtonal keyword search, we search full orgnal text by matchng keywords nput by users. If a word matchng a gven keyword s found, the document wll be returned. Ths produces three problems: Frst, searchng full text s tme consumng; Second rrelevant documents are returned f the words not sgnfcant to the document content match keywords gven by users; Last, the documents contanng the relevant concepts wanted by user are not retreved f these documents do not contan the keywords gven by users. In the realty, there are usually many ways to express a gven concept, so the lteral terms n a user s query may not match those of a relevant document. In other the hand, most words have multple meanngs, so the terms n a user s query wll lterally match terms n documents that are not of nterest to the user. By searchng XML mage document of fnancal news, the frst two problems can be overcome to some extent, because XML mage documents are very much shorter than orgnal documents and partcularly they only contan the sgnfcant words to document content. Fgure 4 shows keyword-based search by usng XML mage documents. Now, we only search the text of the keywords element of XML mage document. Input words User XML mage document News XML documents news Fgure 4. keyword search by usng XML mage documents CONTENT-BASED SEARCH Content-based search means that users can start ther query wth a free text as a query strng, for example the sentences expressng the desred concepts. DocCat takes the query strng as a fnancal news document and transforms ths query document nto a document vector. Then, the smlartes between ths query vector and all document vectors stored n XML mage documents are calculated by the smlarty model (2). Then the fnancal news n the news XML documents are ordered n the descendng order of smlartes, and the frst l top news are thought of as content related documents and are returned to users. The keyword-based search only consders the presence or absence of query words n the documents, whle content-based search dstngushes the words from the vewpont of degree that the words contrbute to the document content. 5. Evaluaton and Experments Based on the 4000 fnancal news collected by FrstInvest ste, some experments have been done. At the preprocessng step, we remove 39 stop-words and convert words nto ther word stems by usng Porter stemmng algorthm [6]. Fnally, 4,428 unque terms

6 are obtaned. Note that only the content parts of news are used n DocCat whle the ttle parts of news are not nvolved. We do dfferent experments by varyng both the szes of feature terms (000,2000 and 3000) and the k values (0,20,30,40,50,60,70) for k-nn. The RCut threshold strategy [4] of value s adopted, that s, the category wth the hghest category score among the canddate categores s assgned to the document. RCut threshold strategy s sutable to the FrstInvest stuaton. Indeed, FrstInvest classfes a news nto only one category, see Secton 2. To evaluate categorzaton systems, we use three standard measures: Recall (r), Precson (p) and F ( r, p). For a category, recall (r) s the proporton of correctly assgned documents to all documents belongng to the category and precson (p) s the proporton of correctly assgned documents to all assgned documents. F ( r, p) measure s defned by combng recall and precson [3] as follows: pr F ( r, p) 2 p + r We also check the average performance of a bnary classfer over multple categores, namely, the macroaverage and the mcro-average [3]. Macro-average gves an equal weght to the performance on every category, regardless how rare or how common a category s. Mcro-average, however, gves an equal weght to the performance on every document (category nstance), thus favors the performance on common categores. For detal, see [3]. Category Recall Precson F Rate Holdngs Pharmace/chme/gaz Hotellere/losr/transport Marchés fnancers Telecoms Aéronautque/Défense Web Agency Banques Botechnologe Dstrbuton spécalsée Table 2. system performance over 0 categores wth 000 features and k0 Due to the lmt of space, here we only present the expermental results over 0 categores n Table 2, where the last column represents the dstrbutons of category n the corpus. The results n Table 2 are obtaned by settng k0 and selectng 000 features by CBA algorthm. Wth the 000 features, the system acheves the best performance at k0. The values of mcro-average of recall and precson and F are 0.72, 0.67 and 0.70 whle the values of macro-average of them are 0.636, 0.7 and The expermental results ndcate that the system acheves a good performance over common categores. For the less frequent categores, the performance s not satsfactory. The weak performance over small categores arses from the uneven category dstrbuton. 6. Concluson Ths paper brefly presents the applcaton of document categorzaton system DocCat. The goal of DocCat s to automatcally categorze French fnancal news at the fnancal portal FrstInvest. The general framework and common technques of categorzaton are also dscussed. In addton, by creatng XML mage documents for productve fnancal news, we propose two approaches to searchng text: keyword-based search and content-based search. Furthermore, the flexblty of the system allows users to easly adapt t to ther applcaton doman and requrements. References [] [2] Joachms, T. Text categorzaton wth support vector machnes: Learnng wth many relevant features. In the proceedngs of ECML, 998. [3] Yang, Y. An evaluaton of statstcal approaches to text categorzaton. Informaton Retreval,(),pp.69-90,999. [4] Yang, Y. A study on thresholdng strateges for text categorzaton, Proceedngs of ACM SIGIR 0, 200 [5] Salton, G. Automatc Text Processng: The Transformaton, Analyss, and Retreval of Informaton by Computer. Addson-Wesley, Readng, Pennsylvana, 989. [6] Porter, An algorthm for suffx strppng, Program, Vol. 4, no. 3, 980, pp [7] Yang Y. and Jan O. Pederson, A Comparatve Study on Feature Selecton n Text Categorzaton, In the 4 th ICML, pp ,997. [8] H. Kou, G. Gardarn., K. Zetoun. Two New Approaches to Feature Selecton for Document Categorzaton, techncal report #2002/9, PRSM Laboratory, Unversty of Versalles, 2002.

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto