Web-supported Matching and Classification of Business Opportunities

Size: px

Start display at page:

Download "Web-supported Matching and Classification of Business Opportunities"

Baldwin Park
5 years ago
Views:

1 Web-supported Matchng and Classfcaton of Busness Opportuntes. DIRO Unversté de Montréal C.P. 628, succursale Centre-vlle Montréal, Québec, H3C 3J7, Canada Jng Ba, Franços Parads,2, Jan-Yun Ne {bajng, paradfr, 2. Nsten Technologes 75, Queen Street, Sute 4400 Montréal, Québec, H3C 2N6, Canada Abstract More and more busness opportuntes are publshed on the Web; however, t s dffcult to collect and process them automatcally. Ths paper descrbes a tool and technques to help users dscoverng relevant busness opportuntes, n partcular, calls for tenders. The tool ncludes spderng, nformaton extracton, classfcaton, and a search nterface. Our focus n ths paper s on classfcaton, whch ams to organze calls for tenders nto classes, so as to facltate user s browsng. We descrbe a new approach to classfcaton of busness opportuntes on the Web usng language modelng (LM) approach. Ths utlzaton s strongly nspred by the recent success of LM n IR experments. However, few attempts have been made to use LM for text classfcaton so far. Our goal s to nvestgate whether LM can brng mprovement to text classfcaton. Our experments are conducted on two corpora: Reuters contanng newswre artcles and FedBzOpps (FBO) contanng calls for tenders (CFTs) publshed on the Web. The expermental results show that LM-based classfcaton can sgnfcantly mprove the classfcaton performance on both test corpora, compared wth the tradtonal Naïve Bayes (NB) classfer. In partcular, t seems to have stronger mpact on FBO than on Reuters. Ths result shows that LM can greatly mprove classfcaton on the Web.. Introducton Fndng and selectng busness opportuntes s a crucal actvty for busnesses, yet they often lack the resources or expertse to commt to ths problem. To ease ths task, many electronc tenderng stes are now avalable. They usually follow ether a centralzng approach, where nformaton s receved drectly from the contractng authortes (for example, n the case of TED ), or an aggregaton approach, where documents are collected from other stes (for example, SourceCan 2 ). Although the centralzng approach allows to control the contents and rchness of the nformaton, t s dffcult to apply to some domans where there s no recognzed authorty, and s often lmted to one geographc area. Furthermore, addtonal nformaton whch mght exst on the Web s gnored. On the other hand, wth the aggregaton approach t s dffcult to extract and categorze relevant nformaton, snce documents do not follow a common form or model, and ther contents can vary wdely. Busness-related documents, n partcular Calls for Tenders (CFTs), are typcally classfed accordng to an ndustry standard, for example, NAICS (North Amercan Industry Classfcaton System) or CPV (Common Procurement Vocabulary, for the European Unon). Some CFTs are manually classfed wth these codes, whereas some others are not. A classfcaton algorthm s a natural addton to organze and search and CFT nto a browsable drectory. It can also provde mult-code classfcaton for converson between standards or dfferent versons of standards. However, automated classfcaton s dffcult on CFTs, especally when they are taken from the Web, where ther contents can vary a lot and there can be a large number of unseen terms. In ths paper, we propose to mprove the classfcaton of CFTs usng a language modelng approach. A language model (LM) refers to a set of probablty estmates on a tranng corpus. It also uses smoothng to deal wth the obtaned-zero probablty problem of unseen words n the corpus. In a classfcaton context, LM s used to estmate the probablty of a word wthn a class. We propose to use these estmates wthn the Naïve Bayes (NB) method. The paper wll be organzed as follows. In Secton 2, we wll brefly descrbe the MBOI project. In Secton 3, we descrbe our approach to text classfcaton usng language models. Secton 4 presents the expermental desgn and results on the 2

2 Reuters-2578 and FBO data sets respectvely. Fnally, Secton 5 gves some conclusons. 2. The MBOI Project The MBOI project (Matchng Busness Opportuntes on the Internet) deals wth the dscovery of busness opportuntes on the Internet. In the frst phase of the project we have mplemented a tool to ad a user n ths process. It ncludes spderng, nformaton extracton, classfcaton, and a search nterface. The nformaton relevant to busness opportuntes comes from varous types of documents: press releases, solctaton notces, awards, quarterly reports, etc. We are not so much nterested n modelng these documents, however, but rather n extractng and organzng nformaton that wll help fndng CFTs: not only nformaton wthn the CFT, but also related to contractng authortes, pror clents, etc. Ths nformaton s crucal for busness decsons. For ths reason, we wll refer to the documents as evdence, from whch the nformaton can be nferred. Fgure shows the nformaton nference process. At the core of the model s the CFT synthess, whch combnes evdence from varous stes. For example, f two stes contan a French and Englsh verson of the same CFT, the synthess wll nclude relevant attrbutes (ttle and descrpton) n both languages. Other characterstcs such as submsson and executon dates, classfcaton codes, submsson procedure, etc. wll also be nferred from the call for tenders notces. Amendments can replace or add to some or all of the elements of the synthess. Snce nformaton can be extracted from several documents, there must be a strategy for the combnaton of evdence. Even for offcal documents such as call for tenders, there can be more than one verson, publshed on the same ste, or on several stes. Parng these documents can be dffcult f edtors create ther own solctaton numbers, sometmes wthout explct reference to the contractng authorty. We thus defne a confdence measure on the nferred nformaton. Ths confdence measures the valdty of nference rules. It can also reflect the confdence of the source of the nformaton: for example a contractng authorty publshng ts own documents can be deemed more trustworthy then an aggregator ste. Fgure 2 shows a smplfed example of a presolctaton notce and ts amendment, regardng a contract for the offce supples of the Saskatchewan government. Both documents were fetched from the Merx ste. From these documents, the system nfers a synthess wth extracted nformaton such as: publcaton and closng dates, ttle (both French and Englsh), contact, etc. It also classfes the CFT: n ths case, to NAICS code 4820 ( Statonery and Offce Supples Wholesaler- Dstrbutors ). The synthess s stored n an XML format nspred by xcbl (Common Busness Language) and UBL (Unversal Busness Language) [5]. Presolctaton (on Merx): Reference Number: CFAB4 Source ID: PV.MN.SA.2342 Publshed: 2003/0/08 Closng: 2003/0/28 02:00PM Organsaton Name: Saskatchewan Government Ttle (Englsh): Offce Supples Ttle (French): Fourntures de Bureau Descrpton: The Government of Saskatchewan nvtes tenders to provde offce supples to ts offces n Regna. The suppler s expected to start delvery on December 5, 2003, and enter an agreement of at least 2 years. Contact: Berne Juneau, (306) Amendment (on Merx): Reference Number: CFAB4 Descrpton: The start delvery date has been revsed to January 5, Fgure 2: A call for tenders Fgure : Informaton nference Other nformaton can add to the exstng knowledge about contractng authortes and ther contacts. These could later be used for busness ntellgence. Fgure 3 shows the MBOI system archtecture. There are two man processes: ndexng,.e., creatng an ndex wth the nformaton nferred from the Web documents, and queryng/browsng, whch s the search nterface for the user.

3 The frst step of ndexng s to collect documents from Web stes. We use a robot that can connect wth a username and password (for stes wth restrcted access), look for URL patterns, fll out forms, and follow lnks of a gven form. The next step s the nference of nformaton, whch ncludes nformaton extracton and classfcaton. Fnally, an ndex s created and organzed by felds of nformaton (.e., correspondng to elements n the CFT synthess). Fgure 3. System archtecture The front-end to the system allows the user to search for CFTs by topc, date, class code, etc. or wth an all-felds free text query. It also ncludes functonaltes for browsng the class herarchy, save the results n topc folders, etc. Fgure 4 shows an example of results for a query about economc recovery. Ths s a saved query,.e., one that has been defned by the user and s executed on a routne bass. Ths functon s useful for a user who checks for a partcular type of busness opportuntes on a daly bass. The ndexng and retreval processes used n MBOI use the classcal IR approaches of vector space model, wth some enhancements to deal wth structures of CFTs (e.g., secton, ttle, etc.). We wll not descrbe these processes n detal. Instead, we wll concentrate on the classfcaton process of CFTs n whch we use a new method based on the statstcal language modelng approach. 3. Usng Language Models for Text Classfcaton Language models have been successfully appled n many applcaton areas such as speech recognton and statstcal NLP. Recently, a number of studes have confrmed that language model s also an effectve and attractve approach for nformaton retreval (IR) [6, ]. It not only provdes an elegant theoretcal framework to IR, but also results n effectveness comparable to the best state-of-the-art systems. Ths success has trggered a great nterest n IR communty, and LM has snce been used to other IR-related tasks, such as topc detecton and trackng [7]. However, untl now, few attempts have been made to use language models for text classfcaton although there s a strong relatonshp between IR and classfcaton. Text classfcaton ams to assgn text documents nto one or more predefned classes based on ther contents. Many machne learnng technques have been appled to automatc text classfcaton, such as Naïve Bayes (NB), K-Nearest Neghbor and Support Vector Machnes (SVM). Indeed, classfcaton shares several common processngs wth IR. It s then possble that LM can also brng sgnfcant mprovement to classfcaton. Our goal n usng language models to classfcaton s to nvestgate whether language models can also mprove the performance of classfcaton. In partcular, we wll frst ntegrate NB wth language models, because we can observe a strong smlarty between them. 3. Naïve Bayes Classfer Fgure 4. Queryng n MBOI Let us frst descrbe the prncple of Naïve Bayes classfer. Gven a document d and a set of predefned classes { c, }, a Naïve Bayes classfer frst computes the posteror probablty that the document belongs to each partcular class c,.e., c d), and then assgns the document to the class(es) wth the hghest probablty value(s). The posteror probablty s computed by applyng the Bayes rule: The denomnator d) c d) = () d c ) c ) d) n formula () s ndependent from classes; therefore, t can be gnored for the purpose of class rankng. Thus:

4 P c d) d c ) c ) (2) ( In Naïve Bayes, t s further assumed that words are ndependent gven a class,.e., for a document d = d,,d m: d c ) = m j= d j c ) Formula (2) can then be smply expressed as follows: c m d) d c ) c ) (3) j= In formula (3), P c ) can be estmated by the percentage of the ( tranng examples belongng to class c : c ) = where N s the number of tranng documents n class c, and N s the total number of tranng documents respectvely. P d j c ) s usually determned by: ( d j N N + count( d j, c ) c ) = V + c where count d, c ) s the number of tmes that term d j ( j occurs wthn the tranng documents of class c, V s the total number of terms n vocabulary, and c s the number of terms n class c. Ths estmaton uses the Laplace (or add-one) smoothng to solve the zero-probablty problem. 3.2 Language Modelng Approach n IR Language modelng has been appled successfully n nformaton retreval [6,, 2] and several related applcatons such as topc detecton and trackng [7]. Gven a document d and a query q, the basc prncple of ths approach s to compute the condtonal probablty P ( d q) as follows: q d) d) d q) = q d) d) q) If we assume P (d) to be a constant, then the rankng of a document d for a query q s determned by P ( q d). The calculaton of ths value s performed as follows: We frst construct a statstcal language model P (. d) for the document d, called document model. Then q d) s estmated as the probablty that the query can be generated from the document model. Ths probablty s often calculated by makng the assumpton that words are ndependent (n a ungram model) n a smlar way to Naïve Bayes. Ths means that for a query q = q,,q n, we have: j q d) = n j= w j d) In prevous studes, t turns out that smoothng s a very mportant process n buldng a language model []. The effectveness of a language modelng approach s strongly dependent on the way that the document language model s smoothed. The prmary goal of smoothng s to assgn a nonzero probablty to the unseen words and to mprove the maxmum lkelhood estmaton. However, n IR applcatons, smoothng also allows us to consder the global dstrbuton of terms n the whole collecton,.e., the IDF factor used n IR []. Several smoothng methods such as Drchlet, Absolute dscount, etc., have been appled n language models. In Zha and Lafferty [], t has been found that the retreval effectveness s generally senstve to the smoothng parameters. In our experments on classfcaton, we also observed smlar effects. 3.3 Usng Language Modelng Approach for Text Classfcaton If we compare Naïve Bayes wth the general language modelng approach n IR, we can observe a remarkable smlarty: the general probablstc framework s the same, and both use smoothng to solve the zero-probablty problem. The dfference between them les n the objects whch a language model s constructed for and appled to. In IR, one bulds a LM for a document and apples t to a query, whereas n NB classfer, one bulds a LM for a class and apples t to a document. However, we also observe that n the mplementaton of NB, one usually s lmted to the Laplace smoothng. Few attempts have been made n usng more sophstcated smoothng methods. As the experments n IR showed, the effectveness of language modelng strongly depends on the smoothng methods, and several smoothng methods have proven to be effectve. Then a natural queston s whether t s also benefcal n classfcaton to use other sophstcated smoothng methods nstead of the Laplace smoothng. In ths paper, we wll focus on ths problem. As we wll see later n our experments, t wll be clear that such a replacement can brng mprovements to Naïve Bayes classfer. Another queston we wll examne s whether a LM classfcaton approach wll have smlar mpact on dfferent types of documents Prncple. The basc prncple of our approach to text classfcaton usng language models s straghtforward. As n Naïve Bayes, the score of a class c for a gven document d s estmated by formula (3). However, the estmaton of P d j c ) s dfferent: It wll be estmated from ( the language modelng perspectve. Frst, we construct a language model for each class wth several smoothng methods. Then P d j c ) s the probablty that the term d j can be (

5 generated from ths model. As smoothng turns out to be crucal n IR experments, t s also necessary to carefully select the smoothng methods. In the next secton, we wll descrbe those that have been used n several IR experments Smoothng Methods for Estmaton. A number of smoothng methods have been developed n statstcal natural language processng to estmate the probablty of a word or an n-gram. As we mentoned earler, the prmary goal s to attrbute a non-zero probablty to the words or n-grams that are not seen n a set of tranng documents. Two basc deas have been used n smoothng: ) usng a lower-order model to supplement a hgher-order model; 2) modfyng the frequency of word occurrences. In IR, both deas have been used. On the frst soluton, t s common n IR to utlze the whole collecton of documents to construct a background model. Ths model s consdered as a lower-order model to the document model, although both models may be ungram models. Ths soluton has been useful for relatvely short documents. Although a class usually contans more than one document, thus longer than a sngle document, the same problem of mprecse estmaton exsts, especally for small classes. Therefore, one can use the same approach of smoothng to classfcaton. The second soluton s often used n combnaton wth the frst one (.e., one smultaneously use the collecton model and change the word counts), as we can see n the smoothng methods descrbed below. Two general formulatons are used n smoothng: backoff and nterpolaton. Both smoothng methods can be expressed n the followng general form [2]: w c ) = αc P ( w c ) s P ( w C) u wsseen n c ws unseen n c That s, for a class c, one estmate s made for the words seen n the class, and another estmate s made for the unseen words. In the second case, the estmate for unseen words s based on the entre collecton,.e., the collecton model. The effect of ncorporatng the collecton model not only allows us solvng the zero-probablty problem, but also s a way to produce the same effect as the IDF factor commonly used n IR (as shown n []). In our experments, we tested the followng specfc smoothng methods. All of them use the collecton model. Jelnek-Mercer (JM) smoothng: P ( w c ) = ( λ ) P ( w c ) + λ w C) JM whch lnearly combnes the maxmum lkelhood estmate P ml ( w c ) of the class model wth an estmate of the collecton model. Drchlet smoothng: c( w, c ) + µ w C) P Dr ( w c ) = c + µ ml where c w, c ) s the count of word w n c, c s the sze of c ( (.e., the total word count of c ) and µ s a pseudo-count. Absolute dscount smoothng: max( c( w, c ) δ,0) +δ c u w C) PAD ( w c ) = c n whch the count of each word s reduced by a constant δ [0, ], and the dscounted probablty mass s redstrbuted on the unseen words proportonally to ther probablty n the collecton model. In the above equaton, c u s the number of unque words n c. Two-Stage (TS) smoothng [2]: c( w, c ) + µ w C) PTS ( w c ) = ( λ ) + λ w C) c + µ Ths smoothng method combnes Drchlet smoothng wth an nterpolaton smoothng. In the prevous experments of IR, t turns out that Drchlet and Two-stage smoothng methods provded very good effectveness. In our experments, we wll test whether these smoothng methods, when appled to text classfcaton, brng smlar mpact. 4. EXPERIMENTAL EVALUATION ON CLASSIFICATION 4. Corpora In order to compare wth the prevous results, our experments have been conducted on the benchmark corpus of Reuters- 2578, contanng Reuter s newswre artcles. We chose the ModApte splt of Reuters-2578 data set, whch s commonly used for text classfcaton research today [9]. There are 35 topc classes, but we used only those 90 for whch there exsts at least one document n both the tranng and test set. Then we obtaned 7769 tranng documents and 309 test documents. The number of tranng documents per class vares from 2877 to. The largest 0 classes contan 75% of the documents, and 33% classes have fewer than 0 tranng documents. In our experments of fndng busness opportuntes on the Web, we created a collecton of CFT documents by downloadng the daly synopses from the FedBzOpps (FBO) webste, whch are n the perod from September 2000 to October Ths resulted n 2945 documents, whch were splt 70% for tranng and 30% for testng n our experments. Notce that all the CFTs publshed on ths ste are manually classfed usng NAICS codes. NAICS codes are organzed herarchcally, where every dgt of a sx-dgt code corresponds to a level of the herarchy. In order to reduce the class space, we only consder the frst three dgts n our current study. Although class herarchy s an aspect that makes the classfcaton of CFTs dfferent from the general classfcaton problem wth flat classes, we wll postpone ths problem to a

6 later study. That s, our current study wll consder the set of classes at the same level. After removng the classes that do not ncluded at least one document n both tranng and test set, we obtaned 86 classes, 532 tranng documents and 6627 test documents. The largest 0 classes contan 72% of the documents, and 30% classes have fewer than 20 tranng documents. We can see that the FBO collecton has qute smlar a dstrbuton to the Reuters collecton. 4.2 Performance Measure For the purpose of comparson wth prevous works, we evaluate the performance of classfcaton n terms of standard recall, precson and F measure. For evaluatng average performance across classes, we used macro-averagng and mcro-averagng. Macro-averagng scores are the averages of the scores of each class calculated separately. Mcro-averagng scores are the scores calculated by mxng together the documents across all the classes. Macro-averagng gves an equal weght to every class regardless how rare or how common a class s. On the other hand, mcro-averagng gves an equal weght to every document, thus puttng more emphass on larger classes. In [9], t s clamed that mcro-averagng can better reflect the real classfcaton performance than macroaveragng. Therefore, our observatons wll be made manly on mcro-averagng F. 4.3 Naïve Bayes Classfer To provde the comparable results of classfcaton on Reuters corpus, we used the multnomal mxture model of Naïve Bayes classfer of the Ranbow package, developed by McCallum [3]. In NB classfer, feature selecton s mportant. The effect of feature selecton s to remove meanngless features (words) so that classfcaton can be determned accordng to meanngful features. Several feature selecton methods are commonly used: nformaton gan (IG), ch-square, mutual nformaton, etc. Informaton gan has shown to produce good results n [9]. The nformaton gan of a word w s calculated as follows: IG( w) = w) k = k = c )log c ) + c w)log c w) + w) k = where w means the absence of the word w. c w)log c w) One can choose a fxed number of features accordng to ther IG, or set up a threshold on IG to make the selecton. The followng table shows the classfcaton results by NB wthout feature selecton and wth a selecton of 2000 features accordng to IG. The number 2000 s suggested n [9]. Table 2 shows the classfcaton results by NB wthout feature selecton and wth a selecton of 2,000 features accordng to IG. The number 2,000 produced the best performance on FBO collecton. NB mr mp mf maf Error all features K features mr: mcro-averagng recall mp: mcro-averagng precson mf : mcro-averagng F maf : macro-averagng F Table. Performance of NB on Reuters-2578 collecton NB mr mp mf maf Error all features K features Table 2. Performance of NB on FBO collecton 4.4 Language Modelng Approach In the experments usng language models, we used the Lemur toolkt, whch s desgned and developed by Carnege Mellon Unversty and the Unversty of Massachusetts [2]. The system allows us to tran a language model for each class usng a set of tranng documents, and to calculate the lkelhood of a document accordng to each class model,.e. P ( d c ). The fnal score of a class can then be computed accordng to formula (2) Dfferent Smoothng Methods. In our experments, we used the four smoothng methods that are descrbed earler by varyng the parameters. Table 3 shows the results by each method. No feature selecton s made. The percentages n the table are the relatve changes wth respect to NB wth no feature selecton (Table ). Smoothng mr mp mf maf Error Jelnek-Mercer (λ=0.3) Drchlet (µ=9500) Absolute (δ=0.83) Two-stage (λ=0.86,µ=6000) (+.3%) (+0.9%) (+.8%) (+3.9%) (+53.5%) (+6.9%) (+63.3%) (+29.3%) Table 3. Performance of LM on Reuters As we can see, on Reuters-2578 corpus, the three frst smoothng methods only lead to margnal mprovements on mcro-averagng F over NB. On the other hand, Two-stage smoothng produces a larger mprovement over NB. The performances of dfferent LMs on FBO collecton are shown n Table 4.

7 Smoothng mr mp mf maf Error Jelnek- Mercer (λ=0.05) Drchlet (µ=500) Absolute (δ=0.05) Two-stage (λ=0.05,µ=0) (+8.9%) (+2.3%) (+.7%) (+8.9%) (+90.8%) (+72.%) (+95.9%) (+90.8%) Table 4. Performance of LM on FBO If we compare the three frst smoothng methods (wth ther best performances shown n Tables 3 and 4), we can see that, the Absolute smoothng produced better performances than the other two smoothng methods on both corpora. Drchlet smoothng produced the least mprovements. Two-stage smoothng produced the largest mprovement on Reuters. However, the phenomenon on the FBO collecton s not the same. In the case of Two-stage smoothng on FBO, the best performance s obtaned when µ s set to 0,.e., we ndeed use the Jelnek-Mercer smoothng. The dfferences of the smoothng methods on the two collectons show that FBO has dfferent characterstc than newswre artcles, and they may requre dfferent classfcaton methods. Globally, our experments show that usng language models may mprove classfcaton effectveness over Naïve Bayes on both corpora. Ths s true especally for macro-averagng F whch s much hgher than wth NB. The mprovements on mcro-averagng F are more evdent on the FBO collecton than on Reuters In order to test statstcal sgnfcance of the changes of performance, we use the macro t-test [9], whch compares pared F values obtaned for each class. It turns out that all the mprovements obtaned on both corpora wth the four smoothng methods are statstcally sgnfcant, wth p-values < The comparson of the mprovements on macro- and mcroaveragng F suggests that language models can brng larger mprovements to small classes than to large classes. A possble reason s that our smoothng methods also combne the collecton probabltes, nstead of only changng the frequences of words as n NB (Laplace smoothng). By modfyng the frequency of words n Laplace smoothng, all the unseen words, ether meanngful or not, wll be attrbuted an equal probablty. However, the smoothng methods wth the collecton model attrbute dfferent probabltes to unseen words accordng to ther global dstrbuton n the collecton. Therefore, the latter probabltes can better reflect the 3 A p-value lower than 0.05 s consdered to be statstcally sgnfcant at the 0.95 sgnfcance level. characterstcs of the collecton and of the language. In our experments, the addton of the collecton model seems to beneft greatly small classes whch have less tranng data and for whch a heavy smoothng s requred. Another advantage of usng the collecton to smooth the class model s that the meanngless features that do not allow us to dstngush dfferent classes are now neutralzed wth the collecton model, n such a way that ther dfferences across classes are weakened. Ths s equvalent to feature selecton n the other classfcaton methods. As we wll see n Secton 4.4.2, t turns out that feature selecton s not necessary wth LM. Ths confrms that smoothng has the same effects as feature selecton. The absolute level of performances on FBO s lower than that of Reuters. Ths suggests that the classfcaton of CFTs, or more globally, the classfcaton of busness opportuntes on the Web, s a more dffcult problem than that for newswre artcles. The man dfference between them s that a CFT usually contans a very short descrpton of the goods or servces (one or a few sentences), whch s the object of the call. The nsuffcent descrpton makes t dffcult to obtan a thorough characterzaton of the goods or servce. On the other hand, the remanng parts, whch take an mportant porton of the CFT, descrbe unessental elements for classfcaton, such as the condtons of submsson, the deadlne, etc. These latter are not drectly related to the classfcaton by doman (although they may be useful for other purposes). By usng the classcal term weghtng methods based on term frequency (or nverted document frequency), t s dffcult to flter out the non-mportant parts of a CFT. These partculartes make the global performances of classfcaton on CFTs lower than for newswre artcles Feature Selecton wth Language Model. Feature selecton has been very useful for NB classfer. Would t produce a smlar effect on language models? In order to answer ths queston, we conducted a seres of experments usng dfferent numbers of features selected accordng to nformaton gan. The followng table shows the results of dong feature selecton on the four smoothng methods shown n Table 3. Mcro-F all features Jelnek-Mercer Drchlet Absolute dsc. Two-stage Fgure 5. The effects of feature selecton on Reuters

8 These results do not show sgnfcant performance mprovement when we use feature selecton, except for Drchlet smoothng. On the contrary, for absolute smoothng and Jelnek-Mercer smoothng, the effect of feature selecton s rather negatve: We obtan lower performances f we select a subset of features. Ths concluson seems contradctory to the results wth NB, and counter-ntutve at the frst glance. However, one can possbly explan ths by the fact that, as the class model has been massvely smoothed by the collecton model, those non-dscrmnatve features do not make a sgnfcant dfference between documents wth respect to a class. Therefore, the ncluson of such features n the calculaton of the score does not hurt as much as n NB, whch does not ncorporate the collecton model. Ths suggests that the consderaton of the collecton model n smoothng renders feature selecton less necessary. Therefore, another mportant advantage of usng LMs s that t can avod the need for explct feature selecton. 5. Concluson We have descrbed a tool to help the dscovery of busness opportuntes on the Internet, and propose a new approach for the classfcaton of such documents. The MBOI tool has been n use for a year and a half by our commercal partners, and deployed n several applcatons: as an ad for busness opportuntes watch for the St-Hyacnthe (Quebec) regon, as a CFT search faclty for the Canada's metal ndustry portal (NetMetal 4 ), and as an "ssue" or "thematc" watch for the Quebec travel ndustry. All have reported a sgnfcant mprovement to ther actvtes by usng our system. On classfcaton, we used LM to enhance NB. In partcular, the Laplace smoothng commonly used n NB s replaced by some other smoothng methods that ntegrate the collecton model. Our experments on Reuters-2578 and FBO collectons have shown sgnfcant mprovements over NB, especally on the macro-averagng F. On mcro-averagng F, we also observed notceable mprovements, n partcular, on FBO collecton. Ths prelmnary study dd show that language models can contrbute n mprovng text classfcaton by NB. Our comparson on two document collectons show that language modelng approaches can be useful for the classfcaton of both newswre artcles and busness opportuntes on the Web, despte the dfferences between these documents. To further mprove the classfcaton performance of busness opportuntes, t wll be necessary to study specfc methods adapted to ths type of data. In partcular, we wll have to deal wth the problem wth very short useful descrpton n Calls for Tenders. We have notced qute a bt of nose n the FBO documents n terms of rrelevant content, for example, pertanng to procedural nstructons rather than the topc of the CFT. Ths s typcal of Web documents, and therefore we thnk that t s qute encouragng that the mprovement usng LM was greater on FBO (a Web corpus) than on Reuters (a controlled test collecton). Our prelmnary study s lmted to the utlzaton of ungram models. We wll nvestgate the ntegraton of bgram language models for text classfcaton n our future work. Other future works nclude: extendng herarchcal classfcaton, ncorporatng LMs nto other classfcaton algorthms, and usng other types of features n classfcaton (e.g., concepts, named enttes as extracted usng Nsten s tools). ACKNOWLEDGMENT Ths work has been carred out wthn a jont research project wth Nsten technologes. We would lke to thank Nsten and NSERC for ther support. REFERENCES [] S. T. Dumas, J. Platt, D. Heckerman and M. Saham (998). Inductve learnng algorthms and representatons for text categorzaton. In Proceedngs of ACM-CIKM98, Nov. 998, pp [2] The lemur toolkt for language modelng and nformaton retreval. [3] A. McCallum (996). Bow: A toolkt for statstcal language modelng, text retreval, classfcaton and clusterng. [4] A. McCallum and K. Ngam (998). A comparson of event models for Naïve Bayes text classfcaton. In Proceedngs of AAAI-98 Workshop, AAAI Press. [5] E. Dumbll. Hgh hopes for the unversal busness language. XML.com, O'Relly, November [6] J. Ponte and W. B. Croft (998). A language modelng approach to nformaton retreval. In Proceedngs of SIGIR 998. pp [7] M. Sptters and W. Kraaj (200), TNO at TDT200: language model-based topc detecton, In Proceedngs of Topc Detecton and Trackng (TDT) Workshop 200. [8] Y. Yang (999). An evaluaton of statstcal approaches to text categorzaton. Journal of Informaton Retreval, Vol., No. /2, pp [9] Y. Yang and X. Lu (999). A re-examnaton of text categorzaton methods. In Proceedngs of the 22nd Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, pp [0] Y. Yang (200). A study on thresholdng strateges for text categorzaton. In Proceedngs of SIGIR 200, pp [] C. Zha and J. Lafferty (200). A study of smoothng methods for language models appled to ad hoc nformaton retreval. In Proceedngs of SIGIR 200, pp [2] C. Zha and J. Lafferty (2002). Two-stage language models for nformaton retreval. In Proceedng of SIGIR 2002, pp

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto