Web-supported Matching and Classification of Business Opportunities

Size: px
Start display at page:

Download "Web-supported Matching and Classification of Business Opportunities"

Transcription

1 Web-supported Matchng and Classfcaton of Busness Opportuntes. DIRO Unversté de Montréal C.P. 628, succursale Centre-vlle Montréal, Québec, H3C 3J7, Canada Jng Ba, Franços Parads,2, Jan-Yun Ne {bajng, paradfr, 2. Nsten Technologes 75, Queen Street, Sute 4400 Montréal, Québec, H3C 2N6, Canada Abstract More and more busness opportuntes are publshed on the Web; however, t s dffcult to collect and process them automatcally. Ths paper descrbes a tool and technques to help users dscoverng relevant busness opportuntes, n partcular, calls for tenders. The tool ncludes spderng, nformaton extracton, classfcaton, and a search nterface. Our focus n ths paper s on classfcaton, whch ams to organze calls for tenders nto classes, so as to facltate user s browsng. We descrbe a new approach to classfcaton of busness opportuntes on the Web usng language modelng (LM) approach. Ths utlzaton s strongly nspred by the recent success of LM n IR experments. However, few attempts have been made to use LM for text classfcaton so far. Our goal s to nvestgate whether LM can brng mprovement to text classfcaton. Our experments are conducted on two corpora: Reuters contanng newswre artcles and FedBzOpps (FBO) contanng calls for tenders (CFTs) publshed on the Web. The expermental results show that LM-based classfcaton can sgnfcantly mprove the classfcaton performance on both test corpora, compared wth the tradtonal Naïve Bayes (NB) classfer. In partcular, t seems to have stronger mpact on FBO than on Reuters. Ths result shows that LM can greatly mprove classfcaton on the Web.. Introducton Fndng and selectng busness opportuntes s a crucal actvty for busnesses, yet they often lack the resources or expertse to commt to ths problem. To ease ths task, many electronc tenderng stes are now avalable. They usually follow ether a centralzng approach, where nformaton s receved drectly from the contractng authortes (for example, n the case of TED ), or an aggregaton approach, where documents are collected from other stes (for example, SourceCan 2 ). Although the centralzng approach allows to control the contents and rchness of the nformaton, t s dffcult to apply to some domans where there s no recognzed authorty, and s often lmted to one geographc area. Furthermore, addtonal nformaton whch mght exst on the Web s gnored. On the other hand, wth the aggregaton approach t s dffcult to extract and categorze relevant nformaton, snce documents do not follow a common form or model, and ther contents can vary wdely. Busness-related documents, n partcular Calls for Tenders (CFTs), are typcally classfed accordng to an ndustry standard, for example, NAICS (North Amercan Industry Classfcaton System) or CPV (Common Procurement Vocabulary, for the European Unon). Some CFTs are manually classfed wth these codes, whereas some others are not. A classfcaton algorthm s a natural addton to organze and search and CFT nto a browsable drectory. It can also provde mult-code classfcaton for converson between standards or dfferent versons of standards. However, automated classfcaton s dffcult on CFTs, especally when they are taken from the Web, where ther contents can vary a lot and there can be a large number of unseen terms. In ths paper, we propose to mprove the classfcaton of CFTs usng a language modelng approach. A language model (LM) refers to a set of probablty estmates on a tranng corpus. It also uses smoothng to deal wth the obtaned-zero probablty problem of unseen words n the corpus. In a classfcaton context, LM s used to estmate the probablty of a word wthn a class. We propose to use these estmates wthn the Naïve Bayes (NB) method. The paper wll be organzed as follows. In Secton 2, we wll brefly descrbe the MBOI project. In Secton 3, we descrbe our approach to text classfcaton usng language models. Secton 4 presents the expermental desgn and results on the 2

2 Reuters-2578 and FBO data sets respectvely. Fnally, Secton 5 gves some conclusons. 2. The MBOI Project The MBOI project (Matchng Busness Opportuntes on the Internet) deals wth the dscovery of busness opportuntes on the Internet. In the frst phase of the project we have mplemented a tool to ad a user n ths process. It ncludes spderng, nformaton extracton, classfcaton, and a search nterface. The nformaton relevant to busness opportuntes comes from varous types of documents: press releases, solctaton notces, awards, quarterly reports, etc. We are not so much nterested n modelng these documents, however, but rather n extractng and organzng nformaton that wll help fndng CFTs: not only nformaton wthn the CFT, but also related to contractng authortes, pror clents, etc. Ths nformaton s crucal for busness decsons. For ths reason, we wll refer to the documents as evdence, from whch the nformaton can be nferred. Fgure shows the nformaton nference process. At the core of the model s the CFT synthess, whch combnes evdence from varous stes. For example, f two stes contan a French and Englsh verson of the same CFT, the synthess wll nclude relevant attrbutes (ttle and descrpton) n both languages. Other characterstcs such as submsson and executon dates, classfcaton codes, submsson procedure, etc. wll also be nferred from the call for tenders notces. Amendments can replace or add to some or all of the elements of the synthess. Snce nformaton can be extracted from several documents, there must be a strategy for the combnaton of evdence. Even for offcal documents such as call for tenders, there can be more than one verson, publshed on the same ste, or on several stes. Parng these documents can be dffcult f edtors create ther own solctaton numbers, sometmes wthout explct reference to the contractng authorty. We thus defne a confdence measure on the nferred nformaton. Ths confdence measures the valdty of nference rules. It can also reflect the confdence of the source of the nformaton: for example a contractng authorty publshng ts own documents can be deemed more trustworthy then an aggregator ste. Fgure 2 shows a smplfed example of a presolctaton notce and ts amendment, regardng a contract for the offce supples of the Saskatchewan government. Both documents were fetched from the Merx ste. From these documents, the system nfers a synthess wth extracted nformaton such as: publcaton and closng dates, ttle (both French and Englsh), contact, etc. It also classfes the CFT: n ths case, to NAICS code 4820 ( Statonery and Offce Supples Wholesaler- Dstrbutors ). The synthess s stored n an XML format nspred by xcbl (Common Busness Language) and UBL (Unversal Busness Language) [5]. Presolctaton (on Merx): Reference Number: CFAB4 Source ID: PV.MN.SA.2342 Publshed: 2003/0/08 Closng: 2003/0/28 02:00PM Organsaton Name: Saskatchewan Government Ttle (Englsh): Offce Supples Ttle (French): Fourntures de Bureau Descrpton: The Government of Saskatchewan nvtes tenders to provde offce supples to ts offces n Regna. The suppler s expected to start delvery on December 5, 2003, and enter an agreement of at least 2 years. Contact: Berne Juneau, (306) Amendment (on Merx): Reference Number: CFAB4 Descrpton: The start delvery date has been revsed to January 5, Fgure 2: A call for tenders Fgure : Informaton nference Other nformaton can add to the exstng knowledge about contractng authortes and ther contacts. These could later be used for busness ntellgence. Fgure 3 shows the MBOI system archtecture. There are two man processes: ndexng,.e., creatng an ndex wth the nformaton nferred from the Web documents, and queryng/browsng, whch s the search nterface for the user.

3 The frst step of ndexng s to collect documents from Web stes. We use a robot that can connect wth a username and password (for stes wth restrcted access), look for URL patterns, fll out forms, and follow lnks of a gven form. The next step s the nference of nformaton, whch ncludes nformaton extracton and classfcaton. Fnally, an ndex s created and organzed by felds of nformaton (.e., correspondng to elements n the CFT synthess). Fgure 3. System archtecture The front-end to the system allows the user to search for CFTs by topc, date, class code, etc. or wth an all-felds free text query. It also ncludes functonaltes for browsng the class herarchy, save the results n topc folders, etc. Fgure 4 shows an example of results for a query about economc recovery. Ths s a saved query,.e., one that has been defned by the user and s executed on a routne bass. Ths functon s useful for a user who checks for a partcular type of busness opportuntes on a daly bass. The ndexng and retreval processes used n MBOI use the classcal IR approaches of vector space model, wth some enhancements to deal wth structures of CFTs (e.g., secton, ttle, etc.). We wll not descrbe these processes n detal. Instead, we wll concentrate on the classfcaton process of CFTs n whch we use a new method based on the statstcal language modelng approach. 3. Usng Language Models for Text Classfcaton Language models have been successfully appled n many applcaton areas such as speech recognton and statstcal NLP. Recently, a number of studes have confrmed that language model s also an effectve and attractve approach for nformaton retreval (IR) [6, ]. It not only provdes an elegant theoretcal framework to IR, but also results n effectveness comparable to the best state-of-the-art systems. Ths success has trggered a great nterest n IR communty, and LM has snce been used to other IR-related tasks, such as topc detecton and trackng [7]. However, untl now, few attempts have been made to use language models for text classfcaton although there s a strong relatonshp between IR and classfcaton. Text classfcaton ams to assgn text documents nto one or more predefned classes based on ther contents. Many machne learnng technques have been appled to automatc text classfcaton, such as Naïve Bayes (NB), K-Nearest Neghbor and Support Vector Machnes (SVM). Indeed, classfcaton shares several common processngs wth IR. It s then possble that LM can also brng sgnfcant mprovement to classfcaton. Our goal n usng language models to classfcaton s to nvestgate whether language models can also mprove the performance of classfcaton. In partcular, we wll frst ntegrate NB wth language models, because we can observe a strong smlarty between them. 3. Naïve Bayes Classfer Fgure 4. Queryng n MBOI Let us frst descrbe the prncple of Naïve Bayes classfer. Gven a document d and a set of predefned classes { c, }, a Naïve Bayes classfer frst computes the posteror probablty that the document belongs to each partcular class c,.e., c d), and then assgns the document to the class(es) wth the hghest probablty value(s). The posteror probablty s computed by applyng the Bayes rule: The denomnator d) c d) = () d c ) c ) d) n formula () s ndependent from classes; therefore, t can be gnored for the purpose of class rankng. Thus:

4 P c d) d c ) c ) (2) ( In Naïve Bayes, t s further assumed that words are ndependent gven a class,.e., for a document d = d,,d m: d c ) = m j= d j c ) Formula (2) can then be smply expressed as follows: c m d) d c ) c ) (3) j= In formula (3), P c ) can be estmated by the percentage of the ( tranng examples belongng to class c : c ) = where N s the number of tranng documents n class c, and N s the total number of tranng documents respectvely. P d j c ) s usually determned by: ( d j N N + count( d j, c ) c ) = V + c where count d, c ) s the number of tmes that term d j ( j occurs wthn the tranng documents of class c, V s the total number of terms n vocabulary, and c s the number of terms n class c. Ths estmaton uses the Laplace (or add-one) smoothng to solve the zero-probablty problem. 3.2 Language Modelng Approach n IR Language modelng has been appled successfully n nformaton retreval [6,, 2] and several related applcatons such as topc detecton and trackng [7]. Gven a document d and a query q, the basc prncple of ths approach s to compute the condtonal probablty P ( d q) as follows: q d) d) d q) = q d) d) q) If we assume P (d) to be a constant, then the rankng of a document d for a query q s determned by P ( q d). The calculaton of ths value s performed as follows: We frst construct a statstcal language model P (. d) for the document d, called document model. Then q d) s estmated as the probablty that the query can be generated from the document model. Ths probablty s often calculated by makng the assumpton that words are ndependent (n a ungram model) n a smlar way to Naïve Bayes. Ths means that for a query q = q,,q n, we have: j q d) = n j= w j d) In prevous studes, t turns out that smoothng s a very mportant process n buldng a language model []. The effectveness of a language modelng approach s strongly dependent on the way that the document language model s smoothed. The prmary goal of smoothng s to assgn a nonzero probablty to the unseen words and to mprove the maxmum lkelhood estmaton. However, n IR applcatons, smoothng also allows us to consder the global dstrbuton of terms n the whole collecton,.e., the IDF factor used n IR []. Several smoothng methods such as Drchlet, Absolute dscount, etc., have been appled n language models. In Zha and Lafferty [], t has been found that the retreval effectveness s generally senstve to the smoothng parameters. In our experments on classfcaton, we also observed smlar effects. 3.3 Usng Language Modelng Approach for Text Classfcaton If we compare Naïve Bayes wth the general language modelng approach n IR, we can observe a remarkable smlarty: the general probablstc framework s the same, and both use smoothng to solve the zero-probablty problem. The dfference between them les n the objects whch a language model s constructed for and appled to. In IR, one bulds a LM for a document and apples t to a query, whereas n NB classfer, one bulds a LM for a class and apples t to a document. However, we also observe that n the mplementaton of NB, one usually s lmted to the Laplace smoothng. Few attempts have been made n usng more sophstcated smoothng methods. As the experments n IR showed, the effectveness of language modelng strongly depends on the smoothng methods, and several smoothng methods have proven to be effectve. Then a natural queston s whether t s also benefcal n classfcaton to use other sophstcated smoothng methods nstead of the Laplace smoothng. In ths paper, we wll focus on ths problem. As we wll see later n our experments, t wll be clear that such a replacement can brng mprovements to Naïve Bayes classfer. Another queston we wll examne s whether a LM classfcaton approach wll have smlar mpact on dfferent types of documents Prncple. The basc prncple of our approach to text classfcaton usng language models s straghtforward. As n Naïve Bayes, the score of a class c for a gven document d s estmated by formula (3). However, the estmaton of P d j c ) s dfferent: It wll be estmated from ( the language modelng perspectve. Frst, we construct a language model for each class wth several smoothng methods. Then P d j c ) s the probablty that the term d j can be (

5 generated from ths model. As smoothng turns out to be crucal n IR experments, t s also necessary to carefully select the smoothng methods. In the next secton, we wll descrbe those that have been used n several IR experments Smoothng Methods for Estmaton. A number of smoothng methods have been developed n statstcal natural language processng to estmate the probablty of a word or an n-gram. As we mentoned earler, the prmary goal s to attrbute a non-zero probablty to the words or n-grams that are not seen n a set of tranng documents. Two basc deas have been used n smoothng: ) usng a lower-order model to supplement a hgher-order model; 2) modfyng the frequency of word occurrences. In IR, both deas have been used. On the frst soluton, t s common n IR to utlze the whole collecton of documents to construct a background model. Ths model s consdered as a lower-order model to the document model, although both models may be ungram models. Ths soluton has been useful for relatvely short documents. Although a class usually contans more than one document, thus longer than a sngle document, the same problem of mprecse estmaton exsts, especally for small classes. Therefore, one can use the same approach of smoothng to classfcaton. The second soluton s often used n combnaton wth the frst one (.e., one smultaneously use the collecton model and change the word counts), as we can see n the smoothng methods descrbed below. Two general formulatons are used n smoothng: backoff and nterpolaton. Both smoothng methods can be expressed n the followng general form [2]: w c ) = αc P ( w c ) s P ( w C) u wsseen n c ws unseen n c That s, for a class c, one estmate s made for the words seen n the class, and another estmate s made for the unseen words. In the second case, the estmate for unseen words s based on the entre collecton,.e., the collecton model. The effect of ncorporatng the collecton model not only allows us solvng the zero-probablty problem, but also s a way to produce the same effect as the IDF factor commonly used n IR (as shown n []). In our experments, we tested the followng specfc smoothng methods. All of them use the collecton model. Jelnek-Mercer (JM) smoothng: P ( w c ) = ( λ ) P ( w c ) + λ w C) JM whch lnearly combnes the maxmum lkelhood estmate P ml ( w c ) of the class model wth an estmate of the collecton model. Drchlet smoothng: c( w, c ) + µ w C) P Dr ( w c ) = c + µ ml where c w, c ) s the count of word w n c, c s the sze of c ( (.e., the total word count of c ) and µ s a pseudo-count. Absolute dscount smoothng: max( c( w, c ) δ,0) +δ c u w C) PAD ( w c ) = c n whch the count of each word s reduced by a constant δ [0, ], and the dscounted probablty mass s redstrbuted on the unseen words proportonally to ther probablty n the collecton model. In the above equaton, c u s the number of unque words n c. Two-Stage (TS) smoothng [2]: c( w, c ) + µ w C) PTS ( w c ) = ( λ ) + λ w C) c + µ Ths smoothng method combnes Drchlet smoothng wth an nterpolaton smoothng. In the prevous experments of IR, t turns out that Drchlet and Two-stage smoothng methods provded very good effectveness. In our experments, we wll test whether these smoothng methods, when appled to text classfcaton, brng smlar mpact. 4. EXPERIMENTAL EVALUATION ON CLASSIFICATION 4. Corpora In order to compare wth the prevous results, our experments have been conducted on the benchmark corpus of Reuters- 2578, contanng Reuter s newswre artcles. We chose the ModApte splt of Reuters-2578 data set, whch s commonly used for text classfcaton research today [9]. There are 35 topc classes, but we used only those 90 for whch there exsts at least one document n both the tranng and test set. Then we obtaned 7769 tranng documents and 309 test documents. The number of tranng documents per class vares from 2877 to. The largest 0 classes contan 75% of the documents, and 33% classes have fewer than 0 tranng documents. In our experments of fndng busness opportuntes on the Web, we created a collecton of CFT documents by downloadng the daly synopses from the FedBzOpps (FBO) webste, whch are n the perod from September 2000 to October Ths resulted n 2945 documents, whch were splt 70% for tranng and 30% for testng n our experments. Notce that all the CFTs publshed on ths ste are manually classfed usng NAICS codes. NAICS codes are organzed herarchcally, where every dgt of a sx-dgt code corresponds to a level of the herarchy. In order to reduce the class space, we only consder the frst three dgts n our current study. Although class herarchy s an aspect that makes the classfcaton of CFTs dfferent from the general classfcaton problem wth flat classes, we wll postpone ths problem to a

6 later study. That s, our current study wll consder the set of classes at the same level. After removng the classes that do not ncluded at least one document n both tranng and test set, we obtaned 86 classes, 532 tranng documents and 6627 test documents. The largest 0 classes contan 72% of the documents, and 30% classes have fewer than 20 tranng documents. We can see that the FBO collecton has qute smlar a dstrbuton to the Reuters collecton. 4.2 Performance Measure For the purpose of comparson wth prevous works, we evaluate the performance of classfcaton n terms of standard recall, precson and F measure. For evaluatng average performance across classes, we used macro-averagng and mcro-averagng. Macro-averagng scores are the averages of the scores of each class calculated separately. Mcro-averagng scores are the scores calculated by mxng together the documents across all the classes. Macro-averagng gves an equal weght to every class regardless how rare or how common a class s. On the other hand, mcro-averagng gves an equal weght to every document, thus puttng more emphass on larger classes. In [9], t s clamed that mcro-averagng can better reflect the real classfcaton performance than macroaveragng. Therefore, our observatons wll be made manly on mcro-averagng F. 4.3 Naïve Bayes Classfer To provde the comparable results of classfcaton on Reuters corpus, we used the multnomal mxture model of Naïve Bayes classfer of the Ranbow package, developed by McCallum [3]. In NB classfer, feature selecton s mportant. The effect of feature selecton s to remove meanngless features (words) so that classfcaton can be determned accordng to meanngful features. Several feature selecton methods are commonly used: nformaton gan (IG), ch-square, mutual nformaton, etc. Informaton gan has shown to produce good results n [9]. The nformaton gan of a word w s calculated as follows: IG( w) = w) k = k = c )log c ) + c w)log c w) + w) k = where w means the absence of the word w. c w)log c w) One can choose a fxed number of features accordng to ther IG, or set up a threshold on IG to make the selecton. The followng table shows the classfcaton results by NB wthout feature selecton and wth a selecton of 2000 features accordng to IG. The number 2000 s suggested n [9]. Table 2 shows the classfcaton results by NB wthout feature selecton and wth a selecton of 2,000 features accordng to IG. The number 2,000 produced the best performance on FBO collecton. NB mr mp mf maf Error all features K features mr: mcro-averagng recall mp: mcro-averagng precson mf : mcro-averagng F maf : macro-averagng F Table. Performance of NB on Reuters-2578 collecton NB mr mp mf maf Error all features K features Table 2. Performance of NB on FBO collecton 4.4 Language Modelng Approach In the experments usng language models, we used the Lemur toolkt, whch s desgned and developed by Carnege Mellon Unversty and the Unversty of Massachusetts [2]. The system allows us to tran a language model for each class usng a set of tranng documents, and to calculate the lkelhood of a document accordng to each class model,.e. P ( d c ). The fnal score of a class can then be computed accordng to formula (2) Dfferent Smoothng Methods. In our experments, we used the four smoothng methods that are descrbed earler by varyng the parameters. Table 3 shows the results by each method. No feature selecton s made. The percentages n the table are the relatve changes wth respect to NB wth no feature selecton (Table ). Smoothng mr mp mf maf Error Jelnek-Mercer (λ=0.3) Drchlet (µ=9500) Absolute (δ=0.83) Two-stage (λ=0.86,µ=6000) (+.3%) (+0.9%) (+.8%) (+3.9%) (+53.5%) (+6.9%) (+63.3%) (+29.3%) Table 3. Performance of LM on Reuters As we can see, on Reuters-2578 corpus, the three frst smoothng methods only lead to margnal mprovements on mcro-averagng F over NB. On the other hand, Two-stage smoothng produces a larger mprovement over NB. The performances of dfferent LMs on FBO collecton are shown n Table 4.

7 Smoothng mr mp mf maf Error Jelnek- Mercer (λ=0.05) Drchlet (µ=500) Absolute (δ=0.05) Two-stage (λ=0.05,µ=0) (+8.9%) (+2.3%) (+.7%) (+8.9%) (+90.8%) (+72.%) (+95.9%) (+90.8%) Table 4. Performance of LM on FBO If we compare the three frst smoothng methods (wth ther best performances shown n Tables 3 and 4), we can see that, the Absolute smoothng produced better performances than the other two smoothng methods on both corpora. Drchlet smoothng produced the least mprovements. Two-stage smoothng produced the largest mprovement on Reuters. However, the phenomenon on the FBO collecton s not the same. In the case of Two-stage smoothng on FBO, the best performance s obtaned when µ s set to 0,.e., we ndeed use the Jelnek-Mercer smoothng. The dfferences of the smoothng methods on the two collectons show that FBO has dfferent characterstc than newswre artcles, and they may requre dfferent classfcaton methods. Globally, our experments show that usng language models may mprove classfcaton effectveness over Naïve Bayes on both corpora. Ths s true especally for macro-averagng F whch s much hgher than wth NB. The mprovements on mcro-averagng F are more evdent on the FBO collecton than on Reuters In order to test statstcal sgnfcance of the changes of performance, we use the macro t-test [9], whch compares pared F values obtaned for each class. It turns out that all the mprovements obtaned on both corpora wth the four smoothng methods are statstcally sgnfcant, wth p-values < The comparson of the mprovements on macro- and mcroaveragng F suggests that language models can brng larger mprovements to small classes than to large classes. A possble reason s that our smoothng methods also combne the collecton probabltes, nstead of only changng the frequences of words as n NB (Laplace smoothng). By modfyng the frequency of words n Laplace smoothng, all the unseen words, ether meanngful or not, wll be attrbuted an equal probablty. However, the smoothng methods wth the collecton model attrbute dfferent probabltes to unseen words accordng to ther global dstrbuton n the collecton. Therefore, the latter probabltes can better reflect the 3 A p-value lower than 0.05 s consdered to be statstcally sgnfcant at the 0.95 sgnfcance level. characterstcs of the collecton and of the language. In our experments, the addton of the collecton model seems to beneft greatly small classes whch have less tranng data and for whch a heavy smoothng s requred. Another advantage of usng the collecton to smooth the class model s that the meanngless features that do not allow us to dstngush dfferent classes are now neutralzed wth the collecton model, n such a way that ther dfferences across classes are weakened. Ths s equvalent to feature selecton n the other classfcaton methods. As we wll see n Secton 4.4.2, t turns out that feature selecton s not necessary wth LM. Ths confrms that smoothng has the same effects as feature selecton. The absolute level of performances on FBO s lower than that of Reuters. Ths suggests that the classfcaton of CFTs, or more globally, the classfcaton of busness opportuntes on the Web, s a more dffcult problem than that for newswre artcles. The man dfference between them s that a CFT usually contans a very short descrpton of the goods or servces (one or a few sentences), whch s the object of the call. The nsuffcent descrpton makes t dffcult to obtan a thorough characterzaton of the goods or servce. On the other hand, the remanng parts, whch take an mportant porton of the CFT, descrbe unessental elements for classfcaton, such as the condtons of submsson, the deadlne, etc. These latter are not drectly related to the classfcaton by doman (although they may be useful for other purposes). By usng the classcal term weghtng methods based on term frequency (or nverted document frequency), t s dffcult to flter out the non-mportant parts of a CFT. These partculartes make the global performances of classfcaton on CFTs lower than for newswre artcles Feature Selecton wth Language Model. Feature selecton has been very useful for NB classfer. Would t produce a smlar effect on language models? In order to answer ths queston, we conducted a seres of experments usng dfferent numbers of features selected accordng to nformaton gan. The followng table shows the results of dong feature selecton on the four smoothng methods shown n Table 3. Mcro-F all features Jelnek-Mercer Drchlet Absolute dsc. Two-stage Fgure 5. The effects of feature selecton on Reuters

8 These results do not show sgnfcant performance mprovement when we use feature selecton, except for Drchlet smoothng. On the contrary, for absolute smoothng and Jelnek-Mercer smoothng, the effect of feature selecton s rather negatve: We obtan lower performances f we select a subset of features. Ths concluson seems contradctory to the results wth NB, and counter-ntutve at the frst glance. However, one can possbly explan ths by the fact that, as the class model has been massvely smoothed by the collecton model, those non-dscrmnatve features do not make a sgnfcant dfference between documents wth respect to a class. Therefore, the ncluson of such features n the calculaton of the score does not hurt as much as n NB, whch does not ncorporate the collecton model. Ths suggests that the consderaton of the collecton model n smoothng renders feature selecton less necessary. Therefore, another mportant advantage of usng LMs s that t can avod the need for explct feature selecton. 5. Concluson We have descrbed a tool to help the dscovery of busness opportuntes on the Internet, and propose a new approach for the classfcaton of such documents. The MBOI tool has been n use for a year and a half by our commercal partners, and deployed n several applcatons: as an ad for busness opportuntes watch for the St-Hyacnthe (Quebec) regon, as a CFT search faclty for the Canada's metal ndustry portal (NetMetal 4 ), and as an "ssue" or "thematc" watch for the Quebec travel ndustry. All have reported a sgnfcant mprovement to ther actvtes by usng our system. On classfcaton, we used LM to enhance NB. In partcular, the Laplace smoothng commonly used n NB s replaced by some other smoothng methods that ntegrate the collecton model. Our experments on Reuters-2578 and FBO collectons have shown sgnfcant mprovements over NB, especally on the macro-averagng F. On mcro-averagng F, we also observed notceable mprovements, n partcular, on FBO collecton. Ths prelmnary study dd show that language models can contrbute n mprovng text classfcaton by NB. Our comparson on two document collectons show that language modelng approaches can be useful for the classfcaton of both newswre artcles and busness opportuntes on the Web, despte the dfferences between these documents. To further mprove the classfcaton performance of busness opportuntes, t wll be necessary to study specfc methods adapted to ths type of data. In partcular, we wll have to deal wth the problem wth very short useful descrpton n Calls for Tenders. We have notced qute a bt of nose n the FBO documents n terms of rrelevant content, for example, pertanng to procedural nstructons rather than the topc of the CFT. Ths s typcal of Web documents, and therefore we thnk that t s qute encouragng that the mprovement usng LM was greater on FBO (a Web corpus) than on Reuters (a controlled test collecton). Our prelmnary study s lmted to the utlzaton of ungram models. We wll nvestgate the ntegraton of bgram language models for text classfcaton n our future work. Other future works nclude: extendng herarchcal classfcaton, ncorporatng LMs nto other classfcaton algorthms, and usng other types of features n classfcaton (e.g., concepts, named enttes as extracted usng Nsten s tools). ACKNOWLEDGMENT Ths work has been carred out wthn a jont research project wth Nsten technologes. We would lke to thank Nsten and NSERC for ther support. REFERENCES [] S. T. Dumas, J. Platt, D. Heckerman and M. Saham (998). Inductve learnng algorthms and representatons for text categorzaton. In Proceedngs of ACM-CIKM98, Nov. 998, pp [2] The lemur toolkt for language modelng and nformaton retreval. [3] A. McCallum (996). Bow: A toolkt for statstcal language modelng, text retreval, classfcaton and clusterng. [4] A. McCallum and K. Ngam (998). A comparson of event models for Naïve Bayes text classfcaton. In Proceedngs of AAAI-98 Workshop, AAAI Press. [5] E. Dumbll. Hgh hopes for the unversal busness language. XML.com, O'Relly, November [6] J. Ponte and W. B. Croft (998). A language modelng approach to nformaton retreval. In Proceedngs of SIGIR 998. pp [7] M. Sptters and W. Kraaj (200), TNO at TDT200: language model-based topc detecton, In Proceedngs of Topc Detecton and Trackng (TDT) Workshop 200. [8] Y. Yang (999). An evaluaton of statstcal approaches to text categorzaton. Journal of Informaton Retreval, Vol., No. /2, pp [9] Y. Yang and X. Lu (999). A re-examnaton of text categorzaton methods. In Proceedngs of the 22nd Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, pp [0] Y. Yang (200). A study on thresholdng strateges for text categorzaton. In Proceedngs of SIGIR 200, pp [] C. Zha and J. Lafferty (200). A study of smoothng methods for language models appled to ad hoc nformaton retreval. In Proceedngs of SIGIR 200, pp [2] C. Zha and J. Lafferty (2002). Two-stage language models for nformaton retreval. In Proceedng of SIGIR 2002, pp

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography   Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Informaton Retreval Systems Jm Martn! Lecture 11 9/29/2011 Today 9/29 Classfcaton Naïve Bayes classfcaton Ungram LM 1 Where we are... Bascs of ad hoc retreval Indexng Term weghtng/scorng Cosne

More information

Deep Classification in Large-scale Text Hierarchies

Deep Classification in Large-scale Text Hierarchies Deep Classfcaton n Large-scale Text Herarches Gu-Rong Xue Dkan Xng Qang Yang 2 Yong Yu Dept. of Computer Scence and Engneerng Shangha Jao-Tong Unversty {grxue, dkxng, yyu}@apex.sjtu.edu.cn 2 Hong Kong

More information

An Image Fusion Approach Based on Segmentation Region

An Image Fusion Approach Based on Segmentation Region Rong Wang, L-Qun Gao, Shu Yang, Yu-Hua Cha, and Yan-Chun Lu An Image Fuson Approach Based On Segmentaton Regon An Image Fuson Approach Based on Segmentaton Regon Rong Wang, L-Qun Gao, Shu Yang 3, Yu-Hua

More information

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines (IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, Herarchcal Web Page Classfcaton Based on a Topc Model and Neghborng Pages Integraton Wongkot Srura Phayung Meesad Choochart Haruechayasak

More information

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for

More information

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng

More information

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices Steps for Computng the Dssmlarty, Entropy, Herfndahl-Hrschman and Accessblty (Gravty wth Competton) Indces I. Dssmlarty Index Measurement: The followng formula can be used to measure the evenness between

More information

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence

More information

Deep Classifier: Automatically Categorizing Search Results into Large-Scale Hierarchies

Deep Classifier: Automatically Categorizing Search Results into Large-Scale Hierarchies Deep Classfer: Automatcally Categorzng Search Results nto Large-Scale Herarches Dkan Xng 1, Gu-Rong Xue 1, Qang Yang 2, Yong Yu 1 1 Shangha Jao Tong Unversty, Shangha, Chna {xaobao,grxue,yyu}@apex.sjtu.edu.cn

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Classifier Selection Based on Data Complexity Measures *

Classifier Selection Based on Data Complexity Measures * Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur FEATURE EXTRACTION Dr. K.Vjayarekha Assocate Dean School of Electrcal and Electroncs Engneerng SASTRA Unversty, Thanjavur613 41 Jont Intatve of IITs and IISc Funded by MHRD Page 1 of 8 Table of Contents

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Selecting Query Term Alterations for Web Search by Exploiting Query Contexts

Selecting Query Term Alterations for Web Search by Exploiting Query Contexts Selectng Query Term Alteratons for Web Search by Explotng Query Contexts Guhong Cao Stephen Robertson Jan-Yun Ne Dept. of Computer Scence and Operatons Research Mcrosoft Research at Cambrdge Dept. of Computer

More information

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval Combnng Multple Resources, Evdence and Crtera for Genomc Informaton Retreval Luo S 1, Je Lu 2 and Jame Callan 2 1 Department of Computer Scence, Purdue Unversty, West Lafayette, IN 47907, USA ls@cs.purdue.edu

More information

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers Journal of Convergence Informaton Technology Volume 5, Number 2, Aprl 2010 Investgatng the Performance of Naïve- Bayes Classfers and K- Nearest Neghbor Classfers Mohammed J. Islam *, Q. M. Jonathan Wu,

More information

Experiments in Text Categorization Using Term Selection by Distance to Transition Point

Experiments in Text Categorization Using Term Selection by Distance to Transition Point Experments n Text Categorzaton Usng Term Selecton by Dstance to Transton Pont Edgar Moyotl-Hernández, Héctor Jménez-Salazar Facultad de Cencas de la Computacón, B. Unversdad Autónoma de Puebla, 14 Sur

More information

The Research of Support Vector Machine in Agricultural Data Classification

The Research of Support Vector Machine in Agricultural Data Classification The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Query classification using topic models and support vector machine

Query classification using topic models and support vector machine Query classfcaton usng topc models and support vector machne Deu-Thu Le Unversty of Trento, Italy deuthu.le@ds.untn.t Raffaella Bernard Unversty of Trento, Italy bernard@ds.untn.t Abstract Ths paper descrbes

More information

Query Clustering Using a Hybrid Query Similarity Measure

Query Clustering Using a Hybrid Query Similarity Measure Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

Pruning Training Corpus to Speedup Text Classification 1

Pruning Training Corpus to Speedup Text Classification 1 Prunng Tranng Corpus to Speedup Text Classfcaton Jhong Guan and Shugeng Zhou School of Computer Scence, Wuhan Unversty, Wuhan, 430079, Chna hguan@wtusm.edu.cn State Key Lab of Software Engneerng, Wuhan

More information

Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks

Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks Federated Search of Text-Based Dgtal Lbrares n Herarchcal Peer-to-Peer Networks Je Lu School of Computer Scence Carnege Mellon Unversty Pttsburgh, PA 15213 jelu@cs.cmu.edu Jame Callan School of Computer

More information

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET 1 BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET TZU-CHENG CHUANG School of Electrcal and Computer Engneerng, Purdue Unversty, West Lafayette, Indana 47907 SAUL B. GELFAND School

More information

Cross-Language Information Retrieval

Cross-Language Information Retrieval Feature Artcle: Cross-Language Informaton Retreval 19 Cross-Language Informaton Retreval Jan-Yun Ne 1 Abstract A research group n Unversty of Montreal has worked on the problem of cross-language nformaton

More information

Semantic Image Retrieval Using Region Based Inverted File

Semantic Image Retrieval Using Region Based Inverted File Semantc Image Retreval Usng Regon Based Inverted Fle Dengsheng Zhang, Md Monrul Islam, Guoun Lu and Jn Hou 2 Gppsland School of Informaton Technology, Monash Unversty Churchll, VIC 3842, Australa E-mal:

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval Proceedngs of the Thrd NTCIR Workshop Descrpton of NTU Approach to NTCIR3 Multlngual Informaton Retreval Wen-Cheng Ln and Hsn-Hs Chen Department of Computer Scence and Informaton Engneerng Natonal Tawan

More information

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Determining the Optimal Bandwidth Based on Multi-criterion Fusion Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Biostatistics 615/815

Biostatistics 615/815 The E-M Algorthm Bostatstcs 615/815 Lecture 17 Last Lecture: The Smplex Method General method for optmzaton Makes few assumptons about functon Crawls towards mnmum Some recommendatons Multple startng ponts

More information

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems A Unfed Framework for Semantcs and Feature Based Relevance Feedback n Image Retreval Systems Ye Lu *, Chunhu Hu 2, Xngquan Zhu 3*, HongJang Zhang 2, Qang Yang * School of Computng Scence Smon Fraser Unversty

More information

Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study

Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study Arabc Text Classfcaton Usng N-Gram Frequency Statstcs A Comparatve Study Lala Khresat Dept. of Computer Scence, Math and Physcs Farlegh Dcknson Unversty 285 Madson Ave, Madson NJ 07940 Khresat@fdu.edu

More information

Using Language Models for Flat Text Queries in XML Retrieval

Using Language Models for Flat Text Queries in XML Retrieval Usng Language Models for Flat ext Queres n XML Retreval aul Oglve, Jame Callan Language echnoes Insttute School of Computer Scence Carnege Mellon Unversty ttsburgh, A USA {pto,callan}@cs.cmu.edu ABSRAC

More information

Reliable Negative Extracting Based on knn for Learning from Positive and Unlabeled Examples

Reliable Negative Extracting Based on knn for Learning from Positive and Unlabeled Examples 94 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY 2009 Relable Negatve Extractng Based on knn for Learnng from Postve and Unlabeled Examples Bangzuo Zhang College of Computer Scence and Technology, Jln Unversty,

More information

A Knowledge Management System for Organizing MEDLINE Database

A Knowledge Management System for Organizing MEDLINE Database A Knowledge Management System for Organzng MEDLINE Database Hyunk Km, Su-Shng Chen Computer and Informaton Scence Engneerng Department, Unversty of Florda, Ganesvlle, Florda 32611, USA Wth the exploson

More information

Keyword-based Document Clustering

Keyword-based Document Clustering Keyword-based ocument lusterng Seung-Shk Kang School of omputer Scence Kookmn Unversty & AIrc hungnung-dong Songbuk-gu Seoul 36-72 Korea sskang@kookmn.ac.kr Abstract ocument clusterng s an aggregaton of

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status Internatonal Journal of Appled Busness and Informaton Systems ISSN: 2597-8993 Vol 1, No 2, September 2017, pp. 6-12 6 Implementaton Naïve Bayes Algorthm for Student Classfcaton Based on Graduaton Status

More information

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc. [Type text] [Type text] [Type text] ISSN : 0974-74 Volume 0 Issue BoTechnology 04 An Indan Journal FULL PAPER BTAIJ 0() 04 [684-689] Revew on Chna s sports ndustry fnancng market based on market -orented

More information

Learning to Classify Documents with Only a Small Positive Training Set

Learning to Classify Documents with Only a Small Positive Training Set Learnng to Classfy Documents wth Only a Small Postve Tranng Set Xao-L L 1, Bng Lu 2, and See-Kong Ng 1 1 Insttute for Infocomm Research, Heng Mu Keng Terrace, 119613, Sngapore 2 Department of Computer

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

A Novel Term_Class Relevance Measure for Text Categorization

A Novel Term_Class Relevance Measure for Text Categorization A Novel Term_Class Relevance Measure for Text Categorzaton D S Guru, Mahamad Suhl Department of Studes n Computer Scence, Unversty of Mysore, Mysore, Inda Abstract: In ths paper, we ntroduce a new measure

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1) Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A

More information

A Taxonomy Fuzzy Filtering Approach

A Taxonomy Fuzzy Filtering Approach JOURNAL OF AUTOMATIC CONTROL, UNIVERSITY OF BELGRADE, VOL. 13(1):25-29, 2003 A Taxonomy Fuzzy Flterng Approach S. Vrettos and A. Stafylopats Abstract - Our work proposes the use of topc taxonomes as part

More information

A Method of Hot Topic Detection in Blogs Using N-gram Model

A Method of Hot Topic Detection in Blogs Using N-gram Model 84 JOURNAL OF SOFTWARE, VOL. 8, NO., JANUARY 203 A Method of Hot Topc Detecton n Blogs Usng N-gram Model Xaodong Wang College of Computer and Informaton Technology, Henan Normal Unversty, Xnxang, Chna

More information

Using Query Contexts in Information Retrieval Jing Bai 1, Jian-Yun Nie 1, Hugues Bouchard 2, Guihong Cao 1 1 Department IRO, University of Montreal

Using Query Contexts in Information Retrieval Jing Bai 1, Jian-Yun Nie 1, Hugues Bouchard 2, Guihong Cao 1 1 Department IRO, University of Montreal Usng uery Contexts n Informaton Retreval Jng Ba 1, Jan-Yun Ne 1, Hugues Bouchard 2, Guhong Cao 1 1 epartment IRO, Unversty of Montreal CP. 6128, succursale Centre-vlle, Montreal, uebec, H3C 3J7, Canada

More information

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton

More information

Application of k-nn Classifier to Categorizing French Financial News

Application of k-nn Classifier to Categorizing French Financial News Applcaton of k-nn Classfer to Categorzng French Fnancal News Huazhong KOU, Georges GARDARIN 2, Alan D'heygère 2, Karne Zetoun PRSM Laboratory, Unversty of Versalles Sant-Quentn 45 Etats-Uns Road, 78035

More information

Issues and Empirical Results for Improving Text Classification

Issues and Empirical Results for Improving Text Classification Issues and Emprcal Results for Improvng Text Classfcaton Youngoong Ko 1 and Jungyun Seo 2 1 Dept. of Computer Engneerng, Dong-A Unversty, 840 Hadan 2-dong, Saha-gu, Busan, 604-714, Korea yko@dau.ac.kr

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Problem Set 3 Solutions

Problem Set 3 Solutions Introducton to Algorthms October 4, 2002 Massachusetts Insttute of Technology 6046J/18410J Professors Erk Demane and Shaf Goldwasser Handout 14 Problem Set 3 Solutons (Exercses were not to be turned n,

More information

Wishing you all a Total Quality New Year!

Wishing you all a Total Quality New Year! Total Qualty Management and Sx Sgma Post Graduate Program 214-15 Sesson 4 Vnay Kumar Kalakband Assstant Professor Operatons & Systems Area 1 Wshng you all a Total Qualty New Year! Hope you acheve Sx sgma

More information

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:

More information

A Hybrid Text Classification System Using Sentential Frequent Itemsets

A Hybrid Text Classification System Using Sentential Frequent Itemsets A Hybrd Text Classfcaton System Usng Sentental Frequent Itemsets Shzhu Lu, Hepng Hu College of Computer Scence, Huazhong Unversty of Scence and Technology, Wuhan 430074, Chna stoneboo@26.com Abstract:

More information

Incremental Learning with Support Vector Machines and Fuzzy Set Theory

Incremental Learning with Support Vector Machines and Fuzzy Set Theory The 25th Workshop on Combnatoral Mathematcs and Computaton Theory Incremental Learnng wth Support Vector Machnes and Fuzzy Set Theory Yu-Mng Chuang 1 and Cha-Hwa Ln 2* 1 Department of Computer Scence and

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following. Complex Numbers The last topc n ths secton s not really related to most of what we ve done n ths chapter, although t s somewhat related to the radcals secton as we wll see. We also won t need the materal

More information

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z. TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS Muradalyev AZ Azerbajan Scentfc-Research and Desgn-Prospectng Insttute of Energetc AZ1012, Ave HZardab-94 E-mal:aydn_murad@yahoocom Importance of

More information

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION SHI-LIANG SUN, HONG-LEI SHI Department of Computer Scence and Technology, East Chna Normal Unversty 500 Dongchuan Road, Shangha 200241, P. R. Chna E-MAIL: slsun@cs.ecnu.edu.cn,

More information

A Semi-parametric Regression Model to Estimate Variability of NO 2

A Semi-parametric Regression Model to Estimate Variability of NO 2 Envronment and Polluton; Vol. 2, No. 1; 2013 ISSN 1927-0909 E-ISSN 1927-0917 Publshed by Canadan Center of Scence and Educaton A Sem-parametrc Regresson Model to Estmate Varablty of NO 2 Meczysław Szyszkowcz

More information

Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System

Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System Fuzzy Modelng of the Complexty vs. Accuracy Trade-off n a Sequental Two-Stage Mult-Classfer System MARK LAST 1 Department of Informaton Systems Engneerng Ben-Guron Unversty of the Negev Beer-Sheva 84105

More information

Advanced Computer Networks

Advanced Computer Networks Char of Network Archtectures and Servces Department of Informatcs Techncal Unversty of Munch Note: Durng the attendance check a stcker contanng a unque QR code wll be put on ths exam. Ths QR code contans

More information

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 An Iteratve Soluton Approach to Process Plant Layout usng Mxed

More information

Unsupervised Learning

Unsupervised Learning Pattern Recognton Lecture 8 Outlne Introducton Unsupervsed Learnng Parametrc VS Non-Parametrc Approach Mxture of Denstes Maxmum-Lkelhood Estmates Clusterng Prof. Danel Yeung School of Computer Scence and

More information

SI485i : NLP. Set 5 Using Naïve Bayes

SI485i : NLP. Set 5 Using Naïve Bayes SI485 : NL Set 5 Usng Naïve Baes Motvaton We want to predct somethng. We have some text related to ths somethng. somethng = target label text = text features Gven, what s the most probable? Motvaton: Author

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010 Smulaton: Solvng Dynamc Models ABE 5646 Week Chapter 2, Sprng 200 Week Descrpton Readng Materal Mar 5- Mar 9 Evaluatng [Crop] Models Comparng a model wth data - Graphcal, errors - Measures of agreement

More information

TN348: Openlab Module - Colocalization

TN348: Openlab Module - Colocalization TN348: Openlab Module - Colocalzaton Topc The Colocalzaton module provdes the faclty to vsualze and quantfy colocalzaton between pars of mages. The Colocalzaton wndow contans a prevew of the two mages

More information

Online Text Mining System based on M2VSM

Online Text Mining System based on M2VSM FR-E2-1 SCIS & ISIS 2008 Onlne Text Mnng System based on M2VSM Yasufum Takama 1, Takash Okada 1, Toru Ishbash 2 1. Tokyo Metropoltan Unversty, 2. Tokyo Metropoltan Insttute of Technology 6-6 Asahgaoka,

More information

Web Document Classification Based on Fuzzy Association

Web Document Classification Based on Fuzzy Association Web Document Classfcaton Based on Fuzzy Assocaton Choochart Haruechayasa, Me-Lng Shyu Department of Electrcal and Computer Engneerng Unversty of Mam Coral Gables, FL 33124, USA charuech@mam.edu, shyu@mam.edu

More information

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University CAN COMPUTERS LEARN FASTER? Seyda Ertekn Computer Scence & Engneerng The Pennsylvana State Unversty sertekn@cse.psu.edu ABSTRACT Ever snce computers were nvented, manknd wondered whether they mght be made

More information

Backpropagation: In Search of Performance Parameters

Backpropagation: In Search of Performance Parameters Bacpropagaton: In Search of Performance Parameters ANIL KUMAR ENUMULAPALLY, LINGGUO BU, and KHOSROW KAIKHAH, Ph.D. Computer Scence Department Texas State Unversty-San Marcos San Marcos, TX-78666 USA ae049@txstate.edu,

More information

Reducing Frame Rate for Object Tracking

Reducing Frame Rate for Object Tracking Reducng Frame Rate for Object Trackng Pavel Korshunov 1 and We Tsang Oo 2 1 Natonal Unversty of Sngapore, Sngapore 11977, pavelkor@comp.nus.edu.sg 2 Natonal Unversty of Sngapore, Sngapore 11977, oowt@comp.nus.edu.sg

More information

Three supervised learning methods on pen digits character recognition dataset

Three supervised learning methods on pen digits character recognition dataset Three supervsed learnng methods on pen dgts character recognton dataset Chrs Flezach Department of Computer Scence and Engneerng Unversty of Calforna, San Dego San Dego, CA 92093 cflezac@cs.ucsd.edu Satoru

More information

A Statistical Model Selection Strategy Applied to Neural Networks

A Statistical Model Selection Strategy Applied to Neural Networks A Statstcal Model Selecton Strategy Appled to Neural Networks Joaquín Pzarro Elsa Guerrero Pedro L. Galndo joaqun.pzarro@uca.es elsa.guerrero@uca.es pedro.galndo@uca.es Dpto Lenguajes y Sstemas Informátcos

More information

Fuzzy Filtering Algorithms for Image Processing: Performance Evaluation of Various Approaches

Fuzzy Filtering Algorithms for Image Processing: Performance Evaluation of Various Approaches Proceedngs of the Internatonal Conference on Cognton and Recognton Fuzzy Flterng Algorthms for Image Processng: Performance Evaluaton of Varous Approaches Rajoo Pandey and Umesh Ghanekar Department of

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Machine Learning 9. week

Machine Learning 9. week Machne Learnng 9. week Mappng Concept Radal Bass Functons (RBF) RBF Networks 1 Mappng It s probably the best scenaro for the classfcaton of two dataset s to separate them lnearly. As you see n the below

More information

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Edge Detection in Noisy Images Using the Support Vector Machines

Edge Detection in Noisy Images Using the Support Vector Machines Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona

More information

Intelligent Information Acquisition for Improved Clustering

Intelligent Information Acquisition for Improved Clustering Intellgent Informaton Acquston for Improved Clusterng Duy Vu Unversty of Texas at Austn duyvu@cs.utexas.edu Mkhal Blenko Mcrosoft Research mblenko@mcrosoft.com Prem Melvlle IBM T.J. Watson Research Center

More information

Multiple Frame Motion Inference Using Belief Propagation

Multiple Frame Motion Inference Using Belief Propagation Multple Frame Moton Inference Usng Belef Propagaton Jang Gao Janbo Sh The Robotcs Insttute Department of Computer and Informaton Scence Carnege Mellon Unversty Unversty of Pennsylvana Pttsburgh, PA 53

More information

Rules for Using Multi-Attribute Utility Theory for Estimating a User s Interests

Rules for Using Multi-Attribute Utility Theory for Estimating a User s Interests Rules for Usng Mult-Attrbute Utlty Theory for Estmatng a User s Interests Ralph Schäfer 1 DFKI GmbH, Stuhlsatzenhausweg 3, 66123 Saarbrücken Ralph.Schaefer@dfk.de Abstract. In ths paper, we show that Mult-Attrbute

More information

Object-Based Techniques for Image Retrieval

Object-Based Techniques for Image Retrieval 54 Zhang, Gao, & Luo Chapter VII Object-Based Technques for Image Retreval Y. J. Zhang, Tsnghua Unversty, Chna Y. Y. Gao, Tsnghua Unversty, Chna Y. Luo, Tsnghua Unversty, Chna ABSTRACT To overcome the

More information

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures A Novel Adaptve Descrptor Algorthm for Ternary Pattern Textures Fahuan Hu 1,2, Guopng Lu 1 *, Zengwen Dong 1 1.School of Mechancal & Electrcal Engneerng, Nanchang Unversty, Nanchang, 330031, Chna; 2. School

More information

Using Fuzzy Logic to Enhance the Large Size Remote Sensing Images

Using Fuzzy Logic to Enhance the Large Size Remote Sensing Images Internatonal Journal of Informaton and Electroncs Engneerng Vol. 5 No. 6 November 015 Usng Fuzzy Logc to Enhance the Large Sze Remote Sensng Images Trung Nguyen Tu Huy Ngo Hoang and Thoa Vu Van Abstract

More information

Data Mining: Model Evaluation

Data Mining: Model Evaluation Data Mnng: Model Evaluaton Aprl 16, 2013 1 Issues: Evaluatng Classfcaton Methods Accurac classfer accurac: predctng class label predctor accurac: guessng value of predcted attrbutes Speed tme to construct

More information