Issues and Empirical Results for Improving Text Classification

Size: px
Start display at page:

Download "Issues and Empirical Results for Improving Text Classification"

Transcription

1 Issues and Emprcal Results for Improvng Text Classfcaton Youngoong Ko 1 and Jungyun Seo 2 1 Dept. of Computer Engneerng, Dong-A Unversty, 840 Hadan 2-dong, Saha-gu, Busan, , Korea yko@dau.ac.kr 2 Dept. of Computer Engneerng, Sogang Unversty, Snsu-dong, Mapo-gu, Seoul, , Korea seoy@sogang.ac.kr 1. Correspondng author : Jungyun Seo 2. Correspondng address : Dept. of Computer Engneerng, Sogang Unversty, Snsu-dong, Mapo-gu, Seoul , Korea 3. Correspondng telephone number : Correspondng fax number : Correspondng Emal address : seoy@sogang.ac.kr 1

2 Issues and Emprcal Results for Improvng Text Classfcaton ABSTRACT Automatc text classfcaton has a long hstory and many studes have been conducted n ths lterature. In partcular, many machne learnng algorthms and nformaton retreval technques have been appled to text classfcaton tasks. Even though much techncal progress has been made n text classfcaton, there s stll room for mprovement n text classfcaton. We wll here dscuss what ssues are remaned to us for mprovng text classfcaton. In ths paper, three mprovement ssues are presented ncludng automatc tranng data generaton, nosy data treatment, and term weghtng and ndexng, and four actual studes and ther emprcal results for those ssues are ntroduced. Frstly, sem-supervsed learnng technque s appled to text classfcaton for effcently creatng tranng data. For effectve nosy data treatment, a nosy data reducton method and a robust text classfer from nosy data are developed as a soluton. Fnally, the term weghtng and ndexng technque s reformed by reflectng the mportance of sentences nto term weght calculaton usng summarzaton technques. KEYWORDS Text Classfcaton, Improvement Issues, Sem-supervsed Learnng, Nosy Data Reducton, TCFP Classfer, Importance of Sentence, Term Weghtng and Indexng 1. INTRODUCTION In recent years, automatc content-based document management tasks have ganed a promnent status n the nformaton systems feld, due to the wdespread and contnuously ncreasng avalablty of documents n dgtal form. In partcular, the task of classfyng natural language documents nto a pre-defned set of semantc categores has become one of the key methods for organzng onlne nformaton accordng to the rapd growth of 2

3 the World Wde Web [Sabastan, 2002]. Ths task s commonly referred to as text classfcaton. Snce there has been an exploson of electronc texts from not only the World Wde Web but also varous onlne sources (electronc mal, corporate databases, chat rooms, dgtal lbrares, etc.) recently, one way of organzng ths overwhelmng amount of data s to classfy them nto topcal categores. Snce the machne learnng paradgm emerged n the 90 s, many machne learnng algorthms have been appled to text classfcaton. Wthn the machne learnng paradgm, a general nductve process automatcally bulds an automatc text classfer by learnng from a set of prevously classfed documents. The advantages of ths approach are ts accuracy comparable to human performance and a consderable savngs n terms of manpower. In addton, text classfcaton has deep relaton to nformaton retreval, as the foundaton of an automated content-based document management. Thus, text classfcaton may be seen as the meetng pont of 2 machne learnng and nformaton retreval. Informaton gan and statstcs, etc. have been used for feature selecton [Yang, 1997], and Nave Bayes [Ko and Seo, 2000; McCallum and Ngam, 1998], Roccho [Lews et al., 1996], Nearest Neghbor (k-nn) [Yang et al., 2002], Support Vector Machne (SVM) [Joachms, 2001], etc. have been broadly employed as text classfers. After SVM was appled to text classfcaton, t has domnated over other text classfers n the vew of performance. However, text classfcaton stll has many ponts to be mproved such as automatc tranng data generaton, nosy data reducton, a classfer wth robustness from nosy data, term weghtng and ndexng, etc., especally accordng to applcaton areas. In ths paper, we enumerate mprovement ssues for text classfcaton and ntroduce mprovement attempts and ther emprcal results to be actually appled n classfcaton tasks. Among many mprovement ssues n text classfcaton, we focus on three mportant mprovement ssues on automatc tranng data generaton, nosy data treatment (nosy data reducton and a robust classfer from nosy data), and term weghtng and ndexng. In supervsed-learnng based text classfcaton, obtanng the good qualty of tranng data s very mportant whereas labelng tasks for tranng data must be a panfully tme consumng process. To reduce that knd of burden for labelng tasks, sem-supervsed or unsupervsed learnng technques have been appled to text classfcaton [Ko and Seo, 2009; Slonm et al., 2002]. Event thought those labelng tasks are 3

4 performed by experts and by consumng long tme, tranng data can almost all nclude some nosy data. Removng nosy data s another key mprovement ssue for text classfcaton. For ths, we ntroduce two dfferent solutons: nosy data reducton n tranng settngs for postve and negatve examples [Han et al, 2007], and the development of a robust classfer from nosy data [Ko and Seo, 2004a]. Fnally, we pay attenton to the term weghtng and ndexng phase of text classfcaton. Most technques related to them are orgnated from the Informaton Retreval lterature. There exsts a problem that they are appled to text classfcaton n the same methodology as nformaton retreval even though text classfcaton and nformaton retreval have dfferent propertes and applcaton surroundngs. The mportance measure of sentences n a document by usng some text summarzaton technques s used for a new term weghtng scheme [Ko and Seo, 2004b]. The rest of ths paper s devoted to enumerate ssues for mprovng text classfcaton and ther emprcal results step by step as follows. Secton 2 presents the frst mprovement ssue about labelng task and ts one soluton related to sem-supervsed learnng usage. In secton 3, we explan the necessty of nosy data treatment and ts two clues to soluton: the nosy date reducton method appled to bnary text classfcaton and the TCFP classfer wth robustness from nosy data. Secton 4 descrbes the necessty of mproved term weghtng for text classfcaton and ntroduces a new term weghtng scheme. Fnally, we descrbe conclusons. 2. UTILIZATION OF UNLABELED DATA TO REDUCE THE PAINFUL AMOUNT OF MANUAL LABELING TASKS FOR OBTAINING TRAINING DATA Generally, the supervsed-learnng based text classfcaton requres a large, often prohbtve, number of labeled tranng data for accurate learnng. Snce the labelng tasks must be done by hand and the applcaton area of text classfcaton has dversfed from artcles and web pages to emals and blogs, the labelng tasks for each applcaton area are harder and harder. Thus, most users of a practcal system must obvously prefer algorthms that can have a hgh accuracy but do not requre a panful amount of manual labelng task. To answer ther wsh, unsupervsed learnng [Slonm et al., 2002] as well as sem-supervsed learnng [Ko and Seo, 2009; Ngam et al., 4

5 2006] and actve learnng [Tong and Koller, 2001] has been appled to text classfcaton; they all attempt to utlze unlabeled data nstead of labeled data. Whle labeled data s hardly obtaned, unlabeled data s readly avalable and plentful. Therefore, those learnng methods are very useful to utlze unlabeled data for text classfcaton. We advocated a new automatc machne-labelng method usng a bootstrappng technque. The proposed method used only unlabeled documents and the ttle word of each category s used as ntal data for learnng of text classfcaton. The nput to the bootstrappng process s a large amount of unlabeled documents and a small amount of seed nformaton to tell the learner about the specfc task. Here, a ttle word assocated wth a category s consdered as seed nformaton. 2.1 Bootstrappng Technque to Generate Machne-labeled Data The bootstrappng process conssts of three modules: a module to preprocess unlabeled documents, a module to construct context-clusters for tranng, and a module to buld up the Nave Bayes classfer usng context-clusters Preprocessng Frst of all, the Brll POS tagger s used to extract content words as words wth noun or verb POS tags [Brll, 1995]. Snce machne-labeled data has to be created from only a ttle word, context s defned as a new unt of meanng, and t s used as a new meanng unt to bootstrap the meanng of each category. That s a mddle szed processng unt between a word and a document. A sequence of 60 content words wthn a document s regarded as a wndow sze for one context. To extract contexts from a document, we use a sldng wndow technque [Maarek et al., 1991]. The wndow sldes from the frst content word to the last content word of the document n the sze of the wndow (60 words) and wth the nterval of each wndow (30 words). 5

6 2.1.2 Constructng a Context-Cluster as the Tranng Data of Each Category Frstly, keywords are automatcally generated from a ttle word for each category usng co-occurrence nformaton. Then centrod-contexts are extracted by usng the ttle word and keywords. Each centrod-context ncludes at least one of the ttle and keywords. It s regarded as one of the most nformatve contexts for each category. Furthermore, more nformaton of each category s obtaned by assgnng remanng contexts to each context-cluster by a smlarty measure technque. Contexts wth a keyword or a ttle word of any category are selected as a centrod-context. From the selected contexts, we can obtan a set of words n the frst-order co-occurrence from centrod-contexts of each category. We next gather the second-order co-occurrence nformaton by assgnng remanng contexts to the contextcluster of each category. For the assgnng crteron, we calculate smlartes between remanng contexts and the centrod-contexts of each category. Thus, we propose that the smlarty measure algorthm by Karov and Edelman [1998] s reformed and appled to our context-cluster generaton algorthm; remanng contexts are assgned to each context-cluster by ths algorthm. In our smlarty measure algorthm, words and contexts play complementary roles. Contexts are smlar to the extent that they contan smlar words, and words are smlar to the extent that they appear n smlar contexts. Ths defnton s crcular. Thus t s appled teratvely usng two matrces, Word Smlarty Matrx (WSM) and Context Smlarty Matrx (CSM); the rows and columns of WSM are labeled by all the content words encountered n the centrod-contexts of each category and nput remanng contexts, and the rows of CSM correspond to the centrod-contexts and the columns to the remanng contexts. Each category has one WSM and one CSM. In each teraton n, WSM n, whose cell (,) holds a value between 0 and 1, s updated, and the value of each cell ndcates the extent to whch the -th word s contextually smlar to the -th word. Also, CSM n, whch holds smlartes among contexts, s kept and updated. The number of nput contexts of row and column n CSM s lmted to 200 as consderng executon tme and memory allocaton. To estmate the smlartes, WSM s ntalzed to the dentty matrx. That s, each word s fully smlar (1) to tself and completely dssmlar (0) to other words. The followng steps are terated untl the changes n the 6

7 smlarty values are small enough: Update the context smlarty matrx CSM n, usng the word smlarty matrx WSM n and update the word smlarty matrx WSM n, usng the context smlarty matrx CSM n. To smplfy the symmetrc teratve treatment of smlartes between words and contexts, an auxlary relaton between words and contexts s expressed as affnty and s represented by aff n (X,W) by formulae (1) and (2) [Karov and Edelman, 1998]. aff n ( W X n W, X ) max sm ( W, W ) (1) aff n ( W X n X, W ) max sm ( X, X ) (2) In the above formulae, n denotes the teraton number, W X means that a word W belongs to a context X, and the smlarty values are defned by WSM n and CSM n. Every word has some affnty to a context, and the context can be represented by a vector ndcatng the affnty of each word to t. The smlarty of W 1 to W 2 s the average affnty of the contexts that nclude W 1 to W 2, and the smlarty of a context X 1 to X 2 s a weghted average of the affnty of the words n X 1 to X 2. Smlarty formulae are defned as follows: sm n 1 ( X 1, X 2 ) weght ( W, X 1 ) aff n ( W, X 2 ) (3) W X 1 f W 1 W 2 else sm n 1 ( W 1, W 2 ) 1 (4) sm n 1 ( W 1, W 2 ) W 1 X weght ( X, W 1 ) aff n ( X, W 2 ) 7

8 The weghts n formula (3) are calculated by a product of three factors: global frequency, log-lkelhood factor, and part of speech. Snce each weght n formula (4) s a recprocal of the number of contexts that contan W 1, the sum of the weghts s 1. These values are used to update the correspondng entres of WSM n and CSM n. The smlarty of each remanng context to the centrod-contexts of a category s frst estmated, and then the smlarty value s averaged. Fnally, each remanng context s assgned to the context-cluster of any category, when the category has a maxmum smlarty Learnng a Nave Bayes Classfer Usng Context-Clusters In above secton, we obtaned labeled contexts tranng data: context-clusters. Snce the tranng documents are labeled as a context unt, a Nave Bayes classfer s selected to learn from context-clusters because t can be bult by only estmatng words probabltes n each category. Therefore, the Nave Bayes classfer s constructed by estmatng words dstrbuton n the context-cluster of each category, and t fnally classfy unlabeled documents nto each category. The Nave Bayes classfer s bult up wth mnor modfcatons based on Kullback-Lebler Dvergence [Craven et al., 2000]. Ths method makes exactly the same classfcatons as Nave Bayes, but produce classfcaton scores that are less extreme. Thus better reflect uncertanty than those produced by Nave Bayes. A document d s classfed by to the followng formula: P ( c d ˆ ˆ V P ( c ) ( ; ) ˆ P d c ˆ ˆ N ; ) P ( c ) P ( w t c ; ) P ( d ˆ ) t 1 ˆ V log P ( c ; ) ( ; ˆ ) P w ˆ t c P ( w t d ; ) log n t 1 P ( w ; ˆ) t d ( w, d ) (5) where n s the number of words n document d, w t s the t-th word n the vocabulary, N(w t,d ) s the frequency of word w t n document d. 8

9 2.2 Emprcal Evaluaton Data Sets and Expermental Settngs To test the proposed method, we used three dfferent knds of data sets: UseNet newsgroups (Newsgroups), web pages (WebKB), and newswre artcles (Reuters 21578). For far evaluaton n Newsgroups and WebKB, we employed the fve-fold cross-valdaton method. That s, each data set s splt nto fve subsets, and each subset s used once as test data n a partcular run whle the remanng subsets are used as tranng data for that run. The splt nto tranng and test sets for each run s the same for all classfers. Therefore, all the results of our experments are averages of fve runs. About 25% of documents from tranng data of each data set were selected for a valdaton set. After all parameter values of our experments were set from the valdaton set, we evaluated the proposed method usng these parameter values. We appled a statstcal feature selecton method ( 2 statstcs) for each classfer at ts preprocessng stage [Yang and Pedersen, 1997]. As performance measures, we followed the standard defnton of recall (r), precson (p), and F 1 measure (2rp/(r+p)). For evaluatng performance average across categores, we used the mcro-averagng method that s to count the decsons for all the categores n a ont pool and computes the global recall, precson, and F 1 values for that global pool [Yang, 1999]. Results on Reuters are reported as a precson-recall breakeven pont, whch s a standard nformaton retreval measure for bnary classfcaton [Joachms, 2001; Yang, 1999] Expermental Results Here, we employ a supervsed Nave Bayes classfer for comparng our method wth the supervsed method; the supervsed Nave Bayes classfer learns from human-labeled documents. Fgure 1 and Table 1 report the results from three data sets and compare the performance dfferences between the proposed method and supervsed method. 9

10 M c ro - a vg. F 1 o r B EP O ur M ethod Supervsed N B N ew sgroups W ebk B Reuters D a ta S e t Fgure 1. Performance dfferences of the best mcro-avg. F 1 scores or precson-recall break-even ponts n three data sets: our method vs. supervsed NB Table 1. Performance dfferences of the best mcro-avg. F 1 scores or precson-recall break- even ponts n three data sets: our method vs. supervsed NB Data Sets Our method Supervsed NB Newsgroups WebKB Reuters As shown n Table 1, we obtaned a 79.36% mcro-average F 1 score n the Newsgroups data set, a 73.63% mcroaverage F 1 score n the WebKB data set, and an 88.62% mcro-average precson-recall breakeven pont n the Reuters data set. The dfferences between our method and the supervsed Nave Bayes classfer n each data set are 12.36% n the Newsgroups data set, 11.66% n the WebKB data set, and 3.02% n the Reuters data set. Snce we use only unlabeled data and ttle words, the performance of our method s much more sgnfcant. Partcularly, the proposed method n the Reuters data set almost acheved comparable performance wth the supervsed method. As prevously noted n [Joachms, 1997], categores lke wheat and corn are known for a strong correspondence between a small set of words (lke our ttle words and keywords) and the categores, whle categores lke acq are known for more complex characterstcs. Snce the categores wth narrow defntons attan best classfcaton wth small vocabulares, we can acheve good performance n the Reuters data set wth our method whch depends on ttle words. In the Newsgroups and WebKB data sets, we could not attan comparable performance wth the supervsed method. In fact, the categores of these data sets are somewhat 10

11 confusable. In the Newsgroups data set, many of the categores fall nto confusable clusters: for example, fve of them are comp.* dscusson groups, and three of them dscuss relgon. In the WebKB data set, meanngful words of each category also have hgh frequency n other categores. Worst of all, even ttle words (course, professor, faculty, proect) have a confusng usage. We thnk these factors contrbuted to a comparatvely poor performance of our method. 3. NECESSITY OF NOISE REDUCTION AND ITS TWO CLUES: NOISY DATA REDUCTION AND THE TCFP CLASSIFIER WITH ROBUSTNESS FROM NOISY DATA Effectvely Dealng wth nosy data s another key mprovement ssues for text classfcaton. There are two dfferent solutons: nosy data removal n bnary tranng settngs by the one-and-the-rest method [Han et al, 2007], and the TCFP classfer wth robustness from nosy data [Ko and Seo, 2004]. 3.1 Improvng the One-aganst-the-rest for Removng Nosy Data n Bnary Text Classfcaton In text classfcaton, bnary settng or mult-class settng have been used to organze tranng examples for learnng tasks. As the bnary settng conssts of only two classes, t s the smplest, yet most mportant formulaton of the learnng problem. Those two classes are composed of relevant (postve) and non-relevant (negatve) for nformaton retreval applcatons [Joachms, 2002]. Generally, some classfcaton tasks nvolve more than two classes. When we apply the bnary settng to the mult-class settng wth more than two classes, there s a problem that the mult-class settng conssts of only postve examples of each category; each category does not have negatve examples. In order to solve ths problem, the one-aganst-the-rest method has been used n many cases [Zadrozny and Elkan, 2001; Zadrozny and Elkan, 2002; Hsu and Ln, 2002]; t can reduce a multclass problem nto many bnary tasks. That s, whle all the documents of a category are generated as postve examples by hand, documents that do not belong to the category regard as negatve examples ndrectly. Ths labelng task concentrates on only selectng postve examples for each category, and t does not label the 11

12 negatve examples that have the opposte meanng of counterpart postve category drectly. Thus the negatve data set n the one-aganst-the-rest method probably nclude nosy examples. In addton, because the negatve data set conssts of the dfferent dstrbutons of postve examples from varous categores, t s hard to be consdered as the exact negatve examples of each category. Those nosy documents can be one of the maor causes of decreasng the performance for bnary text classfcaton. Therefore, classfers need to effcently handle the nosy tranng documents to acheve the hgh performance Detectng and Removng Nosy Data from the One-aganst-the-rest In the one-aganst-the-rest method, the documents of one category are regarded as postve examples and the documents of the other categores as negatve examples. To effectvely remove nosy data n the one-aganst-the-rest method for tranng settng, we have to fnd a boundary area that denotes a regon ncludng many nosy documents. Frst of all, usng ntal postve and negatve data sets for each category from the one-aganst-the-rest method, we can learn a Nave Bayes (NB) classfer and we can obtan a predcton score for each document by the followng formula (6). Predcton _Score ( c d ) P ( Postve P ( Postve d d ) P ( Negatve ) (6) d ) where c means a category and d means a document of c. P(Postve d ) means a probablty of the document d to be postve n c, and P(Negatve d ) means a probablty of the document d to be negatve n c. Accordng to the calculated predcton scores, the entre documents of each category are sorted out n the descendng order. Probabltes, P(Postve d ) and P(Negatve d ), of formula (6) s generally calculated by the Nave Bayes formula as follows [Lews, 1998; Ko and Seo, 2004b; Craven et al., 2000]: 12

13 P ( Postve d ) P ( Postve ) P ( d Postve P ( d ) ) P ( Postve T ) 1 P ( t Postve N ( t d ) ) (7) log( P ( Postve n )) T P ( t Postve P ( t d ) log 1 P ( t d ) ) where t s the -th word n the vocabulary, T s the sze of the vocabulary, and N(t d ) s the frequency of word t n document d. A boundary between postve and negatve examples can be detected n a block wth the most mxed degree of postve and negatve documents. The sldng wndow technque s frst used to detect the block [Lee et al., 2001]. In ths technque, wndows of a certan sze are sldng from the top document to the last document n a lst ordered by the predcton scores. An entropy value s calculated for estmatng the mxed degree of each wndow as follows [Mtchell, 1997]: Entropy ( W ) p log 2 p p log 2 p (8) where, gven a wndow (W), p + s the proporton of postve documents n W and p - s the proporton of negatve documents n W. Two wndows wth the hghest entropy value are pcked up; one wndow s frstly detected from the top and the other s frstly detected from the bottom. If there s no wndow or only one wndow wth the hghest entropy value, wndows wth the next hghest entropy value become targets of the selected wndows. Then maxmum (max) and mnmum (mn) threshold values can be searched from selected wndows respectvely. The max threshold value s found as the hghest predcton score of a negatve document n the former wndow and the mn 13

14 threshold value s as the lowest predcton score of a postve document n the latter wndow. We regard the documents between max and mn threshold values as unlabeled documents; these documents are consdered as potentally nosy documents. Now three classes for tranng documents of each category are constructed as defntely postve documents, unlabeled documents, and defntely negatve documents. By applyng the revsed EM algorthm to those three data sets, we can extract actual nosy documents and remove them. The EM algorthm s used to pck out nosy documents from unlabeled data and to remove them. The general EM algorthm conssts of two steps, the Expectaton step and the Maxmzaton step [Dempster et al., 1997]. Ths algorthm frst trans a classfer usng the avalable labeled documents and labels the unlabeled documents by hard classfcaton (Expectaton (E or E ) step). It then trans a new classfer usng the labels of all the documents (Maxmzaton (M) step), and terates to convergence. The Nave Bayes classfer s used n the two steps of the EM algorthm. Fgure 2 shows how the EM algorthm s revsed n our method. The Every Every Let Let Revsed d d p u document document be be the the EM - Algorthm n P (postve n N (negatve document document of P ; data) data) Let s s d of U (unlabeled assgned n assgned be the data); the class labeled the class labeled c document of N ; c p n ; ; Buld an ntal nave Bayesan classfer, cˆ, from P and N ; Loop whle classfer parameters ( cˆ ) are mproved (E - step) P {} ; N {} ; Use for the each f else current document P(c p d u ) classfer P P { d u } ; d u U do N N { d u };, P(c n d u ) then (E - step) P {} ; N {} ; Use for the each current document f P(c p d u ) else contnue classfer N N { d u }; d u U do P(c n d u ) then, ; ( P P {} ) (M - step) Re - estmate the classfer, cˆ, from P and N ; Use maxmum a posteror parameter estmaton to fnd cˆ arg max c P ( c ) P ( U c ); Fgure 2. The revsed EM Algorthm 14

15 E -step s reformed to effectvely remove the nose documents located n the boundary area. Unlke orgnal E- step, t does not assgn an unlabeled document, d u, to the postve data set, P, because t regards d u as anther nosy document; snce postve documents are labeled by hand and have enough nformaton for a category, addtonal postve documents can decrease performance. Fnally, we can learn the text classfers wth bnary tranng data generated by the revsed EM algorthm Emprcal Evaluaton The expermental results show that the proposed method acheved better performance than the orgnal oneaganst-the-rest method n all the three tranng data sets and all the four classfers. To evaluate the effectveness of the proposed method, we mplemented four dfferent text classfers (k-nn, Nave Bayes (NB), Roccho, and SVM). And the performance of the orgnal one-aganst-the-rest method s compared to that of the proposed method on three test data sets (Newsgroups, WebKB, and Reuters data sets). As performance measures, the standard defnton of recall and precson s used, and the mcro-averagng method and the macro-averagng method are appled for evaluatng performance average across categores; n the macro-averagng method, the recall, precson, and F 1 measures are frst computed for ndvdual categores and then averaged over categores as a global measure of the average performance over all categores [Yang, 2002]. Results are reported as the precson-recall BEP (BreakEven Ponts), whch s a standard nformaton retreval measure for bnary classfcaton; gven a rankng of documents, the precson-recall breakeven pont s a value at whch precson and recall are equal [Joachms, 1998; Yang, 1999; Ko and See, 2004a]. Table 2, Table 3, and Table 4 show the expermental results from each text classfer n Newsgroups, WebKB, and Reuters data sets respectvely. 15

16 Orgnal Table 2. Results n the Newsgroups data set k-nn NB Roccho SVMs Orgnal Orgnal Orgnal Mcro-avg. BEP Macro-avg. BEP Orgnal Table 3. Results n the WebKB data set k-nn NB Roccho SVMs Orgnal Orgnal Orgnal Mcro-avg. BEP Macro-avg. BEP Orgnal Table 4. Results n the Reuters data set k-nn NB Roccho SVMs Orgnal Orgnal Orgnal Mcro-avg. BEP Macro-avg. BEP As shown n the above Tables, SVM acheved less mprovement than the other classfers. It s caused by the fact that the performance of SVM usng the orgnal one-aganst-the-rest method s too hgh n all the data sets. Note that t s more dffcult to mprove a classfer wth hgher performance. As a result, the proposed method acheved better performances than the orgnal method over all the classfers and all the data sets. Ths s an obvous proof that the proposed method s more effectve than the orgnal one-aganst-the-rest method. 16

17 3.2 The TCFP Classfer wth Robustness from Nosy Data Usng the Feature Proecton Technque To effectvely handle out nosy data, a new text classfer usng a feature proecton technque was developed and the classfer was named by TCFP [Ko and Seo, 2004a]. By the property of the feature proecton technque, the TCFP classfer can have robustness from nosy data. In the experment results, TCFP showed better performance than other conventonal classfers n nosy data The TCFP Classfer wth Robustness from Nosy Data In the TCFP classfer, the classfcaton knowledge s represented as a set of proectons of tranng data on each feature dmenson. The classfcaton of a test document s based on the votng of each feature (word) of the test document. That s, the fnal predcton score s calculated by accumulatng the votng scores of all features. Frst of all, the votng rato of each category must be calculated for all features. Snce elements wth a hgh TF-IDF value n proectons of a feature must become more useful classfcaton crtera for the feature, only elements wth TF-IDF values above the average TF-IDF value are used for votng. The selected elements partcpate n proportonal votng wth the same mportance as the TF-IDF value of each element. Thus, the votng rato of each category c n a feature f m s calculated by the followng formula: r ( c, f m ) = f m V m w ( f m, d ) y ( c, f m ) f m V m w ( f m, d ) (9) In formula (9), f m denotes the proecton element for a feature f m n a document d, w f, d ) s the weght of a ( m feature f m n a document d, V m denotes a set of elements selected for the votng of a feature f m, and y ( c, f ) { 0,1} m s a functon; f the category for an element f s equal to c, the output value s 1. m Otherwse, the output value s 0. Next, snce each feature separately votes on feature proectons, contextual nformaton s mssng. Thus, cooccurrence frequency s used to apply contextual nformaton to the proposed classfcaton algorthm. To 17

18 calculate a co-occurrence frequency value between any two features f and f, the number of documents, whch nclude both features, s counted. TF-IDF values of two features f and f n a test document are modfed by reflectng the co-occurrence frequency of the two features. That s, terms wth a hgh co-occurrence frequency value and a low category frequency value have hgher term weghts as the followng formula: fw ( f, d ) w ( f, ) 1 d log( 1 cf log( co ( f, ) 1) f 1) log( (, ) 1) maxco f k f l (10) where fw(f,d) denotes a modfed term weght assgned to term f, cf denotes the category frequency that s the number of categores n whch f and f co-occur, co f, f ) s a co-occurrence frequency value for f and f, and ( maxco ( f k, f l ) s the maxmum value among all co-occurrence frequency values. Note that the weght of feature f s also modfed by the same formula usng f nstead of f. formula: The fnal votng score of each category c n a feature f m of a test document d s calculated by the followng vs ( c, f m ) fw ( f m, d ) r ( c, f m ) log( 1 2 max ( f m )) (11) where fw(f m,d) denotes a modfed term weght by the co-occurrence frequency and 2 ( f ) maxmum score of the calculated 2 statstcs value of f n each category. m max m denotes the The outlne of the TCFP classfer s as follows: 18

19 Input: test document: d =<f 1,f 2,,f n > Man Process: For each feature f fw(f,d) s calculated For each feature f For each category c vote[c ]=vote[c ]+vs(c,f ) by Formula 11 predcton = arg max vote [ c ] c The robustness from nosy data of the TCFP classfer s due to ts votng mechansm. The votng mechansm of the TCFP classfer depends on separate votng n each feature and t can reduce the negatve effect of possble nosy data and rrelevant features n text classfcaton. That s, when a document contans rrelevant features or ncorrect label, the document may be n wrong locaton of vector space model. Ths document can gve bad effects n performance especally for k-nn and SVM. On the other hand, an rrelevant feature n the TCFP classfer contrbutes to only votng of the feature. Moreover, only feature elements wth a TFIDF weght over an average weght value can take part n our votng mechansm, and ths process makes the TCFP classfer more effectve to handle rrelevant features Expermental Evaluaton We provde emprcal evdences that TCFP s a useful classfer for text classfer and has robustness from nosy data. In our experments, we used three test data sets (Newsgroups, WebKB, and Reuters data sets) and employed four other classfers (k-nn, NB, Roccho, and SVMs) to compare wth the TCFP classfer. As performance measures, we followed the standard defnton of recall (r), precson (p), and F 1 measure (2rp/(r+p)) for the Newsgroups and WebKB data sets and results on Reuters are reported as precson-recall breakeven ponts [Yang, 1999]. 19

20 Frstly, Table 5 reports the performance comparson of classfers on the three data sets to verfy the general usefulness of TCFP. Table 5. Comparson of performance results for each classfer on each data set TCFP k-nn NB Roccho SVM Newsgroups WebKB Reuters The results show that TCFP s superor to k-nn, Nave Bayes, and Roccho classfers. However, TCFP produced lower performance than SVMs, whch has been reported as a classfer wth the best performance n ths lterature. In order to verfy the superorty of TCFP n robustness from nosy data, we conducted experments for evaluatng the robustness of each classfer from nosy data. For ths experment, we generated four data sets wth ncreasng the number of nosy documents from 10% to 40% usng the Newsgroups data set: these nosy documents were randomly chosen from each category and randomly assgned nto other categores. The results of each classfer on each nosy data set are shown n Fgure 3 and Table 6. These results are also obtaned by a fvefold cross-valdaton method. 20

21 Fgure 3. Co. Comparson of performance results for each classfer on four nosy data sets Table 6. Comparson of performance results for each classfer on four nosy data sets Nosy degree TCFP k-nn NB Roccho SVM 10% % % % As shown n Fgure 3 and Table 6, TCFP showed the best performance begnnng from 20% nosy data set, and the decreasng rate of performance of TCFP s less than that of k-nn and SVMs. Especally, we observed that the performance of SVMs degraded rapdly when the number of nosy documents ncreased. As a result, the expermental results show that TCFP has good performance and a specal characterstc wth regards to robustness from nosy data. 21

22 4. IMPROVING TERM WEIGHT SCHEME FOR TEXT CLASSIFICATION We here focus on the term weghtng and ndexng scheme of text classfcaton as another mprovement ssue. The vector space model has been used as a conventonal method for text representaton [Salton et al., 1988]. Ths model commonly represents a document as a vector of features usng Term Frequency (TF) and Inverted Document Frequency (IDF). Ths model smply counts TF wthout consderng where the term occurs. However, each sentence n a document has dfferent mportance for dentfyng the content of the document. Thus, text classfcaton can be mproved by assgnng a dfferent weght accordng to the mportance of the sentence nto each term. For ths, we apply text summarzaton technques to classfy mportant sentences and unmportant sentences from a document. The mportance of each sentence s measured by those text summarzaton technques and the term weghts n each sentence are modfed n proporton to the calculated sentence mportance. To test our proposed method, we used two dfferent newsgroup data sets; one s a well known data set, the Newsgroup data set, and the other was gathered from Korean UseNet dscusson group. 4.1 Measurng the Importance of Sentences The mportance of each sentence s measured by two methods. Frst, the sentences that are more smlar to the ttle have hgher weghts. In the next method, we frst measure the mportance of terms by TF, IDF, and 2 statstc values and then we assgn the hgher mportance to the sentence wth more mportant terms. Fnally, the mportance of a sentence s calculated by combnaton of two methods Importance of sentences by the ttle Generally, we beleve that a ttle summarzes the mportant content of a document [Endres-Nggemeyer, 1998]. Hence, we measure the smlarty between the ttle and each sentence and then we assgn the hgher mportance to the sentences wth the hgher smlarty. The ttle and each sentence of a document are represented as the vectors of content words. The smlarty value of them s calculated by the nner product and the calculated values are 22

23 normalzed nto values between 0 and 1 by a maxmum value. The smlarty value between the ttle T and the sentence S n a document d s calculated by the followng formula: Sm ( S, T ) S T max ( S T ) S d (12) where T denotes a vector of the ttle, and S denotes a vector of sentence Importance of sentences by the mportance of terms Snce the method by the ttle depends on the qualty of the ttle, t can be useless n the document wth a meanngless ttle or no ttle. Besdes, sentences wth mportant terms must be also handled mportantly although they are dssmlar to the ttle. Consderng these ponts, we frst measure the mportance values of terms by TF, IDF, and 2 statstc values, and then the sum of the mportance values of terms n each sentence s assgned to the mportance value of the sentence. In ths method, the mportance value of a sentence calculated as follows: S n a document d s Cen ( S ) t S max S d tf ( t ) df ( t ) ( t ) t S 2 tf ( t ) df ( t ) ( t ) 2 (13) where tf(t) denotes the term frequency of term t, df(t) denotes the nverted document frequency, and 2 (t) denotes the 2 statstc value. 23

24 4.1.3 Combnaton of two sentence mportance values Two knds of sentence mportance are smply combned by the followng formula: Score ( 1 2 S ) 1.0 k Sm ( S, T ) k Cen ( S ) (14) In formula (14), the k 1 and k 2 are constant weghts, whch control the rates of reflectng two mportance values. The 1.0 constant value s added to a calculated sentence mportance value n order to prevent the modfed TF value havng lower value than orgnal TF value by formula (15) Indexng process The mportance value of a sentence by formula (14) s used for modfyng the TF value of a term. That s, snce a TF value of a term n a document s calculated by the sum of the TF values of terms n each sentence, the modfed TF value (WTF(d,t)) of the term t n the document d s calculated by formula (15). WTF d, t ) tf ( S, t ) Score ( S ) ( (15) S d where tf(s,t) denotes TF of the term t n sentence S. The weght by formula (15) s used n k-nn, NB, Roccho, and SVM. 4.2 Emprcal Evaluaton To test our proposed system, we used two newsgroup data sets wrtten by two dfferent languages: Englsh and Korean. Each document n both data sets has only one category. 24

25 The 20 Newsgroups data set s the same one used n prevous sectons. The second data set was gathered from the Korean UseNet group. Ths data set contans a total of 10,331 documents and conssts of 15 categores. 3,107 documents (30%) are used for test data and the remanng 7,224 documents (70%) for tranng data. The resultng vocabulary from tranng data has 69,793 words. As performance measures, we followed the standard defnton of recall (r), precson (p), and F 1 measure (2rp/(r+p)). For evaluatng an average performance across categores, we used the mcro-averagng method and macro-averagng method [Yang, 1999]. Table 7 and 8 lst the comparson of performances n each classfer usng dfferent ndexng schemes on two newsgroup data sets. Here, the bass method used the conventonal TF for NB and conventonal TF-IDF for the other classfers. Mcro-avg. Bass Table 7. Results n the Englsh Newsgroups data set k-nn NB Roccho SVMs Bass Bass Bass F Macro-avg. F Mcro-avg. Bass Table 8. Results n the Korean Newsgroups data set k-nn NB Roccho SVMs Bass Bass Bass F Macro-avg. F In both data sets, the proposed method produced a better performance n all these classfers. As a result, our proposed method can mprove the classfcaton performance wth all these classfers n both Englsh and Korean. 25

26 5. CONCLUSIONS Ths paper has been devoted to present three mportant ponts of mprovement and ther actual emprcal results. They can be summarzed as follows: - Tranng Data Generaton: sem-supervsed learnng or actve learnng technques can be appled to text classfcaton. They gve us a way to utlze nexpensve and plentful unlabeled data. In our experment, we acheved comparable performance to the supervsed method even only usng the ttle word of each category and unlabeled data. - Nosy Data Reducton: effectve nosy data reducton and robust classfer development from nosy data can resolve some nosy data problems. In our experments, the proposed nosy data reducton method led to hgher performances n all of the three test data and four dfferent conventonal classfers and the TCFP classfer, whch are developed as a robust classfer from nosy data, also led to good performance n the envronment wth much nosy data. - Term Weghtng and Indexng: the development of a new term weghtng and ndexng scheme s needed because that of text classfcaton s dfferent to that of nformaton retreval. Thus, we advocated a new term weghtng and ndexng method for text categorzaton usng two knds of text summarzaton technques: one uses the ttle and the other uses the mportance of terms. In our experments, the proposed method acheved a better performance than the bass system dd n all these classfers and both two languages, Englsh and Korean. Acknowledgment Ths research was supported by Basc Scence Research Program through the Natonal Research Foundaton of Korea (NRF) funded by the Mnstry of Educaton, Scence and Technology ( ) 26

27 REFERENCES [1] Brll, E Transformaton-Based Error-drven Learnng and Natural Language Processng: A Case Study n Part of Speech Taggng. Computatonal Lngustcs 21, 4. Craven, M., DPasquo, D., Fretag, D., McCallum, A., Mtchell, T., Ngam, K., and Slattery, S Learnng to Construct Knowledge Bases from the World Wde Web. Artfcal Intellgence 118, 1-2, Dempster, A., Lard, N.M., and Rubn, D Maxmum Lkelhood from Incomplete Data Va the EM Algorthm. Journal of the Royal Statstcal Socety Seres B 39, 1, Endres-Nggemeyer B Summarzng Informaton. Sprnger-Verlag Berln Hedelberg, Fabrzo, S Machne Learnng n Automated Text Categorzaton. ACM Computng Surveys 34, 1, Han, H., Ko, Y., and Seo, J Usng the Revsed EM Algorthm to Remove Nosy Data for Improvng the One-aganst-the-rest n Bnary Text Classfcaton, Informaton Processng & Management, Pergamon-Elsever Scence 43, 5, Joachms, T, A Probablstc Analyss of the Roccho Algorthm wth TFIDF for Text Categorzaton, Machne Learnng: Proceedngs of the Fourteenth Internatonal Conference, Joachms, T Text Categorzaton wth Support Vector Machnes: Learnng wth Many Relevant Features. In Proceedngs of European Conference on Machne Learnng (ECML), Sprnger, Joachms, T Learnng to Classfy Text Usng Support Vector Machnes. The dssertaton for the degree of Doctor of Phlosophy, Kluwer Academc Publshers. Joachms, T Learnng to Classfy Text Usng Support Vetcor Machnes. Kluwer Academc Publshers. [2] Karov, Y. and Edelman, S Smlarty-based Word Sense Dsambguaton. Computatonal Lngustcs 24, 1, Ko, Y. and Seo, J Automatc Text Categorzaton by Unsupervsed Learnng. In Proceedngs of the 18 th Internatonal Conference on Computatonal Lngustcs (COLING 2000),

28 Ko, Y. and Seo, J. 2004a. Usng the Feature Proecton Technque Based on a Normalzed Votng for Text Classfcaton, Informaton Processng & Management 40, 2, Ko, Y., Park, J., and Seo, J. 2004b. Improvng Text Categorzaton Usng the Importance of Sentences, Informaton Processng & Management 40, 1, Ko, Y. and Seo, J Text Classfcaton from Unlabeled Documents wth Bootstrappng and Feature Proecton Technques. Informaton Processng & Management 45, 1, Lee, C.H., Ln, C.R., and Chen, M.S Sldng-wndow Flterng: an Effcent Algorthm for Incremental Mnng. In Proceedngs of the Tenth Internatonal Conference on Informaton and Knowledge Management, Lews, D.D., Schapre, R.E., Callan, J.P., and Papka, R Tranng Algorthms for Lnear Text Classfers. In Proceedngs of the 19 th Internatonal Conference on Research and Development n Informaton Retreval (SIGIR 96), Lews, D.D Nave (bayes) at Forty: The Independence Assumpton n Informaton Retreval. In Proceedngs of European Conference on Machne Learnng, Maarek, Y., Berry, D., and Kaser, G An Informaton Retreval Approach for Automatcally Constructon Software Lbrares. IEEE Transacton on Software Engneerng 17, McCallum, A. and Ngam, K A Comparson of Event Models for Nave Bayes Text Classfcaton. AAAI 98 workshop on Learnng for Text Categorzaton, Mtchell, T Machne Learnng. New York: McGraw-Hll. Ngam, K., McCallum, A., and Mtchell, T Sem-supervsed Text Classfcaton Usng EM. Sem- Supervsed Learnng. MIT press: Boston. [3] Salton, G. and Buckley, C Term Weghtng Approaches n Automatc Text Retreval. Informaton Processng and Management 24,

29 Slonm, N., Fredman, N., and Tshby, N Unsupervsed Document Classfcaton usng Sequental Informaton Maxmzaton. In Proceedngs of the 25 th Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, Tong, K. and Koller, D Support Vector Machne Actve Learnng wth Applcatons to Text Classfcaton. Journal of Machne Learnng Research Yang, Y. and Pedersen, J.P A Comparatve Feature Selecton n Statstcal Learnng of Text Categorzaton. In Proceedngs of the Fourteenth Internatonal Conference on Machne Learnng Yang, Y An evaluaton of statstcal approaches to text categorzaton. Journal of Informaton Retreval 1, 1/2, Yang, Y., Slattery, S., and Ghan, R A study of approaches to hypertext categorzaton. Journal of Intellgent Informaton Systems 18, 2. Zadrozny, B. and Elkan, C Obtanng Calbrated Probablty Estmates from Decson Trees and naïve Bayesan Classfers. In Proceedngs of the Eghteenth Internatonal Conference on Machne Learnn, Zadrozny, B. and Elkan, C Transformng Classfer Scores nto Accurate Multclass Probablty Estmates. In Proceedngs of Internatonal Conference on Knowledge Dscovery and Data Mnng (KDD-02),

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

The Research of Support Vector Machine in Agricultural Data Classification

The Research of Support Vector Machine in Agricultural Data Classification The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou

More information

Experiments in Text Categorization Using Term Selection by Distance to Transition Point

Experiments in Text Categorization Using Term Selection by Distance to Transition Point Experments n Text Categorzaton Usng Term Selecton by Dstance to Transton Pont Edgar Moyotl-Hernández, Héctor Jménez-Salazar Facultad de Cencas de la Computacón, B. Unversdad Autónoma de Puebla, 14 Sur

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Pruning Training Corpus to Speedup Text Classification 1

Pruning Training Corpus to Speedup Text Classification 1 Prunng Tranng Corpus to Speedup Text Classfcaton Jhong Guan and Shugeng Zhou School of Computer Scence, Wuhan Unversty, Wuhan, 430079, Chna hguan@wtusm.edu.cn State Key Lab of Software Engneerng, Wuhan

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems A Unfed Framework for Semantcs and Feature Based Relevance Feedback n Image Retreval Systems Ye Lu *, Chunhu Hu 2, Xngquan Zhu 3*, HongJang Zhang 2, Qang Yang * School of Computng Scence Smon Fraser Unversty

More information

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University CAN COMPUTERS LEARN FASTER? Seyda Ertekn Computer Scence & Engneerng The Pennsylvana State Unversty sertekn@cse.psu.edu ABSTRACT Ever snce computers were nvented, manknd wondered whether they mght be made

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography   Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department

More information

Classifier Selection Based on Data Complexity Measures *

Classifier Selection Based on Data Complexity Measures * Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.

More information

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION SHI-LIANG SUN, HONG-LEI SHI Department of Computer Scence and Technology, East Chna Normal Unversty 500 Dongchuan Road, Shangha 200241, P. R. Chna E-MAIL: slsun@cs.ecnu.edu.cn,

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

Learning to Classify Documents with Only a Small Positive Training Set

Learning to Classify Documents with Only a Small Positive Training Set Learnng to Classfy Documents wth Only a Small Postve Tranng Set Xao-L L 1, Bng Lu 2, and See-Kong Ng 1 1 Insttute for Infocomm Research, Heng Mu Keng Terrace, 119613, Sngapore 2 Department of Computer

More information

Deep Classification in Large-scale Text Hierarchies

Deep Classification in Large-scale Text Hierarchies Deep Classfcaton n Large-scale Text Herarches Gu-Rong Xue Dkan Xng Qang Yang 2 Yong Yu Dept. of Computer Scence and Engneerng Shangha Jao-Tong Unversty {grxue, dkxng, yyu}@apex.sjtu.edu.cn 2 Hong Kong

More information

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng

More information

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc. [Type text] [Type text] [Type text] ISSN : 0974-74 Volume 0 Issue BoTechnology 04 An Indan Journal FULL PAPER BTAIJ 0() 04 [684-689] Revew on Chna s sports ndustry fnancng market based on market -orented

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence

More information

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines (IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, Herarchcal Web Page Classfcaton Based on a Topc Model and Neghborng Pages Integraton Wongkot Srura Phayung Meesad Choochart Haruechayasak

More information

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status Internatonal Journal of Appled Busness and Informaton Systems ISSN: 2597-8993 Vol 1, No 2, September 2017, pp. 6-12 6 Implementaton Naïve Bayes Algorthm for Student Classfcaton Based on Graduaton Status

More information

Incremental Learning with Support Vector Machines and Fuzzy Set Theory

Incremental Learning with Support Vector Machines and Fuzzy Set Theory The 25th Workshop on Combnatoral Mathematcs and Computaton Theory Incremental Learnng wth Support Vector Machnes and Fuzzy Set Theory Yu-Mng Chuang 1 and Cha-Hwa Ln 2* 1 Department of Computer Scence and

More information

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article. A selective ensemble classification method on microarray data

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article. A selective ensemble classification method on microarray data Avalable onlne www.ocpr.com Journal of Chemcal and Pharmaceutcal Research, 2014, 6(6):2860-2866 Research Artcle ISSN : 0975-7384 CODEN(USA) : JCPRC5 A selectve ensemble classfcaton method on mcroarray

More information

Biostatistics 615/815

Biostatistics 615/815 The E-M Algorthm Bostatstcs 615/815 Lecture 17 Last Lecture: The Smplex Method General method for optmzaton Makes few assumptons about functon Crawls towards mnmum Some recommendatons Multple startng ponts

More information

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION Paulo Quntlano 1 & Antono Santa-Rosa 1 Federal Polce Department, Brasla, Brazl. E-mals: quntlano.pqs@dpf.gov.br and

More information

Keyword-based Document Clustering

Keyword-based Document Clustering Keyword-based ocument lusterng Seung-Shk Kang School of omputer Scence Kookmn Unversty & AIrc hungnung-dong Songbuk-gu Seoul 36-72 Korea sskang@kookmn.ac.kr Abstract ocument clusterng s an aggregaton of

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton

More information

A Novel Term_Class Relevance Measure for Text Categorization

A Novel Term_Class Relevance Measure for Text Categorization A Novel Term_Class Relevance Measure for Text Categorzaton D S Guru, Mahamad Suhl Department of Studes n Computer Scence, Unversty of Mysore, Mysore, Inda Abstract: In ths paper, we ntroduce a new measure

More information

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components

More information

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK L-qng Qu, Yong-quan Lang 2, Jng-Chen 3, 2 College of Informaton Scence and Technology, Shandong Unversty of Scence and Technology,

More information

Reliable Negative Extracting Based on knn for Learning from Positive and Unlabeled Examples

Reliable Negative Extracting Based on knn for Learning from Positive and Unlabeled Examples 94 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY 2009 Relable Negatve Extractng Based on knn for Learnng from Postve and Unlabeled Examples Bangzuo Zhang College of Computer Scence and Technology, Jln Unversty,

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Informaton Retreval Systems Jm Martn! Lecture 11 9/29/2011 Today 9/29 Classfcaton Naïve Bayes classfcaton Ungram LM 1 Where we are... Bascs of ad hoc retreval Indexng Term weghtng/scorng Cosne

More information

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:

More information

Edge Detection in Noisy Images Using the Support Vector Machines

Edge Detection in Noisy Images Using the Support Vector Machines Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona

More information

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng

More information

Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study

Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study Arabc Text Classfcaton Usng N-Gram Frequency Statstcs A Comparatve Study Lala Khresat Dept. of Computer Scence, Math and Physcs Farlegh Dcknson Unversty 285 Madson Ave, Madson NJ 07940 Khresat@fdu.edu

More information

Efficient Text Classification by Weighted Proximal SVM *

Efficient Text Classification by Weighted Proximal SVM * Effcent ext Classfcaton by Weghted Proxmal SVM * Dong Zhuang 1, Benyu Zhang, Qang Yang 3, Jun Yan 4, Zheng Chen, Yng Chen 1 1 Computer Scence and Engneerng, Bejng Insttute of echnology, Bejng 100081, Chna

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET 1 BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET TZU-CHENG CHUANG School of Electrcal and Computer Engneerng, Purdue Unversty, West Lafayette, Indana 47907 SAUL B. GELFAND School

More information

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines A Modfed Medan Flter for the Removal of Impulse Nose Based on the Support Vector Machnes H. GOMEZ-MORENO, S. MALDONADO-BASCON, F. LOPEZ-FERRERAS, M. UTRILLA- MANSO AND P. GIL-JIMENEZ Departamento de Teoría

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervsed Learnng and Clusterng Why consder unlabeled samples?. Collectng and labelng large set of samples s costly Gettng recorded speech s free, labelng s tme consumng 2. Classfer could be desgned

More information

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Determining the Optimal Bandwidth Based on Multi-criterion Fusion Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn

More information

Semi Supervised Learning using Higher Order Cooccurrence Paths to Overcome the Complexity of Data Representation

Semi Supervised Learning using Higher Order Cooccurrence Paths to Overcome the Complexity of Data Representation Sem Supervsed Learnng usng Hgher Order Cooccurrence Paths to Overcome the Complexty of Data Representaton Murat Can Ganz Computer Engneerng Department, Faculty of Engneerng Marmara Unversty, İstanbul,

More information

Query Clustering Using a Hybrid Query Similarity Measure

Query Clustering Using a Hybrid Query Similarity Measure Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan

More information

S1 Note. Basis functions.

S1 Note. Basis functions. S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type

More information

An Anti-Noise Text Categorization Method based on Support Vector Machines *

An Anti-Noise Text Categorization Method based on Support Vector Machines * An Ant-Nose Text ategorzaton Method based on Support Vector Machnes * hen Ln, Huang Je and Gong Zheng-Hu School of omputer Scence, Natonal Unversty of Defense Technology, hangsha, 410073, hna chenln@nudt.edu.cn,

More information

Classifying Acoustic Transient Signals Using Artificial Intelligence

Classifying Acoustic Transient Signals Using Artificial Intelligence Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)

More information

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers Journal of Convergence Informaton Technology Volume 5, Number 2, Aprl 2010 Investgatng the Performance of Naïve- Bayes Classfers and K- Nearest Neghbor Classfers Mohammed J. Islam *, Q. M. Jonathan Wu,

More information

Intelligent Information Acquisition for Improved Clustering

Intelligent Information Acquisition for Improved Clustering Intellgent Informaton Acquston for Improved Clusterng Duy Vu Unversty of Texas at Austn duyvu@cs.utexas.edu Mkhal Blenko Mcrosoft Research mblenko@mcrosoft.com Prem Melvlle IBM T.J. Watson Research Center

More information

Available online at Available online at Advanced in Control Engineering and Information Science

Available online at   Available online at   Advanced in Control Engineering and Information Science Avalable onlne at wwwscencedrectcom Avalable onlne at wwwscencedrectcom Proceda Proceda Engneerng Engneerng 00 (2011) 15000 000 (2011) 1642 1646 Proceda Engneerng wwwelsevercom/locate/proceda Advanced

More information

A Weighted Method to Improve the Centroid-based Classifier

A Weighted Method to Improve the Centroid-based Classifier 016 Internatonal onference on Electrcal Engneerng and utomaton (IEE 016) ISN: 978-1-60595-407-3 Weghted ethod to Improve the entrod-based lassfer huan LIU, Wen-yong WNG *, Guang-hu TU, Nan-nan LIU and

More information

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana

More information

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur FEATURE EXTRACTION Dr. K.Vjayarekha Assocate Dean School of Electrcal and Electroncs Engneerng SASTRA Unversty, Thanjavur613 41 Jont Intatve of IITs and IISc Funded by MHRD Page 1 of 8 Table of Contents

More information

Improving Web Image Search using Meta Re-rankers

Improving Web Image Search using Meta Re-rankers VOLUME-1, ISSUE-V (Aug-Sep 2013) IS NOW AVAILABLE AT: www.dcst.com Improvng Web Image Search usng Meta Re-rankers B.Kavtha 1, N. Suata 2 1 Department of Computer Scence and Engneerng, Chtanya Bharath Insttute

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

A New Approach For the Ranking of Fuzzy Sets With Different Heights

A New Approach For the Ranking of Fuzzy Sets With Different Heights New pproach For the ankng of Fuzzy Sets Wth Dfferent Heghts Pushpnder Sngh School of Mathematcs Computer pplcatons Thapar Unversty, Patala-7 00 Inda pushpndersnl@gmalcom STCT ankng of fuzzy sets plays

More information

Data Mining: Model Evaluation

Data Mining: Model Evaluation Data Mnng: Model Evaluaton Aprl 16, 2013 1 Issues: Evaluatng Classfcaton Methods Accurac classfer accurac: predctng class label predctor accurac: guessng value of predcted attrbutes Speed tme to construct

More information

X- Chart Using ANOM Approach

X- Chart Using ANOM Approach ISSN 1684-8403 Journal of Statstcs Volume 17, 010, pp. 3-3 Abstract X- Chart Usng ANOM Approach Gullapall Chakravarth 1 and Chaluvad Venkateswara Rao Control lmts for ndvdual measurements (X) chart are

More information

Machine Learning 9. week

Machine Learning 9. week Machne Learnng 9. week Mappng Concept Radal Bass Functons (RBF) RBF Networks 1 Mappng It s probably the best scenaro for the classfcaton of two dataset s to separate them lnearly. As you see n the below

More information

A Web Site Classification Approach Based On Its Topological Structure

A Web Site Classification Approach Based On Its Topological Structure Internatonal Journal on Asan Language Processng 20 (2):75-86 75 A Web Ste Classfcaton Approach Based On Its Topologcal Structure J-bn Zhang,Zh-mng Xu,Kun-l Xu,Q-shu Pan School of Computer scence and Technology,Harbn

More information

Unsupervised Learning

Unsupervised Learning Pattern Recognton Lecture 8 Outlne Introducton Unsupervsed Learnng Parametrc VS Non-Parametrc Approach Mxture of Denstes Maxmum-Lkelhood Estmates Clusterng Prof. Danel Yeung School of Computer Scence and

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

Smoothing Spline ANOVA for variable screening

Smoothing Spline ANOVA for variable screening Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned

More information

Network Intrusion Detection Based on PSO-SVM

Network Intrusion Detection Based on PSO-SVM TELKOMNIKA Indonesan Journal of Electrcal Engneerng Vol.1, No., February 014, pp. 150 ~ 1508 DOI: http://dx.do.org/10.11591/telkomnka.v1.386 150 Network Intruson Detecton Based on PSO-SVM Changsheng Xang*

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

An Improvement to Naive Bayes for Text Classification

An Improvement to Naive Bayes for Text Classification Avalable onlne at www.scencedrect.com Proceda Engneerng 15 (2011) 2160 2164 Advancen Control Engneerngand Informaton Scence An Improvement to Nave Bayes for Text Classfcaton We Zhang a, Feng Gao a, a*

More information

Document Representation and Clustering with WordNet Based Similarity Rough Set Model

Document Representation and Clustering with WordNet Based Similarity Rough Set Model IJCSI Internatonal Journal of Computer Scence Issues, Vol. 8, Issue 5, No 3, September 20 ISSN (Onlne): 694-084 www.ijcsi.org Document Representaton and Clusterng wth WordNet Based Smlarty Rough Set Model

More information

Review of approximation techniques

Review of approximation techniques CHAPTER 2 Revew of appromaton technques 2. Introducton Optmzaton problems n engneerng desgn are characterzed by the followng assocated features: the objectve functon and constrants are mplct functons evaluated

More information

Impact of a New Attribute Extraction Algorithm on Web Page Classification

Impact of a New Attribute Extraction Algorithm on Web Page Classification Impact of a New Attrbute Extracton Algorthm on Web Page Classfcaton Gösel Brc, Banu Dr, Yldz Techncal Unversty, Computer Engneerng Department Abstract Ths paper ntroduces a new algorthm for dmensonalty

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

Clustering of Words Based on Relative Contribution for Text Categorization

Clustering of Words Based on Relative Contribution for Text Categorization Clusterng of Words Based on Relatve Contrbuton for Text Categorzaton Je-Mng Yang, Zh-Yng Lu, Zhao-Yang Qu Abstract Term clusterng tres to group words based on the smlarty crteron between words, so that

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

Feature Selection as an Improving Step for Decision Tree Construction

Feature Selection as an Improving Step for Decision Tree Construction 2009 Internatonal Conference on Machne Learnng and Computng IPCSIT vol.3 (2011) (2011) IACSIT Press, Sngapore Feature Selecton as an Improvng Step for Decson Tree Constructon Mahd Esmael 1, Fazekas Gabor

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques Enhancement of Infrequent Purchased Product Recommendaton Usng Data Mnng Technques Noraswalza Abdullah, Yue Xu, Shlomo Geva, and Mark Loo Dscplne of Computer Scence Faculty of Scence and Technology Queensland

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System

Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System Fuzzy Modelng of the Complexty vs. Accuracy Trade-off n a Sequental Two-Stage Mult-Classfer System MARK LAST 1 Department of Informaton Systems Engneerng Ben-Guron Unversty of the Negev Beer-Sheva 84105

More information

Online Detection and Classification of Moving Objects Using Progressively Improving Detectors

Online Detection and Classification of Moving Objects Using Progressively Improving Detectors Onlne Detecton and Classfcaton of Movng Objects Usng Progressvely Improvng Detectors Omar Javed Saad Al Mubarak Shah Computer Vson Lab School of Computer Scence Unversty of Central Florda Orlando, FL 32816

More information

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. [Type text] [Type text] [Type text] ISSN : 97-735 Volume Issue 9 BoTechnology An Indan Journal FULL PAPER BTAIJ, (9), [333-3] Matlab mult-dmensonal model-based - 3 Chnese football assocaton super league

More information

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices Steps for Computng the Dssmlarty, Entropy, Herfndahl-Hrschman and Accessblty (Gravty wth Competton) Indces I. Dssmlarty Index Measurement: The followng formula can be used to measure the evenness between

More information

Relevance Feedback Document Retrieval using Non-Relevant Documents

Relevance Feedback Document Retrieval using Non-Relevant Documents Relevance Feedback Document Retreval usng Non-Relevant Documents TAKASHI ONODA, HIROSHI MURATA and SEIJI YAMADA Ths paper reports a new document retreval method usng non-relevant documents. From a large

More information

An Improved Image Segmentation Algorithm Based on the Otsu Method

An Improved Image Segmentation Algorithm Based on the Otsu Method 3th ACIS Internatonal Conference on Software Engneerng, Artfcal Intellgence, Networkng arallel/dstrbuted Computng An Improved Image Segmentaton Algorthm Based on the Otsu Method Mengxng Huang, enjao Yu,

More information

Announcements. Supervised Learning

Announcements. Supervised Learning Announcements See Chapter 5 of Duda, Hart, and Stork. Tutoral by Burge lnked to on web page. Supervsed Learnng Classfcaton wth labeled eamples. Images vectors n hgh-d space. Supervsed Learnng Labeled eamples

More information

CUM: An Efficient Framework for Mining Concept Units

CUM: An Efficient Framework for Mining Concept Units CUM: An Effcent Framework for Mnng Concept Unts P.Santh Thlagam Ananthanarayana V.S Department of Informaton Technology Natonal Insttute of Technology Karnataka - Surathkal Inda 575025 santh_soc@yahoo.co.n,

More information

Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance

Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 16, No 2 Sofa 2016 Prnt ISSN: 1311-9702; Onlne ISSN: 1314-4081 DOI: 10.1515/cat-2016-0017 Hybrdzaton of Expectaton-Maxmzaton

More information

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval Combnng Multple Resources, Evdence and Crtera for Genomc Informaton Retreval Luo S 1, Je Lu 2 and Jame Callan 2 1 Department of Computer Scence, Purdue Unversty, West Lafayette, IN 47907, USA ls@cs.purdue.edu

More information

An Image Fusion Approach Based on Segmentation Region

An Image Fusion Approach Based on Segmentation Region Rong Wang, L-Qun Gao, Shu Yang, Yu-Hua Cha, and Yan-Chun Lu An Image Fuson Approach Based On Segmentaton Regon An Image Fuson Approach Based on Segmentaton Regon Rong Wang, L-Qun Gao, Shu Yang 3, Yu-Hua

More information

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law) Machne Learnng Support Vector Machnes (contans materal adapted from talks by Constantn F. Alfers & Ioanns Tsamardnos, and Martn Law) Bryan Pardo, Machne Learnng: EECS 349 Fall 2014 Support Vector Machnes

More information

Fast Feature Value Searching for Face Detection

Fast Feature Value Searching for Face Detection Vol., No. 2 Computer and Informaton Scence Fast Feature Value Searchng for Face Detecton Yunyang Yan Department of Computer Engneerng Huayn Insttute of Technology Hua an 22300, Chna E-mal: areyyyke@63.com

More information