Learning to Classify Documents with Only a Small Positive Training Set

Size: px
Start display at page:

Download "Learning to Classify Documents with Only a Small Positive Training Set"

Transcription

1 Learnng to Classfy Documents wth Only a Small Postve Tranng Set Xao-L L 1, Bng Lu 2, and See-Kong Ng 1 1 Insttute for Infocomm Research, Heng Mu Keng Terrace, , Sngapore 2 Department of Computer Scence, Unversty of Illnos at Chcago, IL xll@2r.a-star.edu.sg, lub@cs.uc.edu, skng@2r.a-star.edu.sg Abstract. Many real-world classfcaton applcatons fall nto the class of postve and unlabeled (PU) learnng problems. In many such applcatons, not only could the negatve tranng examples be mssng, the number of postve examples avalable for learnng may also be farly lmted due to the mpractcalty of hand-labelng a large number of tranng examples. Current PU learnng technques have focused mostly on dentfyng relable negatve nstances from the unlabeled set U. In ths paper, we address the oft-overlooked PU learnng problem when the number of tranng examples n the postve set P s small. We propose a novel technque LPLP (Learnng from Probablstcally Labeled Postve examples) and apply the approach to classfy product pages from commercal webstes. The expermental results demonstrate that our approach outperforms exstng methods sgnfcantly, even n the challengng cases where the postve examples n P and the hdden postve examples n U were not drawn from the same dstrbuton. 1 Introducton Tradtonal supervsed learnng technques typcally requre a large number of labeled examples to learn an accurate classfer. However, n practce, t can be an expensve and tedous process to obtan the class labels for large sets of tranng examples. One way to reduce the amount of labeled tranng data needed s to develop classfcaton algorthms that can learn from a set of labeled postve examples augmented wth a set of unlabeled examples. That s, gven a set P of postve examples of a partcular class and a set U of unlabeled examples (whch contans both hdden postve and hdden negatve examples), we buld a classfer usng P and U to classfy the data n U as well as future test data. We call ths the PU learnng problem. Several nnovatve technques (e.g. [1], [2], [3]) have been proposed to solve the PU learnng problem recently. All of these technques have focused on addressng the lack of labeled negatve examples n the tranng data. It was assumed that there was a suffcently large set of postve tranng examples, and also that the postve examples n P and the hdden postve examples n U were drawn from the same dstrbuton. However, n practce, obtanng a large number of postve examples can be rather dffcult n many real applcatons. Oftentmes, we have to do wth a farly small set of postve tranng data. In fact, the small postve set may not even adequately represent J.N. Kok et al. (Eds.): ECML 2007, LNAI 4701, pp , Sprnger-Verlag Berln Hedelberg 2007

2 202 X.-L. L, B. Lu, and S.-K. Ng the whole postve class, as t s hghly lkely that there could be hdden postves n U that may not be smlar to those few examples n the small postve set P. Moreover, the examples n the postve set P and the hdden postve examples n the unlabeled set U may not even be generated or drawn from the same dstrbuton. A classfer traned merely on the few avalable examples n P would thus be ncompetent n recognzng all the hdden postves n U as well as those n the test sets. In ths work, we consder the problem of learnng to classfy documents wth only a small postve tranng set. Our work was motvated by a real-lfe busness ntellgence applcaton of classfyng web pages of product nformaton. The rchness of nformaton easly avalable on the World Wde Web has made t routne for companes to conduct busness ntellgence by searchng the Internet for nformaton on related products. For example, a company that sells computer prnters may want to do a product comparson among the varous prnters currently avalable n the market. One can frst collect sample prnter pages by crawlng through all product pages from a consoldated e-commerce web ste (e.g., amazon.com) and then hand-label those pages contanng prnter product nformaton to construct the set P of postve examples. Next, to get more product nformaton, one can then crawl through all the product pages from other web stes (e.g., cnet.com) as U. Ideally, PU learnng technques can then be appled to classfy all pages n U nto prnter pages and nonprnter pages. However, we found that whle the prnter product pages from two webstes (say, amazon.com and cnet.com) do ndeed share many smlartes, they can also be qute dstnct as the dfferent web stes nvarably present ther products (even smlar ones) n dfferent styles and have dfferent focuses. As such, drectly applyng exstng methods would gve very poor results because 1) the small postve set P obtaned from one ste contaned only tens of web pages (usually less than 30) and therefore do not adequately represent the whole postve class, and 2) the features from the postve examples n P and the hdden postve examples n U are not generated from the same dstrbuton because they were from dfferent web stes. In ths paper, we tackle the challenge of constructng a relable document (web page) classfer based on only a few labeled postve pages from a sngle web ste and then use t to automatcally extract the hdden postve pages from dfferent web stes (.e. the unlabeled sets). We propose an effectve technque called LPLP (Learnng from Probablstcally Labelng Postve examples) to perform ths task. Our proposed technque LPLP s based on probablstcally labelng tranng examples from U and the EM algorthm [4]. The expermental results showed that LPLP sgnfcantly outperformed exstng PU learnng methods. 2 Related Works A theoretcal study of PAC learnng from postve and unlabeled examples under the statstcal query model was frst reported n [5]. Muggleton [6] followed by studyng the problem n a Bayesan framework where the dstrbuton of functons and examples are assumed known. [1] reported sample complexty results and provded theoretcal elaboratons on how the problem could be solved. Subsequently, a number of practcal PU learnng algorthms were proposed [1], [3] and [2]. These PU learnng algorthms all conformed to the theoretcal results presented n [1] by followng a

3 Learnng to Classfy Documents wth Only a Small Postve Tranng Set 203 common two-step strategy, namely: (1) dentfyng a set of relable negatve documents from the unlabeled set; and then (2) buldng a classfer usng EM or SVM teratvely. The specfc dfferences between the varous algorthms n these two steps are as follows. The S-EM method proposed n [1] was based on naïve Bayesan classfcaton and the EM algorthm [4]. The man dea was to frst use a spyng technque to dentfy some relable negatve documents from the unlabeled set, and then to run EM to buld the fnal classfer. The PEBL method [3] uses a dfferent method (1-DNF) for dentfyng relable negatve examples and then runs SVM teratvely for classfer buldng. More recently, [2] reported a technque called Roc- SVM. In ths technque, relable negatve documents were extracted by usng the nformaton retreval technque Roccho [7]. Agan, SVM s used n the second step. A classfer selecton crteron s also proposed to catch a good classfer from teratons of SVM. Despte the dfferences n algorthmc detals, the above methods all focused on extractng relable negatve nstances from the unlabeled set. More related to our current work was the recent work by Yu [8], whch proposed to estmate the boundary for the postve class. However, the amount of postve examples they requred was around 30% of the whole data, whch was stll too large for many practcal applcatons. In [9], a method called PN-SVM was proposed to deal wth the case when the postve set s small. However, t (lke all the other exstng algorthms of PU learnng) reled on the assumpton that the postve examples n P and the hdden postves n U were all generated from the same dstrbuton. For the frst tme, our LPLP method proposed n ths paper wll address such common weaknesses of current PU learnng methods, ncludng handlng challengng cases where the postve examples n P and the hdden postve examples n U were not drawn from the same dstrbuton. Note that the problem could potentally be modeled as a one-class classfcaton problem. For example, n [10], a one-class SVM that uses only postve data to buld a SVM classfer was proposed. Such approaches are dfferent from our method n that they do not use unlabeled data for tranng. However, as prevous results reported n [2] have already showed that they were nferor for text classfcaton, we do not consder them n ths work. 3 The Proposed Technque Fgure 1 depcts the general scenaro of PU learnng wth a small postve tranng set P. Let us denote a space ψ that represents the whole postve class, whch s located above the hyperplane H 2. The small postve set P only occupes a relatvely small subspace SP n ψ (SP ψ), shown as the oval regon n the fgure. The examples n the unlabelled set U conssts of both hdden postve examples (represented by crcled + ) and hdden negatve examples (represented by - ). Snce P s small, and ψ contans postve examples from dfferent web stes that present smlar products n dfferent styles and focuses, we may not expect the dstrbutons of the postve examples n P and those of the hdden postve examples n U to be the same. In other words, the set of hdden postve examples that we are tryng to detect may have a very small ntersecton or s even dsont wth SP. If we navely use the small set P as the postve tranng set and the entre unlabelled set U as the negatve set, the

4 204 X.-L. L, B. Lu, and S.-K. Ng resultng classfer (corresponds to hyperplane H 1 ) wll obvously perform badly n dentfyng the hdden postve examples n U. On the other hand, we can attempt to use the more sophstcated PU learnng methods to address ths problem. Instead of merely usng the entre unlabelled set U as the negatve tranng data, the frst step of PU learnng extracts some relable negatves from U. However, ths step s actually rather dffcult n our applcaton scenaro as depcted n Fgure 1. Snce the hdden postve examples n U are lkely to have dfferent dstrbutons from those captured n the small postve set P, not all the tranng examples n U that are dssmlar to examples n P are negatve examples. As a result, the so-called relable negatve set that the frst step of PU learnng extracts based on dssmlarty from P would be a very nosy negatve set, and therefore not very useful for buldng a relable classfer. H 1 H 2 Fg. 1. PU learnng wth a small postve tranng set Let us consder the possblty of extractng a set of lkely postve examples (LP) from the unlabeled set U to address the problem of P s beng not suffcently representatve of the hdden postve documents n U. Unlke P, the dstrbuton of LP wll be smlar wth the other hdden postve examples n U. As such, we could expect that a more accurate classfer can be bult wth the help of set LP (together wth P). Pctorally, the resultng classfer would correspond to the optmal hyperplane H 2 shown n Fgure 1. Instead of tryng to dentfy a set of nosy negatve documents from the unlabeled set U as exstng PU learnng technques do, our proposed technque LPLP therefore focuses on extractng a set of lkely postve documents from U. Whle the postve documents n P and the hdden postve documents n U were not drawn from the same dstrbuton, they should stll be smlar n some underlyng feature dmensons (or subspaces) as they belong to the same class. For example, the prnter pages from two dfferent stes, say amazon.com and cnet.com, would share the representatve word features such as prnter, nket, laser, ppm etc, though ther respectve dstrbutons may be qute dfferent. Partcularly, the pages from cnet.com whose target readers are more techncally-savvy may contan more frequent mentonng of keyword terms that correspond to the techncal specfcatons of prnters than those pages from amazon.com whose prmary focus s to reach out to the less techncally-nclned customers. However, we can safely expect that the basc keywords (representatve word features) that descrbe computer prnters should be

5 Learnng to Classfy Documents wth Only a Small Postve Tranng Set 205 presented n both cases. In ths work, we therefore assume that the representatve word features for the documents n P should be smlar to those for the hdden postve documents n U. If we can fnd such a set of representatve word features (RW) from the postve set P, then we can use them to extract other hdden postve documents from U. We are now ready to present the detals of our proposed technque LPLP. In Secton 3.1, we frst ntroduce a method to select the set of representatve word features RW from the gven postve set P. Then, n Secton 3.2, we extract the lkely postve documents from U and probablstcally label them based on the set RW. Wth the help of the resultng set LP, we employ the EM algorthm wth a good ntalzaton to buld an accurate classfer to dentfy the hdden postve examples from U n Secton Selectng a Set of Representatve Word Features from P As mentoned above, we expect the postve examples n P and the hdden postve examples n U share the same representatve word features as they belong to the same class. We extract a set RW of representatve word features from the postve set P contanng the top k words wth the hghest s(w ) scores. The scorng functon s() s based on TFIDF method [11] whch gves hgh scores to those words that 1. RW =, F = ; occur frequently n the postve set P 2. For each word feature w P and not n the whole corpus P U snce 3. If (stopwords ( w )!=true) U contans many other unrelated 4. F = F {stemmng( w )}; 5. total=0; 6. For each word feature w F 7. P N ( w, d ) N ( w, P ) 1 max ( N ( w, d 8. total += N ( w, P) ; 9. For each word feature w F 10. N( w, P) P U s( w ) *log ; total df( w, P) df( w, U) 11. Rank the word feature s s( w ) from bg to small nto a lst L; 12. Pr TOP = the k-th s( w ) n the lst L, w F ; 13. For w F 14. If ( s ( w ) >=Pr TOP ) 15. RW = RW { w }; Fg. 2. Selectng a set of representatve word features from P w )) ; documents. Fgure 2 shows the detaled algorthm to select the keyword set RW. In step 1, we ntalze the representatve word set RW and unque feature set F as empty sets. After removng the stop words (step 3) and performng stemmng (step 4) [11], all the word features are stored nto the feature set F. For each word feature w n the postve set P, steps 6 to 8 compute the accumulated word frequency (n each document d, the word w s frequency N(w,d ) s normalzed by the maxmal word frequency of d n step 7). Steps 9 to 10 then compute the scores of each word feature, whch consder both w s probabltes of belongng to a postve class and ts nverted document frequency, where df(w, P) and df(w, U) are w s document frequences n P and U respectvely. After rankng the scores nto the rank lst L n step 11, we store nto RW those word features from P wth top k scores n L.

6 206 X.-L. L, B. Lu, and S.-K. Ng 3.2 Identfyng LP from U and Probablstcally Labelng the Documents n LP Once the set RW of representatve keywords s determned, we can regard them together as a representatve document (rd) of the postve class. We then compare the smlarty of each document d n U wth rd usng the cosne smlarty metrc [11], whch automatcally produces a set LP of 1. LP = ; RU = ; 2. For each d U 3. sm ( rd, d ) rd, 2 rd, * w d, probablstcally labeled documents wth Pr(d +) > 0. The algorthm for ths step s gven n Fgure 3. In step 1, the lkely postve set LP and the remanng unlabelled set RU are both ntalzed as empty sets. In steps 2 to 3, each unlabeled document d n U s compared wth rd usng the cosne smlarty. Step 4 stores the largest smlarty value as m. For each document d n U, f ts cosne smlarty sm(rd, d )>0, we assgn a probablty Pr(+ d ) that s based on the rato of ts smlarty and m (step 7) and we store t nto the set LP n step 8. Otherwse, d s ncluded n RU nstead (step 10). The documents n RU have zero smlarty wth rd and can be consdered as a purer negatve set than U. Note that n step 7, the hdden postve examples n LP wll be assgned hgh probabltes whle the negatve examples n LP wll be assgned very low probabltes. Ths s because the representatve features n RW were chosen based on those words that occurred frequently n P but not n the whole corpus P U. As such, the hdden postve examples n LP should also contan many of the features n RW whle the negatve examples n LP would contan few (f any) of the features n RW. 3.3 The Classfcaton Algorthm w w w 2 d, 4. Let m = max( sm( rd, d )), d U; d 5. For each d U 6. If ( sm ( rd, d ) > 0) 7. Pr(+ d ) = sm( rd, d )/ m ; 8. LP = LP {d }; 9. Else 10. RU = RU {d }; Fg. 3. Probablstcally labelng a set of documents ; Next, we employ the naïve Bayesan framework to dentfy the hdden postves n U. Gven a set of tranng documents D, each document s consdered an lst of words and each word n a document s from the vocabulary V = < w 1, w 2,, w v >. We also have a set of predefned classes, C={c 1, c 2,, c C } For smplcty, we wll consder two class classfcaton n ths dscusson,.e. C={c 1, c 2 }, c 1 = + and c 2 = -. To perform classfcaton for a document d, we need to compute the posteror probablty, Pr(c d ), c {+,-}. Based on the Bayesan probablty and the multnomal model [12], we have D ( c d ) 1 ( c ) = Pr Ρr =. (1) D and wth Laplacan smoothng,

7 Learnng to Classfy Documents wth Only a Small Postve Tranng Set V D 1 N ( w, )Ρr ( ) = 1 t d c d Ρr ( w t c ) =. D V + N ( w, d )Ρr ( c d ) s = 1 = 1 Here, Pr(c d ) {0,1} dependng on the class label of the document. Assumng that the probabltes of words are ndependent gven the class, we obtan the naïve Bayesan classfer: d k = c r ) Pr( c ) Pr( w c ) Pr( c d ). c ) 1 d, k = C d Pr( Pr( w = 1 = 1, r k d k In the nave Bayesan classfer, the class wth the hghest Pr(c d ) s assgned as the class of the document. The NB method s known to be an effectve technque for text classfcaton even wth the volaton of some of ts basc assumptons (e.g class condtonal ndependence) [13] [14] [1]. The Expectaton-Maxmzaton (EM) algorthm s a popular class of teratve algorthms for problems wth ncomplete data. It terates over two basc steps, the Expectaton step, and the Maxmzaton step. The Expectaton step bascally flls n the mssng data, whle the Maxmzaton step then estmates the parameters. When applyng EM, equatons (1) and (2) above comprse the Expectaton step, whle equaton (3) s used for the Maxmzaton step. Note that the probablty of the class gven the document now takes the value n [0, 1] nstead of {0, 1}. The ablty of EM to work wth 1. For each d RU, 2. Pr(+ d ) = 0; 3. Pr(- d ) = 1; 4. PS = LP P (or LP); 5. For each d PS 6. If d P 7. Pr(+ d ) = 1; 8. Pr(- d ) = 0; 9. Else 10. Pr(+ d ) = sm( rd, d ) / m ; 11. Pr(- d ) = 0; 12. Buld an NB-C classfer C usng PS and RU based on equatons (1), (2); 13. Whle classfer parameters change 14. For each d PS RU 15. Compute Pr(+ d ) and Pr(- d ) usng NB-C,.e., equaton (3); 16. Update Pr(c ) and Pr(w t c ) by replacng equatons (1) and (2) wth the new probabltes produced n step 15 (a new NB- C s beng bult n the process) Fg. 4. The detaled LPLP algorthm s r (2) (3) mssng data s exactly what we need here. Let us regard all the postve documents to have the postve class value +. We want to know the class value of each document n the unlabeled set U. EM can help to properly assgn a probablstc class label to each document d n the unlabeled set,.e., Pr(+ d ) or Pr(- d ). Theoretcally, n EM algorthm, the probabltes of documents n U wll converge after a number of teratons [4]. However, a good ntalzaton s mportant n order to fnd a good maxmum of the lkelhood functon. For example, f we drectly use P as postve class and U as negatve class (ntally), then EM algorthm wll not buld an accurate classfer as the negatve class would be too nosy (as explaned prevously). Thus, n our algorthm, after extractng lkely postve set LP, we re-ntalze the EM algorthm by treatng the probablstcally labeled LP (wth/wthout P) as postve documents. The resultng classfer s more accurate

8 208 X.-L. L, B. Lu, and S.-K. Ng because 1) LP has the smlar dstrbutons wth other hdden postve documents n U, and 2) the remanng unlabeled set RU s also much purer than U as a negatve set. The detaled LPLP algorthm s gven n Fgure 4. The nputs to the algorthm are LP, RU and P. Steps 1 to 3 ntalze the probabltes for each document d n RU, whch are all treated as negatve documents ntally. Step 4 sets the postve set PS; there are two possble ways to acheve ths: we ether (1) combne LP and P as PS, or (2) use only LP as PS. We wll evaluate the effect of the ncluson of P n PS n the next secton. Steps 5 to 11 wll assgn the ntal probabltes to the documents n P (f P s used) and LP. Each document n P s assgned to the postve class whle each document n LP s probablstcally labeled usng the algorthm n Fgure 3. Usng PS and RU, a NB classfer can be bult (step 12). Ths classfer s then appled to the documents n (LP RU) to obtan the posteror probabltes (Pr(+ d ) and Pr(- d )) for each document (step 15). We can then teratvely employ the revsed posteror probabltes to buld a new (and better) NB classfer (step 16). The EM process contnues untl the parameters of the NB classfer converge. 4 Evaluaton Experments In ths secton, we evaluate the proposed LPLP technque under dfferent settngs and compare t wth exstng methods, namely, Roc-SVM [2] and PEBL [3]. Roc-SVM s avalable on the Web as a part of the LPU system 1. We mplemented PEBL ourselves as t s not publcly avalable. The results of S-EM [1] were not ncluded here because the performance was generally very poor due to ts relance on smlarty of postve documents n P and n U, as expected. 4.1 Datasets Our evaluaton experments were done usng product Web pages from 5 commercal Web stes: Amazon, CNet, PCMag, J&R and ZDnet. These stes contaned many ntroducton/descrpton pages of dfferent types of products. The pages were cleaned usng the web page cleanng technque n [15],.e., navgaton lnks and advertsements have been detected and elmnated. The data contaned web pages of the followng product categores: Notebook, Dgtal Camera, Moble Phone, Prnter and TV. Table 1 lsts the number of pages from each ste, and the correspondng product categores (or classes). In ths work, we treat each page as a text document and we do not use hyperlnks and mages for classfcaton. Table 1. Number of Web pages and ther classes. Amazon CNet J&R PCMag ZDnet Notebook Camera Moble Prnter TV

9 Learnng to Classfy Documents wth Only a Small Postve Tranng Set 209 Note that we dd not use standard text collectons such as Reuters 2 and 20 Newsgroup data 3 n our experments as we want to hghlght the performance of our approach on data sets that have dfferent postve dstrbutons n P and n U. 4.2 Experment Settngs As mentoned, the number of avalable postvely labeled documents n practce can be qute small ether because there were few documents to start wth, or t was tedous and expensve to hand-label the tranng examples on a large scale. To reflect ths constrant, we expermented wth dfferent number of (randomly selected) postve documents n P,.e. P = 5, 15, or 25 and allpos. Here allpos means that all documents of a partcular product from a Web ste were used. The purpose of these experments s to nvestgate the relatve effectveness of our proposed method for both small and large postve sets. We conducted a comprehensve set of experments usng all the possble P and U combnatons. That s, we selected every entry (one type of product from each Web ste) n Table 1 as the postve set P and use each of the other 4 Web stes as the unlabeled set U. Three products were omtted n our experments because ther P <10, namely, Moble phone n J&R (9 pages), TV n PCMag (no page), and TV n ZDnet (no page). A grand total of 88 experments were conducted usng all the possble P and U combnatons of the 5 stes. Due to the large number of combnatons, the results reported below are the average values for the results from all the combnatons. To study the senstvty of the number of representatve word features used n dentfyng lkely postve examples, we also performed a seres of experments usng dfferent numbers of representatve features,.e. k = 5, 10, 15 and 20 n our algorthm. 4.3 Expermental Results Snce our task s to dentfy or retreve postve documents from the unlabeled set U, t s approprate to use F value to evaluate the performance of the fnal classfer. F value s the harmonc mean of precson (p) and recall (r),.e. F=2*p*r/(p+r). When ether of p or r s small, the F value wll be small. Only when both of them are large, F value wll be large. Ths s sutable for our purpose as we do not want to dentfy postve documents wth ether too small a precson or too small a recall. Note that n our experment results, the reported F values gve the classfcaton (retreval) results of the postve documents n U as U s also the test set. Let us frst show the results of our proposed technque LPLP under dfferent expermental settngs. We wll then compare t wth two exstng technques. The bar chart n Fgure 5 shows the F values (Y-axs) of LPLP usng dfferent numbers of postve documents (X-axs) and dfferent numbers of representatve words (4 data seres). Recall that we had presented two optons to construct the postve set PS n step 4 of the LPLP algorthm (Fgure 4). The frst opton s to add the extracted lkely postve documents (LP) to the orgnal set of postve documents P, represented n Fgure 5 by wth P. The second opton s to use only the extracted lkely postve documents as the postve data n learnng,.e., droppng the orgnal

10 210 X.-L. L, B. Lu, and S.-K. Ng words 10 words 15 words 20 words F value (wth P) 5 (wthout P) 15 (wth P) 15 (wthout P) 25 (wth P) 25 (wthout P) allpos (wth P) allpos (wthout P) Fg. 5. F values of LPLP wth dfferent numbers of postve documents postve set P (snce t s not representatve of the hdden postve documents n U). Ths opton s denoted by wthout P n Fgure 5. Incluson of P for constructng PS: If there were only a small number of postve documents ( P = 5, 15 and 25) avalable, we found that usng opton 1 (wth P) to construct the postve set for the classfer s better than usng opton 2 (wthout P), as expected. However, nterestngly, f there s a large number of postve documents (allpos n Fgure 5), then opton 1 s actually nferor to opton 2. The reason s that the use of a bg postve set P, whch s not representatve of the postve documents n U, would have ntroduced too much negatve nfluence on the fnal classfer (many hdden postve examples n U wll be classfed as negatve class). However, when the gven postve set P s small, ts potental negatve nfluence s much less, and t wll therefore help to strengthen the lkely postve documents by provdng more postve data. Ths s a subtle and rather unexpected trade-off here. Number of postve documents n P: From Fgure 5, we also observe that the number of the gven postve documents n P does not nfluence the fnal results a great deal. The reason for ths s that the computed lkely postve documents from U are actually more effectve postve documents for learnng than the orgnal postve documents n P. Ths s a very compellng advantage of our proposed technque as ths means that the user does not need to label or to fnd a large number of postve examples for effectve learnng. In fact, as dscussed above, we also notce that even wthout usng any orgnal postve document n P, the results were very good as well. Number of representatve word features: The results n Fgure 5 also showed that there s no need to use many representatve words for detectng postve documents. In general, 5-15 representatve words would suffce. Includng the less representatve word features beyond the top k most representatve ones would ntroduce unnecessary nose n dentfyng the lkely postve documents n U. Next, we compare the results of our LPLP technque wth those of the two best exstng technques mentoned earler, namely, Roc-SVM [2] and PEBL [3]. Fgure 6 shows two seres of results. The frst seres, marked P, showed the classfcaton results of all three methods usng all postve documents ( allpos ), wthout the use of the lkely postve documents LP as suggested n ths paper. In other words, learnng

11 Learnng to Classfy Documents wth Only a Small Postve Tranng Set 211 was done usng only P and U. Note that for the strctest comparson what we have shown here are the best possble results for Roc-SVM and PEBL, whch may not be obtanable n practce because t s hardly possble to determne whch SVM teraton would gve the best results n these algorthms (both Roc-SVM and PEBL algorthms run SVM many tmes). In fact, ther results at convergence were actually much worse. We can see that PEBL performed better than both LPLP and Roc-SVM. However, the absolute F value of PEBL s stll very low (0.54). Note also that because of the use of allpos for tranng, the LPLP s result for ths was obtaned wthout usng the lkely postve set LP (t s now the EM standard algorthm), hence t was unable to perform as well as t should have. The second seres n Fgure 6 shows the comparatve results of usng the extracted lkely postve documents nstead of P for learnng. Here, our LPLP algorthm performs dramatcally better (F=0.94) even aganst the best possble results of PEBL (F=0.84) and Roc-SVM (F=0.81). Note that here PEBL and Roc-SVM also use the lkely postve documents LP extracted from U by our method (we boosted the PEBL and Roc-SVM for the purpose of comparson). The lkely postves were dentfed from U usng 10 representatve words selected from P. Unlke our LPLP algorthm, both Roc-SVM and PEBL do not take probablstcally labels, but only bnary labels. As such, for these two algorthms, we chose the lkely postve documents from U by requrng each document (d) to contan at least 5 (out of 10) selected representatve words. All the lkely postve documents dentfed were then treated as postve documents,.e., Pr(+ d) = 1. We also tred usng other numbers of representatve words n RW and found that 5 words performed well for these two algorthms wth our datasets. We can see that wth the use of the lkely postves (set LP) dentfed by our method (nstead of P), the classfcaton results of these two exstng algorthms have also mproved dramatcally as well. In fact, by usng LP nstead of P, the prevously weaker Roc-SVM has caught up so substantally that the best possble result of PEBL s only slghtly better than that of Roc-SVM now. Fnally, n Fgure 7, we show the comparatve results when the number of the postve documents s small, whch s more often than not the case n practce. Agan, we see that our new method LPLP performed much better than the best possble F value P LP LPLP Roc-SVM PEBL F value LPLP Roc-SVM PEBL The document number of P Fg. 6. F values of LPLP, the best results of Roc-SVM and PEBL usng all postve documents Fg. 7. F values of LPLP, the best results of Roc-SVM and PEBL usng P together wth LP

12 212 X.-L. L, B. Lu, and S.-K. Ng results of the two exstng methods Roc-SVM and PEBL (whch may not be obtanable n practce, as explaned earler) when there were only 5, 15, or 25 postve documents n P. As explaned earler, ncludng P together wth LP (for all the three technques) gave better results when P s small. In summary, the results n Fgures 6 and 7 showed that the lkely postve documents LP extracted from U can be used to help boost the performance of classfcaton technques for PU learnng problems. In partcular, LPLP algorthm benefted the most and performed the best. Ths s probably because of ts ablty to handle probablstc labels and s thus better equpped to take advantage of the probablstc (and hence potentally nosy) LP set than the SVM-based approaches. 5 Conclusons In many real-world classfcaton applcatons, t s often the case that not only the negatve tranng examples are hard to come by, but the number of postve examples avalable for learnng can also be farly lmted as t s often tedous and expensve to hand-label large amounts of tranng data. To address the lack of negatve examples, many PU learnng methods have been proposed to learn from a pool of postve data (P) wthout any negatve data but wth the help of unlabeled data (U). However, PU learnng methods stll do not work well when the sze of postve examples s small. In ths paper, we address ths oft-overlooked ssue for PU learnng when the number of postve examples s qute small. In addton, we consder the challengng case where the postve examples n P and the hdden postve examples n U may not even be drawn from the same dstrbuton. Exstng technques have been found to perform poorly n ths settng. We proposed an effectve technque LPLP that can learn effectvely from postve and unlabeled examples wth a small postve set for document classfcaton. Instead of dentfyng a set of relable negatve documents from the unlabeled set U as exstng PU technques do, our new method focuses on extractng a set of lkely postve documents from U. In ths way, the learnng can rely less on the lmtatons assocated wth the orgnal postve set P, such as ts lmted sze and potental dstrbuton dfferences. Augmented by the extracted probablstc LP set, our LPLP algorthm can buld a much more robust classfer. We reported expermental results wth product page classfcaton that confrmed that our new technque s ndeed much more effectve than exstng methods n ths challengng classfcaton problem. In our future work, we plan to generalze our current approach to solve smlar classfcaton problems other than document classfcaton. References 1. Lu, B., Lee, W., Yu, P., L, X.: Partally Supervsed Classfcaton of Text Documents. In: ICML, pp (2002) 2. L, X., Lu, B.: Learnng to Classfy Texts Usng Postve and Unlabeled Data. In: IJCAI, pp (2003) 3. Yu, H., Han, J., Chang, K.C.-C.: PEBL: Postve Example Based Learnng for Web Page Classfcaton Usng SVM. In: KDD, pp (2002)

13 Learnng to Classfy Documents wth Only a Small Postve Tranng Set Dempster, A.P., Lard, N.M., Rubn, D.B.: Maxmum Lkelhood from Incomplete Data va the EM Algorthm. Journal of the Royal Statstcal Socety (1977) 5. Dens, F.: PAC Learnng from Postve Statstcal Queres. In: ALT, pp (1998) 6. Muggleton, S.: Learnng from Postve Data. In: Proceedngs of the sxth Internatonal Workshop on Inductve Logc Programmng, pp Sprnger, Hedelberg (1997) 7. Roccho, J.: Relevance feedback n nformaton retreval. In: Salton, G. (ed.) The SMART Retreval System: Experments n Automatc Document Processng (1971) 8. Yu, H.: General MC: Estmatng boundary of postve class from small postve data. In: ICDM, pp (2003) 9. Fung, G.P.C., et al.: Text Classfcaton wthout Negatve Examples Revst. IEEE Transactons on Knowledge and Data Engneerng 18(1), 6 20 (2006) 10. Schölkopf, B., et al.: Estmatng the Support of a Hgh-Dmensonal Dstrbuton. Neural Comput 13(7), (2001) 11. Salton, G., McGll, M.J.: Introducton to Modern Informaton Retreval (1986) 12. McCallum, A., Ngam, K.: A comparson of event models for naïve Bayes text classfcaton. In: AAAI Workshop on Learnng for Text Categorzaton (1998) 13. Lews, D.D.: A sequental algorthm for tranng text classfers: corrgendum and addtonal data. In: SIGIR Forum, (1995) 14. Ngam, K., et al.: Text Classfcaton from Labeled and Unlabeled Documents usng EM. Machne Learnng 39(2-3), (2000) 15. Y, L., Lu, B., L, X.: Elmnatng nosy nformaton n Web pages for data mnng. In: KDD, pp (2003)

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

Reliable Negative Extracting Based on knn for Learning from Positive and Unlabeled Examples

Reliable Negative Extracting Based on knn for Learning from Positive and Unlabeled Examples 94 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY 2009 Relable Negatve Extractng Based on knn for Learnng from Postve and Unlabeled Examples Bangzuo Zhang College of Computer Scence and Technology, Jln Unversty,

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

The Research of Support Vector Machine in Agricultural Data Classification

The Research of Support Vector Machine in Agricultural Data Classification The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou

More information

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems A Unfed Framework for Semantcs and Feature Based Relevance Feedback n Image Retreval Systems Ye Lu *, Chunhu Hu 2, Xngquan Zhu 3*, HongJang Zhang 2, Qang Yang * School of Computng Scence Smon Fraser Unversty

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Deep Classification in Large-scale Text Hierarchies

Deep Classification in Large-scale Text Hierarchies Deep Classfcaton n Large-scale Text Herarches Gu-Rong Xue Dkan Xng Qang Yang 2 Yong Yu Dept. of Computer Scence and Engneerng Shangha Jao-Tong Unversty {grxue, dkxng, yyu}@apex.sjtu.edu.cn 2 Hong Kong

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

Edge Detection in Noisy Images Using the Support Vector Machines

Edge Detection in Noisy Images Using the Support Vector Machines Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science EECS 730 Introducton to Bonformatcs Sequence Algnment Luke Huan Electrcal Engneerng and Computer Scence http://people.eecs.ku.edu/~huan/ HMM Π s a set of states Transton Probabltes a kl Pr( l 1 k Probablty

More information

Classifier Selection Based on Data Complexity Measures *

Classifier Selection Based on Data Complexity Measures * Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.

More information

Classifying Acoustic Transient Signals Using Artificial Intelligence

Classifying Acoustic Transient Signals Using Artificial Intelligence Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)

More information

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography   Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department

More information

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton

More information

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng

More information

Unsupervised Learning

Unsupervised Learning Pattern Recognton Lecture 8 Outlne Introducton Unsupervsed Learnng Parametrc VS Non-Parametrc Approach Mxture of Denstes Maxmum-Lkelhood Estmates Clusterng Prof. Danel Yeung School of Computer Scence and

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University CAN COMPUTERS LEARN FASTER? Seyda Ertekn Computer Scence & Engneerng The Pennsylvana State Unversty sertekn@cse.psu.edu ABSTRACT Ever snce computers were nvented, manknd wondered whether they mght be made

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Determining the Optimal Bandwidth Based on Multi-criterion Fusion Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Informaton Retreval Systems Jm Martn! Lecture 11 9/29/2011 Today 9/29 Classfcaton Naïve Bayes classfcaton Ungram LM 1 Where we are... Bascs of ad hoc retreval Indexng Term weghtng/scorng Cosne

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for

More information

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Intelligent Information Acquisition for Improved Clustering

Intelligent Information Acquisition for Improved Clustering Intellgent Informaton Acquston for Improved Clusterng Duy Vu Unversty of Texas at Austn duyvu@cs.utexas.edu Mkhal Blenko Mcrosoft Research mblenko@mcrosoft.com Prem Melvlle IBM T.J. Watson Research Center

More information

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers Journal of Convergence Informaton Technology Volume 5, Number 2, Aprl 2010 Investgatng the Performance of Naïve- Bayes Classfers and K- Nearest Neghbor Classfers Mohammed J. Islam *, Q. M. Jonathan Wu,

More information

A Novel Term_Class Relevance Measure for Text Categorization

A Novel Term_Class Relevance Measure for Text Categorization A Novel Term_Class Relevance Measure for Text Categorzaton D S Guru, Mahamad Suhl Department of Studes n Computer Scence, Unversty of Mysore, Mysore, Inda Abstract: In ths paper, we ntroduce a new measure

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned

More information

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET 1 BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET TZU-CHENG CHUANG School of Electrcal and Computer Engneerng, Purdue Unversty, West Lafayette, Indana 47907 SAUL B. GELFAND School

More information

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1) Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A

More information

Pruning Training Corpus to Speedup Text Classification 1

Pruning Training Corpus to Speedup Text Classification 1 Prunng Tranng Corpus to Speedup Text Classfcaton Jhong Guan and Shugeng Zhou School of Computer Scence, Wuhan Unversty, Wuhan, 430079, Chna hguan@wtusm.edu.cn State Key Lab of Software Engneerng, Wuhan

More information

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines A Modfed Medan Flter for the Removal of Impulse Nose Based on the Support Vector Machnes H. GOMEZ-MORENO, S. MALDONADO-BASCON, F. LOPEZ-FERRERAS, M. UTRILLA- MANSO AND P. GIL-JIMENEZ Departamento de Teoría

More information

Improving Web Image Search using Meta Re-rankers

Improving Web Image Search using Meta Re-rankers VOLUME-1, ISSUE-V (Aug-Sep 2013) IS NOW AVAILABLE AT: www.dcst.com Improvng Web Image Search usng Meta Re-rankers B.Kavtha 1, N. Suata 2 1 Department of Computer Scence and Engneerng, Chtanya Bharath Insttute

More information

Biostatistics 615/815

Biostatistics 615/815 The E-M Algorthm Bostatstcs 615/815 Lecture 17 Last Lecture: The Smplex Method General method for optmzaton Makes few assumptons about functon Crawls towards mnmum Some recommendatons Multple startng ponts

More information

Efficient Text Classification by Weighted Proximal SVM *

Efficient Text Classification by Weighted Proximal SVM * Effcent ext Classfcaton by Weghted Proxmal SVM * Dong Zhuang 1, Benyu Zhang, Qang Yang 3, Jun Yan 4, Zheng Chen, Yng Chen 1 1 Computer Scence and Engneerng, Bejng Insttute of echnology, Bejng 100081, Chna

More information

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Learning-Based Top-N Selection Query Evaluation over Relational Databases Learnng-Based Top-N Selecton Query Evaluaton over Relatonal Databases Lang Zhu *, Wey Meng ** * School of Mathematcs and Computer Scence, Hebe Unversty, Baodng, Hebe 071002, Chna, zhu@mal.hbu.edu.cn **

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust

More information

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION Paulo Quntlano 1 & Antono Santa-Rosa 1 Federal Polce Department, Brasla, Brazl. E-mals: quntlano.pqs@dpf.gov.br and

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status Internatonal Journal of Appled Busness and Informaton Systems ISSN: 2597-8993 Vol 1, No 2, September 2017, pp. 6-12 6 Implementaton Naïve Bayes Algorthm for Student Classfcaton Based on Graduaton Status

More information

Issues and Empirical Results for Improving Text Classification

Issues and Empirical Results for Improving Text Classification Issues and Emprcal Results for Improvng Text Classfcaton Youngoong Ko 1 and Jungyun Seo 2 1 Dept. of Computer Engneerng, Dong-A Unversty, 840 Hadan 2-dong, Saha-gu, Busan, 604-714, Korea yko@dau.ac.kr

More information

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following. Complex Numbers The last topc n ths secton s not really related to most of what we ve done n ths chapter, although t s somewhat related to the radcals secton as we wll see. We also won t need the materal

More information

Three supervised learning methods on pen digits character recognition dataset

Three supervised learning methods on pen digits character recognition dataset Three supervsed learnng methods on pen dgts character recognton dataset Chrs Flezach Department of Computer Scence and Engneerng Unversty of Calforna, San Dego San Dego, CA 92093 cflezac@cs.ucsd.edu Satoru

More information

Associative Based Classification Algorithm For Diabetes Disease Prediction

Associative Based Classification Algorithm For Diabetes Disease Prediction Internatonal Journal of Engneerng Trends and Technology (IJETT) Volume-41 Number-3 - November 016 Assocatve Based Classfcaton Algorthm For Dabetes Dsease Predcton 1 N. Gnana Deepka, Y.surekha, 3 G.Laltha

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Experiments in Text Categorization Using Term Selection by Distance to Transition Point

Experiments in Text Categorization Using Term Selection by Distance to Transition Point Experments n Text Categorzaton Usng Term Selecton by Dstance to Transton Pont Edgar Moyotl-Hernández, Héctor Jménez-Salazar Facultad de Cencas de la Computacón, B. Unversdad Autónoma de Puebla, 14 Sur

More information

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION 48 CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION 3.1 INTRODUCTION The raw mcroarray data s bascally an mage wth dfferent colors ndcatng hybrdzaton (Xue

More information

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines (IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, Herarchcal Web Page Classfcaton Based on a Topc Model and Neghborng Pages Integraton Wongkot Srura Phayung Meesad Choochart Haruechayasak

More information

A User Selection Method in Advertising System

A User Selection Method in Advertising System Int. J. Communcatons, etwork and System Scences, 2010, 3, 54-58 do:10.4236/jcns.2010.31007 Publshed Onlne January 2010 (http://www.scrp.org/journal/jcns/). A User Selecton Method n Advertsng System Shy

More information

Selecting Query Term Alterations for Web Search by Exploiting Query Contexts

Selecting Query Term Alterations for Web Search by Exploiting Query Contexts Selectng Query Term Alteratons for Web Search by Explotng Query Contexts Guhong Cao Stephen Robertson Jan-Yun Ne Dept. of Computer Scence and Operatons Research Mcrosoft Research at Cambrdge Dept. of Computer

More information

Related-Mode Attacks on CTR Encryption Mode

Related-Mode Attacks on CTR Encryption Mode Internatonal Journal of Network Securty, Vol.4, No.3, PP.282 287, May 2007 282 Related-Mode Attacks on CTR Encrypton Mode Dayn Wang, Dongda Ln, and Wenlng Wu (Correspondng author: Dayn Wang) Key Laboratory

More information

USING LINEAR REGRESSION FOR THE AUTOMATION OF SUPERVISED CLASSIFICATION IN MULTITEMPORAL IMAGES

USING LINEAR REGRESSION FOR THE AUTOMATION OF SUPERVISED CLASSIFICATION IN MULTITEMPORAL IMAGES USING LINEAR REGRESSION FOR THE AUTOMATION OF SUPERVISED CLASSIFICATION IN MULTITEMPORAL IMAGES 1 Fetosa, R.Q., 2 Merelles, M.S.P., 3 Blos, P. A. 1,3 Dept. of Electrcal Engneerng ; Catholc Unversty of

More information

Relevance Feedback Document Retrieval using Non-Relevant Documents

Relevance Feedback Document Retrieval using Non-Relevant Documents Relevance Feedback Document Retreval usng Non-Relevant Documents TAKASHI ONODA, HIROSHI MURATA and SEIJI YAMADA Ths paper reports a new document retreval method usng non-relevant documents. From a large

More information

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law) Machne Learnng Support Vector Machnes (contans materal adapted from talks by Constantn F. Alfers & Ioanns Tsamardnos, and Martn Law) Bryan Pardo, Machne Learnng: EECS 349 Fall 2014 Support Vector Machnes

More information

An Anti-Noise Text Categorization Method based on Support Vector Machines *

An Anti-Noise Text Categorization Method based on Support Vector Machines * An Ant-Nose Text ategorzaton Method based on Support Vector Machnes * hen Ln, Huang Je and Gong Zheng-Hu School of omputer Scence, Natonal Unversty of Defense Technology, hangsha, 410073, hna chenln@nudt.edu.cn,

More information

Network Intrusion Detection Based on PSO-SVM

Network Intrusion Detection Based on PSO-SVM TELKOMNIKA Indonesan Journal of Electrcal Engneerng Vol.1, No., February 014, pp. 150 ~ 1508 DOI: http://dx.do.org/10.11591/telkomnka.v1.386 150 Network Intruson Detecton Based on PSO-SVM Changsheng Xang*

More information

Query Clustering Using a Hybrid Query Similarity Measure

Query Clustering Using a Hybrid Query Similarity Measure Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

An Improvement to Naive Bayes for Text Classification

An Improvement to Naive Bayes for Text Classification Avalable onlne at www.scencedrect.com Proceda Engneerng 15 (2011) 2160 2164 Advancen Control Engneerngand Informaton Scence An Improvement to Nave Bayes for Text Classfcaton We Zhang a, Feng Gao a, a*

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

On Some Entertaining Applications of the Concept of Set in Computer Science Course

On Some Entertaining Applications of the Concept of Set in Computer Science Course On Some Entertanng Applcatons of the Concept of Set n Computer Scence Course Krasmr Yordzhev *, Hrstna Kostadnova ** * Assocate Professor Krasmr Yordzhev, Ph.D., Faculty of Mathematcs and Natural Scences,

More information

Reducing Frame Rate for Object Tracking

Reducing Frame Rate for Object Tracking Reducng Frame Rate for Object Trackng Pavel Korshunov 1 and We Tsang Oo 2 1 Natonal Unversty of Sngapore, Sngapore 11977, pavelkor@comp.nus.edu.sg 2 Natonal Unversty of Sngapore, Sngapore 11977, oowt@comp.nus.edu.sg

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

Deep Classifier: Automatically Categorizing Search Results into Large-Scale Hierarchies

Deep Classifier: Automatically Categorizing Search Results into Large-Scale Hierarchies Deep Classfer: Automatcally Categorzng Search Results nto Large-Scale Herarches Dkan Xng 1, Gu-Rong Xue 1, Qang Yang 2, Yong Yu 1 1 Shangha Jao Tong Unversty, Shangha, Chna {xaobao,grxue,yyu}@apex.sjtu.edu.cn

More information

Machine Learning 9. week

Machine Learning 9. week Machne Learnng 9. week Mappng Concept Radal Bass Functons (RBF) RBF Networks 1 Mappng It s probably the best scenaro for the classfcaton of two dataset s to separate them lnearly. As you see n the below

More information

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach Angle Estmaton and Correcton of Hand Wrtten, Textual and Large areas of Non-Textual Document Images: A Novel Approach D.R.Ramesh Babu Pyush M Kumat Mahesh D Dhannawat PES Insttute of Technology Research

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION 1 THE PUBLISHING HOUSE PROCEEDINGS OF THE ROMANIAN ACADEMY, Seres A, OF THE ROMANIAN ACADEMY Volume 4, Number 2/2003, pp.000-000 A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION Tudor BARBU Insttute

More information

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK L-qng Qu, Yong-quan Lang 2, Jng-Chen 3, 2 College of Informaton Scence and Technology, Shandong Unversty of Scence and Technology,

More information

TN348: Openlab Module - Colocalization

TN348: Openlab Module - Colocalization TN348: Openlab Module - Colocalzaton Topc The Colocalzaton module provdes the faclty to vsualze and quantfy colocalzaton between pars of mages. The Colocalzaton wndow contans a prevew of the two mages

More information

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:

More information

Face Detection with Deep Learning

Face Detection with Deep Learning Face Detecton wth Deep Learnng Yu Shen Yus122@ucsd.edu A13227146 Kuan-We Chen kuc010@ucsd.edu A99045121 Yzhou Hao y3hao@ucsd.edu A98017773 Mn Hsuan Wu mhwu@ucsd.edu A92424998 Abstract The project here

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters Proper Choce of Data Used for the Estmaton of Datum Transformaton Parameters Hakan S. KUTOGLU, Turkey Key words: Coordnate systems; transformaton; estmaton, relablty. SUMMARY Advances n technologes and

More information

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE Yordzhev K., Kostadnova H. Інформаційні технології в освіті ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE Yordzhev K., Kostadnova H. Some aspects of programmng educaton

More information

Data Mining: Model Evaluation

Data Mining: Model Evaluation Data Mnng: Model Evaluaton Aprl 16, 2013 1 Issues: Evaluatng Classfcaton Methods Accurac classfer accurac: predctng class label predctor accurac: guessng value of predcted attrbutes Speed tme to construct

More information

Online Detection and Classification of Moving Objects Using Progressively Improving Detectors

Online Detection and Classification of Moving Objects Using Progressively Improving Detectors Onlne Detecton and Classfcaton of Movng Objects Usng Progressvely Improvng Detectors Omar Javed Saad Al Mubarak Shah Computer Vson Lab School of Computer Scence Unversty of Central Florda Orlando, FL 32816

More information

Pictures at an Exhibition

Pictures at an Exhibition 1 Pctures at an Exhbton Stephane Kwan and Karen Zhu Department of Electrcal Engneerng Stanford Unversty, Stanford, CA 9405 Emal: {skwan1, kyzhu}@stanford.edu Abstract An mage processng algorthm s desgned

More information

An Evaluation of Divide-and-Combine Strategies for Image Categorization by Multi-Class Support Vector Machines

An Evaluation of Divide-and-Combine Strategies for Image Categorization by Multi-Class Support Vector Machines An Evaluaton of Dvde-and-Combne Strateges for Image Categorzaton by Mult-Class Support Vector Machnes C. Demrkesen¹ and H. Cherf¹, ² 1: Insttue of Scence and Engneerng 2: Faculté des Scences Mrande Galatasaray

More information

Multi-Criteria-based Active Learning for Named Entity Recognition

Multi-Criteria-based Active Learning for Named Entity Recognition Mult-Crtera-based Actve Learnng for Named Entty Recognton Dan Shen 1 Je Zhang Jan Su Guodong Zhou Chew-Lm Tan Insttute for Infocomm Technology 21 Heng Mu Keng Terrace Sngapore 119613 Department of Computer

More information

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur FEATURE EXTRACTION Dr. K.Vjayarekha Assocate Dean School of Electrcal and Electroncs Engneerng SASTRA Unversty, Thanjavur613 41 Jont Intatve of IITs and IISc Funded by MHRD Page 1 of 8 Table of Contents

More information

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes SPH3UW Unt 7.3 Sphercal Concave Mrrors Page 1 of 1 Notes Physcs Tool box Concave Mrror If the reflectng surface takes place on the nner surface of the sphercal shape so that the centre of the mrror bulges

More information

Fast Feature Value Searching for Face Detection

Fast Feature Value Searching for Face Detection Vol., No. 2 Computer and Informaton Scence Fast Feature Value Searchng for Face Detecton Yunyang Yan Department of Computer Engneerng Huayn Insttute of Technology Hua an 22300, Chna E-mal: areyyyke@63.com

More information

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques

More information

Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search

Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search Can We Beat the Prefx Flterng? An Adaptve Framework for Smlarty Jon and Search Jannan Wang Guolang L Janhua Feng Department of Computer Scence and Technology, Tsnghua Natonal Laboratory for Informaton

More information

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study

Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study Arabc Text Classfcaton Usng N-Gram Frequency Statstcs A Comparatve Study Lala Khresat Dept. of Computer Scence, Math and Physcs Farlegh Dcknson Unversty 285 Madson Ave, Madson NJ 07940 Khresat@fdu.edu

More information

A Multi-step Strategy for Shape Similarity Search In Kamon Image Database

A Multi-step Strategy for Shape Similarity Search In Kamon Image Database A Mult-step Strategy for Shape Smlarty Search In Kamon Image Database Paul W.H. Kwan, Kazuo Torach 2, Kesuke Kameyama 2, Junbn Gao 3, Nobuyuk Otsu 4 School of Mathematcs, Statstcs and Computer Scence,

More information

Announcements. Supervised Learning

Announcements. Supervised Learning Announcements See Chapter 5 of Duda, Hart, and Stork. Tutoral by Burge lnked to on web page. Supervsed Learnng Classfcaton wth labeled eamples. Images vectors n hgh-d space. Supervsed Learnng Labeled eamples

More information

A fault tree analysis strategy using binary decision diagrams

A fault tree analysis strategy using binary decision diagrams Loughborough Unversty Insttutonal Repostory A fault tree analyss strategy usng bnary decson dagrams Ths tem was submtted to Loughborough Unversty's Insttutonal Repostory by the/an author. Addtonal Informaton:

More information

Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System

Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System Fuzzy Modelng of the Complexty vs. Accuracy Trade-off n a Sequental Two-Stage Mult-Classfer System MARK LAST 1 Department of Informaton Systems Engneerng Ben-Guron Unversty of the Negev Beer-Sheva 84105

More information

Semantic Image Retrieval Using Region Based Inverted File

Semantic Image Retrieval Using Region Based Inverted File Semantc Image Retreval Usng Regon Based Inverted Fle Dengsheng Zhang, Md Monrul Islam, Guoun Lu and Jn Hou 2 Gppsland School of Informaton Technology, Monash Unversty Churchll, VIC 3842, Australa E-mal:

More information

Meta-heuristics for Multidimensional Knapsack Problems

Meta-heuristics for Multidimensional Knapsack Problems 2012 4th Internatonal Conference on Computer Research and Development IPCSIT vol.39 (2012) (2012) IACSIT Press, Sngapore Meta-heurstcs for Multdmensonal Knapsack Problems Zhbao Man + Computer Scence Department,

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information