Learning to Classify Documents with Only a Small Positive Training Set
|
|
- Roberta Clarke
- 5 years ago
- Views:
Transcription
1 Learnng to Classfy Documents wth Only a Small Postve Tranng Set Xao-L L 1, Bng Lu 2, and See-Kong Ng 1 1 Insttute for Infocomm Research, Heng Mu Keng Terrace, , Sngapore 2 Department of Computer Scence, Unversty of Illnos at Chcago, IL xll@2r.a-star.edu.sg, lub@cs.uc.edu, skng@2r.a-star.edu.sg Abstract. Many real-world classfcaton applcatons fall nto the class of postve and unlabeled (PU) learnng problems. In many such applcatons, not only could the negatve tranng examples be mssng, the number of postve examples avalable for learnng may also be farly lmted due to the mpractcalty of hand-labelng a large number of tranng examples. Current PU learnng technques have focused mostly on dentfyng relable negatve nstances from the unlabeled set U. In ths paper, we address the oft-overlooked PU learnng problem when the number of tranng examples n the postve set P s small. We propose a novel technque LPLP (Learnng from Probablstcally Labeled Postve examples) and apply the approach to classfy product pages from commercal webstes. The expermental results demonstrate that our approach outperforms exstng methods sgnfcantly, even n the challengng cases where the postve examples n P and the hdden postve examples n U were not drawn from the same dstrbuton. 1 Introducton Tradtonal supervsed learnng technques typcally requre a large number of labeled examples to learn an accurate classfer. However, n practce, t can be an expensve and tedous process to obtan the class labels for large sets of tranng examples. One way to reduce the amount of labeled tranng data needed s to develop classfcaton algorthms that can learn from a set of labeled postve examples augmented wth a set of unlabeled examples. That s, gven a set P of postve examples of a partcular class and a set U of unlabeled examples (whch contans both hdden postve and hdden negatve examples), we buld a classfer usng P and U to classfy the data n U as well as future test data. We call ths the PU learnng problem. Several nnovatve technques (e.g. [1], [2], [3]) have been proposed to solve the PU learnng problem recently. All of these technques have focused on addressng the lack of labeled negatve examples n the tranng data. It was assumed that there was a suffcently large set of postve tranng examples, and also that the postve examples n P and the hdden postve examples n U were drawn from the same dstrbuton. However, n practce, obtanng a large number of postve examples can be rather dffcult n many real applcatons. Oftentmes, we have to do wth a farly small set of postve tranng data. In fact, the small postve set may not even adequately represent J.N. Kok et al. (Eds.): ECML 2007, LNAI 4701, pp , Sprnger-Verlag Berln Hedelberg 2007
2 202 X.-L. L, B. Lu, and S.-K. Ng the whole postve class, as t s hghly lkely that there could be hdden postves n U that may not be smlar to those few examples n the small postve set P. Moreover, the examples n the postve set P and the hdden postve examples n the unlabeled set U may not even be generated or drawn from the same dstrbuton. A classfer traned merely on the few avalable examples n P would thus be ncompetent n recognzng all the hdden postves n U as well as those n the test sets. In ths work, we consder the problem of learnng to classfy documents wth only a small postve tranng set. Our work was motvated by a real-lfe busness ntellgence applcaton of classfyng web pages of product nformaton. The rchness of nformaton easly avalable on the World Wde Web has made t routne for companes to conduct busness ntellgence by searchng the Internet for nformaton on related products. For example, a company that sells computer prnters may want to do a product comparson among the varous prnters currently avalable n the market. One can frst collect sample prnter pages by crawlng through all product pages from a consoldated e-commerce web ste (e.g., amazon.com) and then hand-label those pages contanng prnter product nformaton to construct the set P of postve examples. Next, to get more product nformaton, one can then crawl through all the product pages from other web stes (e.g., cnet.com) as U. Ideally, PU learnng technques can then be appled to classfy all pages n U nto prnter pages and nonprnter pages. However, we found that whle the prnter product pages from two webstes (say, amazon.com and cnet.com) do ndeed share many smlartes, they can also be qute dstnct as the dfferent web stes nvarably present ther products (even smlar ones) n dfferent styles and have dfferent focuses. As such, drectly applyng exstng methods would gve very poor results because 1) the small postve set P obtaned from one ste contaned only tens of web pages (usually less than 30) and therefore do not adequately represent the whole postve class, and 2) the features from the postve examples n P and the hdden postve examples n U are not generated from the same dstrbuton because they were from dfferent web stes. In ths paper, we tackle the challenge of constructng a relable document (web page) classfer based on only a few labeled postve pages from a sngle web ste and then use t to automatcally extract the hdden postve pages from dfferent web stes (.e. the unlabeled sets). We propose an effectve technque called LPLP (Learnng from Probablstcally Labelng Postve examples) to perform ths task. Our proposed technque LPLP s based on probablstcally labelng tranng examples from U and the EM algorthm [4]. The expermental results showed that LPLP sgnfcantly outperformed exstng PU learnng methods. 2 Related Works A theoretcal study of PAC learnng from postve and unlabeled examples under the statstcal query model was frst reported n [5]. Muggleton [6] followed by studyng the problem n a Bayesan framework where the dstrbuton of functons and examples are assumed known. [1] reported sample complexty results and provded theoretcal elaboratons on how the problem could be solved. Subsequently, a number of practcal PU learnng algorthms were proposed [1], [3] and [2]. These PU learnng algorthms all conformed to the theoretcal results presented n [1] by followng a
3 Learnng to Classfy Documents wth Only a Small Postve Tranng Set 203 common two-step strategy, namely: (1) dentfyng a set of relable negatve documents from the unlabeled set; and then (2) buldng a classfer usng EM or SVM teratvely. The specfc dfferences between the varous algorthms n these two steps are as follows. The S-EM method proposed n [1] was based on naïve Bayesan classfcaton and the EM algorthm [4]. The man dea was to frst use a spyng technque to dentfy some relable negatve documents from the unlabeled set, and then to run EM to buld the fnal classfer. The PEBL method [3] uses a dfferent method (1-DNF) for dentfyng relable negatve examples and then runs SVM teratvely for classfer buldng. More recently, [2] reported a technque called Roc- SVM. In ths technque, relable negatve documents were extracted by usng the nformaton retreval technque Roccho [7]. Agan, SVM s used n the second step. A classfer selecton crteron s also proposed to catch a good classfer from teratons of SVM. Despte the dfferences n algorthmc detals, the above methods all focused on extractng relable negatve nstances from the unlabeled set. More related to our current work was the recent work by Yu [8], whch proposed to estmate the boundary for the postve class. However, the amount of postve examples they requred was around 30% of the whole data, whch was stll too large for many practcal applcatons. In [9], a method called PN-SVM was proposed to deal wth the case when the postve set s small. However, t (lke all the other exstng algorthms of PU learnng) reled on the assumpton that the postve examples n P and the hdden postves n U were all generated from the same dstrbuton. For the frst tme, our LPLP method proposed n ths paper wll address such common weaknesses of current PU learnng methods, ncludng handlng challengng cases where the postve examples n P and the hdden postve examples n U were not drawn from the same dstrbuton. Note that the problem could potentally be modeled as a one-class classfcaton problem. For example, n [10], a one-class SVM that uses only postve data to buld a SVM classfer was proposed. Such approaches are dfferent from our method n that they do not use unlabeled data for tranng. However, as prevous results reported n [2] have already showed that they were nferor for text classfcaton, we do not consder them n ths work. 3 The Proposed Technque Fgure 1 depcts the general scenaro of PU learnng wth a small postve tranng set P. Let us denote a space ψ that represents the whole postve class, whch s located above the hyperplane H 2. The small postve set P only occupes a relatvely small subspace SP n ψ (SP ψ), shown as the oval regon n the fgure. The examples n the unlabelled set U conssts of both hdden postve examples (represented by crcled + ) and hdden negatve examples (represented by - ). Snce P s small, and ψ contans postve examples from dfferent web stes that present smlar products n dfferent styles and focuses, we may not expect the dstrbutons of the postve examples n P and those of the hdden postve examples n U to be the same. In other words, the set of hdden postve examples that we are tryng to detect may have a very small ntersecton or s even dsont wth SP. If we navely use the small set P as the postve tranng set and the entre unlabelled set U as the negatve set, the
4 204 X.-L. L, B. Lu, and S.-K. Ng resultng classfer (corresponds to hyperplane H 1 ) wll obvously perform badly n dentfyng the hdden postve examples n U. On the other hand, we can attempt to use the more sophstcated PU learnng methods to address ths problem. Instead of merely usng the entre unlabelled set U as the negatve tranng data, the frst step of PU learnng extracts some relable negatves from U. However, ths step s actually rather dffcult n our applcaton scenaro as depcted n Fgure 1. Snce the hdden postve examples n U are lkely to have dfferent dstrbutons from those captured n the small postve set P, not all the tranng examples n U that are dssmlar to examples n P are negatve examples. As a result, the so-called relable negatve set that the frst step of PU learnng extracts based on dssmlarty from P would be a very nosy negatve set, and therefore not very useful for buldng a relable classfer. H 1 H 2 Fg. 1. PU learnng wth a small postve tranng set Let us consder the possblty of extractng a set of lkely postve examples (LP) from the unlabeled set U to address the problem of P s beng not suffcently representatve of the hdden postve documents n U. Unlke P, the dstrbuton of LP wll be smlar wth the other hdden postve examples n U. As such, we could expect that a more accurate classfer can be bult wth the help of set LP (together wth P). Pctorally, the resultng classfer would correspond to the optmal hyperplane H 2 shown n Fgure 1. Instead of tryng to dentfy a set of nosy negatve documents from the unlabeled set U as exstng PU learnng technques do, our proposed technque LPLP therefore focuses on extractng a set of lkely postve documents from U. Whle the postve documents n P and the hdden postve documents n U were not drawn from the same dstrbuton, they should stll be smlar n some underlyng feature dmensons (or subspaces) as they belong to the same class. For example, the prnter pages from two dfferent stes, say amazon.com and cnet.com, would share the representatve word features such as prnter, nket, laser, ppm etc, though ther respectve dstrbutons may be qute dfferent. Partcularly, the pages from cnet.com whose target readers are more techncally-savvy may contan more frequent mentonng of keyword terms that correspond to the techncal specfcatons of prnters than those pages from amazon.com whose prmary focus s to reach out to the less techncally-nclned customers. However, we can safely expect that the basc keywords (representatve word features) that descrbe computer prnters should be
5 Learnng to Classfy Documents wth Only a Small Postve Tranng Set 205 presented n both cases. In ths work, we therefore assume that the representatve word features for the documents n P should be smlar to those for the hdden postve documents n U. If we can fnd such a set of representatve word features (RW) from the postve set P, then we can use them to extract other hdden postve documents from U. We are now ready to present the detals of our proposed technque LPLP. In Secton 3.1, we frst ntroduce a method to select the set of representatve word features RW from the gven postve set P. Then, n Secton 3.2, we extract the lkely postve documents from U and probablstcally label them based on the set RW. Wth the help of the resultng set LP, we employ the EM algorthm wth a good ntalzaton to buld an accurate classfer to dentfy the hdden postve examples from U n Secton Selectng a Set of Representatve Word Features from P As mentoned above, we expect the postve examples n P and the hdden postve examples n U share the same representatve word features as they belong to the same class. We extract a set RW of representatve word features from the postve set P contanng the top k words wth the hghest s(w ) scores. The scorng functon s() s based on TFIDF method [11] whch gves hgh scores to those words that 1. RW =, F = ; occur frequently n the postve set P 2. For each word feature w P and not n the whole corpus P U snce 3. If (stopwords ( w )!=true) U contans many other unrelated 4. F = F {stemmng( w )}; 5. total=0; 6. For each word feature w F 7. P N ( w, d ) N ( w, P ) 1 max ( N ( w, d 8. total += N ( w, P) ; 9. For each word feature w F 10. N( w, P) P U s( w ) *log ; total df( w, P) df( w, U) 11. Rank the word feature s s( w ) from bg to small nto a lst L; 12. Pr TOP = the k-th s( w ) n the lst L, w F ; 13. For w F 14. If ( s ( w ) >=Pr TOP ) 15. RW = RW { w }; Fg. 2. Selectng a set of representatve word features from P w )) ; documents. Fgure 2 shows the detaled algorthm to select the keyword set RW. In step 1, we ntalze the representatve word set RW and unque feature set F as empty sets. After removng the stop words (step 3) and performng stemmng (step 4) [11], all the word features are stored nto the feature set F. For each word feature w n the postve set P, steps 6 to 8 compute the accumulated word frequency (n each document d, the word w s frequency N(w,d ) s normalzed by the maxmal word frequency of d n step 7). Steps 9 to 10 then compute the scores of each word feature, whch consder both w s probabltes of belongng to a postve class and ts nverted document frequency, where df(w, P) and df(w, U) are w s document frequences n P and U respectvely. After rankng the scores nto the rank lst L n step 11, we store nto RW those word features from P wth top k scores n L.
6 206 X.-L. L, B. Lu, and S.-K. Ng 3.2 Identfyng LP from U and Probablstcally Labelng the Documents n LP Once the set RW of representatve keywords s determned, we can regard them together as a representatve document (rd) of the postve class. We then compare the smlarty of each document d n U wth rd usng the cosne smlarty metrc [11], whch automatcally produces a set LP of 1. LP = ; RU = ; 2. For each d U 3. sm ( rd, d ) rd, 2 rd, * w d, probablstcally labeled documents wth Pr(d +) > 0. The algorthm for ths step s gven n Fgure 3. In step 1, the lkely postve set LP and the remanng unlabelled set RU are both ntalzed as empty sets. In steps 2 to 3, each unlabeled document d n U s compared wth rd usng the cosne smlarty. Step 4 stores the largest smlarty value as m. For each document d n U, f ts cosne smlarty sm(rd, d )>0, we assgn a probablty Pr(+ d ) that s based on the rato of ts smlarty and m (step 7) and we store t nto the set LP n step 8. Otherwse, d s ncluded n RU nstead (step 10). The documents n RU have zero smlarty wth rd and can be consdered as a purer negatve set than U. Note that n step 7, the hdden postve examples n LP wll be assgned hgh probabltes whle the negatve examples n LP wll be assgned very low probabltes. Ths s because the representatve features n RW were chosen based on those words that occurred frequently n P but not n the whole corpus P U. As such, the hdden postve examples n LP should also contan many of the features n RW whle the negatve examples n LP would contan few (f any) of the features n RW. 3.3 The Classfcaton Algorthm w w w 2 d, 4. Let m = max( sm( rd, d )), d U; d 5. For each d U 6. If ( sm ( rd, d ) > 0) 7. Pr(+ d ) = sm( rd, d )/ m ; 8. LP = LP {d }; 9. Else 10. RU = RU {d }; Fg. 3. Probablstcally labelng a set of documents ; Next, we employ the naïve Bayesan framework to dentfy the hdden postves n U. Gven a set of tranng documents D, each document s consdered an lst of words and each word n a document s from the vocabulary V = < w 1, w 2,, w v >. We also have a set of predefned classes, C={c 1, c 2,, c C } For smplcty, we wll consder two class classfcaton n ths dscusson,.e. C={c 1, c 2 }, c 1 = + and c 2 = -. To perform classfcaton for a document d, we need to compute the posteror probablty, Pr(c d ), c {+,-}. Based on the Bayesan probablty and the multnomal model [12], we have D ( c d ) 1 ( c ) = Pr Ρr =. (1) D and wth Laplacan smoothng,
7 Learnng to Classfy Documents wth Only a Small Postve Tranng Set V D 1 N ( w, )Ρr ( ) = 1 t d c d Ρr ( w t c ) =. D V + N ( w, d )Ρr ( c d ) s = 1 = 1 Here, Pr(c d ) {0,1} dependng on the class label of the document. Assumng that the probabltes of words are ndependent gven the class, we obtan the naïve Bayesan classfer: d k = c r ) Pr( c ) Pr( w c ) Pr( c d ). c ) 1 d, k = C d Pr( Pr( w = 1 = 1, r k d k In the nave Bayesan classfer, the class wth the hghest Pr(c d ) s assgned as the class of the document. The NB method s known to be an effectve technque for text classfcaton even wth the volaton of some of ts basc assumptons (e.g class condtonal ndependence) [13] [14] [1]. The Expectaton-Maxmzaton (EM) algorthm s a popular class of teratve algorthms for problems wth ncomplete data. It terates over two basc steps, the Expectaton step, and the Maxmzaton step. The Expectaton step bascally flls n the mssng data, whle the Maxmzaton step then estmates the parameters. When applyng EM, equatons (1) and (2) above comprse the Expectaton step, whle equaton (3) s used for the Maxmzaton step. Note that the probablty of the class gven the document now takes the value n [0, 1] nstead of {0, 1}. The ablty of EM to work wth 1. For each d RU, 2. Pr(+ d ) = 0; 3. Pr(- d ) = 1; 4. PS = LP P (or LP); 5. For each d PS 6. If d P 7. Pr(+ d ) = 1; 8. Pr(- d ) = 0; 9. Else 10. Pr(+ d ) = sm( rd, d ) / m ; 11. Pr(- d ) = 0; 12. Buld an NB-C classfer C usng PS and RU based on equatons (1), (2); 13. Whle classfer parameters change 14. For each d PS RU 15. Compute Pr(+ d ) and Pr(- d ) usng NB-C,.e., equaton (3); 16. Update Pr(c ) and Pr(w t c ) by replacng equatons (1) and (2) wth the new probabltes produced n step 15 (a new NB- C s beng bult n the process) Fg. 4. The detaled LPLP algorthm s r (2) (3) mssng data s exactly what we need here. Let us regard all the postve documents to have the postve class value +. We want to know the class value of each document n the unlabeled set U. EM can help to properly assgn a probablstc class label to each document d n the unlabeled set,.e., Pr(+ d ) or Pr(- d ). Theoretcally, n EM algorthm, the probabltes of documents n U wll converge after a number of teratons [4]. However, a good ntalzaton s mportant n order to fnd a good maxmum of the lkelhood functon. For example, f we drectly use P as postve class and U as negatve class (ntally), then EM algorthm wll not buld an accurate classfer as the negatve class would be too nosy (as explaned prevously). Thus, n our algorthm, after extractng lkely postve set LP, we re-ntalze the EM algorthm by treatng the probablstcally labeled LP (wth/wthout P) as postve documents. The resultng classfer s more accurate
8 208 X.-L. L, B. Lu, and S.-K. Ng because 1) LP has the smlar dstrbutons wth other hdden postve documents n U, and 2) the remanng unlabeled set RU s also much purer than U as a negatve set. The detaled LPLP algorthm s gven n Fgure 4. The nputs to the algorthm are LP, RU and P. Steps 1 to 3 ntalze the probabltes for each document d n RU, whch are all treated as negatve documents ntally. Step 4 sets the postve set PS; there are two possble ways to acheve ths: we ether (1) combne LP and P as PS, or (2) use only LP as PS. We wll evaluate the effect of the ncluson of P n PS n the next secton. Steps 5 to 11 wll assgn the ntal probabltes to the documents n P (f P s used) and LP. Each document n P s assgned to the postve class whle each document n LP s probablstcally labeled usng the algorthm n Fgure 3. Usng PS and RU, a NB classfer can be bult (step 12). Ths classfer s then appled to the documents n (LP RU) to obtan the posteror probabltes (Pr(+ d ) and Pr(- d )) for each document (step 15). We can then teratvely employ the revsed posteror probabltes to buld a new (and better) NB classfer (step 16). The EM process contnues untl the parameters of the NB classfer converge. 4 Evaluaton Experments In ths secton, we evaluate the proposed LPLP technque under dfferent settngs and compare t wth exstng methods, namely, Roc-SVM [2] and PEBL [3]. Roc-SVM s avalable on the Web as a part of the LPU system 1. We mplemented PEBL ourselves as t s not publcly avalable. The results of S-EM [1] were not ncluded here because the performance was generally very poor due to ts relance on smlarty of postve documents n P and n U, as expected. 4.1 Datasets Our evaluaton experments were done usng product Web pages from 5 commercal Web stes: Amazon, CNet, PCMag, J&R and ZDnet. These stes contaned many ntroducton/descrpton pages of dfferent types of products. The pages were cleaned usng the web page cleanng technque n [15],.e., navgaton lnks and advertsements have been detected and elmnated. The data contaned web pages of the followng product categores: Notebook, Dgtal Camera, Moble Phone, Prnter and TV. Table 1 lsts the number of pages from each ste, and the correspondng product categores (or classes). In ths work, we treat each page as a text document and we do not use hyperlnks and mages for classfcaton. Table 1. Number of Web pages and ther classes. Amazon CNet J&R PCMag ZDnet Notebook Camera Moble Prnter TV
9 Learnng to Classfy Documents wth Only a Small Postve Tranng Set 209 Note that we dd not use standard text collectons such as Reuters 2 and 20 Newsgroup data 3 n our experments as we want to hghlght the performance of our approach on data sets that have dfferent postve dstrbutons n P and n U. 4.2 Experment Settngs As mentoned, the number of avalable postvely labeled documents n practce can be qute small ether because there were few documents to start wth, or t was tedous and expensve to hand-label the tranng examples on a large scale. To reflect ths constrant, we expermented wth dfferent number of (randomly selected) postve documents n P,.e. P = 5, 15, or 25 and allpos. Here allpos means that all documents of a partcular product from a Web ste were used. The purpose of these experments s to nvestgate the relatve effectveness of our proposed method for both small and large postve sets. We conducted a comprehensve set of experments usng all the possble P and U combnatons. That s, we selected every entry (one type of product from each Web ste) n Table 1 as the postve set P and use each of the other 4 Web stes as the unlabeled set U. Three products were omtted n our experments because ther P <10, namely, Moble phone n J&R (9 pages), TV n PCMag (no page), and TV n ZDnet (no page). A grand total of 88 experments were conducted usng all the possble P and U combnatons of the 5 stes. Due to the large number of combnatons, the results reported below are the average values for the results from all the combnatons. To study the senstvty of the number of representatve word features used n dentfyng lkely postve examples, we also performed a seres of experments usng dfferent numbers of representatve features,.e. k = 5, 10, 15 and 20 n our algorthm. 4.3 Expermental Results Snce our task s to dentfy or retreve postve documents from the unlabeled set U, t s approprate to use F value to evaluate the performance of the fnal classfer. F value s the harmonc mean of precson (p) and recall (r),.e. F=2*p*r/(p+r). When ether of p or r s small, the F value wll be small. Only when both of them are large, F value wll be large. Ths s sutable for our purpose as we do not want to dentfy postve documents wth ether too small a precson or too small a recall. Note that n our experment results, the reported F values gve the classfcaton (retreval) results of the postve documents n U as U s also the test set. Let us frst show the results of our proposed technque LPLP under dfferent expermental settngs. We wll then compare t wth two exstng technques. The bar chart n Fgure 5 shows the F values (Y-axs) of LPLP usng dfferent numbers of postve documents (X-axs) and dfferent numbers of representatve words (4 data seres). Recall that we had presented two optons to construct the postve set PS n step 4 of the LPLP algorthm (Fgure 4). The frst opton s to add the extracted lkely postve documents (LP) to the orgnal set of postve documents P, represented n Fgure 5 by wth P. The second opton s to use only the extracted lkely postve documents as the postve data n learnng,.e., droppng the orgnal
10 210 X.-L. L, B. Lu, and S.-K. Ng words 10 words 15 words 20 words F value (wth P) 5 (wthout P) 15 (wth P) 15 (wthout P) 25 (wth P) 25 (wthout P) allpos (wth P) allpos (wthout P) Fg. 5. F values of LPLP wth dfferent numbers of postve documents postve set P (snce t s not representatve of the hdden postve documents n U). Ths opton s denoted by wthout P n Fgure 5. Incluson of P for constructng PS: If there were only a small number of postve documents ( P = 5, 15 and 25) avalable, we found that usng opton 1 (wth P) to construct the postve set for the classfer s better than usng opton 2 (wthout P), as expected. However, nterestngly, f there s a large number of postve documents (allpos n Fgure 5), then opton 1 s actually nferor to opton 2. The reason s that the use of a bg postve set P, whch s not representatve of the postve documents n U, would have ntroduced too much negatve nfluence on the fnal classfer (many hdden postve examples n U wll be classfed as negatve class). However, when the gven postve set P s small, ts potental negatve nfluence s much less, and t wll therefore help to strengthen the lkely postve documents by provdng more postve data. Ths s a subtle and rather unexpected trade-off here. Number of postve documents n P: From Fgure 5, we also observe that the number of the gven postve documents n P does not nfluence the fnal results a great deal. The reason for ths s that the computed lkely postve documents from U are actually more effectve postve documents for learnng than the orgnal postve documents n P. Ths s a very compellng advantage of our proposed technque as ths means that the user does not need to label or to fnd a large number of postve examples for effectve learnng. In fact, as dscussed above, we also notce that even wthout usng any orgnal postve document n P, the results were very good as well. Number of representatve word features: The results n Fgure 5 also showed that there s no need to use many representatve words for detectng postve documents. In general, 5-15 representatve words would suffce. Includng the less representatve word features beyond the top k most representatve ones would ntroduce unnecessary nose n dentfyng the lkely postve documents n U. Next, we compare the results of our LPLP technque wth those of the two best exstng technques mentoned earler, namely, Roc-SVM [2] and PEBL [3]. Fgure 6 shows two seres of results. The frst seres, marked P, showed the classfcaton results of all three methods usng all postve documents ( allpos ), wthout the use of the lkely postve documents LP as suggested n ths paper. In other words, learnng
11 Learnng to Classfy Documents wth Only a Small Postve Tranng Set 211 was done usng only P and U. Note that for the strctest comparson what we have shown here are the best possble results for Roc-SVM and PEBL, whch may not be obtanable n practce because t s hardly possble to determne whch SVM teraton would gve the best results n these algorthms (both Roc-SVM and PEBL algorthms run SVM many tmes). In fact, ther results at convergence were actually much worse. We can see that PEBL performed better than both LPLP and Roc-SVM. However, the absolute F value of PEBL s stll very low (0.54). Note also that because of the use of allpos for tranng, the LPLP s result for ths was obtaned wthout usng the lkely postve set LP (t s now the EM standard algorthm), hence t was unable to perform as well as t should have. The second seres n Fgure 6 shows the comparatve results of usng the extracted lkely postve documents nstead of P for learnng. Here, our LPLP algorthm performs dramatcally better (F=0.94) even aganst the best possble results of PEBL (F=0.84) and Roc-SVM (F=0.81). Note that here PEBL and Roc-SVM also use the lkely postve documents LP extracted from U by our method (we boosted the PEBL and Roc-SVM for the purpose of comparson). The lkely postves were dentfed from U usng 10 representatve words selected from P. Unlke our LPLP algorthm, both Roc-SVM and PEBL do not take probablstcally labels, but only bnary labels. As such, for these two algorthms, we chose the lkely postve documents from U by requrng each document (d) to contan at least 5 (out of 10) selected representatve words. All the lkely postve documents dentfed were then treated as postve documents,.e., Pr(+ d) = 1. We also tred usng other numbers of representatve words n RW and found that 5 words performed well for these two algorthms wth our datasets. We can see that wth the use of the lkely postves (set LP) dentfed by our method (nstead of P), the classfcaton results of these two exstng algorthms have also mproved dramatcally as well. In fact, by usng LP nstead of P, the prevously weaker Roc-SVM has caught up so substantally that the best possble result of PEBL s only slghtly better than that of Roc-SVM now. Fnally, n Fgure 7, we show the comparatve results when the number of the postve documents s small, whch s more often than not the case n practce. Agan, we see that our new method LPLP performed much better than the best possble F value P LP LPLP Roc-SVM PEBL F value LPLP Roc-SVM PEBL The document number of P Fg. 6. F values of LPLP, the best results of Roc-SVM and PEBL usng all postve documents Fg. 7. F values of LPLP, the best results of Roc-SVM and PEBL usng P together wth LP
12 212 X.-L. L, B. Lu, and S.-K. Ng results of the two exstng methods Roc-SVM and PEBL (whch may not be obtanable n practce, as explaned earler) when there were only 5, 15, or 25 postve documents n P. As explaned earler, ncludng P together wth LP (for all the three technques) gave better results when P s small. In summary, the results n Fgures 6 and 7 showed that the lkely postve documents LP extracted from U can be used to help boost the performance of classfcaton technques for PU learnng problems. In partcular, LPLP algorthm benefted the most and performed the best. Ths s probably because of ts ablty to handle probablstc labels and s thus better equpped to take advantage of the probablstc (and hence potentally nosy) LP set than the SVM-based approaches. 5 Conclusons In many real-world classfcaton applcatons, t s often the case that not only the negatve tranng examples are hard to come by, but the number of postve examples avalable for learnng can also be farly lmted as t s often tedous and expensve to hand-label large amounts of tranng data. To address the lack of negatve examples, many PU learnng methods have been proposed to learn from a pool of postve data (P) wthout any negatve data but wth the help of unlabeled data (U). However, PU learnng methods stll do not work well when the sze of postve examples s small. In ths paper, we address ths oft-overlooked ssue for PU learnng when the number of postve examples s qute small. In addton, we consder the challengng case where the postve examples n P and the hdden postve examples n U may not even be drawn from the same dstrbuton. Exstng technques have been found to perform poorly n ths settng. We proposed an effectve technque LPLP that can learn effectvely from postve and unlabeled examples wth a small postve set for document classfcaton. Instead of dentfyng a set of relable negatve documents from the unlabeled set U as exstng PU technques do, our new method focuses on extractng a set of lkely postve documents from U. In ths way, the learnng can rely less on the lmtatons assocated wth the orgnal postve set P, such as ts lmted sze and potental dstrbuton dfferences. Augmented by the extracted probablstc LP set, our LPLP algorthm can buld a much more robust classfer. We reported expermental results wth product page classfcaton that confrmed that our new technque s ndeed much more effectve than exstng methods n ths challengng classfcaton problem. In our future work, we plan to generalze our current approach to solve smlar classfcaton problems other than document classfcaton. References 1. Lu, B., Lee, W., Yu, P., L, X.: Partally Supervsed Classfcaton of Text Documents. In: ICML, pp (2002) 2. L, X., Lu, B.: Learnng to Classfy Texts Usng Postve and Unlabeled Data. In: IJCAI, pp (2003) 3. Yu, H., Han, J., Chang, K.C.-C.: PEBL: Postve Example Based Learnng for Web Page Classfcaton Usng SVM. In: KDD, pp (2002)
13 Learnng to Classfy Documents wth Only a Small Postve Tranng Set Dempster, A.P., Lard, N.M., Rubn, D.B.: Maxmum Lkelhood from Incomplete Data va the EM Algorthm. Journal of the Royal Statstcal Socety (1977) 5. Dens, F.: PAC Learnng from Postve Statstcal Queres. In: ALT, pp (1998) 6. Muggleton, S.: Learnng from Postve Data. In: Proceedngs of the sxth Internatonal Workshop on Inductve Logc Programmng, pp Sprnger, Hedelberg (1997) 7. Roccho, J.: Relevance feedback n nformaton retreval. In: Salton, G. (ed.) The SMART Retreval System: Experments n Automatc Document Processng (1971) 8. Yu, H.: General MC: Estmatng boundary of postve class from small postve data. In: ICDM, pp (2003) 9. Fung, G.P.C., et al.: Text Classfcaton wthout Negatve Examples Revst. IEEE Transactons on Knowledge and Data Engneerng 18(1), 6 20 (2006) 10. Schölkopf, B., et al.: Estmatng the Support of a Hgh-Dmensonal Dstrbuton. Neural Comput 13(7), (2001) 11. Salton, G., McGll, M.J.: Introducton to Modern Informaton Retreval (1986) 12. McCallum, A., Ngam, K.: A comparson of event models for naïve Bayes text classfcaton. In: AAAI Workshop on Learnng for Text Categorzaton (1998) 13. Lews, D.D.: A sequental algorthm for tranng text classfers: corrgendum and addtonal data. In: SIGIR Forum, (1995) 14. Ngam, K., et al.: Text Classfcaton from Labeled and Unlabeled Documents usng EM. Machne Learnng 39(2-3), (2000) 15. Y, L., Lu, B., L, X.: Elmnatng nosy nformaton n Web pages for data mnng. In: KDD, pp (2003)
Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task
Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto
More informationReliable Negative Extracting Based on knn for Learning from Positive and Unlabeled Examples
94 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY 2009 Relable Negatve Extractng Based on knn for Learnng from Postve and Unlabeled Examples Bangzuo Zhang College of Computer Scence and Technology, Jln Unversty,
More informationUser Authentication Based On Behavioral Mouse Dynamics Biometrics
User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA
More informationThe Research of Support Vector Machine in Agricultural Data Classification
The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou
More informationA Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems
A Unfed Framework for Semantcs and Feature Based Relevance Feedback n Image Retreval Systems Ye Lu *, Chunhu Hu 2, Xngquan Zhu 3*, HongJang Zhang 2, Qang Yang * School of Computng Scence Smon Fraser Unversty
More informationA Binarization Algorithm specialized on Document Images and Photos
A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a
More informationSupport Vector Machines
/9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.
More informationDeep Classification in Large-scale Text Hierarchies
Deep Classfcaton n Large-scale Text Herarches Gu-Rong Xue Dkan Xng Qang Yang 2 Yong Yu Dept. of Computer Scence and Engneerng Shangha Jao-Tong Unversty {grxue, dkxng, yyu}@apex.sjtu.edu.cn 2 Hong Kong
More informationLearning the Kernel Parameters in Kernel Minimum Distance Classifier
Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department
More informationContent Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers
IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth
More informationEdge Detection in Noisy Images Using the Support Vector Machines
Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona
More informationParallelism for Nested Loops with Non-uniform and Flow Dependences
Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr
More informationEECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science
EECS 730 Introducton to Bonformatcs Sequence Algnment Luke Huan Electrcal Engneerng and Computer Scence http://people.eecs.ku.edu/~huan/ HMM Π s a set of states Transton Probabltes a kl Pr( l 1 k Probablty
More informationClassifier Selection Based on Data Complexity Measures *
Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.
More informationClassifying Acoustic Transient Signals Using Artificial Intelligence
Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)
More informationUB at GeoCLEF Department of Geography Abstract
UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department
More informationOutline. Type of Machine Learning. Examples of Application. Unsupervised Learning
Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton
More informationOptimizing Document Scoring for Query Retrieval
Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng
More informationUnsupervised Learning
Pattern Recognton Lecture 8 Outlne Introducton Unsupervsed Learnng Parametrc VS Non-Parametrc Approach Mxture of Denstes Maxmum-Lkelhood Estmates Clusterng Prof. Danel Yeung School of Computer Scence and
More informationSubspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;
Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features
More informationCAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University
CAN COMPUTERS LEARN FASTER? Seyda Ertekn Computer Scence & Engneerng The Pennsylvana State Unversty sertekn@cse.psu.edu ABSTRACT Ever snce computers were nvented, manknd wondered whether they mght be made
More informationImprovement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration
Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,
More informationDetermining the Optimal Bandwidth Based on Multi-criterion Fusion
Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn
More informationCSCI 5417 Information Retrieval Systems Jim Martin!
CSCI 5417 Informaton Retreval Systems Jm Martn! Lecture 11 9/29/2011 Today 9/29 Classfcaton Naïve Bayes classfcaton Ungram LM 1 Where we are... Bascs of ad hoc retreval Indexng Term weghtng/scorng Cosne
More informationSLAM Summer School 2006 Practical 2: SLAM using Monocular Vision
SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,
More informationTsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance
Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for
More informationPerformance Evaluation of Information Retrieval Systems
Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence
More informationA Fast Content-Based Multimedia Retrieval Technique Using Compressed Data
A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,
More informationIntelligent Information Acquisition for Improved Clustering
Intellgent Informaton Acquston for Improved Clusterng Duy Vu Unversty of Texas at Austn duyvu@cs.utexas.edu Mkhal Blenko Mcrosoft Research mblenko@mcrosoft.com Prem Melvlle IBM T.J. Watson Research Center
More informationInvestigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers
Journal of Convergence Informaton Technology Volume 5, Number 2, Aprl 2010 Investgatng the Performance of Naïve- Bayes Classfers and K- Nearest Neghbor Classfers Mohammed J. Islam *, Q. M. Jonathan Wu,
More informationA Novel Term_Class Relevance Measure for Text Categorization
A Novel Term_Class Relevance Measure for Text Categorzaton D S Guru, Mahamad Suhl Department of Studes n Computer Scence, Unversty of Mysore, Mysore, Inda Abstract: In ths paper, we ntroduce a new measure
More informationSupport Vector Machines
Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned
More informationBOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET
1 BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET TZU-CHENG CHUANG School of Electrcal and Computer Engneerng, Purdue Unversty, West Lafayette, Indana 47907 SAUL B. GELFAND School
More informationFor instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)
Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A
More informationPruning Training Corpus to Speedup Text Classification 1
Prunng Tranng Corpus to Speedup Text Classfcaton Jhong Guan and Shugeng Zhou School of Computer Scence, Wuhan Unversty, Wuhan, 430079, Chna hguan@wtusm.edu.cn State Key Lab of Software Engneerng, Wuhan
More informationA Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines
A Modfed Medan Flter for the Removal of Impulse Nose Based on the Support Vector Machnes H. GOMEZ-MORENO, S. MALDONADO-BASCON, F. LOPEZ-FERRERAS, M. UTRILLA- MANSO AND P. GIL-JIMENEZ Departamento de Teoría
More informationImproving Web Image Search using Meta Re-rankers
VOLUME-1, ISSUE-V (Aug-Sep 2013) IS NOW AVAILABLE AT: www.dcst.com Improvng Web Image Search usng Meta Re-rankers B.Kavtha 1, N. Suata 2 1 Department of Computer Scence and Engneerng, Chtanya Bharath Insttute
More informationBiostatistics 615/815
The E-M Algorthm Bostatstcs 615/815 Lecture 17 Last Lecture: The Smplex Method General method for optmzaton Makes few assumptons about functon Crawls towards mnmum Some recommendatons Multple startng ponts
More informationEfficient Text Classification by Weighted Proximal SVM *
Effcent ext Classfcaton by Weghted Proxmal SVM * Dong Zhuang 1, Benyu Zhang, Qang Yang 3, Jun Yan 4, Zheng Chen, Yng Chen 1 1 Computer Scence and Engneerng, Bejng Insttute of echnology, Bejng 100081, Chna
More informationLearning-Based Top-N Selection Query Evaluation over Relational Databases
Learnng-Based Top-N Selecton Query Evaluaton over Relatonal Databases Lang Zhu *, Wey Meng ** * School of Mathematcs and Computer Scence, Hebe Unversty, Baodng, Hebe 071002, Chna, zhu@mal.hbu.edu.cn **
More informationA mathematical programming approach to the analysis, design and scheduling of offshore oilfields
17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and
More informationCS 534: Computer Vision Model Fitting
CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust
More informationMULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION
MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION Paulo Quntlano 1 & Antono Santa-Rosa 1 Federal Polce Department, Brasla, Brazl. E-mals: quntlano.pqs@dpf.gov.br and
More informationCompiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz
Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster
More informationImplementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status
Internatonal Journal of Appled Busness and Informaton Systems ISSN: 2597-8993 Vol 1, No 2, September 2017, pp. 6-12 6 Implementaton Naïve Bayes Algorthm for Student Classfcaton Based on Graduaton Status
More informationIssues and Empirical Results for Improving Text Classification
Issues and Emprcal Results for Improvng Text Classfcaton Youngoong Ko 1 and Jungyun Seo 2 1 Dept. of Computer Engneerng, Dong-A Unversty, 840 Hadan 2-dong, Saha-gu, Busan, 604-714, Korea yko@dau.ac.kr
More informationComplex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.
Complex Numbers The last topc n ths secton s not really related to most of what we ve done n ths chapter, although t s somewhat related to the radcals secton as we wll see. We also won t need the materal
More informationThree supervised learning methods on pen digits character recognition dataset
Three supervsed learnng methods on pen dgts character recognton dataset Chrs Flezach Department of Computer Scence and Engneerng Unversty of Calforna, San Dego San Dego, CA 92093 cflezac@cs.ucsd.edu Satoru
More informationAssociative Based Classification Algorithm For Diabetes Disease Prediction
Internatonal Journal of Engneerng Trends and Technology (IJETT) Volume-41 Number-3 - November 016 Assocatve Based Classfcaton Algorthm For Dabetes Dsease Predcton 1 N. Gnana Deepka, Y.surekha, 3 G.Laltha
More informationAn Optimal Algorithm for Prufer Codes *
J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,
More informationExperiments in Text Categorization Using Term Selection by Distance to Transition Point
Experments n Text Categorzaton Usng Term Selecton by Dstance to Transton Pont Edgar Moyotl-Hernández, Héctor Jménez-Salazar Facultad de Cencas de la Computacón, B. Unversdad Autónoma de Puebla, 14 Sur
More informationCHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION
48 CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION 3.1 INTRODUCTION The raw mcroarray data s bascally an mage wth dfferent colors ndcatng hybrdzaton (Xue
More informationKeywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines
(IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, Herarchcal Web Page Classfcaton Based on a Topc Model and Neghborng Pages Integraton Wongkot Srura Phayung Meesad Choochart Haruechayasak
More informationA User Selection Method in Advertising System
Int. J. Communcatons, etwork and System Scences, 2010, 3, 54-58 do:10.4236/jcns.2010.31007 Publshed Onlne January 2010 (http://www.scrp.org/journal/jcns/). A User Selecton Method n Advertsng System Shy
More informationSelecting Query Term Alterations for Web Search by Exploiting Query Contexts
Selectng Query Term Alteratons for Web Search by Explotng Query Contexts Guhong Cao Stephen Robertson Jan-Yun Ne Dept. of Computer Scence and Operatons Research Mcrosoft Research at Cambrdge Dept. of Computer
More informationRelated-Mode Attacks on CTR Encryption Mode
Internatonal Journal of Network Securty, Vol.4, No.3, PP.282 287, May 2007 282 Related-Mode Attacks on CTR Encrypton Mode Dayn Wang, Dongda Ln, and Wenlng Wu (Correspondng author: Dayn Wang) Key Laboratory
More informationUSING LINEAR REGRESSION FOR THE AUTOMATION OF SUPERVISED CLASSIFICATION IN MULTITEMPORAL IMAGES
USING LINEAR REGRESSION FOR THE AUTOMATION OF SUPERVISED CLASSIFICATION IN MULTITEMPORAL IMAGES 1 Fetosa, R.Q., 2 Merelles, M.S.P., 3 Blos, P. A. 1,3 Dept. of Electrcal Engneerng ; Catholc Unversty of
More informationRelevance Feedback Document Retrieval using Non-Relevant Documents
Relevance Feedback Document Retreval usng Non-Relevant Documents TAKASHI ONODA, HIROSHI MURATA and SEIJI YAMADA Ths paper reports a new document retreval method usng non-relevant documents. From a large
More informationMachine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)
Machne Learnng Support Vector Machnes (contans materal adapted from talks by Constantn F. Alfers & Ioanns Tsamardnos, and Martn Law) Bryan Pardo, Machne Learnng: EECS 349 Fall 2014 Support Vector Machnes
More informationAn Anti-Noise Text Categorization Method based on Support Vector Machines *
An Ant-Nose Text ategorzaton Method based on Support Vector Machnes * hen Ln, Huang Je and Gong Zheng-Hu School of omputer Scence, Natonal Unversty of Defense Technology, hangsha, 410073, hna chenln@nudt.edu.cn,
More informationNetwork Intrusion Detection Based on PSO-SVM
TELKOMNIKA Indonesan Journal of Electrcal Engneerng Vol.1, No., February 014, pp. 150 ~ 1508 DOI: http://dx.do.org/10.11591/telkomnka.v1.386 150 Network Intruson Detecton Based on PSO-SVM Changsheng Xang*
More informationQuery Clustering Using a Hybrid Query Similarity Measure
Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan
More informationMathematics 256 a course in differential equations for engineering students
Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the
More informationAn Improvement to Naive Bayes for Text Classification
Avalable onlne at www.scencedrect.com Proceda Engneerng 15 (2011) 2160 2164 Advancen Control Engneerngand Informaton Scence An Improvement to Nave Bayes for Text Classfcaton We Zhang a, Feng Gao a, a*
More informationCluster Analysis of Electrical Behavior
Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School
More informationOn Some Entertaining Applications of the Concept of Set in Computer Science Course
On Some Entertanng Applcatons of the Concept of Set n Computer Scence Course Krasmr Yordzhev *, Hrstna Kostadnova ** * Assocate Professor Krasmr Yordzhev, Ph.D., Faculty of Mathematcs and Natural Scences,
More informationReducing Frame Rate for Object Tracking
Reducng Frame Rate for Object Trackng Pavel Korshunov 1 and We Tsang Oo 2 1 Natonal Unversty of Sngapore, Sngapore 11977, pavelkor@comp.nus.edu.sg 2 Natonal Unversty of Sngapore, Sngapore 11977, oowt@comp.nus.edu.sg
More informationLoad Balancing for Hex-Cell Interconnection Network
Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,
More informationProblem Definitions and Evaluation Criteria for Computational Expensive Optimization
Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty
More informationDeep Classifier: Automatically Categorizing Search Results into Large-Scale Hierarchies
Deep Classfer: Automatcally Categorzng Search Results nto Large-Scale Herarches Dkan Xng 1, Gu-Rong Xue 1, Qang Yang 2, Yong Yu 1 1 Shangha Jao Tong Unversty, Shangha, Chna {xaobao,grxue,yyu}@apex.sjtu.edu.cn
More informationMachine Learning 9. week
Machne Learnng 9. week Mappng Concept Radal Bass Functons (RBF) RBF Networks 1 Mappng It s probably the best scenaro for the classfcaton of two dataset s to separate them lnearly. As you see n the below
More informationSkew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach
Angle Estmaton and Correcton of Hand Wrtten, Textual and Large areas of Non-Textual Document Images: A Novel Approach D.R.Ramesh Babu Pyush M Kumat Mahesh D Dhannawat PES Insttute of Technology Research
More informationThe Codesign Challenge
ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.
More informationA PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION
1 THE PUBLISHING HOUSE PROCEEDINGS OF THE ROMANIAN ACADEMY, Seres A, OF THE ROMANIAN ACADEMY Volume 4, Number 2/2003, pp.000-000 A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION Tudor BARBU Insttute
More informationFINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK
FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK L-qng Qu, Yong-quan Lang 2, Jng-Chen 3, 2 College of Informaton Scence and Technology, Shandong Unversty of Scence and Technology,
More informationTN348: Openlab Module - Colocalization
TN348: Openlab Module - Colocalzaton Topc The Colocalzaton module provdes the faclty to vsualze and quantfy colocalzaton between pars of mages. The Colocalzaton wndow contans a prevew of the two mages
More informationOutline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1
4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:
More informationFace Detection with Deep Learning
Face Detecton wth Deep Learnng Yu Shen Yus122@ucsd.edu A13227146 Kuan-We Chen kuc010@ucsd.edu A99045121 Yzhou Hao y3hao@ucsd.edu A98017773 Mn Hsuan Wu mhwu@ucsd.edu A92424998 Abstract The project here
More informationAn Entropy-Based Approach to Integrated Information Needs Assessment
Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology
More informationProper Choice of Data Used for the Estimation of Datum Transformation Parameters
Proper Choce of Data Used for the Estmaton of Datum Transformaton Parameters Hakan S. KUTOGLU, Turkey Key words: Coordnate systems; transformaton; estmaton, relablty. SUMMARY Advances n technologes and
More informationON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE
Yordzhev K., Kostadnova H. Інформаційні технології в освіті ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE Yordzhev K., Kostadnova H. Some aspects of programmng educaton
More informationData Mining: Model Evaluation
Data Mnng: Model Evaluaton Aprl 16, 2013 1 Issues: Evaluatng Classfcaton Methods Accurac classfer accurac: predctng class label predctor accurac: guessng value of predcted attrbutes Speed tme to construct
More informationOnline Detection and Classification of Moving Objects Using Progressively Improving Detectors
Onlne Detecton and Classfcaton of Movng Objects Usng Progressvely Improvng Detectors Omar Javed Saad Al Mubarak Shah Computer Vson Lab School of Computer Scence Unversty of Central Florda Orlando, FL 32816
More informationPictures at an Exhibition
1 Pctures at an Exhbton Stephane Kwan and Karen Zhu Department of Electrcal Engneerng Stanford Unversty, Stanford, CA 9405 Emal: {skwan1, kyzhu}@stanford.edu Abstract An mage processng algorthm s desgned
More informationAn Evaluation of Divide-and-Combine Strategies for Image Categorization by Multi-Class Support Vector Machines
An Evaluaton of Dvde-and-Combne Strateges for Image Categorzaton by Mult-Class Support Vector Machnes C. Demrkesen¹ and H. Cherf¹, ² 1: Insttue of Scence and Engneerng 2: Faculté des Scences Mrande Galatasaray
More informationMulti-Criteria-based Active Learning for Named Entity Recognition
Mult-Crtera-based Actve Learnng for Named Entty Recognton Dan Shen 1 Je Zhang Jan Su Guodong Zhou Chew-Lm Tan Insttute for Infocomm Technology 21 Heng Mu Keng Terrace Sngapore 119613 Department of Computer
More informationFEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur
FEATURE EXTRACTION Dr. K.Vjayarekha Assocate Dean School of Electrcal and Electroncs Engneerng SASTRA Unversty, Thanjavur613 41 Jont Intatve of IITs and IISc Funded by MHRD Page 1 of 8 Table of Contents
More informationR s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes
SPH3UW Unt 7.3 Sphercal Concave Mrrors Page 1 of 1 Notes Physcs Tool box Concave Mrror If the reflectng surface takes place on the nner surface of the sphercal shape so that the centre of the mrror bulges
More informationFast Feature Value Searching for Face Detection
Vol., No. 2 Computer and Informaton Scence Fast Feature Value Searchng for Face Detecton Yunyang Yan Department of Computer Engneerng Huayn Insttute of Technology Hua an 22300, Chna E-mal: areyyyke@63.com
More informationCourse Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms
Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques
More informationCan We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search
Can We Beat the Prefx Flterng? An Adaptve Framework for Smlarty Jon and Search Jannan Wang Guolang L Janhua Feng Department of Computer Scence and Technology, Tsnghua Natonal Laboratory for Informaton
More informationMachine Learning: Algorithms and Applications
14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of
More informationArabic Text Classification Using N-Gram Frequency Statistics A Comparative Study
Arabc Text Classfcaton Usng N-Gram Frequency Statstcs A Comparatve Study Lala Khresat Dept. of Computer Scence, Math and Physcs Farlegh Dcknson Unversty 285 Madson Ave, Madson NJ 07940 Khresat@fdu.edu
More informationA Multi-step Strategy for Shape Similarity Search In Kamon Image Database
A Mult-step Strategy for Shape Smlarty Search In Kamon Image Database Paul W.H. Kwan, Kazuo Torach 2, Kesuke Kameyama 2, Junbn Gao 3, Nobuyuk Otsu 4 School of Mathematcs, Statstcs and Computer Scence,
More informationAnnouncements. Supervised Learning
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutoral by Burge lnked to on web page. Supervsed Learnng Classfcaton wth labeled eamples. Images vectors n hgh-d space. Supervsed Learnng Labeled eamples
More informationA fault tree analysis strategy using binary decision diagrams
Loughborough Unversty Insttutonal Repostory A fault tree analyss strategy usng bnary decson dagrams Ths tem was submtted to Loughborough Unversty's Insttutonal Repostory by the/an author. Addtonal Informaton:
More informationFuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System
Fuzzy Modelng of the Complexty vs. Accuracy Trade-off n a Sequental Two-Stage Mult-Classfer System MARK LAST 1 Department of Informaton Systems Engneerng Ben-Guron Unversty of the Negev Beer-Sheva 84105
More informationSemantic Image Retrieval Using Region Based Inverted File
Semantc Image Retreval Usng Regon Based Inverted Fle Dengsheng Zhang, Md Monrul Islam, Guoun Lu and Jn Hou 2 Gppsland School of Informaton Technology, Monash Unversty Churchll, VIC 3842, Australa E-mal:
More informationMeta-heuristics for Multidimensional Knapsack Problems
2012 4th Internatonal Conference on Computer Research and Development IPCSIT vol.39 (2012) (2012) IACSIT Press, Sngapore Meta-heurstcs for Multdmensonal Knapsack Problems Zhbao Man + Computer Scence Department,
More information6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour
6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the
More information