Learning to Classify Documents with Only a Small Positive Training Set

Learnng to Classfy Documents wth Only a Small Postve Tranng Set Xao-L L 1, Bng Lu 2, and See-Kong Ng 1 1 Insttute for Infocomm Research, Heng Mu Keng Terrace, 119613, Sngapore 2 Department of Computer Scence, Unversty of Illnos at Chcago, IL 60607-7053 xll@2r.a-star.edu.sg, lub@cs.uc.edu, skng@2r.a-star.edu.sg Abstract. Many real-world classfcaton applcatons fall nto the class of postve and unlabeled (PU) learnng problems. In many such applcatons, not only could the negatve tranng examples be mssng, the number of postve examples avalable for learnng may also be farly lmted due to the mpractcalty of hand-labelng a large number of tranng examples. Current PU learnng technques have focused mostly on dentfyng relable negatve nstances from the unlabeled set U. In ths paper, we address the oft-overlooked PU learnng problem when the number of tranng examples n the postve set P s small. We propose a novel technque LPLP (Learnng from Probablstcally Labeled Postve examples) and apply the approach to classfy product pages from commercal webstes. The expermental results demonstrate that our approach outperforms exstng methods sgnfcantly, even n the challengng cases where the postve examples n P and the hdden postve examples n U were not drawn from the same dstrbuton. 1 Introducton Tradtonal supervsed learnng technques typcally requre a large number of labeled examples to learn an accurate classfer. However, n practce, t can be an expensve and tedous process to obtan the class labels for large sets of tranng examples. One way to reduce the amount of labeled tranng data needed s to develop classfcaton algorthms that can learn from a set of labeled postve examples augmented wth a set of unlabeled examples. That s, gven a set P of postve examples of a partcular class and a set U of unlabeled examples (whch contans both hdden postve and hdden negatve examples), we buld a classfer usng P and U to classfy the data n U as well as future test data. We call ths the PU learnng problem. Several nnovatve technques (e.g. [1], [2], [3]) have been proposed to solve the PU learnng problem recently. All of these technques have focused on addressng the lack of labeled negatve examples n the tranng data. It was assumed that there was a suffcently large set of postve tranng examples, and also that the postve examples n P and the hdden postve examples n U were drawn from the same dstrbuton. However, n practce, obtanng a large number of postve examples can be rather dffcult n many real applcatons. Oftentmes, we have to do wth a farly small set of postve tranng data. In fact, the small postve set may not even adequately represent J.N. Kok et al. (Eds.): ECML 2007, LNAI 4701, pp. 201 213, 2007. Sprnger-Verlag Berln Hedelberg 2007

202 X.-L. L, B. Lu, and S.-K. Ng the whole postve class, as t s hghly lkely that there could be hdden postves n U that may not be smlar to those few examples n the small postve set P. Moreover, the examples n the postve set P and the hdden postve examples n the unlabeled set U may not even be generated or drawn from the same dstrbuton. A classfer traned merely on the few avalable examples n P would thus be ncompetent n recognzng all the hdden postves n U as well as those n the test sets. In ths work, we consder the problem of learnng to classfy documents wth only a small postve tranng set. Our work was motvated by a real-lfe busness ntellgence applcaton of classfyng web pages of product nformaton. The rchness of nformaton easly avalable on the World Wde Web has made t routne for companes to conduct busness ntellgence by searchng the Internet for nformaton on related products. For example, a company that sells computer prnters may want to do a product comparson among the varous prnters currently avalable n the market. One can frst collect sample prnter pages by crawlng through all product pages from a consoldated e-commerce web ste (e.g., amazon.com) and then hand-label those pages contanng prnter product nformaton to construct the set P of postve examples. Next, to get more product nformaton, one can then crawl through all the product pages from other web stes (e.g., cnet.com) as U. Ideally, PU learnng technques can then be appled to classfy all pages n U nto prnter pages and nonprnter pages. However, we found that whle the prnter product pages from two webstes (say, amazon.com and cnet.com) do ndeed share many smlartes, they can also be qute dstnct as the dfferent web stes nvarably present ther products (even smlar ones) n dfferent styles and have dfferent focuses. As such, drectly applyng exstng methods would gve very poor results because 1) the small postve set P obtaned from one ste contaned only tens of web pages (usually less than 30) and therefore do not adequately represent the whole postve class, and 2) the features from the postve examples n P and the hdden postve examples n U are not generated from the same dstrbuton because they were from dfferent web stes. In ths paper, we tackle the challenge of constructng a relable document (web page) classfer based on only a few labeled postve pages from a sngle web ste and then use t to automatcally extract the hdden postve pages from dfferent web stes (.e. the unlabeled sets). We propose an effectve technque called LPLP (Learnng from Probablstcally Labelng Postve examples) to perform ths task. Our proposed technque LPLP s based on probablstcally labelng tranng examples from U and the EM algorthm [4]. The expermental results showed that LPLP sgnfcantly outperformed exstng PU learnng methods. 2 Related Works A theoretcal study of PAC learnng from postve and unlabeled examples under the statstcal query model was frst reported n [5]. Muggleton [6] followed by studyng the problem n a Bayesan framework where the dstrbuton of functons and examples are assumed known. [1] reported sample complexty results and provded theoretcal elaboratons on how the problem could be solved. Subsequently, a number of practcal PU learnng algorthms were proposed [1], [3] and [2]. These PU learnng algorthms all conformed to the theoretcal results presented n [1] by followng a

Learnng to Classfy Documents wth Only a Small Postve Tranng Set 203 common two-step strategy, namely: (1) dentfyng a set of relable negatve documents from the unlabeled set; and then (2) buldng a classfer usng EM or SVM teratvely. The specfc dfferences between the varous algorthms n these two steps are as follows. The S-EM method proposed n [1] was based on naïve Bayesan classfcaton and the EM algorthm [4]. The man dea was to frst use a spyng technque to dentfy some relable negatve documents from the unlabeled set, and then to run EM to buld the fnal classfer. The PEBL method [3] uses a dfferent method (1-DNF) for dentfyng relable negatve examples and then runs SVM teratvely for classfer buldng. More recently, [2] reported a technque called Roc- SVM. In ths technque, relable negatve documents were extracted by usng the nformaton retreval technque Roccho [7]. Agan, SVM s used n the second step. A classfer selecton crteron s also proposed to catch a good classfer from teratons of SVM. Despte the dfferences n algorthmc detals, the above methods all focused on extractng relable negatve nstances from the unlabeled set. More related to our current work was the recent work by Yu [8], whch proposed to estmate the boundary for the postve class. However, the amount of postve examples they requred was around 30% of the whole data, whch was stll too large for many practcal applcatons. In [9], a method called PN-SVM was proposed to deal wth the case when the postve set s small. However, t (lke all the other exstng algorthms of PU learnng) reled on the assumpton that the postve examples n P and the hdden postves n U were all generated from the same dstrbuton. For the frst tme, our LPLP method proposed n ths paper wll address such common weaknesses of current PU learnng methods, ncludng handlng challengng cases where the postve examples n P and the hdden postve examples n U were not drawn from the same dstrbuton. Note that the problem could potentally be modeled as a one-class classfcaton problem. For example, n [10], a one-class SVM that uses only postve data to buld a SVM classfer was proposed. Such approaches are dfferent from our method n that they do not use unlabeled data for tranng. However, as prevous results reported n [2] have already showed that they were nferor for text classfcaton, we do not consder them n ths work. 3 The Proposed Technque Fgure 1 depcts the general scenaro of PU learnng wth a small postve tranng set P. Let us denote a space ψ that represents the whole postve class, whch s located above the hyperplane H 2. The small postve set P only occupes a relatvely small subspace SP n ψ (SP ψ), shown as the oval regon n the fgure. The examples n the unlabelled set U conssts of both hdden postve examples (represented by crcled + ) and hdden negatve examples (represented by - ). Snce P s small, and ψ contans postve examples from dfferent web stes that present smlar products n dfferent styles and focuses, we may not expect the dstrbutons of the postve examples n P and those of the hdden postve examples n U to be the same. In other words, the set of hdden postve examples that we are tryng to detect may have a very small ntersecton or s even dsont wth SP. If we navely use the small set P as the postve tranng set and the entre unlabelled set U as the negatve set, the

204 X.-L. L, B. Lu, and S.-K. Ng resultng classfer (corresponds to hyperplane H 1 ) wll obvously perform badly n dentfyng the hdden postve examples n U. On the other hand, we can attempt to use the more sophstcated PU learnng methods to address ths problem. Instead of merely usng the entre unlabelled set U as the negatve tranng data, the frst step of PU learnng extracts some relable negatves from U. However, ths step s actually rather dffcult n our applcaton scenaro as depcted n Fgure 1. Snce the hdden postve examples n U are lkely to have dfferent dstrbutons from those captured n the small postve set P, not all the tranng examples n U that are dssmlar to examples n P are negatve examples. As a result, the so-called relable negatve set that the frst step of PU learnng extracts based on dssmlarty from P would be a very nosy negatve set, and therefore not very useful for buldng a relable classfer. H 1 H 2 Fg. 1. PU learnng wth a small postve tranng set Let us consder the possblty of extractng a set of lkely postve examples (LP) from the unlabeled set U to address the problem of P s beng not suffcently representatve of the hdden postve documents n U. Unlke P, the dstrbuton of LP wll be smlar wth the other hdden postve examples n U. As such, we could expect that a more accurate classfer can be bult wth the help of set LP (together wth P). Pctorally, the resultng classfer would correspond to the optmal hyperplane H 2 shown n Fgure 1. Instead of tryng to dentfy a set of nosy negatve documents from the unlabeled set U as exstng PU learnng technques do, our proposed technque LPLP therefore focuses on extractng a set of lkely postve documents from U. Whle the postve documents n P and the hdden postve documents n U were not drawn from the same dstrbuton, they should stll be smlar n some underlyng feature dmensons (or subspaces) as they belong to the same class. For example, the prnter pages from two dfferent stes, say amazon.com and cnet.com, would share the representatve word features such as prnter, nket, laser, ppm etc, though ther respectve dstrbutons may be qute dfferent. Partcularly, the pages from cnet.com whose target readers are more techncally-savvy may contan more frequent mentonng of keyword terms that correspond to the techncal specfcatons of prnters than those pages from amazon.com whose prmary focus s to reach out to the less techncally-nclned customers. However, we can safely expect that the basc keywords (representatve word features) that descrbe computer prnters should be

Learnng to Classfy Documents wth Only a Small Postve Tranng Set 205 presented n both cases. In ths work, we therefore assume that the representatve word features for the documents n P should be smlar to those for the hdden postve documents n U. If we can fnd such a set of representatve word features (RW) from the postve set P, then we can use them to extract other hdden postve documents from U. We are now ready to present the detals of our proposed technque LPLP. In Secton 3.1, we frst ntroduce a method to select the set of representatve word features RW from the gven postve set P. Then, n Secton 3.2, we extract the lkely postve documents from U and probablstcally label them based on the set RW. Wth the help of the resultng set LP, we employ the EM algorthm wth a good ntalzaton to buld an accurate classfer to dentfy the hdden postve examples from U n Secton 3.3. 3.1 Selectng a Set of Representatve Word Features from P As mentoned above, we expect the postve examples n P and the hdden postve examples n U share the same representatve word features as they belong to the same class. We extract a set RW of representatve word features from the postve set P contanng the top k words wth the hghest s(w ) scores. The scorng functon s() s based on TFIDF method [11] whch gves hgh scores to those words that 1. RW =, F = ; occur frequently n the postve set P 2. For each word feature w P and not n the whole corpus P U snce 3. If (stopwords ( w )!=true) U contans many other unrelated 4. F = F {stemmng( w )}; 5. total=0; 6. For each word feature w F 7. P N ( w, d ) N ( w, P ) 1 max ( N ( w, d 8. total += N ( w, P) ; 9. For each word feature w F 10. N( w, P) P U s( w ) *log ; total df( w, P) df( w, U) 11. Rank the word feature s s( w ) from bg to small nto a lst L; 12. Pr TOP = the k-th s( w ) n the lst L, w F ; 13. For w F 14. If ( s ( w ) >=Pr TOP ) 15. RW = RW { w }; Fg. 2. Selectng a set of representatve word features from P w )) ; documents. Fgure 2 shows the detaled algorthm to select the keyword set RW. In step 1, we ntalze the representatve word set RW and unque feature set F as empty sets. After removng the stop words (step 3) and performng stemmng (step 4) [11], all the word features are stored nto the feature set F. For each word feature w n the postve set P, steps 6 to 8 compute the accumulated word frequency (n each document d, the word w s frequency N(w,d ) s normalzed by the maxmal word frequency of d n step 7). Steps 9 to 10 then compute the scores of each word feature, whch consder both w s probabltes of belongng to a postve class and ts nverted document frequency, where df(w, P) and df(w, U) are w s document frequences n P and U respectvely. After rankng the scores nto the rank lst L n step 11, we store nto RW those word features from P wth top k scores n L.

206 X.-L. L, B. Lu, and S.-K. Ng 3.2 Identfyng LP from U and Probablstcally Labelng the Documents n LP Once the set RW of representatve keywords s determned, we can regard them together as a representatve document (rd) of the postve class. We then compare the smlarty of each document d n U wth rd usng the cosne smlarty metrc [11], whch automatcally produces a set LP of 1. LP = ; RU = ; 2. For each d U 3. sm ( rd, d ) rd, 2 rd, * w d, probablstcally labeled documents wth Pr(d +) > 0. The algorthm for ths step s gven n Fgure 3. In step 1, the lkely postve set LP and the remanng unlabelled set RU are both ntalzed as empty sets. In steps 2 to 3, each unlabeled document d n U s compared wth rd usng the cosne smlarty. Step 4 stores the largest smlarty value as m. For each document d n U, f ts cosne smlarty sm(rd, d )>0, we assgn a probablty Pr(+ d ) that s based on the rato of ts smlarty and m (step 7) and we store t nto the set LP n step 8. Otherwse, d s ncluded n RU nstead (step 10). The documents n RU have zero smlarty wth rd and can be consdered as a purer negatve set than U. Note that n step 7, the hdden postve examples n LP wll be assgned hgh probabltes whle the negatve examples n LP wll be assgned very low probabltes. Ths s because the representatve features n RW were chosen based on those words that occurred frequently n P but not n the whole corpus P U. As such, the hdden postve examples n LP should also contan many of the features n RW whle the negatve examples n LP would contan few (f any) of the features n RW. 3.3 The Classfcaton Algorthm w w w 2 d, 4. Let m = max( sm( rd, d )), d U; d 5. For each d U 6. If ( sm ( rd, d ) > 0) 7. Pr(+ d ) = sm( rd, d )/ m ; 8. LP = LP {d }; 9. Else 10. RU = RU {d }; Fg. 3. Probablstcally labelng a set of documents ; Next, we employ the naïve Bayesan framework to dentfy the hdden postves n U. Gven a set of tranng documents D, each document s consdered an lst of words and each word n a document s from the vocabulary V = < w 1, w 2,, w v >. We also have a set of predefned classes, C={c 1, c 2,, c C } For smplcty, we wll consder two class classfcaton n ths dscusson,.e. C={c 1, c 2 }, c 1 = + and c 2 = -. To perform classfcaton for a document d, we need to compute the posteror probablty, Pr(c d ), c {+,-}. Based on the Bayesan probablty and the multnomal model [12], we have D ( c d ) 1 ( c ) = Pr Ρr =. (1) D and wth Laplacan smoothng,

Learnng to Classfy Documents wth Only a Small Postve Tranng Set 207 + V D 1 N ( w, )Ρr ( ) = 1 t d c d Ρr ( w t c ) =. D V + N ( w, d )Ρr ( c d ) s = 1 = 1 Here, Pr(c d ) {0,1} dependng on the class label of the document. Assumng that the probabltes of words are ndependent gven the class, we obtan the naïve Bayesan classfer: d k = c r ) Pr( c ) Pr( w c ) Pr( c d ). c ) 1 d, k = C d Pr( Pr( w = 1 = 1, r k d k In the nave Bayesan classfer, the class wth the hghest Pr(c d ) s assgned as the class of the document. The NB method s known to be an effectve technque for text classfcaton even wth the volaton of some of ts basc assumptons (e.g class condtonal ndependence) [13] [14] [1]. The Expectaton-Maxmzaton (EM) algorthm s a popular class of teratve algorthms for problems wth ncomplete data. It terates over two basc steps, the Expectaton step, and the Maxmzaton step. The Expectaton step bascally flls n the mssng data, whle the Maxmzaton step then estmates the parameters. When applyng EM, equatons (1) and (2) above comprse the Expectaton step, whle equaton (3) s used for the Maxmzaton step. Note that the probablty of the class gven the document now takes the value n [0, 1] nstead of {0, 1}. The ablty of EM to work wth 1. For each d RU, 2. Pr(+ d ) = 0; 3. Pr(- d ) = 1; 4. PS = LP P (or LP); 5. For each d PS 6. If d P 7. Pr(+ d ) = 1; 8. Pr(- d ) = 0; 9. Else 10. Pr(+ d ) = sm( rd, d ) / m ; 11. Pr(- d ) = 0; 12. Buld an NB-C classfer C usng PS and RU based on equatons (1), (2); 13. Whle classfer parameters change 14. For each d PS RU 15. Compute Pr(+ d ) and Pr(- d ) usng NB-C,.e., equaton (3); 16. Update Pr(c ) and Pr(w t c ) by replacng equatons (1) and (2) wth the new probabltes produced n step 15 (a new NB- C s beng bult n the process) Fg. 4. The detaled LPLP algorthm s r (2) (3) mssng data s exactly what we need here. Let us regard all the postve documents to have the postve class value +. We want to know the class value of each document n the unlabeled set U. EM can help to properly assgn a probablstc class label to each document d n the unlabeled set,.e., Pr(+ d ) or Pr(- d ). Theoretcally, n EM algorthm, the probabltes of documents n U wll converge after a number of teratons [4]. However, a good ntalzaton s mportant n order to fnd a good maxmum of the lkelhood functon. For example, f we drectly use P as postve class and U as negatve class (ntally), then EM algorthm wll not buld an accurate classfer as the negatve class would be too nosy (as explaned prevously). Thus, n our algorthm, after extractng lkely postve set LP, we re-ntalze the EM algorthm by treatng the probablstcally labeled LP (wth/wthout P) as postve documents. The resultng classfer s more accurate

208 X.-L. L, B. Lu, and S.-K. Ng because 1) LP has the smlar dstrbutons wth other hdden postve documents n U, and 2) the remanng unlabeled set RU s also much purer than U as a negatve set. The detaled LPLP algorthm s gven n Fgure 4. The nputs to the algorthm are LP, RU and P. Steps 1 to 3 ntalze the probabltes for each document d n RU, whch are all treated as negatve documents ntally. Step 4 sets the postve set PS; there are two possble ways to acheve ths: we ether (1) combne LP and P as PS, or (2) use only LP as PS. We wll evaluate the effect of the ncluson of P n PS n the next secton. Steps 5 to 11 wll assgn the ntal probabltes to the documents n P (f P s used) and LP. Each document n P s assgned to the postve class whle each document n LP s probablstcally labeled usng the algorthm n Fgure 3. Usng PS and RU, a NB classfer can be bult (step 12). Ths classfer s then appled to the documents n (LP RU) to obtan the posteror probabltes (Pr(+ d ) and Pr(- d )) for each document (step 15). We can then teratvely employ the revsed posteror probabltes to buld a new (and better) NB classfer (step 16). The EM process contnues untl the parameters of the NB classfer converge. 4 Evaluaton Experments In ths secton, we evaluate the proposed LPLP technque under dfferent settngs and compare t wth exstng methods, namely, Roc-SVM [2] and PEBL [3]. Roc-SVM s avalable on the Web as a part of the LPU system 1. We mplemented PEBL ourselves as t s not publcly avalable. The results of S-EM [1] were not ncluded here because the performance was generally very poor due to ts relance on smlarty of postve documents n P and n U, as expected. 4.1 Datasets Our evaluaton experments were done usng product Web pages from 5 commercal Web stes: Amazon, CNet, PCMag, J&R and ZDnet. These stes contaned many ntroducton/descrpton pages of dfferent types of products. The pages were cleaned usng the web page cleanng technque n [15],.e., navgaton lnks and advertsements have been detected and elmnated. The data contaned web pages of the followng product categores: Notebook, Dgtal Camera, Moble Phone, Prnter and TV. Table 1 lsts the number of pages from each ste, and the correspondng product categores (or classes). In ths work, we treat each page as a text document and we do not use hyperlnks and mages for classfcaton. Table 1. Number of Web pages and ther classes. Amazon CNet J&R PCMag ZDnet Notebook 434 480 51 144 143 Camera 402 219 80 137 151 Moble 45 109 9 43 97 Prnter 767 500 104 107 80 TV 719 449 199 0 0 1 http://www.cs.uc.edu/~lub/lpu/lpu-download.html

Learnng to Classfy Documents wth Only a Small Postve Tranng Set 209 Note that we dd not use standard text collectons such as Reuters 2 and 20 Newsgroup data 3 n our experments as we want to hghlght the performance of our approach on data sets that have dfferent postve dstrbutons n P and n U. 4.2 Experment Settngs As mentoned, the number of avalable postvely labeled documents n practce can be qute small ether because there were few documents to start wth, or t was tedous and expensve to hand-label the tranng examples on a large scale. To reflect ths constrant, we expermented wth dfferent number of (randomly selected) postve documents n P,.e. P = 5, 15, or 25 and allpos. Here allpos means that all documents of a partcular product from a Web ste were used. The purpose of these experments s to nvestgate the relatve effectveness of our proposed method for both small and large postve sets. We conducted a comprehensve set of experments usng all the possble P and U combnatons. That s, we selected every entry (one type of product from each Web ste) n Table 1 as the postve set P and use each of the other 4 Web stes as the unlabeled set U. Three products were omtted n our experments because ther P <10, namely, Moble phone n J&R (9 pages), TV n PCMag (no page), and TV n ZDnet (no page). A grand total of 88 experments were conducted usng all the possble P and U combnatons of the 5 stes. Due to the large number of combnatons, the results reported below are the average values for the results from all the combnatons. To study the senstvty of the number of representatve word features used n dentfyng lkely postve examples, we also performed a seres of experments usng dfferent numbers of representatve features,.e. k = 5, 10, 15 and 20 n our algorthm. 4.3 Expermental Results Snce our task s to dentfy or retreve postve documents from the unlabeled set U, t s approprate to use F value to evaluate the performance of the fnal classfer. F value s the harmonc mean of precson (p) and recall (r),.e. F=2*p*r/(p+r). When ether of p or r s small, the F value wll be small. Only when both of them are large, F value wll be large. Ths s sutable for our purpose as we do not want to dentfy postve documents wth ether too small a precson or too small a recall. Note that n our experment results, the reported F values gve the classfcaton (retreval) results of the postve documents n U as U s also the test set. Let us frst show the results of our proposed technque LPLP under dfferent expermental settngs. We wll then compare t wth two exstng technques. The bar chart n Fgure 5 shows the F values (Y-axs) of LPLP usng dfferent numbers of postve documents (X-axs) and dfferent numbers of representatve words (4 data seres). Recall that we had presented two optons to construct the postve set PS n step 4 of the LPLP algorthm (Fgure 4). The frst opton s to add the extracted lkely postve documents (LP) to the orgnal set of postve documents P, represented n Fgure 5 by wth P. The second opton s to use only the extracted lkely postve documents as the postve data n learnng,.e., droppng the orgnal 2 http://www.research.att.com/~lews/reuters21578.html 3 http://www.cs.cmu.edu/afs/cs/proect/theo-11/www/nave-bayes/20_newsgroups.tar.gz

210 X.-L. L, B. Lu, and S.-K. Ng 1.00 5 words 10 words 15 words 20 words 0.95 0.90 F value 0.85 0.80 0.75 0.70 5 (wth P) 5 (wthout P) 15 (wth P) 15 (wthout P) 25 (wth P) 25 (wthout P) allpos (wth P) allpos (wthout P) Fg. 5. F values of LPLP wth dfferent numbers of postve documents postve set P (snce t s not representatve of the hdden postve documents n U). Ths opton s denoted by wthout P n Fgure 5. Incluson of P for constructng PS: If there were only a small number of postve documents ( P = 5, 15 and 25) avalable, we found that usng opton 1 (wth P) to construct the postve set for the classfer s better than usng opton 2 (wthout P), as expected. However, nterestngly, f there s a large number of postve documents (allpos n Fgure 5), then opton 1 s actually nferor to opton 2. The reason s that the use of a bg postve set P, whch s not representatve of the postve documents n U, would have ntroduced too much negatve nfluence on the fnal classfer (many hdden postve examples n U wll be classfed as negatve class). However, when the gven postve set P s small, ts potental negatve nfluence s much less, and t wll therefore help to strengthen the lkely postve documents by provdng more postve data. Ths s a subtle and rather unexpected trade-off here. Number of postve documents n P: From Fgure 5, we also observe that the number of the gven postve documents n P does not nfluence the fnal results a great deal. The reason for ths s that the computed lkely postve documents from U are actually more effectve postve documents for learnng than the orgnal postve documents n P. Ths s a very compellng advantage of our proposed technque as ths means that the user does not need to label or to fnd a large number of postve examples for effectve learnng. In fact, as dscussed above, we also notce that even wthout usng any orgnal postve document n P, the results were very good as well. Number of representatve word features: The results n Fgure 5 also showed that there s no need to use many representatve words for detectng postve documents. In general, 5-15 representatve words would suffce. Includng the less representatve word features beyond the top k most representatve ones would ntroduce unnecessary nose n dentfyng the lkely postve documents n U. Next, we compare the results of our LPLP technque wth those of the two best exstng technques mentoned earler, namely, Roc-SVM [2] and PEBL [3]. Fgure 6 shows two seres of results. The frst seres, marked P, showed the classfcaton results of all three methods usng all postve documents ( allpos ), wthout the use of the lkely postve documents LP as suggested n ths paper. In other words, learnng

Learnng to Classfy Documents wth Only a Small Postve Tranng Set 211 was done usng only P and U. Note that for the strctest comparson what we have shown here are the best possble results for Roc-SVM and PEBL, whch may not be obtanable n practce because t s hardly possble to determne whch SVM teraton would gve the best results n these algorthms (both Roc-SVM and PEBL algorthms run SVM many tmes). In fact, ther results at convergence were actually much worse. We can see that PEBL performed better than both LPLP and Roc-SVM. However, the absolute F value of PEBL s stll very low (0.54). Note also that because of the use of allpos for tranng, the LPLP s result for ths was obtaned wthout usng the lkely postve set LP (t s now the EM standard algorthm), hence t was unable to perform as well as t should have. The second seres n Fgure 6 shows the comparatve results of usng the extracted lkely postve documents nstead of P for learnng. Here, our LPLP algorthm performs dramatcally better (F=0.94) even aganst the best possble results of PEBL (F=0.84) and Roc-SVM (F=0.81). Note that here PEBL and Roc-SVM also use the lkely postve documents LP extracted from U by our method (we boosted the PEBL and Roc-SVM for the purpose of comparson). The lkely postves were dentfed from U usng 10 representatve words selected from P. Unlke our LPLP algorthm, both Roc-SVM and PEBL do not take probablstcally labels, but only bnary labels. As such, for these two algorthms, we chose the lkely postve documents from U by requrng each document (d) to contan at least 5 (out of 10) selected representatve words. All the lkely postve documents dentfed were then treated as postve documents,.e., Pr(+ d) = 1. We also tred usng other numbers of representatve words n RW and found that 5 words performed well for these two algorthms wth our datasets. We can see that wth the use of the lkely postves (set LP) dentfed by our method (nstead of P), the classfcaton results of these two exstng algorthms have also mproved dramatcally as well. In fact, by usng LP nstead of P, the prevously weaker Roc-SVM has caught up so substantally that the best possble result of PEBL s only slghtly better than that of Roc-SVM now. Fnally, n Fgure 7, we show the comparatve results when the number of the postve documents s small, whch s more often than not the case n practce. Agan, we see that our new method LPLP performed much better than the best possble F value 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 P LP LPLP Roc-SVM PEBL F value 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5 15 25 LPLP Roc-SVM PEBL The document number of P Fg. 6. F values of LPLP, the best results of Roc-SVM and PEBL usng all postve documents Fg. 7. F values of LPLP, the best results of Roc-SVM and PEBL usng P together wth LP

212 X.-L. L, B. Lu, and S.-K. Ng results of the two exstng methods Roc-SVM and PEBL (whch may not be obtanable n practce, as explaned earler) when there were only 5, 15, or 25 postve documents n P. As explaned earler, ncludng P together wth LP (for all the three technques) gave better results when P s small. In summary, the results n Fgures 6 and 7 showed that the lkely postve documents LP extracted from U can be used to help boost the performance of classfcaton technques for PU learnng problems. In partcular, LPLP algorthm benefted the most and performed the best. Ths s probably because of ts ablty to handle probablstc labels and s thus better equpped to take advantage of the probablstc (and hence potentally nosy) LP set than the SVM-based approaches. 5 Conclusons In many real-world classfcaton applcatons, t s often the case that not only the negatve tranng examples are hard to come by, but the number of postve examples avalable for learnng can also be farly lmted as t s often tedous and expensve to hand-label large amounts of tranng data. To address the lack of negatve examples, many PU learnng methods have been proposed to learn from a pool of postve data (P) wthout any negatve data but wth the help of unlabeled data (U). However, PU learnng methods stll do not work well when the sze of postve examples s small. In ths paper, we address ths oft-overlooked ssue for PU learnng when the number of postve examples s qute small. In addton, we consder the challengng case where the postve examples n P and the hdden postve examples n U may not even be drawn from the same dstrbuton. Exstng technques have been found to perform poorly n ths settng. We proposed an effectve technque LPLP that can learn effectvely from postve and unlabeled examples wth a small postve set for document classfcaton. Instead of dentfyng a set of relable negatve documents from the unlabeled set U as exstng PU technques do, our new method focuses on extractng a set of lkely postve documents from U. In ths way, the learnng can rely less on the lmtatons assocated wth the orgnal postve set P, such as ts lmted sze and potental dstrbuton dfferences. Augmented by the extracted probablstc LP set, our LPLP algorthm can buld a much more robust classfer. We reported expermental results wth product page classfcaton that confrmed that our new technque s ndeed much more effectve than exstng methods n ths challengng classfcaton problem. In our future work, we plan to generalze our current approach to solve smlar classfcaton problems other than document classfcaton. References 1. Lu, B., Lee, W., Yu, P., L, X.: Partally Supervsed Classfcaton of Text Documents. In: ICML, pp. 387 394 (2002) 2. L, X., Lu, B.: Learnng to Classfy Texts Usng Postve and Unlabeled Data. In: IJCAI, pp. 587 594 (2003) 3. Yu, H., Han, J., Chang, K.C.-C.: PEBL: Postve Example Based Learnng for Web Page Classfcaton Usng SVM. In: KDD, pp. 239 248 (2002)

Learnng to Classfy Documents wth Only a Small Postve Tranng Set 213 4. Dempster, A.P., Lard, N.M., Rubn, D.B.: Maxmum Lkelhood from Incomplete Data va the EM Algorthm. Journal of the Royal Statstcal Socety (1977) 5. Dens, F.: PAC Learnng from Postve Statstcal Queres. In: ALT, pp. 112 126 (1998) 6. Muggleton, S.: Learnng from Postve Data. In: Proceedngs of the sxth Internatonal Workshop on Inductve Logc Programmng, pp. 358 376. Sprnger, Hedelberg (1997) 7. Roccho, J.: Relevance feedback n nformaton retreval. In: Salton, G. (ed.) The SMART Retreval System: Experments n Automatc Document Processng (1971) 8. Yu, H.: General MC: Estmatng boundary of postve class from small postve data. In: ICDM, pp. 693 696 (2003) 9. Fung, G.P.C., et al.: Text Classfcaton wthout Negatve Examples Revst. IEEE Transactons on Knowledge and Data Engneerng 18(1), 6 20 (2006) 10. Schölkopf, B., et al.: Estmatng the Support of a Hgh-Dmensonal Dstrbuton. Neural Comput 13(7), 1443 1471 (2001) 11. Salton, G., McGll, M.J.: Introducton to Modern Informaton Retreval (1986) 12. McCallum, A., Ngam, K.: A comparson of event models for naïve Bayes text classfcaton. In: AAAI Workshop on Learnng for Text Categorzaton (1998) 13. Lews, D.D.: A sequental algorthm for tranng text classfers: corrgendum and addtonal data. In: SIGIR Forum, 13 19 (1995) 14. Ngam, K., et al.: Text Classfcaton from Labeled and Unlabeled Documents usng EM. Machne Learnng 39(2-3), 103 134 (2000) 15. Y, L., Lu, B., L, X.: Elmnatng nosy nformaton n Web pages for data mnng. In: KDD, pp. 296 305 (2003)