Reliable Negative Extracting Based on knn for Learning from Positive and Unlabeled Examples

Size: px

Start display at page:

Download "Reliable Negative Extracting Based on knn for Learning from Positive and Unlabeled Examples"

Rafe Wilkerson
5 years ago
Views:

1 94 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY 2009 Relable Negatve Extractng Based on knn for Learnng from Postve and Unlabeled Examples Bangzuo Zhang College of Computer Scence and Technology, Jln Unversty, Changchun, P. R. Chna College of Computer, Northeast Normal Unversty, Changchun, P. R. Chna Emal: Wanl Zuo College of Computer Scence and Technology, Jln Unversty, Changchun, P. R. Chna Emal: Abstract Many real-world classfcaton applcatons fall nto the class of postve and unlabeled learnng problems. The exstng technques almost all are based on the two-step strategy. Ths paper proposes a new relable negatve extractng algorthm for step 1. We adopt knn algorthm to rank the smlarty of unlabeled examples from the k nearest postve examples, and set a threshold to label some unlabeled examples that lower than t as the relable negatve examples rather than the common method to label postve examples. In step 2, we use teratve SVM technque to refne the fnally classfer. Our proposed method s smplcty and effcency and on some level ndependent to k. Experments on the popular Reuter21578 collecton show the effectveness of our proposed technque. Index Terms Learnng from Postve and Unlabeled examples, k Nearest Neghbor, Text Classfcaton, Support Vector Machne, Informaton Retreval I. INTRODUCTION Tradtonal learnng technques typcally requre a large number of labeled examples to learn an accurate classfer. Thus, for bnary problems, postve examples and negatve examples are mandatory for machne learnng and data mnng algorthms such as decson tree and neural networks. Ths approach to buldng classfers s called supervsed learnng. However, n many practcal classfcaton applcatons such as document retreval and classfcaton, postve nformaton s readly avalable and unlabeled data can easly be collected, although t s possble to manually label some negatve examples, t s labor ntensve and very tme consumng. One way to reduce the amount of labeled tranng data needed s to develop classfcaton algorthms that can learn from a set of labeled postve examples augmented wth a set of unlabeled examples. That s gve a set P of postve examples of a partcular class and a set U of unlabeled examples, and then buld a classfer usng P and U to classfy the data n U as well as future test data. A frst example s web-page classfcaton, suppose we want a program that classfes web stes as nterestng for a web user. Postves examples are freely avalable: t s the set of web pages correspondng to web stes n hs bookmarks. Moreover, unlabeled web pages are abundant, and easly avalable on the World Wde Web. Many realworld classfcaton applcatons also can fall nto ths class problem. Such as, dagnoss of dseased: postve data are patents who have the dsease, unlabeled data are all patents; marketng: postve data are clents who buy the product, unlabeled data are all clents n the database. Dens orgnally proposes a framework for learnng model from postve examples (POSEX for short) [1] based on the probably approxmately correct model (PAC). The study concentrates on the computatonal complexty of learnng and shows that functon classes learnable under the statstcal queres model are also learnable from postve and unlabeled examples. Lu et al [2] call ths problem LPU (Learnng form Postve and Unlabeled examples), whle t s also called partally supervsed classfcaton [3], and PU learnng problem [4]. Yu et al [5] ntroduce t as PEBL (Postve Example Based Learnng). The key feature of ths problem s that there s no labeled negatve document, whch makes tradtonal classfcaton methods napplcable, as they all need labeled examples of every class. Recently, a few nnovatve technques have been proposed to solve ths problem. These algorthms nclude S-EM [2], Roc-SVM [3], PEBL [5] and NB [6]. One class of these technques have focused on addressng the lack of labeled negatve examples n the tranng examples, and based on a two-step strategy as follows: Step 1: Extracton a set of negatve examples called relable negatves (RN) from the unlabeled examples U. In ths step, S-EM uses a Spy technque, Roc-SVM uses the Roccho algorthm, PEBL uses a technque called 1- DNF, and NB uses the Nave Bayes technque. The key requrement for ths step s that the dentfed negatve examples from the unlabeled examples must be relable or pure,.e., wth no or very few postve examples. Step 2: Buldng a set of classfers by teratvely applyng a classfcaton algorthm and then selectng a good classfer from the set. In ths step, S-EM uses the Expectaton Maxmzaton (EM) algorthm wth a NB

2 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY (Nave Bayes) classfer as the base classfer, whle PEBL and Roc-SVM use Support Vector Machne (SVM). Both S-EM and Roc-SVM have some methods for selectng the fnal classfer. PEBL smply uses the last classfer at convergence. The underlyng dea of these two-step strateges s to teratvely ncrease the number of unlabeled examples that are classfed as negatve whle mantanng the postve examples correctly classfed. Ths dea has been ustfed to be effectve for ths problem n [2]. Other classes of methods for learnng from postve and unlabeled examples are also presented. A NB based method (called PNB) [7] that tres to statstcally remove the effect of postve data n the unlabeled set s proposed. The man shortcomng of ths method s that t requres the user to gve the postve class probablty, whch s hard for the user to provde n practce. It s also possble to dscard the unlabeled examples and learn only from the postve examples. Ths was done n the one-class SVM [8], whch tres to learn the support of the postve dstrbuton. Some results [6] show that ts performance s poorer than learnng methods that take advantage of the unlabeled data. knn [9] stands for k-nearest neghbor classfcaton, s a well-known statstcal approach that has been ntensvely studed n pattern recognton. knn s a type of nstance-based learnng, or lazy learnng where the functon s only approxmated locally and all computaton s deferred untl classfcaton. The knn algorthm assgns each example to the maorty class of ts k closest neghbors where k s a parameter. For 1NN, the algorthm assgns each example to the class of ts closest neghbor. The knn algorthm s also an often-used method for the text categorzaton and has reported the best result n Reuter collecton [9]. In ths paper, we also follow the two-step strategy, and propose a novel method based on knn algorthm for Step 1. We frstly use knn algorthms to extract relable negatve and then construct an ntal classfers. We then use teratve SVM algorthm untl ts convergence. We carry out experments n the popular Reuter21578 collecton, and demonstrate the effectveness of our proposed technque. In ths paper, we would lke to frst revew the exstng two-step LPU algorthms n Secton 2, then propose a new relable negatve examples extractng method by knn algorthm, and show ts effectveness expermentally on the Reuters collecton n Secton 4, fnally make concluson n Secton 5. II. RELATED WORKS Gven a set of tranng documents D, Each document s consdered as an ordered lst of words. We use w d, k to denote the word n poston k of document d, where each word s from the vocabulary V=<w 1,w 2, w v >. The vocabulary s the set of all the words consdered for classfcaton. For LPU, we only consder bnary class classfcaton, so a set of predefned class C = {c 0, c 1 }, and we use c 0 for the postve class, whle c 1 for negatve class. Tradtonal supervsed learnng and sem-supervsed learnng classfcaton technques requre labeled tranng examples of all classes to buld a classfer. They are thus not sutable for LPU problem. Recently, some LPU algorthms ncludng S-EM [2], NB [6], Roc-SVM [3] and PEBL [5] are proposed, and they are all based on the two-step strategy. We frstly revew the exstng technque for step 1 n detal. A. The Spy Technque n S-EM The Spy technque n S-EM [2] frst randomly selects a set S of postve documents from P and puts them n U. The default value s 10% (usng 15% n [6]). The algorthm s gven n Fg. 1. The spes behave dentcally to the unknown postve documents n P and hence allow to relably nferrng the behavor of the unknown postve documents n U. It then runs I-EM algorthm usng the set P-S as postve and the set U S as negatve (lnes 3-7). I-EM bascally runs NB twce. After I-EM completes, the resultng classfer uses the probabltes assgned to the documents n S to decde a probablty threshold th to dentfy possble negatve documents n U to produce the relable negatve examples set RN. However, S-EM s not accurate because t uses nave Bayesan classfer as the underlyng classfer n step 2. Ths algorthm performs stably when the postve set s very small. When the postve set s larger, t s worse than others. B. The Nave Bayes Technque The NB (Nave Bayes) technque s a popular method for text classfcaton. Lu et al [6] frst ntroduce t nto LPU as a new method for step 1. The NB classfer s constructed by usng the tranng documents to estmate the probablty of each class gven the document feature values of a new nstance. To perform classfcaton, t computes the posteror probablty, Pr(c d ). Based on Bayesan probablty and the multnomal model, t gves 1. RN = {}; 2. S = Sample(P, s%); 3. Us = U S; 4. Ps = P-S; 5. Assgn each document n Ps the class label 1; 6. Assgn each document n Us the class label -1; 7. I-EM(Us, Ps); // Ths produces a NB classfer. 8. Classfy each document n Us usng the NB classfer; 9. Determne a probablty threshold th usng S; 10. For each document dus 11. If ts probablty Pr(1 d) < th 12. Then RN = RN {d}; 13. End If 14.End For Fgure 1. The spy technque n S-EM.

3 96 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY 2009 Pr( c D Pr( c d ) 1 ), (1) D To avod zero probablty estmates, some smoothng method s usually used. Lu et al [6] use the Ldstone smoothng as Pr( w t c ) V V D 1 s1 N( w, d )Pr( c d ),(2) N( w, d )Pr( c d ) t D 1 where s the smoothng factor, N(w t,d )s number of tmes that word w t occurs n document d and pr(c d ){0, 1} dependng on the class of the document. Assumng that the probabltes of the words are ndependent gven the class, the NB classfer has been defned as equaton (3) C r 1 d k cr ) Pr( c ) Pr( w c ) Pr( c d ). (3) Pr( c ) 1. Assgn label 1 to each document n P; 2. Assgn label 1 to each document n U; 3. Buld a NB classfer usng P and U; 4. Use the classfer to classfy U. Those documents n U that are classfed as negatve form the relable negatve set RN. Fgure 2. The method of extractng RN usng NB. s 1 d, k d Pr( w 1, k d k In classfyng a document d, the class wth the hghest Pr(c d ) s assgned as the class of the document. The method of extractng a set RN of relable negatve documents from the unlabeled examples set U s done brefly as Fg. 2. Despte the fact that the assumpton of condtonal ndependence s generally not true for word appearance n documents, the nave bayes classfer s surprsngly effectve. C. The Roccho Technque The Roc-SVM algorthm [3] uses the Roccho method to dentfy a set RN from U, whch s a classc method for document routng or flterng n nformaton retreval. Buldng a Roccho classfer s acheved by constructng a prototype vector for each class followng equaton (4) 1 d 1 d c. (4) c d D c d dc ddc and are parameters that adust the relatve mpact of relevant and rrelevant tranng examples. Generally use = 16 and = 4. In classfcaton, for each test document td, t uses the cosne smlarty measure to compute the smlarty of td wth each prototype vector. The class whose prototype vector s more smlar to td s assgned to td. The algorthm that uses Roccho to dentfy a set RN from U s the same as that n Fg. 2 except that t replaces r NB wth Roccho. Roccho performs well consstently under a varety of condtons. D. The 1-DNF Technque for PEBL Yu et al [5] proposes the PEBL framework for web page classfcaton, whch uses mappng-convergence algorthm. In the mappng stage, they extract relable negatve from the unlabeled data by the 1-DNF method. The 1-DNF algorthm s gven n Fg. 3. It frstly bulds a dsuncton lst of postve feature set PF whch contans words that occur n the postve examples set P more frequently than n the unlabeled examples set U (lne 2-6). Then t tres to flter out possble postve documents from U (lne 8-12). A document n U that does not have any postve feature n PF s regarded as a strong negatve document. In ths algorthm, the amount of RN set s always small and sometmes s short text examples. PEBL s not robust because t performs well n certan stuatons and fals badly n others. PEBL s senstve to the number of postve examples. When the postve data s small, the results are often very poor. E. Technques n Step 2 There are four technques for the second step: 1. Runnng SVM only once usng sets P and RN after step 1. Ths method s seldom to use. 2. Runnng EM. Ths method s used n S-EM [2]. 3. Runnng SVM teratvely. Ths method s used n PEBL [5]. 4. Runnng SVM teratvely and then selectng a fnal classfer. Ths method s used n Roc-SVM [3]. The Expectaton-Maxmzaton (EM) algorthm s a popular teratve algorthm for maxmum lkelhood estmaton n problems wth mssng data. The EM algorthm conssts of two steps, the Expectaton step, and the Maxmzaton step. The Expectaton step bascally flls n the mssng data. It produces and revses the probablstc labels of the documents n Q = U - RN. The parameters are estmated n the Maxmzaton step after the mssng data are flled. Ths leads to the next teraton of the algorthm. EM converges when ts parameters stablze. The EM algorthm teratvely runs NB to revse the probablstc label of each document n set Q. 1. PF = {} 2. For = 1 to n 3. If (freq(w, P)/ P > freq(w, U)/ U ) 4. Then PF = PF {w } 5. End f 6. End for 7. RN = U; 8. For each document d U 9. If w freq ( w, d ) 0 and w PF 10. Then RN = RN - {d} 11. End f 12.End for Fgure 3. The 1-DNF technque n PEBL.

4 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY SVM s an effectve learnng algorthm for text classfcaton. The teratve SVM algorthm results n the best performance (see Secton III.C for detals). The reason for selectng a classfer s that there s a danger n runnng SVM repettvely. Snce SVM s senstve to nose, f some teraton of SVM extracts many postve documents from Q and put them n RN, then the last SVM classfer wll be poor. However, t s hard to catch the best classfer. Lu et al [6] perform an evaluaton of all 16 possble combnatons of methods for step 1 and step 2 on the Reuters and the 20 Newsgroup corpuses. III. THE PROPOSED TECHNIQUES In ths secton, we proposed a novel technque for LPU problem based on the two-step strategy. Frst, ntroduce a new relable negatve examples extractng method based on knn algorthm. Although the knn algorthm can t be appled drectly to LPU problem, we use t as a rankng process, and set a threshold to label the relable negatve examples set RN. In step 2, we use the SVM teratvely to produce the fnal classfer. A. Introducton to knn Algorthm The k-nearest neghbor algorthm [10] s amongst the smplest of all machne-learnng algorthms. knn algorthm requres no explct tranng and can use the unprocessed tranng set drectly n classfcaton. An obect s classfed by a maorty vote of ts neghbors, wth the obect beng assgned to the class most common amongst ts k nearest neghbors. The parameter k n knn s often chosen based on experence or knowledge about the classfcaton problem at hand. And k s a postve nteger, typcally small. If k equals 1, then the obect s smply assgned to the class of ts nearest neghbor. In bnary (two class) classfcaton problems, t s desrable for k to be odd to make tes less lkely. The same method can be used for regresson, by smply assgnng the property value for the obect to be the average of the values of ts k nearest neghbors. It can be useful to weght the contrbutons of the neghbors, so that the nearer neghbors contrbute more to the average than the more dstant ones. In knn classfcaton, user need not perform any estmaton of parameters as usng Roccho (centrods) classfcaton or n Nave Bayes (prors and condtonal probabltes). knn smply memorzes all examples n the tranng set and then compares the test examples to them. For ths reason, knn s also called memory-based learnng or nstance-based learnng. The neghbors are taken from a set of obects for whch the correct classfcaton (or, n the case of regresson, the value of the property) s known. Ths can be thought of as the tranng set for the algorthm, though no explct tranng step s requred. In order to dentfy neghbors, the obects are represented by poston vectors n a multdmensonal feature space. It s usual to use the Eucldean dstance, though other dstance measures, such as the Manhattan dstance could n prncple be used nstead. The k-nearest neghbor algorthm s senstve to the local structure of the data. The nearest-neghbor rule s a sub-optmal procedure. Its use wll usually lead to an error rate greater than the mnmum possble,.e. the Bayes error rate. However, wth an unlmted number of prototypes the error rate s never worse than twce the Bayes error rate [10]. knn s effectveness s close to that of the most accurate learnng methods n lots of applcatons. The knn algorthm s also an often-used method for the text categorzaton [9]. Gven a test document, the system fnds the k nearest neghbors among the tranng documents, and uses the categores of the k neghbors to weght the category canddates. The smlarty score of each neghbor document to the test document s used as the weght of the categores of the neghbor document. If several of the k nearest neghbors shares a category, then the per-neghbor weghts of that category are added together, and the resultng weghted sum s used as the lkelhood score of that category wth respect to the test document. By sortng the scores of canddate categores, a ranked lst s obtaned for the test document. By set a threshold on these scores, bnary category assgnments are obtaned. The decson rule [9] n knn has been wrtten as equaton (5) y ( d, c ) sm( d, d ) y( d, c ) b, (5) dknn where y(d, c ) s the classfcaton for document d wth respect to category c ; sm(d, d ) s the smlarty between the test document d and the tranng document d ; and b s the category-specfc threshold for the bnary decsons. For the parameter k n knn, Y. Yang [9] tests the values of 30, 45 and 65, and suggests that the resultng dfference n the F1 scores of knn are almost neglgble. So, n [11] they set k as 45. B. The RN Extractng Technque usng knn The knn algorthm can t be appled drectly to LPU problem. However, there s a possblty to employ a process of rankng [12, 13]. The unlabeled examples are ranked accordng to ther smlarty to the tranng samples. When the dstances of unlabeled examples from k nearest postve examples are computed, the resultng values can be used for sortng the classfed examples, nearer unlabeled nstances take postons ahead of the ones that are further away. J., Hroza et al [12, 13] then decde what s the 'true' smlarty, how many unlabeled examples they are wllng to accept, what degree of precson s acceptable, and what recall s stll satsfactory. Accordng to prortes assgned to the parameters of the knn Rankng algorthm, they label the frst r vectors as postve examples. J., Hroza et al do not gve an operable method for how to decde the approprate value of r. When user s nterested only n a small part of the most relevant documents, ths method can get very hgh precson, but the recall value s very small, so the smaller F1 score. However, t s hard to determne the value of r.

5 98 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY 2009 We follow the thought of rank, and reverse the method to exact the relable negatve examples rather than to label the postve examples. That s, For LPU problem, we set a predefned threshold T, f the smlarty resultng value of an unlabeled example s lower than T, and then label t as relable negatve example. We consder that unlabeled examples are very large, so not need to set the T elaborately. When we exact the pure relable negatve examples set, then we can use some methods to refne the classfer n step 2. Accordng to our method, the decson rule can be rewrtten as equaton (6) w ( d) sm( d, d ) T. (6) dknn When knn s appled to text examples, we tokenze all documents by a vector wth TFIDF weght followng the tradtonal Informaton Retreval (IR) approach. Assumng the term vectors are normalzed, cosne functon s a commonly used smlarty measure for two documents as equaton (7) v m 1 sm( d, d ) w w. (7) For the parameter k n knn, J., Hroza et al [12] test k from 1 to 5, and when k s 5 get the best result on Reuters 10 dataset. When consder the dfferent word representatons and stop-word number, they get the dfferent concluson [12, 13]. Our proposed relable negatve extractng algorthm usng knn s shown n Fg. 4. C. The Iteratve SVM technque Support Vector Machnes (SVM) s a relatvely new learnng approach ntroduced by Vapnk n 1995 for solvng two-class pattern recognton problems [14]. It s based on the Structural Rsk Mnmzaton prncple for Algorthm: Relable negatve extractng usng knn Input: P postve examples set U unlabeled examples set K the number of nearest neghbors T threshold Output: RN relable negatve examples set; Steps: 1. RN ={} 2. For each unlabeled examples u 3. For each postve examples v 4. Computng the smlarty sm(u, v ) 5. End For 6. Select k nearest neghbors v (=1,,k) 7. Compute the result value w(u ) accordng to equaton 6 8. If w(u )<0; 9. Then RN = RN u 10. End If 11.End For Fgure 4. Relable negatve extractng usng knn. m m whch error-bound analyss has been theoretcally motvated. The dea of structural rsk mnmzaton s to fnd a hypothess for whch can guarantee the lowest true error. SVM are very unversal learners n text classfcaton. In ther basc form, SVM learn lnear threshold functon. Nevertheless, by a smple "plug-n" of an approprate kernel functon, they can be used to learn polynomal classfers, radal basc functon (RBF) networks, and three-layer sgmod neural nets. One remarkable property of SVM s that ther ablty to learn can be ndependent of the dmensonalty of the feature space. Consderng a bnary classfcaton task wth data ponts x ( = 1,, n), havng correspondng labels y = +1 or -1 and let the decson functon be f ( x) sgn( w x b). (8) The problem of fndng the hyper plane can be stated as the followng optmzaton problem 1 T Mnmze : w w 2. (9) T Subectto : y ( w x b) 1, 1,2,, n To deal wth cases where there may be no separatng hyper plane due to nosy labels of both postve and negatve tranng examples, the soft margn SVM s proposed, whch s formulated as Mnmze : 1 2 w w C T Subectto : y ( w x b) 1, 1,2,, n T n 1. (10) where C0 s a parameter that controls the amount of tranng errors allowed. Joachms [15] frstly ntroduces support vector machnes for text categorzaton. The expermental results show that SVM consstently acheve good performance on text categorzaton tasks, outperformng exstng methods substantally and sgnfcantly. From theoretcal and emprcal evdence, he concludes that SVM acknowledge the partcular propertes of text: (a) hgh dmensonal feature spaces, (b) few rrelevant features (dense concept vector) and (c) sparse nstance vector. Wth ts ablty to generalze well n hgh dmensonal feature spaces, SVM elmnates the need for feature selecton, makng the applcaton of text categorzaton consderably easer. Another advantage of SVM over the conventonal methods s ther robustness. Furthermore, SVM do not requre any parameter tunng, snce they can fnd good parameter settngs automatcally. All ths makes SVM a very promsng and easy-to-use method for learnng text classfers from examples. For step 2, we run SVM teratvely as shown n Fg. 5. Ths method s smlar to the step 2 of PEBL technque and Roc-SVM technque except that we do not use an addtve classfer selecton step. Our technque does not select a good classfer from a set of classfers bult by SVM, and use the last SVM classfer at convergence. The basc dea s to use each teratons of SVM to exact

6 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY Algorthm: Iteratve SVM Input: P postve examples set RN relable negatve set produced by step 1 Q the remanng unlabeled set,.e. U-RN; Output: The fnal classfer S; Steps: 1. Assgned the label 1 to each document n P; 2. Assgned the label -1 to each document n RN; 3. Whle(true) 4. Tranng a new SVM classfer S wth P and RN; 5. Classfy Q usng S; 6. Let the set of documents n Q that are classfed as negatve be W; 7. If W {} 8. Then Q = Q-W; RN = RN W; 9. End If 10.End Whle Fgure 5. The algorthm of teratve SVM more possble negatve examples from Q (U RN) and put them n RN. The teraton converges when no document n Q s classfed as negatve. H. Yu et al [5] analyss that as long as the ntal postve and negatve examples s strong, the teratve SVM can converge nto the unbased negatves through the teratons regardless of the qualty of the ntal mappng. The poor qualty of the ntal mappng would ncrease the number of the teratons n the algorthm, whch ends up longer tranng tme, but the fnal accuracy would be the same. Our experments also show that classfcaton accuracy converges nto the tradtonal SVM traned from the labeled examples no matter how bad the ntal mappng s. IV. EXPERIMENT We now evaluate our proposed technque, and compare wth the orgnally LPU algorthms. That s S- EM [2], Roc-SVM [3], PEBL [5] and NB [6]. A. Experments Setup and Data Preprocess We use Reuters [16], the popular text collecton n text classfcaton experment, whch has documents collected from the Reuters newswre. Among 135 categores, only the most populous 10 are used documents are selected to use n our experment. Each category s employed as the postve examples class, and the rest as the negatve examples class. Ths gves us 10 datasets. Table 1 gves the number of documents n each of the ten topc categores. In data preprocessng, we use the Bow toolkt [17]. We appled stop-word removal, and the stop-lst s the SMART system's lst of 524 common words, not consder the number of stop-words as that J., Hroza et al [12, 13] do. No feature selecton or stemmng was done. The TFIDF value s used n the feature vectors. For each dataset, 30% of the documents are randomly selected as TABLE I. THE MOST POPULAR 10 CATEGORIES IN REUTERS test documents. The rest (70%) are used to create tranng sets as follows: percent of the documents from the postve examples class s frst selected as the postve examples set P. The rest of the postve and negatve documents are used as unlabeled examples set U. We range from 10%-90% to create a wde range of scenaros. B. Evaluaton Measures In our experments, we use the popular F1 score on the postve examples class as the evaluaton measure. F1 score takes nto account of both recall and precson. The F1 measure s often used as an optmzaton crteron n threshold tunng for bnary decsons. Its score s maxmzed when the values of recall and precson are equal or close; otherwse, the smaller of recall and precson domnates the value of F1. Precson, recall and F1 defned as: # of correct postve predcton s Precson, (11) # of postve predcton s Recall Acq 2369 Corn 237 Crude 578 Earn 3964 Gran 582 Interest 478 Money 717 Shp 286 Trade 486 Wheat 283 # of correct postve predcton s, (12) # of postve examples precson recall F 2 1. (13) precson recall For evaluatng performance average across categores, we use macro averagng. Macro averagng performance scores are determned by frst computng the performance measures per category and then averagng those to compute the global means. C. Experment Results We mplemented our proposed algorthm. For SVM, we use the SVM lght system [18] wth lnear kernel, and do not tune the parameters. The results of PEBL, S-EM, ROC-SVM, and the NB method are extracted from the experment of Lu et al [6]; Noted that they are all use the

7 100 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY 2009 confrmed Y. Yang s concluson [11] that the mpact of parameter k for resultng dfference n the F1 scores of knn are almost neglgble. So, t not needs to elaborately tune the K value. The result s exctng. Our proposed method performs well and on some level ndependent to k. Experments show the effectveness of our proposed technque. Fgure 6. The Macro averagng F1 scores of our proposed method wth dfferent T values and compare wth the other four LPU algorthms. teratve SVM technque n step 2 and comparable wth our proposed method. Frst, based on the work of J., Hroza et al [12, 13], we set k to 5, and test the T value range from 0.01 to 0.3, the macro-averagng F1 score on the 10 Reuters datasets for each settng are shown n Fg. 6. We compare the macro averagng F1 score wth the other LPU algorthm. When T s 0.05 and 0.01, our proposed method outperforms others, especally when the percent of postve s small. We fnd that when T equals 0.05 almost get the best result, so we set t as the T value n the next experments. And we can observe that the value of T has the sgnfcant mpact on the F1 score. So, how to tune the value s one of the mportant future works. Second, we set T to 0.05, and test dfferent k values for knn. We not only test 1, 2, and 5 value that has been used n [12, 13], but also 45 that used n [9, 11], and we also test the values of 10, 20, and 30. The Macro averagng F1 scores of our proposed method wth dfferent k values are shown n Fg. 7. Our experments Fgure 7. The Macro averagng F1 scores of our proposed method wth dfferent k values V. CONCLUSION Many real-world classfcaton applcatons fall nto the class of postve and unlabeled learnng problems. In ths paper, we propose a new relable negatve example set extractng algorthm that use knn for solvng LPU problem based on the two-step strategy. We adopt knn algorthm to rank smlarty of unlabeled examples from the k nearest postve examples, and then set a threshold to label some unlabeled examples that lower than t as the relable negatve examples, and contrary to J., Hroza et al s work that labels postve examples. For step 2, we use teratve SVM technque to refne the classfer. Experments on the popular Reuter21578 collecton show the effectveness of our proposed technque. Our proposed technque s smplcty and effcency and on some level ndependent to k. Besdes tunng the threshold T for rank learnng wth knn algorthm, larger testng wth more real data could brng more accurate answers, whch also s the am of the future works. ACKNOWLEDGMENT The work was supported by the Natonal Natural Scence Foundaton of Chna under Grant No , the Scence and Technology Development Program of Jln Provnce of Chna under the Grant No , and the Scence Foundaton for Young Teachers of Northeast Normal Unversty (No ). The authors wsh to thank the anonymous revewers for ther comments and suggestons. REFERENCES [1] F. Dens, "PAC Learnng from Postve Statstcal Queres", Proc. of Workshop on Algorthmc Learnng Theory, Sprnger, Hedelberg, 1998, pp [2] B. Lu, Y. Da, X.L. L, W.S. Lee, and Phlp Y., "Buldng Text Classfers Usng Postve and Unlabeled Examples", ICDM-03, Melbourne, Florda, November 2003, pp [3] B. Lu, W.S. Lee, P.S. Yu, and X.L. L, "Partally Supervsed Classfcaton of Text Documents", Proceedngs of the Nneteenth Internatonal Conference on Machne Learnng (ICML-2002), Sydney, July 2002, pp [4] X.L. L and B. Lu, "Learnng to Classfy Documents wth Only Postve Tranng Set", ECML 2007, LNAI 4701, 2007, pp [5] H. Yu, J. Han, and Chang K.C.-C., "PEBL: Postve Example Based Learnng for Web Page Classfcaton Usng SVM", Proc. Eghth Int'l Conf. Knowledge Dscovery and Data Mnng (KDD'02), ACM Press, New York, 2002, pp [6] B. Lu, Y. Da, X.L. L, W. S. Lee, and Phlp Y., "Buldng Text Classfers Usng Postve and Unlabeled

8 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY Examples", Proceedngs of the Thrd IEEE Internatonal Conference on Data Mnng (ICDM-03), Melbourne, Florda, November 2003, pp [7] F., Dens, R., Glleron and M. Tommas, "Text classfcaton from postve and unlabeled examples", IPMU, [8] L., Manevtz and M., Yousef, "One-class SVMs for document classfcaton", Journal of Machne Learnng Research, vol. 2, 2001, pp [9] Y., Yang and X. Lu, "A Re-Examnaton of Text Categorzaton Methods", Proceedngs of the 22nd Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, August 15-19, 1999, Berkeley, CA, USA, pp [10] Duda, R. O., Hart, P. E., and Stock, D. G., Pattern Classfcaton, Second Edton, John Wley& Sons, 2001 [11] Y. Yang, "An evaluaton of statstcal approaches to text categorzaton", Journal of Informaton Retreval, 1999 volume 1, pp [12] J., Hroza and J. Žžka, B. Poulquen, C. Ignat and R. Stenberger, "Mnng Relevant Text Documents Usng Rankng-Based k-nn Algorthms Traned by Only Postve Examples", Proceedngs of the Fourth Czech-Slovak Conference Knowledge-2005, February 9-11, 2005, Stará Lesná, Slovak Republc, pp [13] J., Hroza, J. Žžka, B. Poulquen, C. Ignat and R. Stenberger, "The Selecton of Electronc Text Documents Supported by Only Postve Examples", JADT 2006, Besancon, France, pp [14] V. Vapnc, the Nature of Statstcal Learnng Theory, Sprnger, New York, [15] T. Joachms, "Text Categorzaton wth Support Vector Machnes: Learnng wth Many Relevant Features", European Conference on Machne Learnng (ECML), [16] Reuters Text Categorzaton Collecton, html. [17] Bow: A Toolkt for Statstcal Language Modelng, Text Retreval, Classfcaton and Clusterng, [18] T. Joachms, "Makng large-scale SVM Learnng Practcal", Advances n Kernel Methods - Support Vector Learnng, MIT-Press, 1999 Bangzuo Zhang, born n Langzhong Cty, Schuan Provnce, P.R.Chna, on Feb. 27, Receved Bachelor of Scence n computer scence educaton from Northeast Normal Unversty, Chna n 1995 and Master of Engneer n computer applcaton technque from Jln Unversty, Chna n From Sep. 2003, Ph D. canddate n computer scence and technology from Jln Unversty, Chna. He has been a faculty number from July Currently, he s a Lecturer n College of Computer, Northeast Normal Unversty, Chna. He has oned and accomplshed 3 natonal and provncal research programs, such as "Research on the Sem-supervsed Text Mnng and Applcaton". He has also publshed 6 papers n Internatonal conferences/ournals, such as "A Novel Relable Negatve Method Based Clusterng For Learnng from Postve and Unlabeled Examples" n Lecture Notes n Computer Scence. Hs maor research nterests nclude database and ntellgent network, web ntellgence. Mr. Zhang has receved the Frst Class Educatonal Achevement Award from Hgher Educatonal Commttee of Jln Provnce, Chna n Wanl Zuo, born n Jln Cty, Jln Provnce, P.R.Chna, on Dec. 6, Receved Bachelor of Engneerng, Master of Scence, and Ph D. from Jln Unversty, P.R.Chna n 1982, 1985, and 2005 respectvely. He has been workng n Jln Unversty snce From July 1996 to July 1997, he conducted collaboratve research n Lousana Stare Unversty, US, as a senor vstng scholar. He has accomplshed 5 natonal and provncal research programs, such as "Obect-orented Actve database based on Petr nets". He has also publshed more than 60 papers n Internatonal conferences/ournals and Chnese Journals, such as "Relatonshp Graph and Termnaton Analyss of Actve Rules n Database Systems" n Chnese Journal of Software. He has also publshed 4 books, such as "A Course of Operatng Systems" by Hgher Educatonal Press of P. R. Chna n Hs maor research nterests nclude database, web ntellgence, and search engnes. Dr. Zuo s currently senor member of Computer Federaton of Chna, councl of System Software Assocaton of Chna. Receved 5 awards from Educatonal Department of Chna, such as the Second Class Natonal Educatonal Achevement Award n 1996.

The Research of Support Vector Machine in Agricultural Data Classification

The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou