Reliable Negative Extracting Based on knn for Learning from Positive and Unlabeled Examples

Size: px
Start display at page:

Download "Reliable Negative Extracting Based on knn for Learning from Positive and Unlabeled Examples"

Transcription

1 94 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY 2009 Relable Negatve Extractng Based on knn for Learnng from Postve and Unlabeled Examples Bangzuo Zhang College of Computer Scence and Technology, Jln Unversty, Changchun, P. R. Chna College of Computer, Northeast Normal Unversty, Changchun, P. R. Chna Emal: Wanl Zuo College of Computer Scence and Technology, Jln Unversty, Changchun, P. R. Chna Emal: Abstract Many real-world classfcaton applcatons fall nto the class of postve and unlabeled learnng problems. The exstng technques almost all are based on the two-step strategy. Ths paper proposes a new relable negatve extractng algorthm for step 1. We adopt knn algorthm to rank the smlarty of unlabeled examples from the k nearest postve examples, and set a threshold to label some unlabeled examples that lower than t as the relable negatve examples rather than the common method to label postve examples. In step 2, we use teratve SVM technque to refne the fnally classfer. Our proposed method s smplcty and effcency and on some level ndependent to k. Experments on the popular Reuter21578 collecton show the effectveness of our proposed technque. Index Terms Learnng from Postve and Unlabeled examples, k Nearest Neghbor, Text Classfcaton, Support Vector Machne, Informaton Retreval I. INTRODUCTION Tradtonal learnng technques typcally requre a large number of labeled examples to learn an accurate classfer. Thus, for bnary problems, postve examples and negatve examples are mandatory for machne learnng and data mnng algorthms such as decson tree and neural networks. Ths approach to buldng classfers s called supervsed learnng. However, n many practcal classfcaton applcatons such as document retreval and classfcaton, postve nformaton s readly avalable and unlabeled data can easly be collected, although t s possble to manually label some negatve examples, t s labor ntensve and very tme consumng. One way to reduce the amount of labeled tranng data needed s to develop classfcaton algorthms that can learn from a set of labeled postve examples augmented wth a set of unlabeled examples. That s gve a set P of postve examples of a partcular class and a set U of unlabeled examples, and then buld a classfer usng P and U to classfy the data n U as well as future test data. A frst example s web-page classfcaton, suppose we want a program that classfes web stes as nterestng for a web user. Postves examples are freely avalable: t s the set of web pages correspondng to web stes n hs bookmarks. Moreover, unlabeled web pages are abundant, and easly avalable on the World Wde Web. Many realworld classfcaton applcatons also can fall nto ths class problem. Such as, dagnoss of dseased: postve data are patents who have the dsease, unlabeled data are all patents; marketng: postve data are clents who buy the product, unlabeled data are all clents n the database. Dens orgnally proposes a framework for learnng model from postve examples (POSEX for short) [1] based on the probably approxmately correct model (PAC). The study concentrates on the computatonal complexty of learnng and shows that functon classes learnable under the statstcal queres model are also learnable from postve and unlabeled examples. Lu et al [2] call ths problem LPU (Learnng form Postve and Unlabeled examples), whle t s also called partally supervsed classfcaton [3], and PU learnng problem [4]. Yu et al [5] ntroduce t as PEBL (Postve Example Based Learnng). The key feature of ths problem s that there s no labeled negatve document, whch makes tradtonal classfcaton methods napplcable, as they all need labeled examples of every class. Recently, a few nnovatve technques have been proposed to solve ths problem. These algorthms nclude S-EM [2], Roc-SVM [3], PEBL [5] and NB [6]. One class of these technques have focused on addressng the lack of labeled negatve examples n the tranng examples, and based on a two-step strategy as follows: Step 1: Extracton a set of negatve examples called relable negatves (RN) from the unlabeled examples U. In ths step, S-EM uses a Spy technque, Roc-SVM uses the Roccho algorthm, PEBL uses a technque called 1- DNF, and NB uses the Nave Bayes technque. The key requrement for ths step s that the dentfed negatve examples from the unlabeled examples must be relable or pure,.e., wth no or very few postve examples. Step 2: Buldng a set of classfers by teratvely applyng a classfcaton algorthm and then selectng a good classfer from the set. In ths step, S-EM uses the Expectaton Maxmzaton (EM) algorthm wth a NB

2 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY (Nave Bayes) classfer as the base classfer, whle PEBL and Roc-SVM use Support Vector Machne (SVM). Both S-EM and Roc-SVM have some methods for selectng the fnal classfer. PEBL smply uses the last classfer at convergence. The underlyng dea of these two-step strateges s to teratvely ncrease the number of unlabeled examples that are classfed as negatve whle mantanng the postve examples correctly classfed. Ths dea has been ustfed to be effectve for ths problem n [2]. Other classes of methods for learnng from postve and unlabeled examples are also presented. A NB based method (called PNB) [7] that tres to statstcally remove the effect of postve data n the unlabeled set s proposed. The man shortcomng of ths method s that t requres the user to gve the postve class probablty, whch s hard for the user to provde n practce. It s also possble to dscard the unlabeled examples and learn only from the postve examples. Ths was done n the one-class SVM [8], whch tres to learn the support of the postve dstrbuton. Some results [6] show that ts performance s poorer than learnng methods that take advantage of the unlabeled data. knn [9] stands for k-nearest neghbor classfcaton, s a well-known statstcal approach that has been ntensvely studed n pattern recognton. knn s a type of nstance-based learnng, or lazy learnng where the functon s only approxmated locally and all computaton s deferred untl classfcaton. The knn algorthm assgns each example to the maorty class of ts k closest neghbors where k s a parameter. For 1NN, the algorthm assgns each example to the class of ts closest neghbor. The knn algorthm s also an often-used method for the text categorzaton and has reported the best result n Reuter collecton [9]. In ths paper, we also follow the two-step strategy, and propose a novel method based on knn algorthm for Step 1. We frstly use knn algorthms to extract relable negatve and then construct an ntal classfers. We then use teratve SVM algorthm untl ts convergence. We carry out experments n the popular Reuter21578 collecton, and demonstrate the effectveness of our proposed technque. In ths paper, we would lke to frst revew the exstng two-step LPU algorthms n Secton 2, then propose a new relable negatve examples extractng method by knn algorthm, and show ts effectveness expermentally on the Reuters collecton n Secton 4, fnally make concluson n Secton 5. II. RELATED WORKS Gven a set of tranng documents D, Each document s consdered as an ordered lst of words. We use w d, k to denote the word n poston k of document d, where each word s from the vocabulary V=<w 1,w 2, w v >. The vocabulary s the set of all the words consdered for classfcaton. For LPU, we only consder bnary class classfcaton, so a set of predefned class C = {c 0, c 1 }, and we use c 0 for the postve class, whle c 1 for negatve class. Tradtonal supervsed learnng and sem-supervsed learnng classfcaton technques requre labeled tranng examples of all classes to buld a classfer. They are thus not sutable for LPU problem. Recently, some LPU algorthms ncludng S-EM [2], NB [6], Roc-SVM [3] and PEBL [5] are proposed, and they are all based on the two-step strategy. We frstly revew the exstng technque for step 1 n detal. A. The Spy Technque n S-EM The Spy technque n S-EM [2] frst randomly selects a set S of postve documents from P and puts them n U. The default value s 10% (usng 15% n [6]). The algorthm s gven n Fg. 1. The spes behave dentcally to the unknown postve documents n P and hence allow to relably nferrng the behavor of the unknown postve documents n U. It then runs I-EM algorthm usng the set P-S as postve and the set U S as negatve (lnes 3-7). I-EM bascally runs NB twce. After I-EM completes, the resultng classfer uses the probabltes assgned to the documents n S to decde a probablty threshold th to dentfy possble negatve documents n U to produce the relable negatve examples set RN. However, S-EM s not accurate because t uses nave Bayesan classfer as the underlyng classfer n step 2. Ths algorthm performs stably when the postve set s very small. When the postve set s larger, t s worse than others. B. The Nave Bayes Technque The NB (Nave Bayes) technque s a popular method for text classfcaton. Lu et al [6] frst ntroduce t nto LPU as a new method for step 1. The NB classfer s constructed by usng the tranng documents to estmate the probablty of each class gven the document feature values of a new nstance. To perform classfcaton, t computes the posteror probablty, Pr(c d ). Based on Bayesan probablty and the multnomal model, t gves 1. RN = {}; 2. S = Sample(P, s%); 3. Us = U S; 4. Ps = P-S; 5. Assgn each document n Ps the class label 1; 6. Assgn each document n Us the class label -1; 7. I-EM(Us, Ps); // Ths produces a NB classfer. 8. Classfy each document n Us usng the NB classfer; 9. Determne a probablty threshold th usng S; 10. For each document dus 11. If ts probablty Pr(1 d) < th 12. Then RN = RN {d}; 13. End If 14.End For Fgure 1. The spy technque n S-EM.

3 96 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY 2009 Pr( c D Pr( c d ) 1 ), (1) D To avod zero probablty estmates, some smoothng method s usually used. Lu et al [6] use the Ldstone smoothng as Pr( w t c ) V V D 1 s1 N( w, d )Pr( c d ),(2) N( w, d )Pr( c d ) t D 1 where s the smoothng factor, N(w t,d )s number of tmes that word w t occurs n document d and pr(c d ){0, 1} dependng on the class of the document. Assumng that the probabltes of the words are ndependent gven the class, the NB classfer has been defned as equaton (3) C r 1 d k cr ) Pr( c ) Pr( w c ) Pr( c d ). (3) Pr( c ) 1. Assgn label 1 to each document n P; 2. Assgn label 1 to each document n U; 3. Buld a NB classfer usng P and U; 4. Use the classfer to classfy U. Those documents n U that are classfed as negatve form the relable negatve set RN. Fgure 2. The method of extractng RN usng NB. s 1 d, k d Pr( w 1, k d k In classfyng a document d, the class wth the hghest Pr(c d ) s assgned as the class of the document. The method of extractng a set RN of relable negatve documents from the unlabeled examples set U s done brefly as Fg. 2. Despte the fact that the assumpton of condtonal ndependence s generally not true for word appearance n documents, the nave bayes classfer s surprsngly effectve. C. The Roccho Technque The Roc-SVM algorthm [3] uses the Roccho method to dentfy a set RN from U, whch s a classc method for document routng or flterng n nformaton retreval. Buldng a Roccho classfer s acheved by constructng a prototype vector for each class followng equaton (4) 1 d 1 d c. (4) c d D c d dc ddc and are parameters that adust the relatve mpact of relevant and rrelevant tranng examples. Generally use = 16 and = 4. In classfcaton, for each test document td, t uses the cosne smlarty measure to compute the smlarty of td wth each prototype vector. The class whose prototype vector s more smlar to td s assgned to td. The algorthm that uses Roccho to dentfy a set RN from U s the same as that n Fg. 2 except that t replaces r NB wth Roccho. Roccho performs well consstently under a varety of condtons. D. The 1-DNF Technque for PEBL Yu et al [5] proposes the PEBL framework for web page classfcaton, whch uses mappng-convergence algorthm. In the mappng stage, they extract relable negatve from the unlabeled data by the 1-DNF method. The 1-DNF algorthm s gven n Fg. 3. It frstly bulds a dsuncton lst of postve feature set PF whch contans words that occur n the postve examples set P more frequently than n the unlabeled examples set U (lne 2-6). Then t tres to flter out possble postve documents from U (lne 8-12). A document n U that does not have any postve feature n PF s regarded as a strong negatve document. In ths algorthm, the amount of RN set s always small and sometmes s short text examples. PEBL s not robust because t performs well n certan stuatons and fals badly n others. PEBL s senstve to the number of postve examples. When the postve data s small, the results are often very poor. E. Technques n Step 2 There are four technques for the second step: 1. Runnng SVM only once usng sets P and RN after step 1. Ths method s seldom to use. 2. Runnng EM. Ths method s used n S-EM [2]. 3. Runnng SVM teratvely. Ths method s used n PEBL [5]. 4. Runnng SVM teratvely and then selectng a fnal classfer. Ths method s used n Roc-SVM [3]. The Expectaton-Maxmzaton (EM) algorthm s a popular teratve algorthm for maxmum lkelhood estmaton n problems wth mssng data. The EM algorthm conssts of two steps, the Expectaton step, and the Maxmzaton step. The Expectaton step bascally flls n the mssng data. It produces and revses the probablstc labels of the documents n Q = U - RN. The parameters are estmated n the Maxmzaton step after the mssng data are flled. Ths leads to the next teraton of the algorthm. EM converges when ts parameters stablze. The EM algorthm teratvely runs NB to revse the probablstc label of each document n set Q. 1. PF = {} 2. For = 1 to n 3. If (freq(w, P)/ P > freq(w, U)/ U ) 4. Then PF = PF {w } 5. End f 6. End for 7. RN = U; 8. For each document d U 9. If w freq ( w, d ) 0 and w PF 10. Then RN = RN - {d} 11. End f 12.End for Fgure 3. The 1-DNF technque n PEBL.

4 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY SVM s an effectve learnng algorthm for text classfcaton. The teratve SVM algorthm results n the best performance (see Secton III.C for detals). The reason for selectng a classfer s that there s a danger n runnng SVM repettvely. Snce SVM s senstve to nose, f some teraton of SVM extracts many postve documents from Q and put them n RN, then the last SVM classfer wll be poor. However, t s hard to catch the best classfer. Lu et al [6] perform an evaluaton of all 16 possble combnatons of methods for step 1 and step 2 on the Reuters and the 20 Newsgroup corpuses. III. THE PROPOSED TECHNIQUES In ths secton, we proposed a novel technque for LPU problem based on the two-step strategy. Frst, ntroduce a new relable negatve examples extractng method based on knn algorthm. Although the knn algorthm can t be appled drectly to LPU problem, we use t as a rankng process, and set a threshold to label the relable negatve examples set RN. In step 2, we use the SVM teratvely to produce the fnal classfer. A. Introducton to knn Algorthm The k-nearest neghbor algorthm [10] s amongst the smplest of all machne-learnng algorthms. knn algorthm requres no explct tranng and can use the unprocessed tranng set drectly n classfcaton. An obect s classfed by a maorty vote of ts neghbors, wth the obect beng assgned to the class most common amongst ts k nearest neghbors. The parameter k n knn s often chosen based on experence or knowledge about the classfcaton problem at hand. And k s a postve nteger, typcally small. If k equals 1, then the obect s smply assgned to the class of ts nearest neghbor. In bnary (two class) classfcaton problems, t s desrable for k to be odd to make tes less lkely. The same method can be used for regresson, by smply assgnng the property value for the obect to be the average of the values of ts k nearest neghbors. It can be useful to weght the contrbutons of the neghbors, so that the nearer neghbors contrbute more to the average than the more dstant ones. In knn classfcaton, user need not perform any estmaton of parameters as usng Roccho (centrods) classfcaton or n Nave Bayes (prors and condtonal probabltes). knn smply memorzes all examples n the tranng set and then compares the test examples to them. For ths reason, knn s also called memory-based learnng or nstance-based learnng. The neghbors are taken from a set of obects for whch the correct classfcaton (or, n the case of regresson, the value of the property) s known. Ths can be thought of as the tranng set for the algorthm, though no explct tranng step s requred. In order to dentfy neghbors, the obects are represented by poston vectors n a multdmensonal feature space. It s usual to use the Eucldean dstance, though other dstance measures, such as the Manhattan dstance could n prncple be used nstead. The k-nearest neghbor algorthm s senstve to the local structure of the data. The nearest-neghbor rule s a sub-optmal procedure. Its use wll usually lead to an error rate greater than the mnmum possble,.e. the Bayes error rate. However, wth an unlmted number of prototypes the error rate s never worse than twce the Bayes error rate [10]. knn s effectveness s close to that of the most accurate learnng methods n lots of applcatons. The knn algorthm s also an often-used method for the text categorzaton [9]. Gven a test document, the system fnds the k nearest neghbors among the tranng documents, and uses the categores of the k neghbors to weght the category canddates. The smlarty score of each neghbor document to the test document s used as the weght of the categores of the neghbor document. If several of the k nearest neghbors shares a category, then the per-neghbor weghts of that category are added together, and the resultng weghted sum s used as the lkelhood score of that category wth respect to the test document. By sortng the scores of canddate categores, a ranked lst s obtaned for the test document. By set a threshold on these scores, bnary category assgnments are obtaned. The decson rule [9] n knn has been wrtten as equaton (5) y ( d, c ) sm( d, d ) y( d, c ) b, (5) dknn where y(d, c ) s the classfcaton for document d wth respect to category c ; sm(d, d ) s the smlarty between the test document d and the tranng document d ; and b s the category-specfc threshold for the bnary decsons. For the parameter k n knn, Y. Yang [9] tests the values of 30, 45 and 65, and suggests that the resultng dfference n the F1 scores of knn are almost neglgble. So, n [11] they set k as 45. B. The RN Extractng Technque usng knn The knn algorthm can t be appled drectly to LPU problem. However, there s a possblty to employ a process of rankng [12, 13]. The unlabeled examples are ranked accordng to ther smlarty to the tranng samples. When the dstances of unlabeled examples from k nearest postve examples are computed, the resultng values can be used for sortng the classfed examples, nearer unlabeled nstances take postons ahead of the ones that are further away. J., Hroza et al [12, 13] then decde what s the 'true' smlarty, how many unlabeled examples they are wllng to accept, what degree of precson s acceptable, and what recall s stll satsfactory. Accordng to prortes assgned to the parameters of the knn Rankng algorthm, they label the frst r vectors as postve examples. J., Hroza et al do not gve an operable method for how to decde the approprate value of r. When user s nterested only n a small part of the most relevant documents, ths method can get very hgh precson, but the recall value s very small, so the smaller F1 score. However, t s hard to determne the value of r.

5 98 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY 2009 We follow the thought of rank, and reverse the method to exact the relable negatve examples rather than to label the postve examples. That s, For LPU problem, we set a predefned threshold T, f the smlarty resultng value of an unlabeled example s lower than T, and then label t as relable negatve example. We consder that unlabeled examples are very large, so not need to set the T elaborately. When we exact the pure relable negatve examples set, then we can use some methods to refne the classfer n step 2. Accordng to our method, the decson rule can be rewrtten as equaton (6) w ( d) sm( d, d ) T. (6) dknn When knn s appled to text examples, we tokenze all documents by a vector wth TFIDF weght followng the tradtonal Informaton Retreval (IR) approach. Assumng the term vectors are normalzed, cosne functon s a commonly used smlarty measure for two documents as equaton (7) v m 1 sm( d, d ) w w. (7) For the parameter k n knn, J., Hroza et al [12] test k from 1 to 5, and when k s 5 get the best result on Reuters 10 dataset. When consder the dfferent word representatons and stop-word number, they get the dfferent concluson [12, 13]. Our proposed relable negatve extractng algorthm usng knn s shown n Fg. 4. C. The Iteratve SVM technque Support Vector Machnes (SVM) s a relatvely new learnng approach ntroduced by Vapnk n 1995 for solvng two-class pattern recognton problems [14]. It s based on the Structural Rsk Mnmzaton prncple for Algorthm: Relable negatve extractng usng knn Input: P postve examples set U unlabeled examples set K the number of nearest neghbors T threshold Output: RN relable negatve examples set; Steps: 1. RN ={} 2. For each unlabeled examples u 3. For each postve examples v 4. Computng the smlarty sm(u, v ) 5. End For 6. Select k nearest neghbors v (=1,,k) 7. Compute the result value w(u ) accordng to equaton 6 8. If w(u )<0; 9. Then RN = RN u 10. End If 11.End For Fgure 4. Relable negatve extractng usng knn. m m whch error-bound analyss has been theoretcally motvated. The dea of structural rsk mnmzaton s to fnd a hypothess for whch can guarantee the lowest true error. SVM are very unversal learners n text classfcaton. In ther basc form, SVM learn lnear threshold functon. Nevertheless, by a smple "plug-n" of an approprate kernel functon, they can be used to learn polynomal classfers, radal basc functon (RBF) networks, and three-layer sgmod neural nets. One remarkable property of SVM s that ther ablty to learn can be ndependent of the dmensonalty of the feature space. Consderng a bnary classfcaton task wth data ponts x ( = 1,, n), havng correspondng labels y = +1 or -1 and let the decson functon be f ( x) sgn( w x b). (8) The problem of fndng the hyper plane can be stated as the followng optmzaton problem 1 T Mnmze : w w 2. (9) T Subectto : y ( w x b) 1, 1,2,, n To deal wth cases where there may be no separatng hyper plane due to nosy labels of both postve and negatve tranng examples, the soft margn SVM s proposed, whch s formulated as Mnmze : 1 2 w w C T Subectto : y ( w x b) 1, 1,2,, n T n 1. (10) where C0 s a parameter that controls the amount of tranng errors allowed. Joachms [15] frstly ntroduces support vector machnes for text categorzaton. The expermental results show that SVM consstently acheve good performance on text categorzaton tasks, outperformng exstng methods substantally and sgnfcantly. From theoretcal and emprcal evdence, he concludes that SVM acknowledge the partcular propertes of text: (a) hgh dmensonal feature spaces, (b) few rrelevant features (dense concept vector) and (c) sparse nstance vector. Wth ts ablty to generalze well n hgh dmensonal feature spaces, SVM elmnates the need for feature selecton, makng the applcaton of text categorzaton consderably easer. Another advantage of SVM over the conventonal methods s ther robustness. Furthermore, SVM do not requre any parameter tunng, snce they can fnd good parameter settngs automatcally. All ths makes SVM a very promsng and easy-to-use method for learnng text classfers from examples. For step 2, we run SVM teratvely as shown n Fg. 5. Ths method s smlar to the step 2 of PEBL technque and Roc-SVM technque except that we do not use an addtve classfer selecton step. Our technque does not select a good classfer from a set of classfers bult by SVM, and use the last SVM classfer at convergence. The basc dea s to use each teratons of SVM to exact

6 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY Algorthm: Iteratve SVM Input: P postve examples set RN relable negatve set produced by step 1 Q the remanng unlabeled set,.e. U-RN; Output: The fnal classfer S; Steps: 1. Assgned the label 1 to each document n P; 2. Assgned the label -1 to each document n RN; 3. Whle(true) 4. Tranng a new SVM classfer S wth P and RN; 5. Classfy Q usng S; 6. Let the set of documents n Q that are classfed as negatve be W; 7. If W {} 8. Then Q = Q-W; RN = RN W; 9. End If 10.End Whle Fgure 5. The algorthm of teratve SVM more possble negatve examples from Q (U RN) and put them n RN. The teraton converges when no document n Q s classfed as negatve. H. Yu et al [5] analyss that as long as the ntal postve and negatve examples s strong, the teratve SVM can converge nto the unbased negatves through the teratons regardless of the qualty of the ntal mappng. The poor qualty of the ntal mappng would ncrease the number of the teratons n the algorthm, whch ends up longer tranng tme, but the fnal accuracy would be the same. Our experments also show that classfcaton accuracy converges nto the tradtonal SVM traned from the labeled examples no matter how bad the ntal mappng s. IV. EXPERIMENT We now evaluate our proposed technque, and compare wth the orgnally LPU algorthms. That s S- EM [2], Roc-SVM [3], PEBL [5] and NB [6]. A. Experments Setup and Data Preprocess We use Reuters [16], the popular text collecton n text classfcaton experment, whch has documents collected from the Reuters newswre. Among 135 categores, only the most populous 10 are used documents are selected to use n our experment. Each category s employed as the postve examples class, and the rest as the negatve examples class. Ths gves us 10 datasets. Table 1 gves the number of documents n each of the ten topc categores. In data preprocessng, we use the Bow toolkt [17]. We appled stop-word removal, and the stop-lst s the SMART system's lst of 524 common words, not consder the number of stop-words as that J., Hroza et al [12, 13] do. No feature selecton or stemmng was done. The TFIDF value s used n the feature vectors. For each dataset, 30% of the documents are randomly selected as TABLE I. THE MOST POPULAR 10 CATEGORIES IN REUTERS test documents. The rest (70%) are used to create tranng sets as follows: percent of the documents from the postve examples class s frst selected as the postve examples set P. The rest of the postve and negatve documents are used as unlabeled examples set U. We range from 10%-90% to create a wde range of scenaros. B. Evaluaton Measures In our experments, we use the popular F1 score on the postve examples class as the evaluaton measure. F1 score takes nto account of both recall and precson. The F1 measure s often used as an optmzaton crteron n threshold tunng for bnary decsons. Its score s maxmzed when the values of recall and precson are equal or close; otherwse, the smaller of recall and precson domnates the value of F1. Precson, recall and F1 defned as: # of correct postve predcton s Precson, (11) # of postve predcton s Recall Acq 2369 Corn 237 Crude 578 Earn 3964 Gran 582 Interest 478 Money 717 Shp 286 Trade 486 Wheat 283 # of correct postve predcton s, (12) # of postve examples precson recall F 2 1. (13) precson recall For evaluatng performance average across categores, we use macro averagng. Macro averagng performance scores are determned by frst computng the performance measures per category and then averagng those to compute the global means. C. Experment Results We mplemented our proposed algorthm. For SVM, we use the SVM lght system [18] wth lnear kernel, and do not tune the parameters. The results of PEBL, S-EM, ROC-SVM, and the NB method are extracted from the experment of Lu et al [6]; Noted that they are all use the

7 100 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY 2009 confrmed Y. Yang s concluson [11] that the mpact of parameter k for resultng dfference n the F1 scores of knn are almost neglgble. So, t not needs to elaborately tune the K value. The result s exctng. Our proposed method performs well and on some level ndependent to k. Experments show the effectveness of our proposed technque. Fgure 6. The Macro averagng F1 scores of our proposed method wth dfferent T values and compare wth the other four LPU algorthms. teratve SVM technque n step 2 and comparable wth our proposed method. Frst, based on the work of J., Hroza et al [12, 13], we set k to 5, and test the T value range from 0.01 to 0.3, the macro-averagng F1 score on the 10 Reuters datasets for each settng are shown n Fg. 6. We compare the macro averagng F1 score wth the other LPU algorthm. When T s 0.05 and 0.01, our proposed method outperforms others, especally when the percent of postve s small. We fnd that when T equals 0.05 almost get the best result, so we set t as the T value n the next experments. And we can observe that the value of T has the sgnfcant mpact on the F1 score. So, how to tune the value s one of the mportant future works. Second, we set T to 0.05, and test dfferent k values for knn. We not only test 1, 2, and 5 value that has been used n [12, 13], but also 45 that used n [9, 11], and we also test the values of 10, 20, and 30. The Macro averagng F1 scores of our proposed method wth dfferent k values are shown n Fg. 7. Our experments Fgure 7. The Macro averagng F1 scores of our proposed method wth dfferent k values V. CONCLUSION Many real-world classfcaton applcatons fall nto the class of postve and unlabeled learnng problems. In ths paper, we propose a new relable negatve example set extractng algorthm that use knn for solvng LPU problem based on the two-step strategy. We adopt knn algorthm to rank smlarty of unlabeled examples from the k nearest postve examples, and then set a threshold to label some unlabeled examples that lower than t as the relable negatve examples, and contrary to J., Hroza et al s work that labels postve examples. For step 2, we use teratve SVM technque to refne the classfer. Experments on the popular Reuter21578 collecton show the effectveness of our proposed technque. Our proposed technque s smplcty and effcency and on some level ndependent to k. Besdes tunng the threshold T for rank learnng wth knn algorthm, larger testng wth more real data could brng more accurate answers, whch also s the am of the future works. ACKNOWLEDGMENT The work was supported by the Natonal Natural Scence Foundaton of Chna under Grant No , the Scence and Technology Development Program of Jln Provnce of Chna under the Grant No , and the Scence Foundaton for Young Teachers of Northeast Normal Unversty (No ). The authors wsh to thank the anonymous revewers for ther comments and suggestons. REFERENCES [1] F. Dens, "PAC Learnng from Postve Statstcal Queres", Proc. of Workshop on Algorthmc Learnng Theory, Sprnger, Hedelberg, 1998, pp [2] B. Lu, Y. Da, X.L. L, W.S. Lee, and Phlp Y., "Buldng Text Classfers Usng Postve and Unlabeled Examples", ICDM-03, Melbourne, Florda, November 2003, pp [3] B. Lu, W.S. Lee, P.S. Yu, and X.L. L, "Partally Supervsed Classfcaton of Text Documents", Proceedngs of the Nneteenth Internatonal Conference on Machne Learnng (ICML-2002), Sydney, July 2002, pp [4] X.L. L and B. Lu, "Learnng to Classfy Documents wth Only Postve Tranng Set", ECML 2007, LNAI 4701, 2007, pp [5] H. Yu, J. Han, and Chang K.C.-C., "PEBL: Postve Example Based Learnng for Web Page Classfcaton Usng SVM", Proc. Eghth Int'l Conf. Knowledge Dscovery and Data Mnng (KDD'02), ACM Press, New York, 2002, pp [6] B. Lu, Y. Da, X.L. L, W. S. Lee, and Phlp Y., "Buldng Text Classfers Usng Postve and Unlabeled

8 JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY Examples", Proceedngs of the Thrd IEEE Internatonal Conference on Data Mnng (ICDM-03), Melbourne, Florda, November 2003, pp [7] F., Dens, R., Glleron and M. Tommas, "Text classfcaton from postve and unlabeled examples", IPMU, [8] L., Manevtz and M., Yousef, "One-class SVMs for document classfcaton", Journal of Machne Learnng Research, vol. 2, 2001, pp [9] Y., Yang and X. Lu, "A Re-Examnaton of Text Categorzaton Methods", Proceedngs of the 22nd Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, August 15-19, 1999, Berkeley, CA, USA, pp [10] Duda, R. O., Hart, P. E., and Stock, D. G., Pattern Classfcaton, Second Edton, John Wley& Sons, 2001 [11] Y. Yang, "An evaluaton of statstcal approaches to text categorzaton", Journal of Informaton Retreval, 1999 volume 1, pp [12] J., Hroza and J. Žžka, B. Poulquen, C. Ignat and R. Stenberger, "Mnng Relevant Text Documents Usng Rankng-Based k-nn Algorthms Traned by Only Postve Examples", Proceedngs of the Fourth Czech-Slovak Conference Knowledge-2005, February 9-11, 2005, Stará Lesná, Slovak Republc, pp [13] J., Hroza, J. Žžka, B. Poulquen, C. Ignat and R. Stenberger, "The Selecton of Electronc Text Documents Supported by Only Postve Examples", JADT 2006, Besancon, France, pp [14] V. Vapnc, the Nature of Statstcal Learnng Theory, Sprnger, New York, [15] T. Joachms, "Text Categorzaton wth Support Vector Machnes: Learnng wth Many Relevant Features", European Conference on Machne Learnng (ECML), [16] Reuters Text Categorzaton Collecton, html. [17] Bow: A Toolkt for Statstcal Language Modelng, Text Retreval, Classfcaton and Clusterng, [18] T. Joachms, "Makng large-scale SVM Learnng Practcal", Advances n Kernel Methods - Support Vector Learnng, MIT-Press, 1999 Bangzuo Zhang, born n Langzhong Cty, Schuan Provnce, P.R.Chna, on Feb. 27, Receved Bachelor of Scence n computer scence educaton from Northeast Normal Unversty, Chna n 1995 and Master of Engneer n computer applcaton technque from Jln Unversty, Chna n From Sep. 2003, Ph D. canddate n computer scence and technology from Jln Unversty, Chna. He has been a faculty number from July Currently, he s a Lecturer n College of Computer, Northeast Normal Unversty, Chna. He has oned and accomplshed 3 natonal and provncal research programs, such as "Research on the Sem-supervsed Text Mnng and Applcaton". He has also publshed 6 papers n Internatonal conferences/ournals, such as "A Novel Relable Negatve Method Based Clusterng For Learnng from Postve and Unlabeled Examples" n Lecture Notes n Computer Scence. Hs maor research nterests nclude database and ntellgent network, web ntellgence. Mr. Zhang has receved the Frst Class Educatonal Achevement Award from Hgher Educatonal Commttee of Jln Provnce, Chna n Wanl Zuo, born n Jln Cty, Jln Provnce, P.R.Chna, on Dec. 6, Receved Bachelor of Engneerng, Master of Scence, and Ph D. from Jln Unversty, P.R.Chna n 1982, 1985, and 2005 respectvely. He has been workng n Jln Unversty snce From July 1996 to July 1997, he conducted collaboratve research n Lousana Stare Unversty, US, as a senor vstng scholar. He has accomplshed 5 natonal and provncal research programs, such as "Obect-orented Actve database based on Petr nets". He has also publshed more than 60 papers n Internatonal conferences/ournals and Chnese Journals, such as "Relatonshp Graph and Termnaton Analyss of Actve Rules n Database Systems" n Chnese Journal of Software. He has also publshed 4 books, such as "A Course of Operatng Systems" by Hgher Educatonal Press of P. R. Chna n Hs maor research nterests nclude database, web ntellgence, and search engnes. Dr. Zuo s currently senor member of Computer Federaton of Chna, councl of System Software Assocaton of Chna. Receved 5 awards from Educatonal Department of Chna, such as the Second Class Natonal Educatonal Achevement Award n 1996.

The Research of Support Vector Machine in Agricultural Data Classification

The Research of Support Vector Machine in Agricultural Data Classification The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

Learning to Classify Documents with Only a Small Positive Training Set

Learning to Classify Documents with Only a Small Positive Training Set Learnng to Classfy Documents wth Only a Small Postve Tranng Set Xao-L L 1, Bng Lu 2, and See-Kong Ng 1 1 Insttute for Infocomm Research, Heng Mu Keng Terrace, 119613, Sngapore 2 Department of Computer

More information

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust

More information

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

Classifier Selection Based on Data Complexity Measures *

Classifier Selection Based on Data Complexity Measures * Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.

More information

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:

More information

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET 1 BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET TZU-CHENG CHUANG School of Electrcal and Computer Engneerng, Purdue Unversty, West Lafayette, Indana 47907 SAUL B. GELFAND School

More information

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law) Machne Learnng Support Vector Machnes (contans materal adapted from talks by Constantn F. Alfers & Ioanns Tsamardnos, and Martn Law) Bryan Pardo, Machne Learnng: EECS 349 Fall 2014 Support Vector Machnes

More information

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Pruning Training Corpus to Speedup Text Classification 1

Pruning Training Corpus to Speedup Text Classification 1 Prunng Tranng Corpus to Speedup Text Classfcaton Jhong Guan and Shugeng Zhou School of Computer Scence, Wuhan Unversty, Wuhan, 430079, Chna hguan@wtusm.edu.cn State Key Lab of Software Engneerng, Wuhan

More information

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION SHI-LIANG SUN, HONG-LEI SHI Department of Computer Scence and Technology, East Chna Normal Unversty 500 Dongchuan Road, Shangha 200241, P. R. Chna E-MAIL: slsun@cs.ecnu.edu.cn,

More information

Efficient Text Classification by Weighted Proximal SVM *

Efficient Text Classification by Weighted Proximal SVM * Effcent ext Classfcaton by Weghted Proxmal SVM * Dong Zhuang 1, Benyu Zhang, Qang Yang 3, Jun Yan 4, Zheng Chen, Yng Chen 1 1 Computer Scence and Engneerng, Bejng Insttute of echnology, Bejng 100081, Chna

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Informaton Retreval Systems Jm Martn! Lecture 11 9/29/2011 Today 9/29 Classfcaton Naïve Bayes classfcaton Ungram LM 1 Where we are... Bascs of ad hoc retreval Indexng Term weghtng/scorng Cosne

More information

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification Introducton to Artfcal Intellgence V22.0472-001 Fall 2009 Lecture 24: Nearest-Neghbors & Support Vector Machnes Rob Fergus Dept of Computer Scence, Courant Insttute, NYU Sldes from Danel Yeung, John DeNero

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton

More information

Unsupervised Learning

Unsupervised Learning Pattern Recognton Lecture 8 Outlne Introducton Unsupervsed Learnng Parametrc VS Non-Parametrc Approach Mxture of Denstes Maxmum-Lkelhood Estmates Clusterng Prof. Danel Yeung School of Computer Scence and

More information

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers Journal of Convergence Informaton Technology Volume 5, Number 2, Aprl 2010 Investgatng the Performance of Naïve- Bayes Classfers and K- Nearest Neghbor Classfers Mohammed J. Islam *, Q. M. Jonathan Wu,

More information

Edge Detection in Noisy Images Using the Support Vector Machines

Edge Detection in Noisy Images Using the Support Vector Machines Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona

More information

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

Experiments in Text Categorization Using Term Selection by Distance to Transition Point

Experiments in Text Categorization Using Term Selection by Distance to Transition Point Experments n Text Categorzaton Usng Term Selecton by Dstance to Transton Pont Edgar Moyotl-Hernández, Héctor Jménez-Salazar Facultad de Cencas de la Computacón, B. Unversdad Autónoma de Puebla, 14 Sur

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

S1 Note. Basis functions.

S1 Note. Basis functions. S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type

More information

Machine Learning 9. week

Machine Learning 9. week Machne Learnng 9. week Mappng Concept Radal Bass Functons (RBF) RBF Networks 1 Mappng It s probably the best scenaro for the classfcaton of two dataset s to separate them lnearly. As you see n the below

More information

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence

More information

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status Internatonal Journal of Appled Busness and Informaton Systems ISSN: 2597-8993 Vol 1, No 2, September 2017, pp. 6-12 6 Implementaton Naïve Bayes Algorthm for Student Classfcaton Based on Graduaton Status

More information

Incremental Learning with Support Vector Machines and Fuzzy Set Theory

Incremental Learning with Support Vector Machines and Fuzzy Set Theory The 25th Workshop on Combnatoral Mathematcs and Computaton Theory Incremental Learnng wth Support Vector Machnes and Fuzzy Set Theory Yu-Mng Chuang 1 and Cha-Hwa Ln 2* 1 Department of Computer Scence and

More information

Deep Classification in Large-scale Text Hierarchies

Deep Classification in Large-scale Text Hierarchies Deep Classfcaton n Large-scale Text Herarches Gu-Rong Xue Dkan Xng Qang Yang 2 Yong Yu Dept. of Computer Scence and Engneerng Shangha Jao-Tong Unversty {grxue, dkxng, yyu}@apex.sjtu.edu.cn 2 Hong Kong

More information

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng

More information

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION 48 CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION 3.1 INTRODUCTION The raw mcroarray data s bascally an mage wth dfferent colors ndcatng hybrdzaton (Xue

More information

Face Recognition Method Based on Within-class Clustering SVM

Face Recognition Method Based on Within-class Clustering SVM Face Recognton Method Based on Wthn-class Clusterng SVM Yan Wu, Xao Yao and Yng Xa Department of Computer Scence and Engneerng Tong Unversty Shangha, Chna Abstract - A face recognton method based on Wthn-class

More information

Announcements. Supervised Learning

Announcements. Supervised Learning Announcements See Chapter 5 of Duda, Hart, and Stork. Tutoral by Burge lnked to on web page. Supervsed Learnng Classfcaton wth labeled eamples. Images vectors n hgh-d space. Supervsed Learnng Labeled eamples

More information

Network Intrusion Detection Based on PSO-SVM

Network Intrusion Detection Based on PSO-SVM TELKOMNIKA Indonesan Journal of Electrcal Engneerng Vol.1, No., February 014, pp. 150 ~ 1508 DOI: http://dx.do.org/10.11591/telkomnka.v1.386 150 Network Intruson Detecton Based on PSO-SVM Changsheng Xang*

More information

Smoothing Spline ANOVA for variable screening

Smoothing Spline ANOVA for variable screening Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory

More information

Query Clustering Using a Hybrid Query Similarity Measure

Query Clustering Using a Hybrid Query Similarity Measure Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan

More information

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Learning-Based Top-N Selection Query Evaluation over Relational Databases Learnng-Based Top-N Selecton Query Evaluaton over Relatonal Databases Lang Zhu *, Wey Meng ** * School of Mathematcs and Computer Scence, Hebe Unversty, Baodng, Hebe 071002, Chna, zhu@mal.hbu.edu.cn **

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components

More information

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines (IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, Herarchcal Web Page Classfcaton Based on a Topc Model and Neghborng Pages Integraton Wongkot Srura Phayung Meesad Choochart Haruechayasak

More information

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana

More information

Related-Mode Attacks on CTR Encryption Mode

Related-Mode Attacks on CTR Encryption Mode Internatonal Journal of Network Securty, Vol.4, No.3, PP.282 287, May 2007 282 Related-Mode Attacks on CTR Encrypton Mode Dayn Wang, Dongda Ln, and Wenlng Wu (Correspondng author: Dayn Wang) Key Laboratory

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

Machine Learning. Topic 6: Clustering

Machine Learning. Topic 6: Clustering Machne Learnng Topc 6: lusterng lusterng Groupng data nto (hopefully useful) sets. Thngs on the left Thngs on the rght Applcatons of lusterng Hypothess Generaton lusters mght suggest natural groups. Hypothess

More information

A Weighted Method to Improve the Centroid-based Classifier

A Weighted Method to Improve the Centroid-based Classifier 016 Internatonal onference on Electrcal Engneerng and utomaton (IEE 016) ISN: 978-1-60595-407-3 Weghted ethod to Improve the entrod-based lassfer huan LIU, Wen-yong WNG *, Guang-hu TU, Nan-nan LIU and

More information

Associative Based Classification Algorithm For Diabetes Disease Prediction

Associative Based Classification Algorithm For Diabetes Disease Prediction Internatonal Journal of Engneerng Trends and Technology (IJETT) Volume-41 Number-3 - November 016 Assocatve Based Classfcaton Algorthm For Dabetes Dsease Predcton 1 N. Gnana Deepka, Y.surekha, 3 G.Laltha

More information

Face Recognition Based on SVM and 2DPCA

Face Recognition Based on SVM and 2DPCA Vol. 4, o. 3, September, 2011 Face Recognton Based on SVM and 2DPCA Tha Hoang Le, Len Bu Faculty of Informaton Technology, HCMC Unversty of Scence Faculty of Informaton Scences and Engneerng, Unversty

More information

Lecture 5: Multilayer Perceptrons

Lecture 5: Multilayer Perceptrons Lecture 5: Multlayer Perceptrons Roger Grosse 1 Introducton So far, we ve only talked about lnear models: lnear regresson and lnear bnary classfers. We noted that there are functons that can t be represented

More information

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University CAN COMPUTERS LEARN FASTER? Seyda Ertekn Computer Scence & Engneerng The Pennsylvana State Unversty sertekn@cse.psu.edu ABSTRACT Ever snce computers were nvented, manknd wondered whether they mght be made

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

An Image Fusion Approach Based on Segmentation Region

An Image Fusion Approach Based on Segmentation Region Rong Wang, L-Qun Gao, Shu Yang, Yu-Hua Cha, and Yan-Chun Lu An Image Fuson Approach Based On Segmentaton Regon An Image Fuson Approach Based on Segmentaton Regon Rong Wang, L-Qun Gao, Shu Yang 3, Yu-Hua

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS46: Mnng Massve Datasets Jure Leskovec, Stanford Unversty http://cs46.stanford.edu /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu Perceptron: y = sgn( x Ho to fnd

More information

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Determining the Optimal Bandwidth Based on Multi-criterion Fusion Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn

More information

Fast Computation of Shortest Path for Visiting Segments in the Plane

Fast Computation of Shortest Path for Visiting Segments in the Plane Send Orders for Reprnts to reprnts@benthamscence.ae 4 The Open Cybernetcs & Systemcs Journal, 04, 8, 4-9 Open Access Fast Computaton of Shortest Path for Vstng Segments n the Plane Ljuan Wang,, Bo Jang

More information

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines A Modfed Medan Flter for the Removal of Impulse Nose Based on the Support Vector Machnes H. GOMEZ-MORENO, S. MALDONADO-BASCON, F. LOPEZ-FERRERAS, M. UTRILLA- MANSO AND P. GIL-JIMENEZ Departamento de Teoría

More information

Intelligent Information Acquisition for Improved Clustering

Intelligent Information Acquisition for Improved Clustering Intellgent Informaton Acquston for Improved Clusterng Duy Vu Unversty of Texas at Austn duyvu@cs.utexas.edu Mkhal Blenko Mcrosoft Research mblenko@mcrosoft.com Prem Melvlle IBM T.J. Watson Research Center

More information

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters Proper Choce of Data Used for the Estmaton of Datum Transformaton Parameters Hakan S. KUTOGLU, Turkey Key words: Coordnate systems; transformaton; estmaton, relablty. SUMMARY Advances n technologes and

More information

An Anti-Noise Text Categorization Method based on Support Vector Machines *

An Anti-Noise Text Categorization Method based on Support Vector Machines * An Ant-Nose Text ategorzaton Method based on Support Vector Machnes * hen Ln, Huang Je and Gong Zheng-Hu School of omputer Scence, Natonal Unversty of Defense Technology, hangsha, 410073, hna chenln@nudt.edu.cn,

More information

Classification / Regression Support Vector Machines

Classification / Regression Support Vector Machines Classfcaton / Regresson Support Vector Machnes Jeff Howbert Introducton to Machne Learnng Wnter 04 Topcs SVM classfers for lnearly separable classes SVM classfers for non-lnearly separable classes SVM

More information

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article. A selective ensemble classification method on microarray data

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article. A selective ensemble classification method on microarray data Avalable onlne www.ocpr.com Journal of Chemcal and Pharmaceutcal Research, 2014, 6(6):2860-2866 Research Artcle ISSN : 0975-7384 CODEN(USA) : JCPRC5 A selectve ensemble classfcaton method on mcroarray

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Informaton Search and Management Prof. Chrs Clfton 15 September 2017 Materal adapted from course created by Dr. Luo S, now leadng Albaba research group Retreval Models Informaton Need Representaton

More information

Biostatistics 615/815

Biostatistics 615/815 The E-M Algorthm Bostatstcs 615/815 Lecture 17 Last Lecture: The Smplex Method General method for optmzaton Makes few assumptons about functon Crawls towards mnmum Some recommendatons Multple startng ponts

More information

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur FEATURE EXTRACTION Dr. K.Vjayarekha Assocate Dean School of Electrcal and Electroncs Engneerng SASTRA Unversty, Thanjavur613 41 Jont Intatve of IITs and IISc Funded by MHRD Page 1 of 8 Table of Contents

More information

SHAPE RECOGNITION METHOD BASED ON THE k-nearest NEIGHBOR RULE

SHAPE RECOGNITION METHOD BASED ON THE k-nearest NEIGHBOR RULE SHAPE RECOGNITION METHOD BASED ON THE k-nearest NEIGHBOR RULE Dorna Purcaru Faculty of Automaton, Computers and Electroncs Unersty of Craoa 13 Al. I. Cuza Street, Craoa RO-1100 ROMANIA E-mal: dpurcaru@electroncs.uc.ro

More information

Selecting Query Term Alterations for Web Search by Exploiting Query Contexts

Selecting Query Term Alterations for Web Search by Exploiting Query Contexts Selectng Query Term Alteratons for Web Search by Explotng Query Contexts Guhong Cao Stephen Robertson Jan-Yun Ne Dept. of Computer Scence and Operatons Research Mcrosoft Research at Cambrdge Dept. of Computer

More information

A Novel Term_Class Relevance Measure for Text Categorization

A Novel Term_Class Relevance Measure for Text Categorization A Novel Term_Class Relevance Measure for Text Categorzaton D S Guru, Mahamad Suhl Department of Studes n Computer Scence, Unversty of Mysore, Mysore, Inda Abstract: In ths paper, we ntroduce a new measure

More information

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography   Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department

More information

Discriminative classifiers for object classification. Last time

Discriminative classifiers for object classification. Last time Dscrmnatve classfers for object classfcaton Thursday, Nov 12 Krsten Grauman UT Austn Last tme Supervsed classfcaton Loss and rsk, kbayes rule Skn color detecton example Sldng ndo detecton Classfers, boostng

More information

Three supervised learning methods on pen digits character recognition dataset

Three supervised learning methods on pen digits character recognition dataset Three supervsed learnng methods on pen dgts character recognton dataset Chrs Flezach Department of Computer Scence and Engneerng Unversty of Calforna, San Dego San Dego, CA 92093 cflezac@cs.ucsd.edu Satoru

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15 CS434a/541a: Pattern Recognton Prof. Olga Veksler Lecture 15 Today New Topc: Unsupervsed Learnng Supervsed vs. unsupervsed learnng Unsupervsed learnng Net Tme: parametrc unsupervsed learnng Today: nonparametrc

More information

Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search

Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search Can We Beat the Prefx Flterng? An Adaptve Framework for Smlarty Jon and Search Jannan Wang Guolang L Janhua Feng Department of Computer Scence and Technology, Tsnghua Natonal Laboratory for Informaton

More information

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems A Unfed Framework for Semantcs and Feature Based Relevance Feedback n Image Retreval Systems Ye Lu *, Chunhu Hu 2, Xngquan Zhu 3*, HongJang Zhang 2, Qang Yang * School of Computng Scence Smon Fraser Unversty

More information

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION 1 THE PUBLISHING HOUSE PROCEEDINGS OF THE ROMANIAN ACADEMY, Seres A, OF THE ROMANIAN ACADEMY Volume 4, Number 2/2003, pp.000-000 A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION Tudor BARBU Insttute

More information

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach Angle Estmaton and Correcton of Hand Wrtten, Textual and Large areas of Non-Textual Document Images: A Novel Approach D.R.Ramesh Babu Pyush M Kumat Mahesh D Dhannawat PES Insttute of Technology Research

More information

A User Selection Method in Advertising System

A User Selection Method in Advertising System Int. J. Communcatons, etwork and System Scences, 2010, 3, 54-58 do:10.4236/jcns.2010.31007 Publshed Onlne January 2010 (http://www.scrp.org/journal/jcns/). A User Selecton Method n Advertsng System Shy

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Issues and Empirical Results for Improving Text Classification

Issues and Empirical Results for Improving Text Classification Issues and Emprcal Results for Improvng Text Classfcaton Youngoong Ko 1 and Jungyun Seo 2 1 Dept. of Computer Engneerng, Dong-A Unversty, 840 Hadan 2-dong, Saha-gu, Busan, 604-714, Korea yko@dau.ac.kr

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

A New Approach For the Ranking of Fuzzy Sets With Different Heights

A New Approach For the Ranking of Fuzzy Sets With Different Heights New pproach For the ankng of Fuzzy Sets Wth Dfferent Heghts Pushpnder Sngh School of Mathematcs Computer pplcatons Thapar Unversty, Patala-7 00 Inda pushpndersnl@gmalcom STCT ankng of fuzzy sets plays

More information

SVM-based Learning for Multiple Model Estimation

SVM-based Learning for Multiple Model Estimation SVM-based Learnng for Multple Model Estmaton Vladmr Cherkassky and Yunqan Ma Department of Electrcal and Computer Engneerng Unversty of Mnnesota Mnneapols, MN 55455 {cherkass,myq}@ece.umn.edu Abstract:

More information

Collaboratively Regularized Nearest Points for Set Based Recognition

Collaboratively Regularized Nearest Points for Set Based Recognition Academc Center for Computng and Meda Studes, Kyoto Unversty Collaboratvely Regularzed Nearest Ponts for Set Based Recognton Yang Wu, Mchhko Mnoh, Masayuk Mukunok Kyoto Unversty 9/1/013 BMVC 013 @ Brstol,

More information

Data-dependent Hashing Based on p-stable Distribution

Data-dependent Hashing Based on p-stable Distribution Data-depent Hashng Based on p-stable Dstrbuton Author Ba, Xao, Yang, Hachuan, Zhou, Jun, Ren, Peng, Cheng, Jan Publshed 24 Journal Ttle IEEE Transactons on Image Processng DOI https://do.org/.9/tip.24.2352458

More information

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng

More information

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm Recommended Items Ratng Predcton based on RBF Neural Network Optmzed by PSO Algorthm Chengfang Tan, Cayn Wang, Yuln L and Xx Q Abstract In order to mtgate the data sparsty and cold-start problems of recommendaton

More information

Available online at Available online at Advanced in Control Engineering and Information Science

Available online at   Available online at   Advanced in Control Engineering and Information Science Avalable onlne at wwwscencedrectcom Avalable onlne at wwwscencedrectcom Proceda Proceda Engneerng Engneerng 00 (2011) 15000 000 (2011) 1642 1646 Proceda Engneerng wwwelsevercom/locate/proceda Advanced

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

y and the total sum of

y and the total sum of Lnear regresson Testng for non-lnearty In analytcal chemstry, lnear regresson s commonly used n the constructon of calbraton functons requred for analytcal technques such as gas chromatography, atomc absorpton

More information

Human Face Recognition Using Generalized. Kernel Fisher Discriminant

Human Face Recognition Using Generalized. Kernel Fisher Discriminant Human Face Recognton Usng Generalzed Kernel Fsher Dscrmnant ng-yu Sun,2 De-Shuang Huang Ln Guo. Insttute of Intellgent Machnes, Chnese Academy of Scences, P.O.ox 30, Hefe, Anhu, Chna. 2. Department of

More information

Problem Set 3 Solutions

Problem Set 3 Solutions Introducton to Algorthms October 4, 2002 Massachusetts Insttute of Technology 6046J/18410J Professors Erk Demane and Shaf Goldwasser Handout 14 Problem Set 3 Solutons (Exercses were not to be turned n,

More information

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc. [Type text] [Type text] [Type text] ISSN : 0974-74 Volume 0 Issue BoTechnology 04 An Indan Journal FULL PAPER BTAIJ 0() 04 [684-689] Revew on Chna s sports ndustry fnancng market based on market -orented

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION Paulo Quntlano 1 & Antono Santa-Rosa 1 Federal Polce Department, Brasla, Brazl. E-mals: quntlano.pqs@dpf.gov.br and

More information

Classifying Acoustic Transient Signals Using Artificial Intelligence

Classifying Acoustic Transient Signals Using Artificial Intelligence Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)

More information

Fast Feature Value Searching for Face Detection

Fast Feature Value Searching for Face Detection Vol., No. 2 Computer and Informaton Scence Fast Feature Value Searchng for Face Detecton Yunyang Yan Department of Computer Engneerng Huayn Insttute of Technology Hua an 22300, Chna E-mal: areyyyke@63.com

More information

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK L-qng Qu, Yong-quan Lang 2, Jng-Chen 3, 2 College of Informaton Scence and Technology, Shandong Unversty of Scence and Technology,

More information