A Semi-Supervised Approach Based on k-nearest Neighbor

Size: px

Start display at page:

Download "A Semi-Supervised Approach Based on k-nearest Neighbor"

Richard Nicholson
5 years ago
Views:

1 768 JOURNAL OF SOFTWARE, VOL. 8, NO. 4, APRIL 03 A Sem-Supervsed Approach Based on k-nearest Neghbor Zhlang Lu School of Automaton Engneerng, Unversty of Electronc Scence and Technology of Chna, Chengdu, P. R. Chna Emal: zhlang_lu@uestc.edu.cn Xaomn Zhao Department of Mechancal Engneerng, Unversty of Alberta, Edmonton, Canada Emal: xaomn@ualberta.ca Janxao Zou and Hongbng Xu School of Automaton Engneerng, Unversty of Electronc Scence and Technology of Chna, Chengdu, P. R. Chna Emal: {jxzou, hbxu}@uestc.edu.cn Abstract k-nearest neghbor (k-nn) needs suffcent labeled nstances for tranng n order to get robust performance; otherwse, performance deteroraton occurs f tranng sets are small. In ths paper, a novel algorthm, namely ordnal sem-supervsed k-nn, s proposed to handle cases wth a few labeled tranng nstances. Ths algorthm conssts of two parts: nstance rankng and sem-supervsed learnng (SSL). Usng SSL, the performance of k-nn wth small tranng sets can be mproved because SSL enlarges the tranng set by ncludng unlabeled nstances wth ther predcted labels. Instance rankng s used to pck up the unlabeled nstances that are ncluded to the tranng set. It gves prorty to the unlabeled nstances that are closer to class boundares because they are more lkely to be correctly predcted (these nstances are called hgh confdence predcton nstances). Thus, SSL benefts from hgh confdence predcton nstances f they are added nto the tranng set early. Expermental results demonstrate that the proposed algorthm outperforms the sem-supervsed k- NN and the conventonal k-nn algorthms f the tranng set s small. Index Terms Instance rankng; sem-supervsed learnng; k-nearest neghbor; weghted mean NOTATION U the gven dataset, ncludng the labeled dataset U l and the unlabeled dataset U u n ths paper, U=U l U U u ; n the number of features that descrbe an nstance n U; x (or z) an nstance (a pont) whch s an n-dmensonal column vector n U, and each dmenson represents a feature; N j the number of nstances of the j th class n U l ; N l the number of nstances n U l ; N u the number of nstances n U u ; Manuscrpt receved May 5, 0; revsed August 3, 0; accepted August 9, 0. L the number of classes (labels) n U l ; (j) x the th nstance wthn the j th class; k the number of nearest neghbors used for predcton n k-nn; d(x,z) the Eucldean dstance between two nstances x and z; NN (x) the th nearest neghbor wth respect to an unlabeled nstance x, =,,, k; CF mn the predefned threshold of confdence factor for hgh confdence predcton nstances; class(x) returns the true class of nstance x, and a hat on t ndcates a predcton class; κ kernel functons; n ths paper, Gaussan radal bass functon s adopted; σ the parameter, namely the wdth of features, n the Gaussan radal bass functon; R * the optmal Bayes probablty of error; K the number of folds, that s a parameter n K- fold cross-valdaton. I. INTRODUCTION The nearest neghbor rule (NNR) was orgnally proposed by Fx and Hodges [] for solvng dscrmnaton problems n 95. The theory behnd the NNR s that an unlabeled nstance n the nput space s lkely to have the same class as ts close nstances. Under ths rule, the nearest neghbor (-NN) classfer, a supervsed learnng algorthm, dentfes a class of an unlabeled nstance by fndng ts closest tranng nstance, takng note of ts class, and predctng ths class for the unlabeled nstance. Instead of lookng at one nearest nstance only, the k-nearest neghbor (k-nn) classfer ams to fnd k closest nstances n the tranng set and carres out the k nearest neghbor rule (knnr), typcally by the majorty votng for the conventonal k-nn, to make a predcton decson. In fact, the nearest neghbor classfer s a specal case of k-nn f k=. In references [] and [], t s concluded that k-nn reaches the optmal do:0.4304/jsw

2 JOURNAL OF SOFTWARE, VOL. 8, NO. 4, APRIL Bayes probablty of error R * n such a manner that k/n l 0 f underlyng probablty dstrbutons are assumed on the tranng set. Later on Cover and Hart [3] proved that the asymptotc optmalty of the -NN s n the followng nterval, * * L * R RNN R R L, () where R NN s the -NN probablty of error. They also ponted out that the nearest neghbor contans half classfcaton nformaton wthn an nfnte tranng set. From Eq. (), -NN reaches asymptotc optmalty the same as the optmal Bayes probablty of error f R * =0. In such condton, class boundares are clearly separated wth each other, and no nstances from dfferent classes overlap. Consderng jont dstrbuton cases, k-nn was proposed to address the sub-optmalty problem of NNR [4]. Mathematcally, the algorthm of k-nn can be expressed as ^ class( x) = k, () arg max f ( x, NN ( x)) δ ( class( NN ( x)), j) j=,,..., L = where δ(, j) s the Kronecker Delta Functon defned as, = j δ (, j) =, (3) 0, otherwse whch returns one f the two nputs are dentcal and zero; otherwse, f(x,z) s a votng weght functon that defnes how much mpact of a nearest neghbor z on x. For example, f(x,z) s specfed as f(x,z)=δ(class(x), class(z)), whch means that k-nn assgns an equal weght of one to each of the k nearest neghbors for the majorty votng. As a very popular classfcaton technque, k-nn has been well studed n the past and wdely appled to felds such as text processng [5] and fault dagnoss [6]. Reported studes on k-nn can be roughly grouped nto three topcs: on smlarty measurement, on k nearest neghbor rule (knnr), and on computaton cost. Conventonal metrcs for smlarty measurement of k- NN [7] nclude the Mnkowsk dstance, Eucldean dstance, Manhattan dstance, and Chebyshev dstance. A few new methods of smlarty measurement have been proposed recently. Wang et al. [8] developed an adaptve dstance metrc to mprove the performance of k-nn. Yu et al. [9] extended the kernel method nto the k-nn and termed t as kernel dstance metrc. Le et al. [6] proposed a feature-weghted method (TFSWT) that weghts each feature dfferently to overcome the shortcomngs of feature scalng senstvty. The knnr, the second topc on k-nn, s a strategy of predcton usng k nearest neghbors. It s so crtcal to the performance of k-nn that attracts attentons from many researchers. The majorty votng, one of the most popular and ntutve knnrs, assgns an unlabeled nstance a class that has a majorty vote among k nearest neghbors. Dudan [0] proposed a dstance-weghted functon that weghts the nearest neghbors closer to the unlabeled nstance more heavly than others. The functon s expressed as d( x, NNk( x)) = d( x, NN( x)) f ( xz, ) = d( x, NNk ( x)) d( x, z). (4) otherwse d( x, NNk( x)) d( x, NN( x)) Shepard [] further utlzed the exponental functon on Eucldean dstances to weght the votng for the nearest neghbors. Zeng et al. [] proposed the pseudo NNR that combnes the dstance weghted knnr wth the local mean learnng [3]. Because k-nn s an nstance-based learner and tme consumng, the thrd topc s also valuable to focus on. Instance selecton s an opton to address such a problem. Brghton and Mellsh [4] revewed nstance selecton methods over the past thrty years. A condensed NNR proposed by Hart [5] tends to retan tranng nstances along the class boundares and abandon nstances that are nsde a tranng set. By removng certan tranng nstances, computaton cost s mproved, to a certan extent. Nowadays, the k-nn s partcularly popular because poor run-tme performance s allevated wth the today s avalable computatonal power [7]. The varetes of k-nn revewed above seem robust f there are suffcent tranng data; but challenges such as classfcaton deteroraton, arse f a few tranng data avalable because labeled tranng data are sometmes dffcult, expensve or tme-consumng to obtan [4]. As a result, performance of classfers deterorates [] [6]. The reason of deteroraton s that small sze tranng sets contan only partal nformaton and are not suffcent to represent an ntrnsc structure of classes. That s to say classfers tend to be under-fttng wth a few tranng data. The sem-supervsed learnng (SSL) provdes a possble soluton to ths problem because t enlarges the tranng set wth qualfed unlabeled nstances, namely hgh confdence predcton nstances [7], and ther predcted labels. Xe et al. [4] proposed a sem-supervsed k- nearest neghbor classfer named De-YATSI. Lv [5] proposed a sem-supervsed learnng approach usng local regularzer and unt crcle class label representaton. Wajeed and Adlakshm [6] proposed a sem-supervsed method to ncrease accuracy of k-nn for text classfcaton. Gannakopoulos and Petrds [7] appled a sem-supervsed verson of Fsher dscrmnant analyss together wth k-nearest neghbor to a darzaton system. In ths paper, we utlze nstance rankng to sequence unlabeled nstances accordng a certan crteron. It s beleved that the performance could be mproved further by usng an ordnal/sequental unlabeled set rather than an arbtrary one. The dea behnd nstance rankng s that nstance closer to the class boundares are more lkely to be n the same class, that s, we have a hgher confdence n classfyng them correctly, and t s better to predct them as early as possble. These predcted nstances (more lkely to be correct) can then be utlzed by SSL for further tranng. Smply speakng, nstance rankng gves those mportant nstances prorty of beng predcted frst; by dong SSL sequentally rather than randomly, the tranng set s enlarged teratvely from the ntal class boundares and becomes more and more representatve of

3 770 JOURNAL OF SOFTWARE, VOL. 8, NO. 4, APRIL 03 the ntrnsc structure of the gven problem. The performance of classfers, therefore, can be mproved. The rest of the paper s structured as follows. Secton II detaled ntroduces nstance rankng and how t works wth SSL n the proposed ordnal sem-supervsed k-nn. Secton III llustrates the dea of the proposed algorthm usng smulated datasets and valdates the proposed algorthm usng benchmark datasets; Conclusons are made n Secton IV. II. THE ORIDINAL SEMI-SUPERVISED K-NN As stated n Secton I, k-nn suffers from poor classfcaton performance f the tranng data s lmted. To allevate ths problem, the ordnal sem-supervsed k- NN s proposed, as shown n Fg.. The proposed method contans two parts: nstance rankng and semsupervsed learnng. In the frst part, nstance rankng, unlabeled nstances are sorted by ther dstances to class boundares that are estmated on the tranng set. The second part called sem-supervsed learnng contnues to predct the top one unlabeled nstance untl all the unlabeled nstances are classfed. In another words, the nstance of the top one ranked n the frst part s classfed by SSL and updates the tranng set wth ts predcted label f t s consdered as the hgh confdence nstance. The tranng set may ncrease teratvely, and the procedure termnates untl all the unlabeled nstances are processed. ω( xx, ) ( j) ( j) κ( xx, ) = N j ( j) κ xx t = t, (6) (, ) where κ(x,z), the Gaussan radal bass functon (Gaussan RBF), denotes an evaluaton functon for the dstance between x and z, and t s expressed as d( xz, ) κ ( xz, ) = exp( ), (7) σ where 0< σ <, and 0< κ(x,z). An example n Fg. llustrates how to calculate the weghted mean. Class and class are marked wth red and green colors, respectvely. Each class has four () labeled nstances for tranng, denoted as x and x (), respectvely, where 4. x s one unlabeled nstance. Frst, we calculate dstance d(x, x () ) between x and every labeled nstance x (), where =,,3,4. Second, the weght ω(x, x () ) s avalable after evaluatng dstances between x and every nstance n class by κ(x,z) wth a predefned σ. At last, the weghted mean M (x) s computed accordng to Eq. (5). From Fg., we notce that the dstance between x and M (x) s knd of orthogonal dstance to the correspondng class boundary. The weghted mean of x wth respect to class, M (x), s obtaned followng the same procedure. () x M ( x) x () x M ( x) () x () () x () () Fgure. Scheme of the ordnal sem-supervsed k-nn () Class Class A. Instance Rankng Instance rankng ams to rank unlabeled nstances by ther relatve dstances to class boundares. Weghted mean s a tool to represent the class boundary. That s, the dstance between an unlabeled nstance and ts correspondng weghted mean of class j s an approxmate measurement of the dstance to the boundary of class j. Compared wth the nearest neghbor, the weghted mean s more robust to detect class boundares partcularly f the nearest neghbor s an outler, because t consders not only the nearest neghbor but also other nstances n a class. The weghted mean was frst used n [8] and [9] for the nonparametrc weghted feature extracton. Consder a gven dataset U havng L classes, and n U l class j ( j L) has N j nstances. The weghted mean of an unlabeled nstance x n U u wth respect to class j s defned as N j ( j) ( j) M j ( x) = ω(, ) = x x x, (5) where ω(x,x (j) ) s the weght of x (j) wth respect to x, and t s defned as Fgure. An example of how to calculate the weghted mean n a twoclass case Sgma, σ, n the Gaussan RBF s mportant to the weghted mean because t changes weght dstrbuton among nstances. Exceptonally, the weghted mean becomes the arthmetc mean, that s, N j ( j) M j ( x) = x, (8) = N j f σ. It s because κ(x,z) f σ, and then ω(x, x (j) ) tends to be dentcal of /N j for N j. Let us take an example to llustrate the mpact of sgma n the weghted mean. Suppose x () =(,3), x () =(,), () =(,), () =(3,), x () =(5,4), x () =(4,3), () =(5,), () =(6,3), and x=(.5,4) n Fg.. The weght dstrbuton and Eucldean dstances between an unlabeled nstance x and eght tranng nstances are shown n Fg. 3. Three dfferent values, sgma=0.5,.0, 5.0, are consdered n computng the weght dstrbuton of x. No matter what the value of sgma s, the nearest neghbor s always weghted heavest because κ(x,z) s monotonc wth respect to d(x,z), such as x () n class

4 JOURNAL OF SOFTWARE, VOL. 8, NO. 4, APRIL and x () n class. κ(x,z) wth a smaller sgma weghts the nearest neghbor more than that wth a bgger sgma, such as x () wth sgma = 0.5 and sgma = 5.0. In a word, a small sgma comes wth dscrmnant weghts among a class whle a bg sgma tends to evaluate nstances n a class dentcally. ω( x,z) d( x,z) x () x () x () x () () () σ=.0 σ=.0 σ=5.0 () () ω( x,z) d( x,z) x () x () x () x () () () σ=0.5 σ=.0 σ=5.0 Fgure 3. The weght dstrbutes of nstances wth respect to three dfferent sgma: 0.5,.0, and 5.0 A dstance factor (DF) s then proposed to represent the relatve closeness of the unlabeled nstance x to ts closest class boundary among all the boundares measured by the weghted means. It s defned as d( x, M j ( x)) DF( x) = mn, (9) j=,, LL L d( x, M ( x)) t = where 0 DF /L. There are L dstances between x and L class boundares. The dstance factor s the mnmum one after normalzed by the summaton of L dstances wth respect to x. So far we have a measurement of DF that ndcates how close of an nstance to ts nearest class boundary. A smaller DF mples that the nstance s closer to ts nearest class boundary. If all DFs of unlabeled nstances are ready, they are sorted n ascendng order and ready to be used n the next step. B. Sem-supervsed Learnng Sem-supervsed learnng s a framework of algorthms proposed to mprove the performance of supervsed algorthms through the use of both labeled and unlabeled data. Sem-supervsed learnng leads to be developed manly because n some cases wth lmted labeled data, t needs great effort and cost to label large amount of tranng data [0]. Generally, sem-supervsed learnng may be acheved by ether co-tranng or self-tranng []. Co-tranng assumes that features can be dvded nto two ndependent sets, and each subset s suffcent to tran a classfer. Two classfers are traned on two vews separately so that classfers can help wth each other by addng one s most confdence predcton nstance nto the other s tranng set. Assumptons of co-tranng are too strct to be satsfed n practce partcularly n cases wth a few tranng data. Moreover, co-tranng s complcated because more than one classfer s nvolved n classfcaton. On the other hand, self-tranng s smple to acheve because t uses only one classfer to tran and select hgh confdence nstances n the unlabeled set. In t () () ths work, we mplement SSL by the self-tranng approach. The only classfer used s k-nn. Followng the prncples of self-tranng, we defne a confdence measure n predcton, confdence factor (CF), that s mathematcally expressed as k ^ δ class( NN ( )), ( ) (, ( )) class d NN = x x x x CF( x) =,(0) k d( x, NN ( x)) ^ = where class( x ) s the label predcted by Eq. () usng the majorty votng; 0 CF. In other words, CF s a summaton of the dstances between the unlabeled nstance (x) and those same-class neghbors wthn the group of the k nearest dvded by a summaton of dstances between the unlabeled nstance and all those k nearest neghbors whether n the same class or not. If CF has a large value, there s a strong evdence to trust the predcted class of the unlabeled nstance. The rule of self-learnng s that an unlabeled nstance s consdered as a hgh confdence nstance and added to the tranng set f ts CF s greater than or equal to a predefned threshold denoted by CF mn. Once the hgh confdence nstances are added nto the tranng set, t s consdered the same as the ntal tranng nstances. By nvolvng more unlabeled nstances and ther predcted labels, the tranng set s updated teratvely, and class boundares are thus updated as well. In the next teraton, nstance rankng s repeated on the current unlabeled and tranng sets. k-nn makes a predcton for the next unlabeled nstance of top rankng on the latest tranng set. C. Algorthm Summary TABLE I. PSEUDO CODES OF THE ORDINAL SEMI-SUPERVISED K-NN INPUT: Specfy the values of parameters: σ, CF mn, and k; OUTPUT: The predcted labels of unlabeled nstances; MAIN ALGORITHM For ndex of unlabeled nstances =,,, N u % N u refers to the number of the ntal U u Module ; Module ; End Module : Instance Rankng. For class ndex j =,,, L.. Calculate d(x, x (j) ), the Eucldean dstance between each unlabeled nstance x n U u and each nstance of class j n U l ;.. Calculate the correspondng weghts ω(x, x (j) );..3 Calculate the weghted mean for each nstance x n U u wth respect to class j; End. Compute the dstance factor DF(x) by Eq. (9);.3 Sort all the DF values n ascendng order and return the top rankng nstance possesses the smallest DF; Module : Sem-Supervsed Learnng. Predct the class of current nstance x of top rankng n Module by Eq. ();. Compute confdence factor of x, CF(x) usng Eq. (0);.3 If CF(x) CF mn.3. Ul = Ul U x ;.4 U u = U u \x; % remove x from U u

Then the top one nstance s classfed by SSL and s determned whether t s added to the tranng set wth ts predcted label. The number of teratons s the number of unlabeled nstances, that s, N u.

5 77 JOURNAL OF SOFTWARE, VOL. 8, NO. 4, APRIL 03 The pseudo codes of the proposed method are summarzed n TABLE I. Bascally nstance rankng and sem-supervsed learnng are mplemented by Module and Module, respectvely. Instance rankng fnds the top one nstance of unlabeled accordng to the dstance factor. Then the top one nstance s classfed by SSL and s determned whether t s added to the tranng set wth ts predcted label. The number of teratons s the number of unlabeled nstances, that s, N u. In the next secton, we llustrate the algorthm vsually on smulated datasets and compare t wth other methods. 398 th teraton. The decson boundary evolves by ad of the ordnal sem-supervsed k-nn and unlabeled nstances. It s eventually representatve enough for the structure of the smulated dataset # n the last teraton shown n Fg. 4 (f). III. EXPERIMENTAL RESULTS In ths secton the proposed method s appled to two groups of datasets. The frst group contans two smulated datasets of two-dmensonal whch are used to vsually llustrate the mechansm of the ordnal sem-supervsed k- NN. The second group contans four benchmark datasets whch are used for valdaton of our algorthm n the practcal applcaton. Expermental results and dscussons are gven below. A. Experment : The Smulated Datasets Two smulated datasets are artfcally generated to vsually llustrate the mechansm of the proposed method. One s a two-class and Gaussan dstrbuted dataset, where the mean of two classes are [, 5] T and [6, 5] T, respectvely, and the standard devatons of two classes are both [, ] T. The other s the well-known two-moon dataset [] whch s non-gaussan dstrbuted. We use two half crcles to smulate the dataset: one s wth a center of [6, 5.5] T and radus of 5, the other s wth a center of [, 6.5] T and radus of 5. Two datasets are summarzed n TABLE II. TABLE II. SUMMARY OF SIMULATED DATASETS No. Dataset Classes Instances Features Smulated dataset # 400 Smulated dataset # 0 Fg. 4 shows sx ntermedate teratons on the smulated dataset #: 0 th teraton, 80 th teraton, 60 th teraton, 40 th teraton, 30 th teraton, and 398 th teraton (the last teraton). Whte curves are decson boundares of the current teraton. Colors ndcate the class of nstances, for example, red s of class whle green s of class. Instances wth bg sze dots are used as tranng nstances whle others wth small dots are consdered as unlabeled nstances. A crcle around a dot means the nstance s predcted already, and the color of crcles ndcates the predcted class. At 0 th teraton, two nstances are n the tranng set: one red nstance s for class, and one green nstance s for class. The rest 398 (400-) nstances are consdered to be unlabeled. -NN s the classfer used here. Accordng to the class boundary n Fg. 4 (a), there are lots of nstances wrongly predcted snce the decson boundary of -NN s exactly n the mddle of the two tranng nstances. After 80 teratons, n Fg. 4 (b), the decson boundary s slghtly changed, and predcton s conducted to nstances along class boundares. In Fg. 4 sequental predcton contnues untl (a) 0 th teraton (c) 60 th teraton (b) 80 th teraton (d) 40 th teraton (e) 30 th teraton (f) 398 th teraton Fgure 4. Intermedate boundary change on the smulated dataset #, classfer: -NN Fg. 5 shows ntermedate results on the smulated dataset # that s the two-moon dataset. Unlke the smulated dataset #, dstrbutons n ths dataset are non- Gaussan. The same as the case of the smulated dataset #, only two tranng nstances are selected n the 0 th teraton, so the rest 00 (0-) nstances are used as the unlabeled nstances. The decson boundary, smlarly n the smulated dataset #, contnues to update tself by extractng and absorbng nformaton from unlabeled sets. At the last teraton, the 00 th teraton shown n Fg. 5 (f), the decson boundary could be good enough for the followng predcton. That s, a good model s generated then. (a) 0 th teraton (c) 40 th teraton (b) 0 th teraton (d) 60 th teraton

6 JOURNAL OF SOFTWARE, VOL. 8, NO. 4, APRIL (e) 80 th teraton (f) 00 th teraton Fgure 5. Intermedate boundary change on the smulated dataset #, classfer: -NN From two cases studed above, t demonstrates that the ordnal sem-supervsed k-nn has the ablty to mprove the performance of k-nn by usng unlabeled sets f a few tranng nstances are avalable, such as n the smulated dataset # only one tranng nstance avalable for one class. Its effectveness s proved not only on Gaussan dstrbuted cases, such as the smulated dataset #, but also on non-gaussan dstrbuted cases, such as the smulated dataset #. Before applyng our algorthm, assumptons [] are made on t lke other technques of SSL, and they are: () The Smooth Assumpton (SA). It states nstances n hgh densty regon should have smlar labels, that s, labels should change n low densty regons; () The Cluster Assumpton (CA). It says two nstances from the same cluster should have smlar labels; (3) The Manfold Assumpton (MA). It presumes data les roughly on a low-dmensonal manfold. We demonstrate the effectveness of the proposed method n ths secton, and n next secton the proposed method s appled to practcal benchmark datasets. B. Experment : The Benchmark Datasets The ordnal sem-supervsed k-nn, namely method, s compared wth two other varetes of k-nn: the semsupervsed k-nn and the conventonal k-nn, namely method and method 3, respectvely. The dfference between method and method s that method s mplemented wth nstance rankng whle no nstance rankng, or random rankng, s mplemented n method. The k-nn classfer s mplemented by knnclassfy from the Bonformatcs Toolbox of MATLAB. The parameters n the three methods are predefned as follows: σ=, k=, CF mn =. These three methods are compared wth each other on four benchmark datasets downloaded from the Unversty of Calforna Irvne (UCI) repostory [3]. They are summarzed n TABLE III. TABLE III. SUMMARY OF BENCHMARK DATASETS No. Dataset Classes Instances Features Vehcle Ionosphere Parknson 95 4 Wne We splt each dataset n TABLE III nto two subsets because of lmted nstances avalable n the datasets, one s consdered as a labeled set for tranng; the other, although t s labeled, s consdered as an unlabeled set for SSL. In method 3, unlabeled sets are drectly used to test method 3. In method and method 3, after applyng SSL on the unlabeled set, unlabeled sets, consdered as test sets, are re-used to test these two methods. The K- fold cross-valdaton technque s used to do data parttonng. The captal K s a specfed nteger from to 0. For each K value, one fold s used for unlabeled sets and the other K- folds are used for ntal tranng sets, totally there are K combnatons of selectng one out of K. We run the computer programs for each combnaton, obtan the classfcaton accuracy of each run, and then average them. Classfcaton accuracy s the only measurement for evaluatng performance of methods. It s calculated wth the test set by 0- loss functon. That s to say, classfcaton accuracy s the rato between the total number of truly predcted nstances n the test set and the total number of the test unlabeled set. A labeled rato s defned by the sze of tranng set dvded by all nstances,.e. (K )/K n ths case. Snce K ranges from to 0, the rato ranges from / to 9/0, namely, /, /3, 3/4,, 9/0. Next, we do the same procedure as above except consderng one fold as the tranng set and the rest K folds as the unlabeled set. The labeled rato s therefore calculated by /K and has a range from /0 to /, namely, /0, /9, /8,, /. If all calculatons are completed, we have 7 average values of classfcaton accuracy wth respect to 7 ratos from /0 to 9/0. Classfcaton Accuracy (%) Classfcaton Accuracy (%) method method method Rato (labeled/total) Fgure 6. Classfcaton accuracy on the Vehcle dataset method method method 3 /0 /9 /8 /7 /6 /5 /4 /3 / /3 3/4 4/5 5/6 6/7 7/8 8/9 9/0 Rato (labeled/total) Fgure 7. Classfcaton accuracy on the Ionosphere dataset

7 774 JOURNAL OF SOFTWARE, VOL. 8, NO. 4, APRIL 03 Classfcaton Accuracy (%) Classfcaton Accuracy (%) method 8 method method 3 80 /0 /9 /8 /7 /6 /5 /4 /3 / /3 3/4 4/5 5/6 67/ 7/8 8/9 9/0 Rato (unlabeled/total) Fgure 8. Classfcaton accuracy on the Parknson dataset method 9 method method 3 90 /0 /9 /8 /7 /6 /5 /4 /3 / /3 3/4 4/5 5/6 6/7 7/8 8/9 9/0 Rato (unlabeled/total) Fgure 9. Classfcaton accuracy on the Wne dataset Results are provded n Fg. 6 to Fg. 9 n terms of classfcaton accuracy of three methods at dfferent ratos. We need to compare the results on the case of the labeled rato>/ (large tranng datasets) and the case of the rato / (small tranng datasets). From Fg. 6 to Fg. 9, t can be sad that method outperforms other two methods f the rato ranges from /0 to / (.e. small tranng sets). The conventonal k-nn s senstve to the rato and deterorates a lot as mentoned n secton I f the rato s smaller than /. Wth a small rato ( /), the tranng sets are nsuffcent and are not capable of provdng enough knowledge for a robust performance n classfyng the test set. Comparng wth the conventonal k-nn, use of the sem-supervsed k-nn mproves the classfcaton accuracy. In general, classfcaton accuracy of the sem-supervsed k-nn, however, s nferor n comparson wth the proposed method, as the latter benefted from the use of a sequental unlabeled set. If the rato>/, classfcaton accuracy of three methods s qute comparable. Ths clearly shows that the proposed method s advantageous for small tranng sets and not for large tranng sets. TABLE IV provdes two statstcal measures of classfcaton accuracy for nsuffcent cases (.e. rato /), namely the mean (MEAN) and the standard devaton (STD). Both the mean and the standard devaton are calculated over the nne averaged values of classfcaton accuracy correspondng to nne ratos less or equal to 0.5. It can be seen that the proposed method possesses not only the hghest mean of the accuracy but also the smallest standard devaton on any of the four datasets, whch also demonstrates the effectveness of the proposed method n nsuffcent cases. TABLE IV. SUMMARY OF EXPERIMENT RESULTS Dataset Method Method Method 3 MEAN STD MEAN STD MEAN STD Vehcle Ionosphere Parknson Wne Note: Italc and shadng text denotes ether the largest value of mean or the smallest value of standard devaton. IV. CONCLUSIONS In ths work, we propose a classfcaton method named the ordnal sem-supervsed k-nn to deal wth nsuffcent tranng nstances. The proposed method s a nonparametrc classfcaton approach and ncludes two parts, namely nstance rankng and sem-supervsed learnng. By nstance rankng, a sequental unlabeled set s obtaned. By sem-supervsed learnng, the top rankng nstance s classfed and ncluded nto the tranng set wth ts predcted label f t s consdered as the hgh confdence nstance. The proposed method works for both Gaussan and non-gaussan dstrbuted cases. Expermental results on four benchmark datasets demonstrate that the proposed method outperforms the conventonal k-nn and the sem-supervsed k-nn especally f the tranng nstances are nsuffcent (rato /), and t s comparable wth the other two methods f tranng sets are suffcent (rato>/). But the proposed method requres more computatonal tme than the conventonal k-nn because the tranng set contnuously ncreases whle t s fxed n the conventonal k-nn. REFERENCES [] E. Fx, and J. L. Hodges, Jr., Dscrmnatory analyss. nonparametrc dscrmnaton: consstency propertes, U.S. Ar Force Sch. Avaton Medcne, Randolf Feld, Tex., Project , Contract AF 4 (8)-3, Rep.4, 95. [] E. Fx, and J. L. Hodges Jr., Dscrmnatory analyss: small sample performance, U.S. Ar Force Sch. Avaton Medcne, Randolph Feld, Tex., Project , Rept., 95. [3] T. M. Cover, and P. E. Hart, Nearest neghbor pattern classfcaton, IEEE Transacton on Informaton Theory, vol. 3, no., pp. 7, 967. [4] Y. Wang, X. Xu, H. Zhao, and Z. Hua, Sem-supervsed learnng based on nearestneghbor rule and cut edges, Knowledge-Based Systems, vol. 3, no. 6, pp , August 00. [5] Reynaldo Gl-García and Aurora Pons-Porrata, A new nearest neghbor rule for text categorzaton, Lecture Notes n Computer Scence, vol. 45, pp , 006. [6] Y. Le, and M. J. Zuo, Gear crack level dentfcaton based on weghted k nearest neghbour classfcaton algorthm. Mechancal Systems and Sgnal Processng, vol. 3, no. 5, pp , July 009. [7] P. Cunnngham, and S. J. Delany, k-nearest neghbor classfers, Techncal Report UCD-CSI-007-4, March 007.

8 JOURNAL OF SOFTWARE, VOL. 8, NO. 4, APRIL [8] J. Wang, P. Neskovc, and L. N. Cooper, Improvng nearest neghbor rule wth a smple adaptve dstance measure, Pattern Recognton Letters, vol. 8, no., pp. 07-3, Janurary 007. [9] K. Yu, L. J, and X. Zhang, Kernel nearest-neghbor algorthm, Neural Processng Letter, vol. 5, no., pp , 00. [0] S. A. Dudan, The dstance-weghted k-nearest neghbor rule, IEEE Transactons on Systems, Man and Cybernetcs, vol. 6, no. 4, pp , Aprl 976. [] R. N. Shepard, Toward a unversal law of generalzaton for psychologcal scence, Scence, vol. 37, no. 480, pp.37 33, September 987. [] Y. Zeng, Y. Yang, and L. Zhao, Pseudo nearest neghbor rule for pattern classfcaton, Expert Systems wth Applcatons, vol. 36, no., pp , March 009. [3] Y. Mtan, and Y. Hamamoto, A local mean-based nonparametrc classfer, Pattern Recognton Letters, vol. 7, no. 0, pp. 5 59, July 006. [4] H. Brghton, and C. Mellsh, Advances n nstance selecton for nstance-based learnng algorthms, Data Mnng and Knowledge Dscovery, vol. 6, no., pp. 53-7, Aprl 00. [5] P. E. Hart, The condensed nearest neghbor rule, IEEE Transactons on Informaton Theory, vol. 6, no. 3, pp , May 968. [6] A. M. Martnez, and A. C. Kak, PCA versus LDA, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol.3, no., pp.8-33, Feb. 00. [7] X. Zhu, Sem-Supervsed Learnng Lterature Survey, Computer Scences TR 530. Unversty of Wsconsn Madson, July 008. [8] B. C. Kuo, and D. A. Landgrebe, Nonparametrc weghted feature extracton for classfcaton, IEEE Transactons on Geoscence and Remote Sensng, vol. 4, no. 5, pp , May 004. [9] B. C. Kuo, C. H. L, and J. M. Yang, Kernel nonparametrc weghted feature extracton for Hyperspectral Image Classfcaton, IEEE Transactons on Geoscence and Remote Sensng, vol. 47, no. 4, pp , Aprl 009. [0] A. Albalate, and W. Mnker, Sem-supervsed and unsupervsed machne learnng: novel strateges, John Wley & Sons Inc, Hoboken, USA, 0. [] D. Guan, W. Yuan, Y. K. Lee, and S. Lee, Nearest neghbor edtng aded by unlabeled data, Informaton Scences, vol. 79, no. 3, pp. 73-8, June 009. [] Z. Bodó, and Z. Mner, On Supervsed and Semsupervsed k-nearest Neghbor Algorthms, Proceedngs of the 7th Jont Conference on Mathematcs and Computer Scence, vol. LIII, pp Studa Unverstats Babe s- Bolya, Seres Informatca, 008. [3] A. Frank, and A. Asuncon. UCI Machne Learnng Repostory [ Irvne, CA: Unversty of Calforna, School of Informaton and Computer Scence, 00. [4] Y. Xe, Y. Jang, and M. Tang, A sem-supervsed k- nearest neghbor algorthm based on data edtng, Proceedngs of Chnese control and decson conference, 0. [5] J. Lv, Sem-supervsed learnng usng local regularzer and unt crcle class label representaton, Journal of Software, vol. 7, no. 5, pp , 0. [6] M. A. Wajeed, and T. Adlakshm, Sem-supervsed text classfcaton usng henhanced KNN algorthm, Proceedngs of World Congress on Informaton and Communcaton Technologes, 0. [7] T. Gannakopoulos, and S. Petrds, Fsher lnear sem- Dscrmnant analyss for speaker darzaton, IEEE Transacton on Audo, Speech, and Laguage Processng, vol. 0, no. 7, pp. 93-9, 0. Zhlang Lu receved the B.S. degree n school of electrcal and nformaton engneerng, Southwest Unversty for Natonaltes, Chengdu, n 006. He vsted Unversty of Alberta as a vstng scholar from 009 to 0. He s currently a Ph.D canddate n the school of automaton engeerng, Unversty of Electronc Scence and Technology of Chna, Chengdu. Hs research nterests manly nclude pattern recognton, data mnng, fault dagnoss and prognoss.

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department