Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department of Computer Scence and Engneerng Nanjng Unversty of Aeronautcs and Astronautcs, Nanjng 116, Chna Abstract Choosng approprate values for kernel parameters s one of the key problems n many kernel-based methods because the values of these parameters have sgnfcant mpact on the performances of these methods. In ths paper, a novel approach s proposed to learn the kernel parameters n kernel mnmum dstance (KMD) classfer, where the values of the kernel parameters are computed through optmzng an objectve functon desgned for measurng the classfcaton relablty of KMD. Experments on both artfcal and real-world datasets show that the proposed approach works well on learnng kernel parameters of KMD. Keywords: Kernel mnmum dstance; kernel parameter optmzaton; kernel selecton 1. Introducton Mnmum dstance (MD) and nearest neghbor (NN) are smple but popular technques n pattern recognton. Recently, both methods have been extended to kernel versons,.e. the kernel mnmum dstance (KMD) and kernel nearest neghbor (KNN), for classfyng complex and nonlnear patterns such as faces [1], []. However, lke other kernel-based methods, the performance of KMD and KNN s greatly affected by the selecton of kernel parameters values. In ths paper, we focus on optmzng the kernel parameters for KMD. In the lterature, there are two wdely used approaches n choosng the values of kernel parameters n kernel-based methods [1], [3], [4]. The frst approach emprcally chooses a seres of canddate values for the kernel parameter, executes the concerned method under these values agan * Correspondng author. Emal: zhouzh@nju.edu.cn, Tel.: +86-5-8368-668, Fax: +86-5-8368-668
and agan, and selects the one correspondng to the best performance as the fnal kernel parameter value. However, ths approach suffers from the fact that only a very lmted canddate values are consdered, therefore the performance of the kernel-based methods may not be optmzed. The second approach s the well-known cross-valdaton, whch s also wdely used n model selecton. Compared wth the frst approach, cross-valdaton often yelds better performance because t searches the optmal value for kernel parameter n a much wder range. However, performng cross-valdaton s often tme-consumng and hence t cannot be used to adjust the kernel parameters n real tme [3]. Furthermore, when there are only a lmted number of tranng examples, the cross-valdaton approach can hardly ensure robust estmaton. In ths paper, a novel approach s proposed to learn the kernel parameters n KMD. At frst an objectve functon s defned to measure the classfcaton relablty of KMD wth dfferent kernel parameters. Then, the optmal values of the kernel parameters are chosen through optmzng the above defned objectve functon. Experments on both artfcal and real-world datasets show the effect of the proposed approach on learnng kernel parameters n KMD.. Kernel mnmum dstance classfer One of the key ngredents of KMD s the defnton of kernel-nduced dstance measures. Gven a data set S { x x } =,..., 1 l sampled from the nput space X, a kernel K(x,y) and a functon Φ n a feature space satsfy K(x,y) = Φ(x) T Φ(y). An mportant property of the kernel s that t can be drectly constructed n the orgnal nput space wthout knowng the concrete form of Φ. That s, a kernel mplctly defnes a nonlnear mappng functon. There are several typcal kernels, e.g. the Gaussan kernel x y d K( x, y) = exp, the polynomal kernel Kxy (, ) = ( xy T + 1), etc. The kernel-nduced dstance between two ponts defned by a kernel K s shown n Eq. (1). d x y x y K x x K x y K y y (, ) = Φ( ) Φ ( ) = (, ) (, ) + (, ). (1) Suppose the tranng data set S contans c dfferent classes,.e. S1, S,..., Sc, and each class S has l samples, satsfyng under the map Φ, and denote the centre of c l = l. Let ( S ) { ( x ) x } j j = 1 Φ = Φ S be the mage of class Φ ( S ) as S
1 Φ S = ( x ) Φ j. () l xj S Then, the dstance between the mage of a new pont x and the centre of class Φ S can be computed as d ( Φ( x), Φ ) = Φ( x) Φ S S T T T =Φ ( x) Φ ( x) +Φ Φ Φ ( x) Φ S S S 1 = K( xx, ) + Kx (, x) Kxx (, ) j k j l xj, xk S l xj S (3) Accordng to Eq. (3), the classfcaton rule n KMD s to assgn the new pont x to the class wth the smallest dstance: 1 c { } hx ( ) = argmn d ( Φ( x), Φ ) (4) S 3. The proposed method The followng objectve functon s defned to measure the classfcaton relablty of KMD wth dfferent kernel parameters: J d ( Φ( x ), Φ ) π ( ) (5) ( d Φ x ΦS ) j l S ( θ ) = exp = 1 mn ( ( ), ) 1 j c j π () Here θ denotes the kernel parameters, and π () denotes the class label of x. The ntuton behnd Eq. (5) s to make the dstance between the mage of a sample and the centre of ts correspondng class as small as possble, whle to make the dstance between the mage of the sample to other classes as large as possble. The smaller the value of the objectve functon, the hgher the classfcaton relablty. Here the exponental functon s used for speedng up the convergence of optmzaton. Note that when d ( ( x), S ) mn ( d ( ( x ), )) π ( ) S Φ Φ < Φ Φ j, the sample x s correctly classfed. 1 j c j π () Equaton (5) specfes that the optmal value for a kernel parameter should not only correctly classfy the tranng data, but also make the classfcaton relablty as hgh as possble. In the extreme case where d ( Φ ( x ), Φ ) = Sπ and mn ( d ) ( Φ( x ), ) () Φ S j = for each x, the hghest classfcaton relablty s obtaned. 1 j c j π () 3
The optmal values of the kernel parameters can be obtaned through mnmzng Eq. (5),.e. In ths paper, an teratve algorthm s employed to generate * θ = arg mn J ( θ ). (6) θ * θ. Accordng to the general gradent method, the updatng equaton for mnmzng the objectve functon J s gven by Where η s the learnng rate and ( n+ 1) ( n) J θ = θ + η θ n s the teraton step. (7) The proposed method KMD-opt s summarzed as follows: Step 1. Set the learnng rateη and the maxmum teraton number N, and set ε to a very small postve number. Step. Intalze the kernel parameters θ () = θ and set the teraton step n =. Step 3. Update the kernel parameters ( n) θ usng Eq. (7). Step 4. If ( n+ 1) ( n) θ θ < ε or n N, stop. Otherwse, set n= n+ 1, goto Step 3. 4. Experments Ths secton evaluates the effectveness of the proposed KMD-opt method. For comparson, the MD and KMD are also tested. An artfcal data set Crcles, as shown n Fg. 1, and two real-world data sets Bupa and Pd from UCI Machne Learnng Repostory [5] are used. For each data set, half of data are used as the tranng data set, whle the remanng data are used as the test data set. The kernel used n the experments s the Gaussan kernel x y K( x, y) = exp, where s the kernel parameter that should be optmzed. In ths paper, f wthout explct explanatons, the ntal value for the kernel parameter s set to l 1 x j 1 j x = = c l, where x s the centrod of the total l tranng data. Specfcally, the values for Crcles, Bupa and Pd are.33, 17.6 and 43.48 respectvely. The learnng rate η s set to.5 and ε s set to.1 wthout extra explanatons. Table 1 shows the test accuraces of MD, KMD and KMD-opt. The values are also presented. Table 1 shows that n most cases KMD obtans better test accuracy than MD, but when the kernel parameter s not chosen approprately ts performance deterorates greatly. In all cases, 4
KMD-opt acheves the best test accuracy. What s more, from Table 1, t can be found the KMD-opt method s qute robust because on every data set, the fnal s t produced are almost wth the same value although the method s wth dfferent ntalzatons. As an example, the left part of Fg. plots the test accuracy of KMD under a seres of values on Bupa. It verfes the clam that a good performance of KMD greatly depends on the selecton of kernel parameters. The rght part of Fg. plots the objectve functon n Eq. (5) under a seres of values on Bupa. It can be seen from Fg. that the objectve functon reaches ts mnmum at smlar values as those at whch KMD acheves ts hghest accuracy. 5. Conclusons In ths paper, a novel approach for learnng the kernel parameters s proposed and successfully appled to the kernel mnmum dstance (KMD) classfer. An objectve functon s defned to measure the classfcaton relablty of KMD wth dfferent kernel parameters, and then the optmal values of the kernel parameters are obtaned by optmzng the objectve functon. Experments show the effect of the proposed approach on learnng kernel parameters n KMD. In future works, the proposed approach wll be extended for other kernel-based learnng methods such as support vector machne (SVM) and kernel fsher dscrmnate (KFD). References [1] J. Peng, D.R. Hesterkamp, H.K. Da, Adaptve quasconformal kernel nearest neghbor classfcaton, IEEE Trans. PAMI 6 (5) (4) 656-661. [] J. Shawe-Taylor, N. Crstann, Kernel methods for pattern analyss, Cambrdge Unversty Press, 4. [3] L. Wang, K.L. Chan, Learnng kernel parameters by usng class separablty measure, NIPS Workshop on Kernel Machnes, Canada,. [4] D.Q. Zhang, S.C. Chen, Clusterng ncomplete data usng kernel-based fuzzy c-means algorthm, Neural Processng Letters 18(3) (3) 155-16. [5] C. Blake, E. Keogh, and C.J. Merz, UCI repostory of machne learnng databases [http://www.cs.uc.edu/~mlearn/mlrepostory.html], Department of Informaton and Computer Scence, Unversty of Calforna, Irvne, CA, 1998. 5
Fg. 1. The Crcles data set Fg.. Test accuracy (left) and objectve functon values (rght) under a seres of values on Bupa. Table 1. Comparsons of test accuracy (%) of MD, KMD and KMD-opt (the values n the brackets denote the values at convergence). Data sets MD KMD KMD-opt 3 / /3 3 / /3 Crcles 5 1 98 5 1 1 1(3.43) 1(3.43) 1(3.43) 1(3.43) 1(3.43) Bupa 59.43 68 61.14 57.14 66.9 65.14 69.14(16.55) 69.14(16.55) 69.14(16.55) 69.14(16.54) 69.14(16.54) Pd 6.5 65.1 58.7 5.78 64.3 64.84 65.63(41.45) 65.63(41.45) 65.63(41.45) 65.63(41.45) 65.63(41.45) 6