Pattern Recognition 46 (2013) Contents lists available at SciVerse ScienceDirect. Pattern Recognition

Size: px

Start display at page:

Download "Pattern Recognition 46 (2013) Contents lists available at SciVerse ScienceDirect. Pattern Recognition"

Audra Jefferson
6 years ago
Views:

attern Recognton 46 (3) 795 87 Contents lsts avalable at ScVerse ScenceDrect attern Recognton journal homepage: www.elsever.

Artcle hstory: Receved September Receved n revsed form 8 March Accepted September Avalable onlne September Keywords: Multple kernel learnng Support vector machnes Support vector regresson Classfcaton

1 attern Recognton 46 (3) Contents lsts avalable at ScVerse ScenceDrect attern Recognton journal homepage: Localzed algorthms for multple kernel learnng Mehmet Gönen n, Ethem Alpaydın Department of Computer Engneerng, Boğazc- Unversty, TR-3434 Bebek, _ Istanbul, Turkey artcle nfo abstract Artcle hstory: Receved September Receved n revsed form 8 March Accepted September Avalable onlne September Keywords: Multple kernel learnng Support vector machnes Support vector regresson Classfcaton Regresson Selectve attenton Instead of selectng a sngle kernel, multple kernel learnng (MKL) uses a weghted sum of kernels where the weght of each kernel s optmzed durng tranng. Such methods assgn the same weght to a kernel over the whole nput space, and we dscuss localzed multple kernel learnng (LMKL) that s composed of a kernel-based learnng algorthm and a parametrc gatng model to assgn local weghts to kernel functons. These two components are traned n a coupled manner usng a two-step alternatng optmzaton algorthm. Emprcal results on benchmark classfcaton and regresson data sets valdate the applcablty of our approach. We see that LMKL acheves hgher accuracy compared wth canoncal MKL on classfcaton problems wth dfferent feature representatons. LMKL can also dentfy the relevant parts of mages usng the gatng model as a salency detector n mage recognton problems. In regresson tasks, LMKL mproves the performance sgnfcantly or reduces the model complexty by storng sgnfcantly fewer support vectors. & Elsever Ltd. All rghts reserved.. Introducton us to obtan the followng dual formulaton: Support vector machne (SVM) s a dscrmnatve classfer based on the theory of structural rsk mnmzaton [33]. Gven a sample of ndependent and dentcally dstrbuted tranng nstances fðx,y Þg N, where x AR D and y Af, þg s ts class label, SVM fnds the lnear dscrmnant wth the maxmum margn n the feature space nduced by the mappng functon FðÞ. The dscrmnant functon s f ðxþ¼/w,fðxþsþb whose parameters can be learned by solvng the followng quadratc optmzaton problem: mn: XN JwJ þc x war S, nar N þ, bar s:t: y ð/w,fðx ÞSþbÞZ x 8 where w s the vector of weght coeffcents, S s the dmensonalty of the feature space obtaned by FðÞ, C s a predefned postve trade-off parameter between model smplcty and classfcaton error, n s the vector of slack varables, and b s the bas term of the separatng hyperplane. Instead of solvng ths optmzaton problem drectly, the Lagrangan dual functon enables n Correspondng author. E-mal addresses: gonen@boun.edu.tr (M. Gönen), alpaydn@boun.edu.tr (E. Alpaydın). max: s:t: a aa½,cš N a y ¼ j ¼ a a y y j kðx,x j Þ where a s the vector of dual varables correspondng to each separaton constrant and the obtaned kernel matrx of kðx,x j Þ¼ /Fðx Þ,Fðx j ÞS s postve semdefnte. Solvng ths, we get w ¼ N a y Fðx Þ and the dscrmnant functon can be wrtten as f ðxþ¼ XN a y kðx,xþþb: There are several kernel functons successfully used n the lterature such as the lnear kernel (k L ), the polynomal kernel (k ), and the Gaussan kernel (k G ) k L ðx,x j Þ¼/x,x j S k ðx,x j Þ¼ð/x,x j SþÞ q qan k G ðx,x j Þ¼expð Jx x j J =s Þ sar þþ : There are also kernel functons proposed for partcular applcatons, such as natural language processng [4] and bonformatcs [3]. Selectng the kernel functon kð,þ and ts parameters (e.g., q or s) s an mportant ssue n tranng. Generally, a cross-valdaton procedure s used to choose the best performng kernel functon 3-33/$ - see front matter & Elsever Ltd. All rghts reserved.

2 796 M. Gönen, E. Alpaydın / attern Recognton 46 (3) among a set of kernel functons on a separate valdaton set dfferent from the tranng set. In recent years, multple kernel learnng (MKL) methods are proposed, where we use multple kernels nstead of selectng one specfc kernel functon and ts correspondng parameters k Z ðx,x j Þ¼f Z ðfk m ðx m Þg Þ ðþ where the combnaton functon f Z ðþ can be a lnear or a nonlnear functon of the nput kernels. Kernel functons, fk m ð,þg, take feature representatons (not necessarly dfferent) of data nstances, where x ¼fx m g, xm AR Dm, and D m s the dmensonalty of the correspondng feature representaton. The reasonng s smlar to combnng dfferent classfers: Instead of choosng a sngle kernel functon and puttng all our eggs n the same basket, t s better to have a set and let an algorthm do the pckng or combnaton. There can be two uses of MKL: () Dfferent kernels correspond to dfferent notons of smlarty and nstead of tryng to fnd whch works best, a learnng method does the pckng for us, or may use a combnaton of them. Usng a specfc kernel may be a source of bas, and n allowng a learner to choose among a set of kernels, a better soluton can be found. () Dfferent kernels may be usng nputs comng from dfferent representatons possbly from dfferent sources or modaltes. Snce these are dfferent representatons, they have dfferent measures of smlarty correspondng to dfferent kernels. In such a case, combnng kernels s one possble way to combne multple nformaton sources. Snce ther orgnal concepton, there s sgnfcant work on the theory and applcaton of multple kernel learnng. Fxed rules use the combnaton functon n () as a fxed functon of the kernels, wthout any tranng. Once we calculate the combned kernel, we tran a sngle kernel machne usng ths kernel. For example, we can obtan a vald kernel by takng the summaton or multplcaton of two kernels as follows []: k Z ðx,x j Þ¼k ðx,x j Þþk ðx,x j Þ k Z ðx,x j Þ¼k ðx,x j Þk ðx,x j Þ: The summaton rule s appled successfully n computatonal bology [7] and optcal dgt recognton [5] to combne two or more kernels obtaned from dfferent representatons. Instead of usng a fxed combnaton functon, we can have a functon parameterzed by a set of parameters H and then we have a learnng procedure to optmze H as well. The smplest case s to parameterze the sum rule as a weghted sum k Z ðx,x j 9H ¼ gþ¼ X Z m k m ðx m Þ wth Z m AR. Dfferent versons of ths approach dffer n the way they put restrctons on the kernel weghts [,4,9,9]. For example, we can use arbtrary weghts (.e., lnear combnaton), nonnegatve kernel weghts (.e., conc combnaton), or weghts on a smplex (.e., convex combnaton). A lnear combnaton may be restrctve and nonlnear combnatons are also possble [3,3,8]; our proposed approach s of ths type and we wll dscuss these n more detal later. We can learn the kernel combnaton weghts usng a qualty measure that gves performance estmates for the kernel matrces calculated on tranng data. Ths corresponds to a functon that assgns weghts to kernel functons g ¼ g Z ðfk m ðx m Þg Þ: The qualty measure used for determnng the kernel weghts could be kernel algnment [,] or another smlarty measure such as the Kullback Lebler dvergence [36]. Another possblty nspred from ensemble and boostng methods s to teratvely update the combned kernel by addng a new kernel as tranng contnues [5,9]. In a traned combner parameterzed by H, f we assume H to contan random varables wth a pror, we can use a Bayesan approach. For the case of a weghted sum, we can, for example, have a pror on the kernel weghts [,,8]. A recent survey of multple kernel learnng algorthms s gven n [8]. Ths paper s organzed as follows: We formulate our proposed nonlnear combnaton method localzed MKL (LMKL) wth detaled mathematcal dervatons n Secton. We gve our expermental results n Secton 3 where we compare LMKL wth MKL and sngle kernel SVM. In Secton 4, we dscuss the key propertes of our proposed method together wth related work n the lterature. We conclude n Secton 5.. Localzed multple kernel learnng Usng a fxed unweghted or weghted sum assgns the same weght to a kernel over the whole nput space. Assgnng dfferent weghts to a kernel n dfferent regons of the nput space may produce a better classfer. If the data has underlyng local structure, dfferent smlarty measures may be suted n dfferent regons. We propose to dvde the nput space nto regons usng a gatng functon and assgn combnaton weghts to kernels n a data-dependent way [3]; n the neural network lterature, a smlar archtecture s prevously proposed under the name mxture of experts [,3]. The dscrmnant functon for bnary classfcaton s rewrtten as f ðxþ¼ X Z m ðx9vþ/w m,f m ðx m ÞSþb where Z m ðx9vþ s a parametrc gatng model that assgns a weght to F m ðx m Þ as a functon of x and V s the matrx of gatng model parameters. Note that unlke n MKL, n LMKL, t s not oblgatory to combne dfferent feature spaces; we can also use multple copes of the same feature space (.e., kernel) n dfferent regons of the nput space and thereby obtan a more complex dscrmnant functon. For example, as we wll see shortly, we can combne multple lnear kernels to get a pecewse lnear dscrmnant... Gatng models In order to assgn kernel weghts n a data-dependent way, we use a gatng model. Orgnally, we nvestgated the softmax gatng model [3] Z m ðx9vþ¼ expð/v m,x G Sþv m Þ h ¼ expð/v 8m ð3þ h,x G Sþv h Þ where x G AR DG s the representaton of the nput nstance n the ðdg þ Þ feature space n whch we learn the gatng model and VAR contans the gatng model parameters fv m,v m g. The softmax gatng model uses kernels n a compettve manner and generally a sngle kernel s actve for each nput. It s possble to use other gatng models and below, we dscuss two new ones, namely sgmod and Gaussan. The gatng model defnes the shape of the regon of expertse of the kernels. The sgmod functon allows multple kernels to be used n a cooperatve manner Z m ðx9vþ¼=ðþexpð /v m,x G S v m ÞÞ 8m: ð4þ Instead of parameterzng the boundares of the local regons for kernels, we can also parameterze ther centers and spreads usng Gaussan gatng Z m ðx9vþ¼ expð JxG l m J =s m Þ h ¼ expð JxG l h J =s h Þ 8m ð5þ ðþ

3 M. Gönen, E. Alpaydın / attern Recognton 46 (3) where VAR ðdg þ Þ contans the means, fl m g, and the spreads, fs m g ; we do not experment any further wth ths n ths current work. If we combne the same feature representaton wth dfferent kernels (.e., x ¼ x ¼ x ¼... ¼ x ), we can smply use t also n the gatng model (.e., x G ¼ x) [3]. If we combne dfferent feature representatons wth the same kernel, the gatng model representaton x G can be one of the representatons fx m g, a concatenaton of a subset of them, or a completely dfferent representaton. In some applcaton areas such as bonformatcs where data nstances may appear n a non-vectoral format such as sequences, trees, and graphs, where we can calculate kernel matrces but cannot represent the data nstances as x vectors drectly, we may use an emprcal kernel map [3, Chapter ], whch corresponds to usng the kernel values between x and tranng ponts as the feature vector for x, and defne x G n terms of the kernel values [5] x G ¼½k G ðx,xþ k G ðx,xþ k G ðx N,xÞŠ > where the gatng kernel, k G ð,þ, can be one of the combned kernels, fk m ð,þg, a combnaton of them, or a completely dfferent kernel used only for determnng the gatng boundares... Mathematcal model Usng the dscrmnant functon n () and regularzng the dscrmnant coeffcents of all the feature spaces together, LMKL obtans the followng optmzaton problem: mn: X Jw m J XN þc x w m AR Sm, nar N þ, VARðDG þ Þ, bar s:t: y f ðx ÞZ x 8 ð6þ where nonconvexty s ntroduced to the model due to the nonlnearty formed usng the gatng model outputs n the separaton constrants. Instead of tryng to solve (6) drectly, we can use a two-step alternatng optmzaton algorthm [3], also used for choosng kernel parameters [6] and obtanng Z m parameters of MKL [9]. Ths procedure conssts of two basc steps: () solvng the model wth a fxed gatng model, and, () updatng the gatng model parameters wth the gradents calculated from the current soluton. Note that f we fx the gatng model parameters, the optmzaton problem (6) becomes convex and we can fnd the correspondng dual optmzaton problem usng dualty. For a fxed V, we obtan the Lagrangan dual of the prmal problem (6) as follows: L D ðvþ¼ X Jw m J XN þc x XN b x XN a y f ðx Þ þx and takng the dervatves of L D ðvþ wth respect to the prmal varables D ðvþ ¼ ) w m ¼ XN y Z m ðx 9VÞF m ðx m Þ D ðvþ ¼ ) XN y D ðvþ ¼ ) C ¼ þb 8: ð7þ From L D ðvþ and (7), the dual formulaton s obtaned as max: JðVÞ¼ XN aa½,cš N a j ¼ a a y y j k Z ðx,x j Þ 8m s:t: a y ¼ where the locally combned kernel functon s defned as k Z ðx,x j Þ¼ X Z m ðx 9VÞk m ðx m ÞZ m ðx j 9VÞ: Note that f the nput kernel matrces are postve semdefnte, the combned kernel matrx s also postve semdefnte by constructon. Locally combned kernel matrx s the summaton of matrces obtaned by pre- and post-multplyng each kernel matrx by the vector that contans gatng model outputs for ths kernel. Usng the support vector coeffcents obtaned from (8) and the gatng model parameters, we obtan the followng dscrmnant functon: f ðxþ¼ XN a y k Z ðx,xþþb: For a gven V, the gradents of the objectve functon n (8) are equal to the gradents of the objectve functon n (6) due to strong dualty, whch guarantees that, for a convex quadratc optmzaton, the dual problem has the same optmum value as ts prmal problem. These gradents are used to update the gatng model parameters at each step..3. Tranng wth alternatng optmzaton We can fnd the gradents of JðVÞ wth respect to the parameters of all three gatng models. The gradents of (8) wth respect to the parameters of the softmax gatng model (3) are ¼ U m Z h ðx 9VÞk h ðx h,xhþz j h ðx j9vþ j ¼ h ¼ ðx G ðdh Z m m ðx 9VÞÞþx G j ðdh Z m m ðx j9vþþþ ¼ U m Z h ðx 9VÞk h ðx h,xhþz j h ðx j9vþ j ¼ h ¼ ðd h Z m m ðx 9VÞþd h Z m m ðx j9vþþ where U j ¼ a a j y y j, and d h m s f m¼h and otherwse. The same gradents wth respect to the parameters of the sgmod gatng model (4) ¼ U m Z m ðx 9VÞk m ðx m ÞZ m ðx j 9VÞ j ¼ ðx G ð Z m ðx 9VÞÞþx G ð Z j m ðx ¼ U m Z m ðx 9VÞk m ðx m ÞZ m ðx j 9VÞ j ¼ ð Z m ðx 9VÞþ Z m ðx j 9VÞÞ where the gatng model parameters for a kernel functon are updated ndependently. We can also fnd the gradents wth respect to the means and the spreads of the Gaussan gatng model m m ¼ XN X j ¼ h ¼ U j Z h ðx 9VÞk h ðx h,xh j ÞZ h ðx j9vþ ððx G l m Þðdh m Z m ðx 9VÞÞþðx G j l m Þðdh m Z m ðx j9vþþþ=s m X j ¼ h ¼ U j Z h ðx 9VÞk h ðx h,xh j ÞZ h ðx j9vþ ðjx G l m J ðdh m Z m ðx 9VÞÞþJx G j l m J ðdh m Z m ðx j9vþþþ=s 3 m : The complete algorthm of our proposed LMKL s summarzed n Algorthm. revously, we used to perform a predetermned number ð8þ ð9þ

4 798 M. Gönen, E. Alpaydın / attern Recognton 46 (3) of teratons [3]; now, we calculate a step sze at each teraton usng a lne search method and catch the convergence of the algorthm by observng the change n the objectve functon value of (8). Ths allows convergng to a better soluton and hence a better learner. Our algorthm s guaranteed to converge n a fnte number of teratons. At each teraton, we pck the step sze usng a lne search method and there s no chance of ncreasng the objectve functon value. After a fnte number of teratons, our algorthm converges to one of local optma due to nonconvexty of the prmal problem n (6). Algorthm. Localzed Multple Kernel Learnng (LMKL). : Intalze V ðþ randomly : repeat 3: Calculate K ðtþ Z ¼fk Zðx,x j Þg N,j ¼ usng VðtÞ 4: Solve kernel machne wth K ðtþ Z 5: Calculate 6: Determne step sze, D ðtþ, usng a lne search method 7: V ðt þ Þ ( V ðtþ 8: untl convergence.4. Extensons to other algorthms We extend our proposed LMKL framework for two-class classfcaton [3] to other kernel-based algorthms, namely support vector regresson (SVR) [6], multclass SVM (MCSVM), and one-class SVM (OCSVM). Note that any kernel machne that has a hyperplane-based decson functon can be localzed by replacng /w,fðxþs wth Z m ðx9vþ/w m,f m ðx m ÞS and dervng the correspondng update rules..4.. Support vector regresson We can also apply the localzed kernel dea to E-tube SVR [6]. The decson functon s rewrtten as f ðxþ¼ X Z m ðx9vþ/w m,f m ðx m ÞSþb and the modfed prmal optmzaton problem s X mn: Jw m J XN þc ðx þ þx Þ w m AR Sm, n þ AR N þ, n AR N þ, þ Þ, bar VARðDG s:t: Eþx þ Zy f ðx Þ 8 Eþx Zf ðx Þ y 8 where fn þ,n g are the vectors of slack varables and E s the wdth of the regresson tube. For a gven V, the correspondng dual formulaton s max: s:t: JðVÞ¼ XN y ða þ j ¼ a Þ E XN a þ A½,CŠ N, a A½,CŠ N ða þ a Þ¼ ða þ þa Þ ða þ a Þða þ j a j Þk Zðx,x j Þ and the resultng decson functon s The same learnng algorthm gven for two-class classfcaton problems can be appled to regresson problems by smply replacng U j n gradent-descent of the gatng model (see Secton.3) wth ða þ a Þða þ a j j Þ..4.. Multclass support vector machne In a multclass classfcaton problem, a data nstance can belong to one of K classes and the class label s gven as y Af,,...,Kg. There are two basc approaches n the lterature to solve multclass problems. In the multmachne approach, the orgnal multclass problem s converted to a number of ndependent, uncoupled two-class problems. In the sngle-machne approach, the constrants due to havng multple classes are coupled n a sngle formulaton [33]. We can easly apply LMKL to the multmachne approach by solvng (8) for each two-class problem separately. In such a case, we obtan dfferent gatng models parameters and hence, dfferent kernel weghng strateges for each of the problems. Another possblty s to solve these uncoupled problems separately but learn a common gatng model; a smlar approach s used for obtanng common kernel weghts n MKL for multclass problems [9]. For the sngle-machne approach, for class l, we wrte the dscrmnant functon as follows: f l ðxþ¼ X Z m ðx9vþ/w l m,f mðx m ÞSþb l : The modfed prmal optmzaton problem s mn: X X K l ¼ Jw l XN m J þc X K x l l ¼ w l m ARSm, n l AR N þ, VARðDG þ Þ, b l AR s:t: f y ðx Þ f l ðx ÞZ x l 8ð,lay Þ x y ¼ 8: We can obtan the dual formulaton for a gven V by followng the same dervaton steps: max: s:t: JðVÞ¼ XN a l AR N þ þ X K l ¼ a l X K j ¼ l ¼ a l XN d l y A ¼ 8l ð d l y ÞC Za l Z 8ð,lÞ j ¼ d y j y A A j k Z ðx,x j Þ a l ðay j a l j Þk Zðx,x j Þ where A ¼ K l ¼ a l. The resultng dscrmnant functons that use the locally combned kernel functon are gven as f l ðxþ¼ XN U j ðd l y A a l Þk Zðx,xÞþb l : should be replaced wth ðd y j y A A j K l ¼ a l ðay j a l jþþ n learnng the gatng model parameters for multclass classfcaton problems One-class support vector machne OCSVM s a dscrmnatve method proposed for novelty detecton problems [3]. The task s to learn the smoothest hyperplane that puts most of the tranng nstances to one sde of the hyperplane whle allowng other nstances remanng on the other sde wth a cost. In the localzed verson, we rewrte the dscrmnant functon as f ðxþ¼ XN ða þ a Þk Z ðx,xþþb: f ðxþ¼ X Z m ðx9vþ/w m,f m ðx m ÞSþb,

5 M. Gönen, E. Alpaydın / attern Recognton 46 (3) and the modfed prmal optmzaton problem s X mn: Jw m J XN þc x þb w m AR Sm, nar N þ, þ Þ, bar VARðDG s:t: f ðx Þþx Z 8: For a gven V, we obtan the followng dual optmzaton problem: max: JðVÞ¼ s:t: aa½,cš N a j ¼ a a j k Z ðx,x j Þ and the resultng dscrmnant functon s f ðxþ¼ XN a k Z ðx,xþþb: In the learnng algorthm, U j should be replaced wth a a j when calculatng the gradents wth respect to the gatng model parameters. 3. Experments In ths secton, we report emprcal performance of LMKL for classfcaton and regresson problems on several data sets and compare LMKL wth SVM, SVR, and MKL (usng the lnear formulaton of [4]). We use our own mplementatons of SVM, SVR, MKL, and LMKL wrtten n MATLAB and the resultng optmzaton problems for all these methods are solved usng the MOSEK optmzaton software [6]. Except otherwse stated, our expermental methodology s as follows: A random one-thrd of the data set s reserved as the test set and the remanng two-thrds s resampled usng 5 crossvaldaton to generate ten tranng and valdaton sets, wth stratfcaton (.e., preservng class ratos) for classfcaton problems. The valdaton sets of all folds are used to optmze C by tryng values {.,.,,, } and for regresson problems, E, the wdth of the error tube. The best confguraton (measured as the hghest average classfcaton accuracy or the lowest mean square error (MSE) for regresson problems) on the valdaton folds s used to tran the fnal classfers/regressors on the tranng folds and ther performance s measured over the test set. We have test set results, and we report ther averages and standard devatons, as well as the percentage of nstances stored as support vectors and the total tranng tme (n seconds) ncludng the cross-valdaton. We use the 5 cv pared F test for comparson []. In the experments, we normalze the kernel matrces to unt dagonal before tranng. 3.. Classfcaton experments 3... Illustratve classfcaton problem In order to llustrate our proposed algorthm, we use the toy data set GAUSS4 [3] consstng of data nstances generated from four Gaussan components (two for each class) wth the followng pror probabltes, mean vectors and covarance matrces: p ¼ :5 m ¼ 3:! :8 : S ¼ þ: : : p ¼ :5 m ¼ þ:! þ: p ¼ :5 m ¼ : : p ¼ :5 m ¼ þ3: : :8 : S ¼ : : :8 : S ¼ : 4: :8 : S ¼ : 4: where data nstances from the frst two components are labeled as postve and others are labeled as negatve. Frst, we tran both MKL and LMKL wth softmax gatng to combne a lnear kernel, k L, and a second-degree polynomal kernel, k (q¼). Fg. (b) shows the classfcaton boundares calculated and the support vectors stored on one of the tranng folds by MKL that assgns combnaton weghts.3 and.68 to k L and k, respectvely. We see that usng the kernel matrx obtaned by combnng k L and k wth these weghts, we do not acheve a good approxmaton to the optmal Bayes boundary. As we see n Fg. (c), LMKL dvdes the nput space nto two regons and uses the polynomal kernel to separate one component from two others quadratcally n one regon and the lnear kernel for the other component n the other regon. We see that we get a very good approxmaton of the optmal Bayes boundary. The softmax functon n the gatng model acheves a smooth transton between the two kernels. The superorty of the localzed approach s also apparent n the smoothness of the ft that uses fewer support vectors: MKL acheves per cent average test accuracy by storng per cent of tranng nstances as support vectors, whereas LMKL acheves per cent average test accuracy by storng per cent support vectors. Wth LMKL, we can also combne multple copes of the same kernel, as shown n Fg. (d), whch shows the classfcaton and gatng model boundares of LMKL usng three lnear kernels and approxmates the optmal Bayes boundary n a pecewse lnear manner. For ths confguraton, LMKL acheves per cent average test accuracy by storng per cent support vectors. Instead of usng complex kernels such as polynomal kernels of hgh-degree or the Gaussan kernel, local combnaton of smple kernels (e.g., lnear or low-degree polynomal kernels) can produce accurate classfers and avod overfttng. Fg. shows the average test accuraces, support vector percentages, and tranng tmes wth one standard devaton for LMKL wth dfferent number of lnear kernels. We see that even f we provde more kernels than needed, LMKL uses only as many support vectors as requred and does not overft. LMKL obtans nearly the same average test accuraces and support vector percentages wth three or more lnear kernels. We also see that the tranng tme of LMKL s ncreasng lnearly wth ncreasng number of kernels Combnng multple feature representatons of benchmark data sets We compare SVM, MKL, and LMKL n terms of classfcaton performance, model complexty (.e., stored support vector percentage), and tranng tme. We tran SVMs wth lnear kernels calculated on each feature representaton separately. We also tran an SVM wth a lnear kernel calculated on the concatenaton of all feature representatons, whch s referred to as ALL. MKL and LMKL combne lnear kernels calculated on each feature representaton. LMKL uses a sngle feature representaton or the concatenaton of all feature representatons n the gatng model. We use both softmax and sgmod gatng models n our experments. We perform experments on the Multple Features (MULTIFEAT) dgt recognton data set from the UCI Machne Learnng Repostory, Avalable at Avalable at

6 8 M. Gönen, E. Alpaydın / attern Recognton 46 (3) Fg.. MKL and LMKL solutons on the GAUSS4 data set. (a) The dashed ellpses show the Gaussans from whch data are sampled and the sold lne shows the optmal Bayes dscrmnant. (b) (d) The sold lnes show the dscrmnants learned. The crcled data ponts represent the support vectors stored. For LMKL solutons, the dashed lnes shows the gatng boundares, where the gatng model outputs of neghborng kernels are equal. (a) GAUSS4 data set. (b) MKL wth (k L -k ). (c) LMKL wth (k L -k ). (d) LMKL wth (k L -k L -k L ) test accuracy support vector tranng tme Fg.. The average test accuraces, support vector percentages, and tranng tmes on the GAUSS4 data set obtaned by LMKL wth multple copes of lnear kernels and softmax gatng.

7 M. Gönen, E. Alpaydın / attern Recognton 46 (3) composed of sx dfferent data representatons for handwrtten numerals. The propertes of these feature representatons are summarzed n Table. A bnary classfcaton problem s generated from the MULTIFEAT data set to separate small ( 4 ) dgts from large ( 5 9 ) dgts. We use the concatenaton of all feature representatons n the gatng model for ths data set. Table lsts the classfcaton results on the MULTIFEAT data set obtaned by SVM, MKL, and LMKL. We see that SVM (ALL) s sgnfcantly more accurate than the best SVM wth sngle feature representaton, namely SVM (FAC), but wth a sgnfcant ncrease n the number of support vectors. MKL s as accurate as SVM (ALL) but stores sgnfcantly more support vectors. LMKL wth softmax gatng s as accurate as SVM (ALL) usng sgnfcantly fewer support vectors. LMKL wth sgmod gatng s sgnfcantly more accurate than MKL, SVM (ALL), and sngle kernel SVMs. It stores Table Multple feature representatons n the MULTIFEAT data set. Name Dmenson Data source FAC 6 rofle correlatons FOU 76 Fourer coeffcents of the shapes KAR 64 Karhunen Loeve coeffcents MOR 6 Morphologcal features IX 4 xel averages n 3 wndows ZER 47 Zernke moments Table Classfcaton results on the MULTIFEAT data set. Method Test accuracy Support vector Tranng tme (s) SVM (FAC) SVM (FOU) SVM (KAR) SVM (MOR) SVM (IX) SVM (ZER) SVM (ALL) sgnfcantly fewer support vectors than MKL and SVM (ALL), and tes wth SVM (FAC). For the MULTIFEAT data set, the average kernel weghts and the average number of actve kernels (whose gatng values are nonzero) calculated on the test set are gven n Table 3. We see that both LMKL wth softmax gatng and LMKL wth sgmod gatng use fewer kernels than MKL n the decson functon. MKL uses all kernels wth the same weght for all nputs; LMKL uses a dfferent smaller subset for each nput. By storng sgnfcantly fewer support vectors and usng fewer actve kernels, LMKL s sgnfcantly faster than MKL n the testng phase. MKL and LMKL are teratve methods and need to solve SVM problems at each teraton. LMKL also needs to update the gatng parameters and that s why t requres sgnfcantly longer tranng tmes than MKL when the dmensonalty of the gatng model representaton s hgh (649 n ths set of experments) LMKL needs to calculate the gradents of (8) wth respect to the parameters of the gatng model and to perform a lne search usng these gradents. Learnng wth sgmod gatng s faster than softmax gatng because wth the sgmod durng the gradentupdate only a sngle value s used and updatng takes OðÞ tme, whereas wth the softmax, all gatng outputs are used and updatng s Oð Þ. When learnng tme s crtcal, the tme complexty of ths step can be reduced by decreasng the dmensonalty of the gatng model representaton usng an unsupervsed dmensonalty reducton method. Note also that both the output calculatons and the gradents n separate kernels can be effcently parallelzed when parallel hardware s avalable. Instead of combnng dfferent feature representatons, we can combne multple copes of the same feature representaton wth LMKL. We combne multple copes of lnear kernels on the sngle best FAC representaton usng the sgmod gatng model on the same representaton (see Fg. 3). Even f we ncrease accuracy (not sgnfcantly) by ncreasng the number of copes of the kernels compared to SVM (FAC), we could not acheve the performance obtaned by combnng dfferent representatons wth sgmod gatng. Table 3 Average kernel weghts and number of actve kernels on the MULTIFEAT data set. MKL LMKL (softmax) LMKL (sgmod) LMKL (6 FAC and sgmod) Method FAC FOU KAR MOR IX ZER MKL LMKL (softmax) LMKL (sgmod) The average numbers of actve kernels are 6.,.43, and 5.36, respectvely. 4 test accuracy support vector tranng tme 3 Fg. 3. The average test accuraces, support vector percentages, and tranng tmes on the MULTIFEAT data set obtaned by LMKL wth multple copes of lnear kernels and sgmod gatng on the FAC representaton.

8 8 M. Gönen, E. Alpaydın / attern Recognton 46 (3) Table 4 Multple feature representatons n the ADVERT data set. Name Dmenson Data source URL 457 hrases occurrng n the URL ORIGURL 495 hrases occurrng n the URL of the mage ANCURL 47 hrases occurrng n the anchor text ALT hrases occurrng n the alternatve text CATION 9 hrases occurrng n the capton terms Table 5 Classfcaton results on the ADVERT data set. Method Test accuracy Support vector Tranng tme (s) SVM (URL) SVM (ORIGURL) SVM (ANCURL) SVM (ALT) SVM (CATION) SVM (ALL) MKL LMKL (softmax) LMKL (sgmod) LMKL (5 ANCURL and sgmod) For example, LMKL wth sgmod gatng and kernels over sx dfferent feature representatons s better than LMKL wth sgmod gatng and sx copes of the kernel over the FAC representaton n terms of both classfcaton accuracy (though not sgnfcantly) and the number of support vectors stored (sgnfcantly) (see Table ). We also see that the tranng tme of LMKL s ncreasng (though not monotoncally) wth ncreasng number of kernels. We also perform experments on the Internet Advertsements (ADVERT) data set 3 from the UCI Machne Learnng Repostory, composed of fve dfferent feature representatons (dfferent bags of words) wth some addtonal geometry nformaton of the mages, whch s gnored n our experments due to mssng values. The propertes of these feature representatons are summarzed n Table 4. The classfcaton task s to predct whether an mage s an advertsement or not. We use the CATION representaton n the gatng model due to ts lower dmensonalty compared to the other representatons. Table 5 gves the classfcaton results on the ADVERT data set obtaned by SVM, MKL, and LMKL. We see that SVM (ALL) s sgnfcantly more accurate than the best SVM wth sngle feature representaton, namely SVM (ANCURL), and uses sgnfcantly fewer support vectors. MKL has comparable classfcaton accuracy to SVM (ALL) and the dfference between the number of support vectors s not sgnfcant. LMKL wth softmax/sgmod gatng has comparable accuracy to MKL and SVM (ALL). LMKL wth sgmod gatng stores sgnfcantly fewer support vectors than SVM (ALL). The average kernel weghts and the average number of actve kernels on the ADVERT data set are gven n Table 6. The dfference between the runnng tmes of MKL and LMKL s not as sgnfcant as on the MULTIFEAT data set because the gatng model representaton (CATION) has only 9 dmensons. Dfferent from the MULTIFEAT data set, LMKL uses approxmately the same number Table 6 Average kernel weghts and number of actve kernels on the ADVERT data set. Method URL ORIGURL ANCURL ALT CATION MKL LMKL (softmax) LMKL (sgmod) The average numbers of actve kernels are 4., 4.4, and 4.96, respectvely. of or more kernels compared to MKL on ths data set. (On one of the ten folds, MKL chooses fve and on the remanng nne folds, t chooses four kernels, leadng to an average of 4..) When we combne multple copes of lnear kernels on the ANCURL representaton wth LMKL usng the sgmod gatng model on the same representaton (see Fg. 4), we see that LMKL stores much fewer support vectors compared to the sngle kernel SVM (ANCURL) wthout sacrfcng from accuracy. But, as before on the MULTIFEAT data set, we could not acheve the classfcaton accuracy obtaned by combnng dfferent representatons wth sgmod gatng. For example, LMKL wth sgmod gatng and kernels over fve dfferent feature representatons s sgnfcantly better than LMKL wth sgmod gatng and fve copes of the kernel over the ANCURL representaton n terms of classfcaton accuracy but the latter stores sgnfcantly fewer support vectors (see Table 5). We agan see that the tranng tme of LMKL s ncreasng lnearly wth ncreasng number of kernels Combnng multple nput patches for mage recognton problems For mage recognton problems, only some parts of the mages contan meanngful nformaton and t s not necessary to examne the whole mage n detal. Instead of defnng kernels over the whole nput mage, we can dvde the mage nto non-overlappng patches and use separate kernels n these patches. The kernels calculated on the parts wth relevant nformaton take nonzero weghts and the kernels over the non-relevant patches are gnored. We use a low-resoluton (smpler) verson of the mage as nput to the gatng model, whch selects a subset of the hghresoluton localzed kernels. In such a case, t s not a good dea to use softmax gatng n LMKL because softmax gatng would choose one or very few patches and a patch by tself does not carry enough dscrmnatve nformaton. We tran SVMs wth lnear kernels calculated on the whole mage n dfferent resolutons. MKL and LMKL combne lnear kernels calculated on each mage patch. LMKL uses the whole mage wth dfferent resolutons n the gatng model [4]. We perform experments on the OLIVETTI data set, 4 whch conssts of dfferent grayscale mages of 4 subjects. We construct a two-class data set by combnng male subjects (36 subjects) nto one class versus female subjects (four subjects) n another class. Our expermental methodology for ths data set s slghtly dfferent: We select two mages of each subject randomly and reserve these total 8 mages as the test set. Then, we apply 8-fold cross-valdaton on the remanng 3 mages by puttng one mage of each subject to the valdaton set at each fold. MKL and LMKL combne 6 lnear kernels calculated on mage patches of sze 6 6. Table 7 shows the results of MKL and LMKL combnng kernels calculated over non-overlappng patches of face mages. MKL acheves sgnfcantly hgher classfcaton accuracy than all sngle kernel SVMs except n 3 3 resoluton. LMKL wth softmax gatng has comparable classfcaton accuracy to MKL and stores sgnfcantly fewer support vectors when 4 4or66 mages 3 Avalable at 4 Avalable at

9 M. Gönen, E. Alpaydın / attern Recognton 46 (3) test accuracy support vector tranng tme 4 Fg. 4. The average test accuraces, support vector percentages, and tranng tmes on the ADVERT data set obtaned by LMKL wth multple copes of lnear kernels and sgmod gatng on the ANCURL representaton. Table 7 Classfcaton results on the OLIVETTI data set. Method Test accuracy Support vector Tranng tme (s) SVM (x ¼ 4 4) SVM (x ¼ 8 8) SVM (x ¼ 6 6) SVM (x ¼ 3 3) SVM (x ¼ 64 64) MKL LMKL (softmax and x G ¼ 4 4) LMKL (softmax and x G ¼ 8 8) LMKL (softmax and x G ¼ 6 6) LMKL (sgmod and x G ¼ 4 4) LMKL (sgmod and x G ¼ 8 8) LMKL (sgmod and x G ¼ 6 6) to consder only regons of hgh salency []. For example, f we use LMKL wth softmax gatng (see Fg. 5(d) (f)), the gatng model generally actvates a sngle patch contanng a part of eyes or eyebrows dependng on the subject. Ths may not be enough for good dscrmnaton and usng sgmod gatng s more approprate. When we use LMKL wth sgmod gatng (see Fg. 5(g) ()), multple patches are gven nonzero weghts n a data-dependent way. Fg. 6 gves the average kernel weghts on the test set for MKL, LMKL wth softmax gatng, and LMKL wth sgmod gatng. We see that MKL and LMKL wth softmax gatng use fewer hghresoluton patches than LMKL wth sgmod gatng. We can generalze ths dea even further: Let us say that we have a number of nformaton sources that are costly to extract or process, and a relatvely smpler one. In such a case, we can feed the smple representaton to the gatng model and feed the costly representatons to the actual kernels and tran LMKL. The gatng model then chooses a costly representaton only when t s needed and chooses only a subset of the costly representatons. Note that the representaton used by the gatng model does not need to be very precse, because t does not do the actual decson, but only chooses the representaton(s) that do the actual decson. are used n the gatng model. Ths s manly due to the normalzaton property of softmax gatng that generally actvates a sngle patch and gnores the others; ths uses fewer support vectors but s not as accurate. LMKL wth sgmod gatng sgnfcantly mproves the classfcaton accuracy over MKL by lookng at the 8 8 mages n the gatng model and choosng a subset of the hgh-resoluton patches. We see that the tranng tme of LMKL s monotoncally ncreasng wth the dmensonalty of the gatng model representaton. Fg. 5 llustrates the example uses of MKL and LMKL wth softmax and sgmod gatng. Fg. 5(b) (c) show the combnaton weghts found by MKL and sample face mages stored as support vectors weghted wth those. MKL uses the same weghts over the whole nput space and thereby the parts whose weghts are nonzero are used n the decson process for all subjects. When we look at the results of LMKL, we see that the gatng model actvates mportant parts of each face mage and these parts are used n the classfer wth nonzero weghts, whereas the parts whose gatng model outputs are zero are not consdered. That s, lookng at the output of the gatng model, we can skp processng the hgh-resoluton versons of these parts. Ths can be consdered smlar to a selectve attenton mechansm whereby the gatng model defnes a salency measure and drves a hgh-resoluton fovea / eye 3.. Regresson experments 3... Illustratve regresson problem We llustrate the applcablty of LMKL to regresson problems on the MOTORCYCLE data set dscussed n [3]. We tran LMKL wth three lnear kernels and softmax gatng (C¼ and E ¼ 6) usng -fold cross-valdaton. Fg. 7 shows the average of global and local fts obtaned for these folds. We learn a pecewse lnear ft through three local models that are obtaned usng lnear kernels n each regon and we combne them usng the softmax gatng model (shown by dashed lnes). The softmax gatng model dvdes the nput space between kernels, generally selects a sngle kernel to use, and also ensures a smooth transton between local fts Combnng multple kernels for benchmark data sets We compare SVR and LMKL n terms of regresson performance (.e., mean square error), model complexty (.e., stored support vector percentage), and tranng tme. We tran SVRs wth dfferent kernels, namely lnear kernel and polynomal kernels up to ffth degree. LMKL combnes these fve kernels wth both softmax and sgmod gatng models.

84 M. Gönen, E. Alpaydın / attern Recognton 46 (3) 795 87 Fg. 5. Example uses of MKL and LMKL on the OLIVETTI data set.

softmax gatng model outputs, (f) Z m ðx9vþf mðx m Þ: features weghted wth softmax gatng model outputs, (g) x G : features fed nto sgmod gatng model, (h) Z m ðx9vþ: sgmod gatng model outputs, and () Z

10 84 M. Gönen, E. Alpaydın / attern Recognton 46 (3) Fg. 5. Example uses of MKL and LMKL on the OLIVETTI data set. (a) F mðx m Þ: features fed nto kernels, (b) Z m : combnaton weghts, and (c) Z m F mðx m Þ: features weghted wth combnaton weghts, (d) x G : features fed nto softmax gatng model, (e) Z m ðx9vþ: softmax gatng model outputs, (f) Z m ðx9vþf mðx m Þ: features weghted wth softmax gatng model outputs, (g) x G : features fed nto sgmod gatng model, (h) Z m ðx9vþ: sgmod gatng model outputs, and () Z m ðx9vþf mðx m Þ: features weghted wth sgmod gatng model outputs. η η η 3 5 Fg. 6. Average kernel weghts on the OLIVETTI data set. (a) MKL, (b) LMKL wth softmax gatng on 6 6 resoluton, and (c) LMKL wth sgmod gatng on 8 8 resoluton. 5 k η k 3 We perform experments on the Concrete Compressve Strength (CONCRETE) dataset 5 andthewnequalty(whitewine) data set 6 from the UCI Machne Learnng Repostory. E s selected from {,, 4, 8, 6} for the CONCRETE data set and {.8,.6,.3,.64,.8} for the WHITEWINE data set. Table 8 lsts the regresson results on the CONCRETE data set obtaned by SVR and LMKL. We see that both LMKL wth softmax gatng and LMKL wth sgmod gatng are sgnfcantly more accurate than all of the sngle kernel SVRs. LMKL wth softmax gatng uses k L, k (q¼4), and k (q¼5) wth relatvely hgher weghts but LMKL wth sgmod gatng uses all of the kernels wth sgnfcant weghts (see Table 9). When we combne multple copes of the lnear kernel usng the softmax gatng model (shown n Fg. 8), we see that LMKL does not overft and we get sgnfcantly lower error than the best sngle kernel SVR (k and q¼3). For example, LMKL wth fve copes of k L and softmax gatng gets sgnfcantly lower error than SVR (k and q¼3) and stores sgnfcantly fewer support vectors. Smlar to the bnary classfcaton results, the tranng tme of LMKL s ncreasng lnearly wth ncreasng number of kernels. 5 Avalable at Strength. 6 Avalable at k Fg. 7. Global and local fts (sold lnes) obtaned by LMKL wth three lnear kernels and softmax gatng on the MOTORCYCLE data set. The dashed lnes show gatng model outputs, whch are multpled by 5 for vsual clarty. Table lsts the regresson results on the WHITEWINE data set obtaned by SVR and LMKL. We see that both LMKL wth softmax gatng and LMKL wth sgmod gatng obtan sgnfcantly less error than SVR (k L ), SVR (k and q¼), and SVR (k and q¼3), and have comparable error to SVR (k and q¼4) and SVR (k and q¼5) but store sgnfcantly fewer support vectors than all sngle kernel SVRs. Even f we do not decrease the error, we learn computatonally smpler models by storng much fewer support vectors. We see from Table that LMKL wth softmax gatng assgns relatvely hgher weghts to k L, k (q¼3), and k (q¼5), whereas LMKL wth sgmod gatng uses the polynomal kernels nearly everywhere n the nput space and the lnear kernel for some of the test nstances. k

11 M. Gönen, E. Alpaydın / attern Recognton 46 (3) Dscusson We dscuss the key propertes of the proposed method and compare t wth smlar MKL methods n the lterature. 4.. Computatonal complexty When we are tranng LMKL, we need to solve a canoncal kernel machne problem wth the combned kernel obtaned wth the current gatng model parameters and calculate the gradents of JðVÞ at each teraton. The gradents calculatons are made usng the support vectors of the current teraton. The gradent calculaton step has lower tme complexty compared to the kernel machne solver when the gatng model representaton s low-dmensonal. If we have a hgh-dmensonal gatng model representaton, we can apply an unsupervsed dmensonalty reducton method (e.g., prncpal component analyss) on ths representaton n order to decrease the tranng tme. The computatonal complexty of LMKL also depends on the complexty of the canoncal kernel machne solver used n the man loop, whch Table 8 Regresson results on the CONCRETE data set. Method MSE Support vector Tranng tme (s) SVR (k L ) SVR (k and q¼) SVR (k and q¼3) SVR (k and q¼4) SVR (k and q¼5) LMKL (softmax) LMKL (sgmod) LMKL (5 k L and softmax) can be reduced usng a hot-start procedure (.e., startng from the prevous soluton). The number of teratons before convergence clearly depends on the tranng data and the step sze selecton procedure. The key ssue for faster convergence s to select good gradent-descent step szes at each teraton. The step sze of each teraton should be determned wth a lne search method (e.g., Armjo s rule whose search procedure allows backtrackng and does not use any curve fttng method), whch requres solvng addtonal kernel machne problems. Clearly, the tme complexty for each teraton ncreases but the algorthm converges n fewer teratons. In practce, we see convergence n 5 teratons. One man advantage of LMKL s n reducng the tme complexty for the testng phase as a result of localzaton. When calculatng the locally combned kernel functon, k Z ðx,xþ, n(9), k m ðx m,x m Þ needs to be evaluated or calculated only f both Z m ðx Þ and Z m ðxþ are actve (.e., nonzero). 4.. Knowledge extracton The kernel weghts obtaned by MKL can be used to extract knowledge about the relatve contrbutons of kernel functons used n combnaton. Dfferent kernels defne dfferent smlarty mea- Table Regresson results on the WHITEWINE data set. Method MSE Support vector Tranng tme (s) SVR (k L ) SVR (k and q¼) SVR (k and q¼3) SVR (k and q¼4) SVR (k and q¼5) LMKL (softmax) LMKL (sgmod) Table 9 Average kernel weghts and number of actve kernels on the CONCRETE data set. Table Average kernel weghts and number of actve kernels on the WHITEWINE data set. Method k L k k k k q¼ q¼3 q¼4 q¼5 LMKL (softmax) LMKL (sgmod) The average numbers of actve kernels are 4.5 and 4.68, respectvely. Method k L k k k k q¼ q¼3 q¼4 q¼5 LMKL (softmax) LMKL (sgmod) The average numbers of actve kernels are.5 and 4.58, respectvely. 5 test error 5 support vector 5 6 tranng tme 4 Fg. 8. The average test mean square errors, support vector percentages, and tranng tmes on the CONCRETE data set obtaned by LMKL wth multple copes of lnear kernels and softmax gatng.

12 86 M. Gönen, E. Alpaydın / attern Recognton 46 (3) sures and we can deduce whch smlarty measures are approprate for the task at hand. If kernel functons are evaluated over dfferent feature subsets or feature representatons, the mportant ones have hgher combnaton weghts. Wth our LMKL framework, we can extract smlar nformaton for dfferent regons of the nput space. Ths enables us to extract nformaton about kernels (smlarty measures), feature subsets, and/or feature representatons n a data-dependent manner Regularzaton Canoncal kernel machnes learn sparse models as a result of regularzaton on the weght vector but the underlyng complexty of the kernel functon s the man factor for determnng the model complexty. The man advantage of LMKL n terms of regularzaton over canoncal kernel machnes s the nherent regularzaton effect on the gatng model. When we regularze the sum of the hyperplane weght vectors n (6), because these weght vectors are wrtten n terms of the gatng model as n (7), we also regularze the gatng model as a sde effect. MKL can combne only dfferent kernel functons and more complex kernels are favored over the smpler ones n order to get better performance. However, LMKL can also combne multple copes of the same kernel and t can dynamcally construct a more complex locally combned kernel usng the kernels n a data-dependent way. LMKL elmnates some of the kernels by assgnng zero weghts to the correspondng gatng outputs n order to get a more regularzed soluton. Fgs. 4 and 8 gve emprcal support to ths regularzaton effect, where we see that LMKL does not overft even f we ncrease the number of kernels up to Dmensonalty reducton The localzed kernel dea can also be combned wth dmensonalty reducton. If the tranng nstances have a local structure (.e., le on low-dmensonal manfolds locally), we can learn lowdmensonal local projectons n each regon, whch we can also use for vsualzaton. revously, t had been proposed to ntegrate a projecton matrx nto the dscrmnant functon [6] and we extended ths dea to project data nstances nto dfferent feature spaces usng local projecton matrces combned wth a gatng model, and calculate the combned kernel functon wth the dot product n the combned feature space [7]. The local projecton matrces can be learned wth the other parameters, as before, usng a two-step alternatng optmzaton algorthm Related work LMKL fnds a nonlnear combnaton of kernel functons wth the help of the gatng model. The dea of learnng a nonlnear combnaton s also dscussed n dfferent studes. For example, a latent varable generatve model usng the maxmum entropy dscrmnaton to learn data-dependent kernel combnaton weghts s proposed n [3]. Ths method combnes a generatve probablstc model wth a dscrmnatve large margn method usng a log-rato of Gaussan mxtures as the classfer. In a more recent work, a nonlnear kernel combnaton method based on kernel rdge regresson and polynomal combnaton of kernels s proposed [8] k Z ðx,x j Þ¼ X q A QZ q...zq k ðx,x j Þq...k ðx,x j Þq where Q ¼fq : qaz þ, q m ¼ dg and the kernel weghts are optmzed over a postve, bounded, and convex set usng a projecton-based gradent-descent algorthm. Smlar to LMKL, a Bayesan approach s developed for combnng dfferent feature representatons n a data-dependent way under the Gaussan process framework [7]. A common covarance functon s obtaned by combnng the covarances of feature representatons n a nonlnear manner. Ths formulaton can dentfy the nosy data nstances for each feature representaton and prevent them from beng used. Classfcaton s performed usng the standard Gaussan processes approach wth the common covarance functon. Inspred from LMKL, two methods that learn a data-dependent kernel functon are used for mage recognton applcatons [34,35]; they dffer n ther gatng models that are constants rather than functons of the nput. In [34], the tranng set s dvded nto clusters as a preprocessng step and then cluster-specfc kernel weghts are learned usng an alternatng optmzaton method. The combned kernel functon can be wrtten as k Z ðx,x j Þ¼ X Z m c k m ðx m ÞZ m c j where Z m c corresponds to the weght of kernel k m ð,þ n the cluster x belongs to. The kernel weghts of the cluster whch a test nstance s assgned to are used n the testng phase. In [35], nstance-specfc kernel weghts are used nstead of cluster-specfc weghts. The correspondng combned kernel functon s k Z ðx,x j Þ¼ X Z m k m ðx m ÞZ m j where Z m corresponds to the weght of kernel k m ð,þ for x and nstance-specfc weghts are optmzed usng an alternatng optmzaton problem for the tranng set. But, n the testng phase, the kernel weghts for a test nstance are all taken to be equal. 5. Conclusons Ths work ntroduces a localzed multple kernel learnng framework for kernel-based algorthms. The proposed algorthm has two man ngredents: () a gatng model that assgns weghts to kernels for a data nstance, () a kernel-based learnng algorthm wth the locally combned kernel. The tranng of these two components s coupled and the parameters of both components are optmzed together usng a two-step alternatng optmzaton procedure. We derve the learnng algorthm for three dfferent gatng models (softmax, sgmod, and Gaussan) and apply the localzed multple kernel learnng framework to four dfferent machne learnng problems (two-class classfcaton, regresson, multclass classfcaton, and one-class classfcaton). We perform experments for several two-class classfcaton and regresson problems. We compare the emprcal performance of LMKL wth sngle kernel SVM and SVR as well as MKL. For classfcaton problems defned on dfferent feature representatons, LMKL s able to construct better classfers than MKL by combnng the kernels on these representatons locally. In our experments, LMKL acheves hgher average test accuraces and stores fewer support vectors compared to MKL. If the combned feature representatons are complementary and do not contan redundant nformaton, the sgmod gatng model should be selected nstead of softmax gatng, n order to have the possblty of usng more than one representaton. We also see that, as expected, combnng heterogeneous feature representatons s more advantageous than combnng multple copes of the same representaton. For mage recognton problems, LMKL dentfes the relevant parts of each nput mage separately usng the gatng model as a salency detector on the kernels on the mage patches, and we see that LMKL obtans better classfcaton results than

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.