A Hill-climbing Landmarker Generation Algorithm Based on Efficiency and Correlativity Criteria

Size: px

Start display at page:

Download "A Hill-climbing Landmarker Generation Algorithm Based on Efficiency and Correlativity Criteria"

Melvyn Pope
5 years ago
Views:

1 A Hll-clmbng Landmarker Generaton Algorthm Based on Effcency and Correlatvty Crtera Daren Ler, Irena Koprnska, and Sanjay Chawla School of Informaton Technologes, Unversty of Sydney Madsen Buldng F09, Unversty of Sydney, NSW 2006, Australa {ler, rena, Abstract For a gven classfcaton task, there are typcally several learnng algorthms avalable. The queston then arses: whch s the most approprate algorthm to apply. Recently, we proposed a new algorthm for makng such a selecton based on landmarkng - a meta-learnng strategy that utlses meta-features that are measurements based on effcent learnng algorthms. Ths algorthm, whch creates a set of landmarkers that each utlse subsets of the algorthms beng landmarked, was shown to be able to estmate accuracy well, even when employng a small fracton of the gven algorthms. However, that verson of the algorthm has exponental computatonal complexty for tranng. In ths paper, we propose a hll-clmbng verson of the landmarker generaton algorthm, whch requres only polynomal tranng tme complexty. Our experments show that the landmarkers formed have smlar results to the more complex verson of the algorthm. 1. Introducton The choce of machne learnng algorthm can be a vtal factor n determnng the success or falure of a learnng soluton. Theoretcal (Wolpert, 2001), as well as emprcal results (Mche et al., 1994) suggest that no one algorthm s genercally superor; ths means that we cannot always rely on one partcular algorthm to solve all our learnng problems. Accordngly, selecton s typcally done va some form of holdout testng. However, wth the growng plethora of machne learnng algorthms, and because several dfferent representatons are usually reasonable for a gven problem, such evaluaton s often computatonally unvable. As an alternatve, some meta-learnng (Graud- Carrer et al., 2004; Vlalta and Drss, 2002) technques utlse past experence or meta-knowledge n order to learn when to use whch learnng algorthm. As wth standard machne learnng, the success of metalearnng s greatly dependent upon the qualty of the features chosen. And whle varous strateges for defnng such meta-features have been proposed (e.g. Kalouss and Hlaro, 2001; Brazdl et al., 1994; Mche et al., 1994), Copyrght 2005, Amercan Assocaton for Artfcal Intellgence ( All rghts reserved. there has been no consensus on what consttutes good features. Landmarkng (Fürnkranz and Petrak, 2001; Pfahrnger et al., 2000) s a new and promsng approach that characterses datasets by drectly measurng the performance of smple and fast learnng algorthms called landmarkers. However, the selecton of landmarkers s typcally done n an ad hoc fashon, wth the landmarkers generated focused on charactersng an arbtrary set of algorthms. In prevous work (Ler et al., 2004b, 2004c), we renterpreted the role of landmarkers, defnng each as a set of learnng algorthms that characterses the pattern of performance of one specfc learnng algorthm. As crtera, we specfed that these landmarkers should each be both effcent and correlated (as compared wth the landmarked algorthm). However, the landmarker generaton algorthm proposed (based on the all-subsets regresson technque (Mller, 2002)) has an exponental tranng tme computatonal complexty, makng t unweldy. In ths paper, we propose an alternate hll-clmbng landmarker generaton algorthm based on the forward selecton (Mller, 2002) method for varable selecton n regresson. We show that ths verson of the landmarker generaton algorthm performs almost as well as the allsubsets verson, whle requrng polynomal (as opposed to exponental) tranng tme complexty. 2. Landmarkng and Landmarkers In the lterature (Fürnkranz and Petrak, 2001; Pfahrnger et al., 2000), a landmarker s typcally assocated wth a sngle algorthm wth low computatonal complexty. The general dea s that the performance of a learnng algorthm on a dataset reveals some characterstcs of that dataset. However, n prevous Landmarkng work (Pfahrnger et al., 2000), despte the presence of two landmarker crtera (.e. effcency and bas dversty), no actual mechansm for generatng approprate landmarkers were defned, and the choce of landmarkers was made n an ad hoc fashon. In (Ler et al., 2004b, 2004c), landmarkng and landmarkers were redefned. Let: () a landmarkng element be an algorthm whose performance s utlsed as a dataset characterstc, and () a landmarker be a functon over the performance of a set of landmarkng algorthms,

2 whch resembles the performance (e.g. accuracy) of one specfc learnng algorthm; then landmarkng s thus the process of generatng a set of algorthm estmators.e. landmarkers. Consequently, we also suggest new landmarker crtera. Prevously (Pfahrnger et al., 2000), t was suggested that each landmarkng element be () effcent, and () bas dverse wth respect to the other chosen landmarkers. However, as landmarker (and not landmarkng element) crtera, there s no guarantee that the landmarkers selected va these crtera wll be able to characterse the performance of one, and even less so, a set of canddate algorthms (Ler et al., 2004b). Instead, n (Ler et al., 2004b, 2004c), we proposed that each landmarker be: () correlated.e. each landmarker should as closely as possble have ts output resemble the performance of the canddate algorthm beng landmarked, and () effcent. 2.1 Usng Crtera to Select Landmarkers Intutvely, two algorthms wth smlar performance wll be closer to each other n a space of algorthm performance. Conceptually, the dstance between two algorthms a and l can be regarded as a l. Also, we may express a l 2 as a 2 + l 2 2 a.l. Thus, f a s close to l, ths mples that a l 2 s small, and thus that a.l s relatvely large. Ths pertans to the correlatvty crteron we use to check f a landmarker (e.g. l) s representatve of the performance of some canddate algorthm (e.g. a). Accordngly, correlaton may be measured based on r (Wckens, 1995): a. l cov( a, l) r = cos ( a, l) = = a l σ a σ l Several other heurstcs smlarly apply, ncludng r 2 (.e. the squared Pearson s correlaton coeffcent), Mallow s C p crteron, Akake and Bayesan Informaton Crtera (.e. AIC and BIC respectvely) (Mller, 2002). In terms of effcency, smply utlsng the (computatonal) tme taken to run a gven landmarker would suffce. However, aggregatng the two nto a sngle heurstc s more dffcult; several methods to do ths have been proposed n (Ler et al., 2004a). 2.2 Potental Landmarkng Elements Essentally, any learnng algorthm could be utlsed as a landmarkng element. Therefore, the ntal search space for landmarkng elements s extremely large. One remedy s to only consder algorthms that are less complex versons of the canddate algorthms. Such modfcatons may be categorsed nto two groups: Algorthm specfc reductons, whch relate to the actual nner workngs of one partcular algorthm, and could nclude: lmtng the structure formed (e.g. decson stumps for decson trees), and/or lmtng the nternal search mechansms wthn the algorthm (e.g. usng randomness (Detterch, 1997)). Algorthm generc reductons, whch relate to generc modfcatons that may be appled to any learnng algorthm. As per [4], such modfcatons could be smlar to the sub-samplng technques used n ensemble lterature (Detterch, 1997). Another method of dong ths, s to utlse a subset set of the gven canddate algorthms tself (Ler et al., 2004b, 2004c). Ths s descrbed n further detal n Secton The Proposed Hll-clmbng Landmarker Generaton Algorthm In prevous work (Ler et al., 2004a, 2004b, 2004c), we renterpret the role of landmarkers and proposed to use a subset of the avalable algorthms as landmarkng elements. More specfcally, gven a set of canddate algorthms A = {a 1,, a n }, we seek to select a set of landmarkers L = {l 1,, l n } (where each l s the landmarker for a and utlses the landmarkng elements l.elements) such that: () l L, l.elements A, and () the unon of all l.elements (.e. L.elements) s a (proper) subset of A. Thus, gven the potental landmarkng element sets (.e. each subset of A), the allsubsets landmarker generaton algorthm would generate L by selectng the landmarkng element set (.e. the subset of A) that acheved the hghest heurstc score. Ths n 1 n method requred the computaton of ( ).( n ) lnear regresson functons, where n = A. Gven that the computaton of each regresson functon has complexty O(m), where m s the number of performance estmates (.e. datasets), the complexty of the all-subsets landmarker generaton algorthm s O(nm.2 n ), whch makes the method unweldy for hgh n. (However, note that ths complexty s assocated wth tranng tme and not runtme executon.e. applcaton on a new dataset) As a more effcent alternatve, we propose a hll-clmbng verson of the landmarker generaton algorthm, whch relates to the prevous verson of the algorthm n the same manner that the standard forward selecton (Mller, 2002) relates to allsubsets regresson (Mller, 2002). The standard forward selecton procedure begns by assgnng no ndependents.e. usng only the mean value of the dependent for estmaton. It then teratvely attempts to add landmarkng elements (.e. ndependents) one by one. In each teraton, t frst computes the regresson functons that each utlse a set of ndependents consstng of the unon of the current set of chosen ndependents and one of the remanng unused ndependents. Subsequently, the best model new_model s selected usng some heurstc or crteron measure (e.g. r 2, C p ), and s compared wth the prevous model (.e. the model ncludng all the prevously selected ndependents) old_model va a F-test. Ths F-test checks f the computed F = (SSE(old_model) - SSE(new_model)) / MSE(new_model) s greater than 4.0 (whch s roughly the F value assocated wth confdence = 0.95, dfn = 1, and dfd (.e. n m 1) > 10). If ths nequalty holds, then: () the new model s adopted, () the = 1

3 newly appled ndependent s removed from the potental set of ndependents, and () we proceed to the next teraton. Else, the procedure stops and the fnal adopted model s old_model. For more detals on the forward selecton procedure see (Mller, 2002). Essentally, the standard forward selecton procedure restrcts the amount of computaton by excludng models that do not nclude the already chosen ndependents. More specfcally, gven a total number of ndependents n, n any teraton, only n - ( - 1) regresson functons need be constructed and evaluated, each pertanng to the unon of the prevously chosen ndependents and one of the remanng ndependents n n that had not been chosen n any prevous teraton. Thus, nstead of evaluatng all possble subsets of A and n 1 n computng 1 ( ) =.( n ) regresson functons, the new hll-clmbng landmarker generaton algorthm at most n 1 n 1 requres the computaton of 1 ( 1 ) =.( n ) regresson functons. Ths translates to a (tranng tme) complexty of O(mn 3 ), reducng a prevously exponental complexty to a polynomal one. However, ths s a hll-clmbng approach, and thus may not fnd the optmal model wth respect to the chosen crteron, e.g. r 2 ). 3.1 Algorthm Descrpton The hll-clmbng verson of the landmarker generaton algorthm s essentally a slghtly more complcated verson of the standard forward selecton procedure descrbed n the prevous secton; t s descrbed n Fgure 1. As per the prevous all-subsets verson of the algorthm, the general dea s to use a subset of the avalable algorthms as potental landmarkng elements. Loosely, ths dea assumes that: () several of the avalable algorthms wll have smlar performance patterns, such that one pattern s representatve of several; and () f a gven algorthm has a very dssmlar performance pattern (as compared to the other algorthms), that performance pattern can be correlated to the conjuncton of several others. To re-terate, gven a set of canddate algorthms A = {a 1,, a n }, we seek to select a set of landmarkers L = {l 1,, l n } (where each l s the landmarker for a and utlses the landmarkng elements l.elements) such that: () l L, l.elements A, and () the unon of all l.elements (.e. unon(l.elements)) s a (proper) subset of A. For example, gven A = {a 1, a 2, a 3, a 4 }, a possble set of selected landmarkers may correspond to L, such that L.elements = {a 1, a 2 }, and l 1.elements = {a 1 }, l 2.elements = {a 2 }, l 3.elements = {a 1, a 2 }, l 4.elements = {a 2 }. As n the all-subsets verson, we requre that L.elements A because our objectve s to ncur less computatonal cost then actually evaluatng A; wthout ths condton, there s a chance that L.elements = A, whch would nvaldate the effcency crteron. Referrng back to the example above, we see that to obtan the performance of each a A, on some new dataset s new, we only have to evaluate a 1 and a 2 on s new (.e. we evaluate a 1 and a 2, and use those values to estmate the performance of a 3 and a 4 ). Thus, f L.elements = A, we would not have made any gans n effcency, as all a A would have to be evaluated. Ths means that n any teraton of the hllclmbng algorthm, we must ensure that the unon of the current landmarkng algorthms selected must not equal A. (Note that throughout the remander of ths paper, we use: the term landmarkng elements nterchangeably wth the term ndependents, and the term dependents for the remanng algorthms.e. A \ ndependents.) Inputs: - A = {a 1,, a n}, a set of n canddate algorthms; and - for a A, ϕ a = { ϕa, 1,..., ϕa, m}, the performance estmates for algorthm a on a correspondng set of datasets S = {s 1,, s m}. Output: - L = {l 1,, l n }, a set of landmarkers, where each l s the chosen landmarker for a. Let: - get_regresson_fn(dependent, ndependents) returns the coeffcents of the for the lnear regresson functon correspondng to the nput dependent and ndependents. - fnd_heurstc(coeffcents) returns the heurstc (e.g. r 2 ) for the regresson functon correspondng to coeffcents. - SSE(coeffcents) returns the sum of squared errors assocated wth the regresson functon correspondng to coeffcents. - MSE(coeffcents) returns the mean squared error assocated wth the regresson functon correspondng to coeffcents. - a A, ϕ a = { ϕ a, 1,..., ϕ a, m } be globally accessble - max_ndependents = maxmum number of algorthms to evaluate. The hllclmbng landmarker generaton algorthm: generate_landmarkers(a) returns L 1 L = nl 2 for a A 3 L[a ] = get_regresson_fn(a, φ) 4 chosen = φ 5 vald = true 6 whle vald 7 fns = nl 8 sum_heurstc = nl 9 for a A \ chosen 10 for a j A \ {chosen a } 11 fns[a ][a j] = get_regresson_fn(a, chosen a j) 12 sum_heurstc[a j] += fnd_heurstc(fns[a ][a j]) 13 best_ndependent = a k sum_heurstc[a k] = argmax sum _ heurstc[ a ] a x A \ chosen 14 chosen = chosen best_ndependent 15 vald_update = false 16 for a A 17 f a chosen 18 L [a ] = a 19 else f [SSE(L [a ]) SSE(fns[a ][best_ndependent])] / MSE(fns[a ][best_ndependent]) > L [a ] = fns[a ][best_ndependent] 21 vald_update = true 22 f!vald_update or chosen == max_ndependents 23 vald = false Fgure 1. Pseudo code for the proposed hll-clmbng landmarker generaton algorthm x

4 It also becomes evdent that any chosen landmarkng element a, must be evaluated, and therefore need not be estmated. In fact, from our example, we see that l 1 and l 2 are really not landmarkers; they are merely evaluatons of the algorthms a 1 and a 2. Thus, wth each teraton, two ponts become clear. The frst s that two subsets of A may be defned: () the algorthms to be estmated I = {a a A \ current_l.elements} (.e. dependents the algorthms that requre landmarkers; and correspondngly, the algorthms yet to be chosen the potental ndependents or landmarkng elements), and () the algorthms to be evaluated J = {a a current_l.elements} (.e. the chosen landmarkng algorthms represented by the varable chosen n the pseudo code n Fgure 1). Based on L.elements from our example, these are: I = {a 3, a 4 }, and J = {a 1, a 2 }. Secondly, we seek to move one of the algorthms from I to J. Note that essentally any number of algorthms (less than A ) may be moved va a sngle teraton. However, dong ths ncreases the complexty of the algorthm. In fact, when attemptng to move A 1 algorthms n a sngle teraton, we are actually executng the orgnal all-subsets verson of the algorthm. Wth ether verson of the landmarker generaton algorthm, we begn wth I = A and J = φ. To shft k < A algorthms from I to J, we essentally select the k algorthms that yeld the greatest overall utlty (.e. measured based on the heurstc chosen e.g. r 2 or r 2 + effcency ganed). As an ndependent, each algorthm attans a certan amount of correlaton to each of ts applcable dependents. Thus, we gauge the utlty of each ndependent based on the mean correlaton wth ts applcable dependents (note that the number of applcable dependents for any potental ndependent wll always be the same;.e. n teraton, there wll be A J 1 dependents for each ndependent). For example, gven A = {a 1, a 2, a 3, a 4 }, and assumng we wsh choose a sngle algorthm (k = 1) based solely on r 2, then gven the r 2 values for A n Table 1, we see that as an ndependent (.e. landmarkng algorthm), a 4 has the hghest overall utlty. The calculaton for cases where k > 1 s slghtly more complcated, and snce the proposed hll-clmbng verson only requres the calculaton for k = 1, we wll not elaborate further; for a descrpton of the selecton method for k > 1, see (Ler et al., 2004c). Another ssue that arses when adaptng the standard forward selecton to our landmarker generaton scenaro s that the orgnal stoppng crteron (.e. F-test) must be modfed. Essentally, the new stoppng crtera requres that: () the F-test be adapted to the case where a sngle ndependent s chosen for multple dependents, and () that a lmt be mposed on the total number of ndependents so that L.elements A s ensured (so that a dynamc lmt on the tranng tme computatonal cost may be mposed). The hll-clmbng algorthm seeks to mmc standard forward selecton, whch n each teraton selects the ndependent wth the hghest crteron measure (e.g. r 2 ). However, snce the proposed algorthm makes ths selecton based on an overall utlty, t s lkely that at least for some dependents, the ndependent wth the hghest crteron measure s not chosen. To (partally) account for ths, we do not stop attemptng to utlse more ndependents for any remanng dependent even when an F-test for that dependent fals. Instead, out stoppng crteron (over all remanng dependents) apples whenever: () the selected ndependents J reaches a certan lmt.e. J > the maxmum number of algorthm we wsh to evaluate; and () f all the F-tests (.e. over all remanng dependents) fal for the currently selected ndependent. Indep. a 1 Dependent a 2 a 3 a 4 r 2 a 1 NA /3 = 0.5 a NA /3 = 0.6 a NA /3 = 0.6 a NA 2.1/3 = 0.7 Table 1. Example r 2 values and overall utlty of A = {a 1, a 2, a 3, a 4 } for the one algorthm case. 3.2 Heurstcs for Correlatvty and Effcency To choose between the varous potental landmarkers (.e. the regresson functons correspondng to the varous ndependent set optons n each teraton), we requre a crteron measure or heurstc that measures the utlty of each remanng dependent (.e. a A \ J) to potental ndependent set (.e. ( a j A \ (I a )) I a j ) parng. Gven the specfed correlatvty and effcency landmarker crtera, two heurstcs are evaluated. The frst s the squared Pearson product moment coeffcent or coeffcent of determnaton r 2. By adoptng ths basc measure of correlaton, we rely on the stoppng crtera L.elements A to ensure the effcency crteron s met. Ths means that n each teraton, the choce of ndependent s purely guded by ts correlatvty utlty.e. the (computatonal) cost over the potental ndependents s gnored. And the second, a sem cost-senstve varant of the plan r 2 crteron: r 2 (a, l k ) + ((eff(a) - eff(l k )) / eff(a)), where a s the dependent, l k s the potental landmarker, and A s the set of all canddate algorthms, and thus r 2 (a, l k ) s the r 2 value observed over the a l k parng, and eff(..) s the estmated overall computatonal cost (.e. the tme taken for tranng and testng) of ts subject. Essentally, (eff(a) eff(l k )) / eff(a) corresponds to the amount of effcency ganed (or tme/computaton saved) when l k.elements s evaluated nstead of A. Note that we have proposed several approaches for estmatng eff(..) (Ler et al., 2004a, 2004b). In ths paper, we adopt the smpler and less space ntensve method specfed n (Ler et al., 2004b), whch smply takes the mean tranng and runtme computatonal cost over the tranng datasets for each relevant a x n A.

5 4. Experments, Results and Analyss For our experments we utlse 10 classfcaton learnng algorthms from WEKA (Wtten and Frank, 2000) (.e. naïve Bayes, k-nearest neghbour (wth k = 1 and 7), polynomal support vector machne, decson stump, J4.8 (a WEKA mplementaton of C4.5), random forest, decson table, Rpper, and ZeroR) (.e as A) and 34 classfcaton datasets randomly chosen from the UCI repostory (Blake and Merz, 1998). Leave-one-out crossvaldaton s used to test the proposed landmarker generaton algorthm. No. Algorthms Evaluated, L.elements r s r Table 2. Results for the all-subsets verson (plan r 2 ) No. Algorthms Evaluated, L.elements r s r Table 3. Results for the all-subsets verson (r 2 + ) For each fold we use 33 of the datasets to generate landmarkers as descrbed n Secton 3. The resultant set of landmarkers ndcates whch algorthms must be evaluated and whch estmated (.e. the algorthms n L.elements, and the remanng A \ L.elements respectvely). On the dataset left out, we frst run the algorthms that must me evaluated and then use ther accuracy results (attaned va stratfed ten-fold cross-valdaton) to estmate the performance of the other algorthms usng the regresson functons computed durng landmarker generaton. The verson of the algorthm descrbed n Secton 3.1 wll attempt to fnd a set of landmarkers L such that L.elements max_ndependents (the user specfed parameter) and where the chosen landmarkers only ncludes algorthms (.e. ndependents) that have not faled all F-tests. Thus t s possble that the hll-clmbng landmarker generaton algorthm returns an L such that L.elements < max_ndependents. Correspondngly, some modfcatons to the proposed algorthm were made n order to make a drect comparson wth the results from the all-subsets verson, whch (n those experments see (Ler et al., 2004b)) produces a set of landmarkers for each possble sze of L.elements < A. In order to ensure that a specfc number of ndependents are utlsed (.e. a specfc number of landmarkng algorthms are run), the hllclmbng verson wll, when t encounters the stuaton where L.elements < max_ndependents and all F-tests for the current ndependent fal, contnue to move algorthms from I to J wthout changng the chosen models (.e. ndependents) for the remanng algorthm to be estmated n I. In each such teraton, the dependent algorthm (.e. landmarked algorthm) wth the model wth lowest utlty s moved. Thus, for each dataset/fold, 9 sets of landmarkers and correspondngly, 9 sets of accuracy estmates for A are attaned (snce A = 10). The followng three evaluatons are reported for each of these sets: Effcency ganed (): the mean percentage of computaton saved (across the 34 datasets or folds) by runnng the correspondng L.elements nstead of A. Rank order correlaton (r s ): the mean Spearman s rank order correlaton coeffcent r s (across the 34 datasets or folds), measurng the rank correlaton between: () the rank order of the accuraces estmated va the landmarkers and regresson, and () the rank order of the accuraces evaluated va ten-fold cross-valdaton. Algorthm-par orderng (): the mean percentage of algorthm parngs n whch the order predcted va the landmarkers s the same as that observed va ten-fold cross-valdaton. Note, wth A = 10, there are 10 C 2 = 45 parngs. Gven that x = L.elements are evaluatng (va ten-fold cross-valdaton), x C 2 algorthm pars wll correspond to the orderng that s found va tenfold cross-valdaton. We denote ths as assured, the parngs that are guaranteed to be correct. The results of the all-subsets verson of the landmarker generaton algorthm usng the plan r 2 and r 2 + effcency ganed heurstcs are gven n Table 2 and 3 respectvely, whle the results from the hll-clmbng (forward selecton) verson are gven n Tables 4 and 5. From Tables 2 through 5, we observe that the hllclmbng landmarker generaton algorthm qute closely matches the predctve performance of the all-subsets verson. Correspondngly, we fnd that even when few landmarkng algorthms are utlsed, the generated landmarkers are stll able to produce a reasonable estmate whle makng a szable gan n effcency. And as wth the all-subsets result, when we allow the generaton algorthm to utlse larger sets of canddate algorthms as landmarkers (.e. as we allow larger L.elements ), the r 2, r s, and all

6 ncrease, and the effcency ganed decreases, wth all the values approachng the ten-fold cross-valdaton result. No. Algorthms Evaluated, L.elements r s r Table 4. Results for the hll-clmbng verson (plan r 2 ) No. Algorthms Evaluated, L.elements r s r Table 5. Results for the hll-clmbng verson (r 2 + ) Acknowledgements The authors would lke to acknowledge the support of the Smart Internet Technology CRC n ths research. References Blake, C.; and Merz, C UCI Repostory of Machne Learnng Databases. Unversty of Calforna, Irvne, Department of Informaton and Computer Scences. Brazdl, P.; Gama, J.; and Henery, R Charactersng the applcablty of classfcaton algorthms usng meta level learnng. In Proceedngs of the European Conference on Machne Learnng, Detterch, T Machne learnng research: 4 current drectons. Artfcal Intellgence Magazne, 18(4): Fürnkranz, J.; and Petrak, J An evaluaton of landmarkng varants. In Proceedngs of the European Conference on Machne Learnng, Workshop on Integratng Aspects of Data Mnng, Decson Support and Meta-learnng, Graud-Carrer, C.; Vlalta, R.; and Brazdl, P Introducton to the specal ssue on meta-learnng. Machne Learnng, 54(3): Kalouss, A.; and Hlaro, M Model selecton va meta-learnng: A Comparatve Study. Internatonal Journal on Artfcal Intellgence Tools, 10(4): Ler, D.; Koprnska, I.; and Chawla, S. 2004a. Comparsons between heurstcs based on correlatvty and effcency for landmarker generaton. In Proceedngs of the 4th Internatonal Conference on Hybrd Intellgent Systems, n press. Ler, D.; Koprnska, I.; and Chawla, S. 2004b. A landmarker selecton algorthm based on correlaton and effcency crtera. In Proceedngs of the 17th Australan Jont Conference on Artfcal Intellgence, Ler, D.; Koprnska, I.; and Chawla, S. 2004c. A new landmarker generaton algorthm based on correlatvty. In Proceedngs of the Internatonal Conference on Machne Learnng and Applcatons, Mche, D.; Spegelhalter, D.; and Taylor, C Machne Learnng, Neural and Statstcal Classfcaton. Ells Horwood. Mller, A Subset Selecton n Regresson (2nd Edton). Chapman and Hall/CRC. Pfahrnger, B.; Bensusan, H.; and Graud-Carrer, C Meta-learnng by landmarkng varous learnng algorthms. In Proceedngs of the Internatonal Conference on Machne Learnng, Vlalta, R.; and Drss, Y. A perspectve vew and survey of meta-learnng. Artfcal Intellgence Revew, 18(2): Wckens, T The Geometry of Multvarate Statstcs. LEA Publshers. Wtten, I.; and Frank, E Data Mnng: Practcal Machne Learnng Tools wth Java Implementatons. Morgan Kaufmann. Wolpert, D The supervsed learnng no-free-lunch theorems. In Proceedngs of the Soft Computng n Industry - Recent Applcatons,

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department