THE classic dichotomy between generative and discriminative

Size: px

Start display at page:

Download "THE classic dichotomy between generative and discriminative"

Mabel Norris
6 years ago
Views:

1 06 IEEE TRANSACTIONS ON PATTERN ANALYSIS AN MACHINE INTELLIGENCE, VOL. 40, NO., JANUARY 08 Generatve Local Metrc Learnng for Nearest Negbor Classfcaton Yung-Kyun No, Byoung-Tak ang, and anel. Lee, Fellow, IEEE Abstract We consder te problem of learnng a local metrc n order to enance te performance of nearest negbor classfcaton. Conventonal metrc learnng metods attempt to separate data dstrbutons n a purely dscrmnatve manner; ere we sow ow to take advantage of nformaton from parametrc generatve models. We focus on te bas n te nformaton-teoretc error arsng from fnte samplng effects, and fnd an approprate local metrc tat maxmally reduces te bas based upon knowledge from generatve models. As a byproduct, te asymptotc teoretcal analyss n ts work relates metrc learnng to dmensonalty reducton from a novel perspectve, wc was not understood from prevous dscrmnatve approaces. Emprcal experments sow tat ts learned local metrc enances te dscrmnatve nearest negbor performance on varous datasets usng smple class condtonal generatve models suc as a Gaussan. Index Terms Metrc learnng, nearest negbor classfcaton, f-dvergence, generatve-dscrmnatve ybrdzaton Ç INTROUCTION THE classc dcotomy between generatve and dscrmnatve metods for classfcaton n macne learnng can clearly be seen n two dstnct performance regmes as te number of tranng examples s vared [], []. Generatve models wc employ models to fnd te underlyng dstrbuton pðxjyþ for dscrete class label y and nput data x R typcally outperform dscrmnatve metods wen te number of tranng examples s small, due to te smaller varance n te generatve models wc compensates for any possble bas n te models. On te oter and, more flexble dscrmnatve metods wc are nterested n a drect measure of pðyjxþ can accurately capture te true posteror structure pðyjxþ wen te number of tranng examples s large. Tus, gven enoug tranng examples, te best performng classfcaton algortms ave typcally employed purely dscrmnatve metods. However, due to te curse of dmensonalty wen s large, te number of data examples may be nsuffcent for dscrmnatve metods to approac ter asymptotc performance lmts. In ts case, t may be possble to mprove dscrmnatve metods by explotng knowledge from generatve models. Tere as been recent work on ybrd Y.-K. No s wt te Scool of Mecancal and Aerospace Engneerng, Seoul Natonal Unversty, Seoul 0886, Korea. E-mal: noyung@snu.ac.kr. B.-T. ang s wt te Scool of Computer Scence and Engneerng, Seoul Natonal Unversty, Gwanak-gu, Seoul 0886, Korea. E-mal: btzang@snu.ac.kr... Lee s wt te epartment of Electrcal & Systems Engneerng, Unversty of Pennsylvana, Pladelpa, PA 904. E-mal: ddlee@seas.upenn.edu. Manuscrpt receved 5 Feb. 06; revsed 30 ec. 06; accepted 30 Jan. 07. ate of publcaton 7 Feb. 07; date of current verson ec. 07. Recommended for acceptance by G. Ceck. For nformaton on obtanng reprnts of ts artcle, please send e-mal to: reprnts@eee.org, and reference te gtal Object Identfer below. gtal Object Identfer no. 0.09/TPAMI models sowng some mprovement [3], [4], [5], but te generatve models ave manly been mproved troug te dscrmnatve formulaton. In ts work, we consder a very smple dscrmnatve classfer, te nearest negbor classfer, were te class label of an unknown datum s cosen accordng to te class label of te nearest known datum. Te coce of a metrc to defne nearest s ten crucal, and we sow ow ts metrc can be locally defned based upon nformaton garnered by generatve models. Prevous work on metrc learnng for nearest negbor classfcaton as focused on a purely dscrmnatve approac. Te metrc s parameterzed by a sngle global form, wc s ten optmzed on te tranng data to maxmze parwse separaton between dssmlar ponts and to mnmze te parwse separaton of smlar ponts [6], [7], [8], [9], [0], were [0] as been extended to a faster onlne algortm []. For k-nearest negbors, a smlar formulaton s ntroduced wc s tolerant to mposters makng no errors by majorty votng []. Te dscrmnatve learnng can be used to fnd multple metrcs appled at dfferent locatons, were eac metrc s locally dscrmnatve [3] or te metrc n dfferent locatons s determned by te nonlnear transformaton usng te combnaton of regresson trees [4]. In dscrmnatve learnng, te calculaton of te dstances for all smlar and dssmlar pars s unavodable, and t s te man reason tat dscrmnatve metrc learnng does not scale well wt many data. Te optmzaton problem s smplfed to an egenvector problem n [5], but te optmzaton over te smplex of all dssmlar pars remans a major drawback for large scale data. Here, we ntroduce a scalable generatve metod by analyzng ow te problem of learnng a metrc can be related to reducng te teoretcal bas of te nearest negbor classfer. Toug te performance of te nearest negbor classfer as good teoretcal guarantees n te lmt of nfnte data, fnte samplng effects can ntroduce a bas wc can ß 07 IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See t_tp:// for more nformaton.

2 NOH ET AL.: GENERATIVE LOCAL METRIC LEARNING FOR NEAREST NEIGHBOR CLASSIFICATION 07 be mnmzed by te coce of an approprate metrc. By drectly reducng ts bas at eac pont, te classfcaton error s reduced sgnfcantly compared to te global classseparatng metrc. We sow ow to coose suc a metrc by analyzng te probablty dstrbuton on nearest negbors, provded we know te underlyng generatve models. Analyses of nearest negbor dstrbutons ave been dscussed before [6], [7], [8], [9], but we take a smpler approac and derve te metrc-dependent term n te bas drectly. Mnmzng ts bas results n a sem-defnte programmng problem avng an analytc soluton of an optmal local-metrc. In related work, Fukunaga et al. consdered learnng a metrc from a generatve settng [0], [], provdng a one-dmensonal drecton for data projecton wc s not related to reducng te bas []. Moreover, te drecton s obtaned from nearby data only, were te resultng drecton s possbly vulnerable to g-dmensonal nose [0]. To avod te dependency on te local nose, ter work was extended to use te averaged metrc wt te Monte Carlo average [], but te bas reducton was stll not consdered. In a dfferent lne of researc, Jaakkola et al. sowed ow a generatve model can be used to derve a specal kernel, called te Fser kernel []. Unfortunately, te Fser kernel s qute generc and does not necessarly mprove nearest negbor performance. Our generatve approac provdes a teoretcal relatonsp between metrc learnng and te dmensonalty reducton problem. In order to fnd better projectons for classfcaton, researc on dmensonalty reducton as utlzed nformaton-teoretc measures suc as te Battacarrya dvergence [3] and mutual nformaton [4], [5]. We argue tat tese problems can be connected wt metrc learnng for nearest negbor classfcaton wtn te general framework of f-dvergences. We wll also explan ow dmensonalty reducton s entrely dfferent from metrc learnng n te generatve approac, wereas n te dscrmnatve settng, t s smply a specal case of metrc learnng were te dstances along partcular drectons ave been srunk to zero. Te remander of te paper s organzed as follows. In Secton, we begn by comparng te metrc dependency of te dscrmnatve and generatve approaces for nearest negbor classfcaton. After we derve te bas due to fnte samplng n Secton 3, we sow n Secton 4 ow mnmzng ts bas results n a local metrc learnng algortm. In Secton 5, we explan ow metrc learnng sould be understood n a generatve perspectve, n partcular, ts relatonsp wt dmensonalty reducton. Experments on varous datasets are presented n Secton 6, and we derve an extended algortm usng k-nearest negbors for k greater tan one n Secton 7. Fnally, n Secton 8, we conclude wt a dscusson of future work and possble extensons. METRIC AN NEAREST NEIGHBOR CLASSIFICATION In recent work, determnng a good metrc for nearest negbor classfcaton as been tougt to be crucal. However, tradtonal generatve analyss of ts problem as smply gnored te metrc ssue wt good reason. In ts secton, we explan wy conventonal metrc learnng Fg.. Metrc cange for grd data. Wen te metrc s canged by extendng te orzontal drecton, te separaton between data s ncreased along te orzontal lne. cannot be explored n a tradtonal generatve approac, and we brefly explan ow ts property can be leveraged for a novel metrc learnng algortm.. Metrc Learnng for Nearest Negbor Classfcaton A nearest negbor classfer determnes te label of an unknown datum accordng to te label of ts nearest negbor. In general, te meanng of te term nearest s defned along wt te noton of dstance n data space. One common coce for ts dstance s te Maalanobs dstance wt a postve defnte square matrx A R were s te dmensonalty of data space. In ts case, te dstance between two ponts x and x s defned as qffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff dðx ; x Þ¼ ðx x Þ > Aðx x Þ ; () and te nearest datum x NN s te one avng mnmal dstance to te test pont among labeled tranng data n fx g N. In ts classfcaton task, te results are gly dependent on te coce of matrx A, and pror work [6], [7], [8], [9], [0] as attempted to mprove te performance by a better coce of A. Tat work assumed te followng common eurstc: Te tranng data n dfferent classes sould be separated n a new metrc space. Gven tranng data, a global A s optmzed suc tat drectons separatng dfferent class data are extended, and drectons bndng togeter data from te same class are srunk. However, n terms of te test results, tese conventonal metods do not mprove te performance dramatcally, wc wll be sown n our later experments on many datasets, and we sow n our teoretcal analyss wy only small mprovements arse or often no mprovements at all.. Teoretcal Performance of Nearest Negbor Classfer Te prmary motvaton for conventonal metrc learnng algortms as been to seek a separaton between sets of data belongng to dfferent classes. For example, f Class data and Class data are dstrbuted along grd lnes as n Fg., extendng te metrc along te orzontal drecton obvously elps provde a smaller error for te nearest negbor classfcaton. Usng ts motvaton, most recent metrc learnng algortms ave tred to maxmze te separaton between dfferent classes. Contrary to te recent metrc learnng approaces, owever, a smple teoretcal analyss usng a generatve model

08 IEEE TRANSACTIONS ON PATTERN ANALYSIS AN MACHINE INTELLIGENCE, VOL. 40, NO., JANUARY 08 Fg.. Metrc cange for data from a probablstc generatve model.

3 08 IEEE TRANSACTIONS ON PATTERN ANALYSIS AN MACHINE INTELLIGENCE, VOL. 40, NO., JANUARY 08 Fg.. Metrc cange for data from a probablstc generatve model. For random data from two dfferent probablty densty functons, extendng te orzontal lne smply reduces te denstes. Tere s no vsble ncrease of separaton along te orzontal drecton between dfferent data any greater tan te ncrease of separaton along te vertcal drecton. dsplays no senstvty to te coce of metrc. In Fg., wt random data from two dfferent probablty denstes p and p were p >p, extendng te orzontal drecton does not ncrease te separaton between data along te orzontal any more tan te separaton along te vertcal drecton. Instead, te extenson of a partcular drecton only reduces te scalar denstes of bot classes togeter. In a stuaton were data are dense and te nearest negbor appears wtn a small dstance were te probablty denstes are consdered to be unform, te expected classfcaton accuraces are unaffected by cangng te metrc. In ts stuaton, no metrc learnng algortms can mprove te nearest negbor classfcaton. A more detaled analyss of ts stuaton s as follows. We consder..d. samples generated from two dfferent dstrbutons p ðxþ and p ðxþ over te vector space x R. Wt nfnte samples, te probablty of msclassfcaton usng a nearest negbor classfer can be obtaned E Asymp ¼ p ðxþp ðxþ dx: () p ðxþþp ðxþ Te equaton comes from te probablty tat te labels of te query datum and ts nearest negbor are dfferent. Te probablty tat te query datum as Class s p =ðp þ p Þ, and te probablty tat te nearest negbor as Class s p =ðp þ p Þ wt many data. Te multplcaton of tese two probabltes consttutes te probablty of error. Along wt te probablty tat te query as Class and ts nearest negbor as Class as well as te total densty of data, assumng equal prors, te ntegraton of gves te expectaton of p p þ p p ¼ p p error wt nfnte data. Te equaton s better known by ts relatonsp to an upper bound, twce te optmal Bayes error [0], [], [6]. By lookng at te asymptotc error n a lnearly transformed z-space, we can sow tat Eq. () s nvarant to te cange of metrc. If we consder a lnear transformaton z ¼ L > x usng a full rank matrx L, and te dstrbuton q c ðzþ for c f; g n z-space satsfyng p c ðxþdx ¼ q c ðzþdz and accompanyng measure cange dz ¼jLjdx, we see tat E Asymp n z-space s uncanged. Snce any postve defnte A can be decomposed as A ¼ LL >, we can say te asymptotc error remans constant even as te metrc srnks or expands n any spatal drectons n data space. However, te metrc ndependency only arses n te asymptotc lmt. Wen we do not ave nfnte samples, te Fg. 3. Te nearest negbor x NN appears at a fnte dstance d N from x 0 due to fnte samplng. Gven te data dstrbuton pðxþ, te average probablty densty functon over te surface of a dmensonal yperspere s ~pðx 0 Þ¼pðx 0 Þþ d N r pj x¼x0 for small d N. expectaton of error s based n tat t devates from te asymptotc error, and te bas s dependent on te metrc. From a teoretcal perspectve, te asymptotc error s te teoretcal lmt of expected error, and te bas reduces as te number of samples ncreases. Snce tese metrc and sample dependences ave not been consdered n prevous researc, te aforementoned class-separatng metrc wll not exbt performance mprovements wen te sample number s large. In te next secton, we look at te performance bas assocated wt fnte samplng drectly and fnd a metrc tat mnmzes te bas from te teoretcal asymptotc error. 3 PERFORMANCE BIAS UE TO FINITE SAMPLING Here, we obtan te expectaton of nearest negbor classfcaton error from te dstrbuton of nearest negbors n dfferent classes. As we consder a fnte number of samples, te nearest negbor from a pont x 0 appears at a fnte dstance d N > 0. Ts non-zero dstance gves rse to te performance dfference from ts teoretcal lmt (). A twcedfferentable probablty densty functon pðxþ s consdered, and t s approxmated to second order near a test pont x 0 pðxþ pðx 0 Þ þ rpðxþj > x¼x 0 ðx x 0 Þ (3) þ ðx x 0Þ > rrpðxþ x¼x0 ðx x 0 Þ (4) wt te gradent rpðxþ and Hessan matrx rrpðxþ defned by takng dervatves wt respect to x. Now, under te condton tat te nearest negbor appears at te dstance d N from te test pont, te expectaton of te probablty pðx NN Þ at a nearest negbor pont s derved by averagng te probablty over te -dmensonal yperspere of radus d N,asnFg.3.Here,te pont x NN s te locaton of te nearest negbor for a gven test pont x 0,andtesetofx NN apart from x 0 by d N s of nterest. After averagng pðx NN Þ, te gradent term dsappears, and te resultng expectaton s obtaned as te sum of te probablty at x 0 and a resdual term contanng te

4 NOH ET AL.: GENERATIVE LOCAL METRIC LEARNING FOR NEAREST NEIGHBOR CLASSIFICATION 09 Laplacan of p, were te detaled dervaton s sown n te Appendx. We replace ts expected probablty by ~pðx 0 Þ E xnn pðx NN Þd N ; x 0 pðx 0 Þþ E x NN ðx x 0 Þ > rrpðxþðx x 0 Þkx x 0 k ¼ d N ¼ pðx 0 Þþ d N r pj x¼x0 ~pðx 0 Þ; (6) were te scalar Laplacan r pðxþ s gven by te sum of te egenvalues of te Hessan rrpðxþ. If we look at te expected error for two-class classfcaton, t s te expectaton of te probablty tat te test pont and ts negbor are labeled dfferently. In oter words, te expectaton error E NN s te expectaton of eðx; x NN Þ¼ pðc ¼ jxþpðc ¼ jx NN Þþ pðc ¼ jxþpðc ¼ jx NN Þ over bot te dstrbuton of x and te dstrbuton of nearest negbor x NN for a gven x " E NN ¼ E x E xnn eðx; x NN Þx # : (7) Usng two class condtonal densty functons p c ðxþ for c ¼ ;, we replace te posterors pðcjxþ and pðcjx NN Þ as p c ðxþ=ðp ðxþþp ðxþþ and p c ðx NN Þ=ðp ðx NN Þþp ðx NN ÞÞ, respectvely. We ten approxmate te expectaton of te posteror E xnn pðcjx NN Þd N ; x at a fxed dstance d N from test pont x usng ~p c ðxþ=ð~p ðxþþ~p ðxþþ. If we expand E NN wt respect to d N and take te expectaton usng te decomposton, E xnn f ¼ E dn E xnn fd N, ten te expected error s gven to te leadng order by p p E NN dx p þ p þ E d N ½d N Š 4 ðp þ p Þ p r p þ p r p (8) p p ðr p þr p Þ dx; were p and p are te abbrevatons of te class condtonal densty functons p ðxþ and p ðxþ, respectvely. Wen E dn ½d NŠ!0 wt an nfnte number of samples, ts error converges to te asymptotc lmt n Eq. () as expected. Te resdual term can be consdered as te fnte samplng bas of te error dscussed earler. Under te coordnate transformaton z ¼ L > x and te dstrbutons pðxþ on x and qðzþ on z, we see tat ts bas term s dependent upon te coce of metrc A ¼ LL > ðq þ q Þ q r q þ q r q q q r q þr q dz; (9) ¼ ðp þ p Þ tr A p rrp þ p rrp p p ðrrp þrrp Þ dx; (5) (0) wc s derved usng pðxþdx ¼ qðzþdz and jljr q ¼ tr½a rrpš were te determnant of te second equaton results from takng te dervatve of pðxþ wt respect to z usng z ¼ L > x. Te expectaton of squared dstance E dn ½d NŠ n Eq. (8) s related to te determnant jaj by E dn ½d N Š/jAj, because E dn ½d N Š s proportonal to p þp as n []. By usng A wt fxed determnant jaj ¼, wedo not ave to consder te cange of E dn ½d NŠ. Terefore, fndng te metrc tat mnmzes te quantty gven n Eq. (9) at eac pont s equvalent to mnmzng te metrcdependent bas n Eq. (8). 4 REUCING EVIATION FROM THE ASYMPTOTIC PERFORMANCE Fndng te local metrc tat mnmzes te bas can be formulated as a sem-defnte programmng (SP) problem of mnmzng te squared resdual wt respect to a postve sem-defnte metrc A mn A ðtr½a BŠÞ s:t: jaj ¼;A 0; () were te matrx B at eac pont s B ¼ p rrp þ p rrp p p ðrrp þrrp Þ: () Here B depends upon te gven test pont x, and te correspondng optmal A s determned locally as well. However te B s at dfferent ponts come from te same generatve model explanng p and p at dfferent ponts. For a gven generatve model, te denstes and Hessans p ðxþ, p ðxþ, rrp j x, and rrp j x are determned at eac testng pont, and te B at te same pont s determned correspondngly. Te optmzaton problem n Eq. () s a smple SP problem avng an analytcal soluton. Te soluton sares te egenvectors wt B. Let L þ R d þd þ and L R d d be te dagonal matrces contanng te postve and negatve egenvalues of B, respectvely. If U þ R d þ contans te egenvectors correspondng to te egenvalues n L þ and U R d contans te egenvectors correspondng to te egenvalues n L, we use a scaled soluton gven by d A opt ¼ b½u þ U Š þ L þ 0 ½U 0 d L þ U Š > ; (3) wt approprate b satsfyng ja opt j¼. Eq. (3) s optmal wen eter all te egenvalues of te matrx B are postve (A opt ¼ bu þ L þ U þ > ) or all egenvalues are negatve (A opt ¼ bu L U > ), wc can be obtaned by drect dfferentaton of te objectve functon. If te matrx B as bot postve and negatve egenvalues, Eq. (3) produces te mnmum zero objectve functon ðtr½a opt BŠÞ ¼ 0. However n ts case, note tat Eq. (3) s not te unque soluton wt zero objectve functon. Wt ts partcular form, te soluton Eq. (3) provdes a contnuous cange at te boundary were te egenvalues cange sgns. Te soluton A opt s a local metrc snce we assume tat te nearest negbor s close to te test pont satsfyng Eq. (4). In prncple, dstances sould ten be defned as geodesc dstances usng ts local metrc on a Remannan manfold. However, ts s computatonally dffcult, so we use te

5 0 IEEE TRANSACTIONS ON PATTERN ANALYSIS AN MACHINE INTELLIGENCE, VOL. 40, NO., JANUARY 08 surrogate dstance A ¼ gi þ A opt and treat g as a regularzaton parameter tat s learned n addton to te local metrc A opt. Te multway extenson of ts problem s stragtforward. Te asymptotc error wt C-class dstrbutons can be extended to P C R P P C c¼ p c j6¼c p j = p dx usng te posterors of eac class, and t replaces B n Eq. () by te extended matrx B ¼ XC rrp X p j p X p j!: (4) 5 METRIC LEARNING IN GENERATIVE MOELS j6¼ Once te Hessan matrces rrp c are gven for all classes c ¼ ;...;C, te metrc can be obtaned from te egenvalues and te egenvectors of Eq. (4). Snce p c s te underlyng densty functon, te Hessan cannot be obtaned drectly but as to be estmated. In ts work, a smple standard Gaussan densty functon s used for p c to learn te generatve model wt parameters m c and S c, te mean vector and te covarance matrx, respectvely for eac class c. Te Hessan of te Gaussan densty functon s gven by te expresson rrp c ðxþ ¼p c ðxþ S c j6¼ ðx m c Þðx m c Þ > S c S c : (5) Ts expresson s ten used wt estmated parameters to learn te local metrc at eac pont x. Because we use a generatve model to obtan te metrc, we call our learnng metod Generatve Local Metrc Learnng (GLML). As a generatve model, any twce-dfferentable parametrc or nonparametrc densty functons can be used, but emprcally, a coarse Gaussan model for eac class works well. For a Gaussan model, te computatonal complexty of calculatng parameters s OðCðN þ 3 ÞÞ for C-class data wt N number of -dmensonal tranng data, and te metrc matrx s obtaned wt cost Oð 3 Þ for solvng te egenvector problem for eac testng datum. Te algortm can easly scale wt te number of tranng data N. Te algortm usng te Gaussan generatve model s presented n Algortm. Algortm. Generatve Local Metrc Learnng for Nearest Negbor Classfcaton Input: data ¼fx ;y g N and a pont x Output: classfcaton output byðxþ Procedure: : Estmate mean vector m c and covarance matrx S c of eac class c ¼f;...;Cg from. : Use te estmated parameters bm c, S b c and obtan B matrx (Eq. (4)) at te pont x. 3: Use te egenvectors and te egenvalues of B and obtan te metrc matrx A n Eq. (3). 4: Use metrc A, perform te nearest negbor classfcaton, and obtan byðxþ wt te new dstance n Eq. (). 5. mensonalty Reducton and Metrc Learnng Tradtonal metrc learnng metods can be understood as beng purely dscrmnatve n oter words, tey do not consder te class-condtonal densty functons, p ðxþ or p ðxþ. Tose metods are focused on maxmzng te separaton of data belongng to dfferent classes. In general, ter motvatons are compared to te supervsed dmensonalty reducton metods, wc try to fnd a low dmensonal space were te separaton between classes s maxmzed. Te motvaton of prevous metrc learnng s not so dfferent from ts dmensonalty reducton, tus dmensonalty reducton as often been consdered as a smlar problem seekng a separaton between classes but wt a specal constrant were te metrc n partcular drectons s forced to be zero. Wen a generatve approac s consdered, owever, te relatonsp between dmensonalty reducton and metrc learnng s dfferent. As n te dscrmnatve case, dmensonalty reducton n generatve models tres to obtan class separaton n a transformed space. It assumes partcular parametrc dstrbutons (typcally Gaussans) and uses a crteron to maxmze te separaton [3], [4], [5], [7]. One general form of tese crtera s te f-dvergence (also known as Csszer s general measure of dvergence), wc can be defned wt respect to a convex functon fðtþ for t R [8] Fðp ðxþ;p ðxþþ ¼ p ðxþ f p ðxþ dx: (6) p ðxþ Examples of usng ts dvergence nclude te Battacaryya dvergence R p ffffffffffffffffffffffffffffffffffffffffff p p ðxþp ðxþdx wen fðtþ ¼ ffff t and te KLdvergence R p ðxþlog p ðxþ p ðxþ dx wen fðtþ ¼ log ðtþ. Te use of mutual nformaton between data and labels can be understood as an extenson of KL-dvergence. Te well known Lnear scrmnant Analyss s a specal example of a Battacaryya crteron wen we assume two-class Gaussans sarng te same covarance matrces. Unlke dmensonalty reducton, we cannot use te f-dvergence as crtera for metrc learnng because any f-dvergence s metrc-nvarant. Instead, we can fnd te relatonsp from te asymptotc error n Eq. (). Ts asymptotc error s related to f-dvergence by E Asymp ¼ Fðp ;p Þ were te f-dvergence s assocated wt fðtþ ¼=ð þ tþ. Te f-dvergence ere can be called te asymptotc accuracy of nearest negbor classfcaton. Wle dmensonalty reducton uses f-dvergences as maxmzng crtera wt generatve nformaton, n metrc learnng n contrast, we vew te problem as an estmaton problem, and we optmze te metrc for better estmaton of te f-dvergence. In oter words, a better estmaton can be nterpreted as a matter of reconstructng te asymptotc nearest negbor classfcaton results wc are less tan twce te Bayes error wt fnte data, and ts nterpretaton provdes an understandng of ow we can dscrmnate ts novel dea from te concept underlyng te dmensonalty reducton problem. 5. Generatve Informaton for Classfcaton Classfcaton usng generatve models typcally tres to make a model for class-condtonal probablty denstes p ðxþ and p ðxþ and use te crteron p ðxþ=p ðxþ5 for classfcaton. Nearest negbors ave te probablty

6 NOH ET AL.: GENERATIVE LOCAL METRIC LEARNING FOR NEAREST NEIGHBOR CLASSIFICATION Fg. 4. Optmal local metrcs are sown on te left for tree example Gaussan dstrbutons n a 5-dmensonal space. Te projected -dmensonal dstrbutons are represented as ellpses (one standard devaton from te mean), wle te remanng 3 dmensons ave sotropc dstrbutons. Te local ~p=p of te tree classes s plotted on te rgt usng a eucldean metrc I and for te optmal metrc A opt. Te soluton A opt tres to keep te rato ~p=p over te dfferent classes as smlar as possble for dfferent dstance d N. densty nformaton were te ger densty tends to gve te sorter nearest negbor dstance. However, te classfcaton usng nearest negbors from fnte samples devates from tat of usng nfnte samples because te probablty densty tself devates at te locaton of nearest negbors. If we can keep te probablty denstes ntact at te locaton of nearest negbors x NN, te expected error of a pont x s stll p x p xnn þ p x p xnn ¼ p ðxþp ðxþ ðp ðxþþp ðxþþ, were p ðxþ and p ðxþ are te probablty denstes at te pont of nterest x, and te nearest negbor classfcaton wll aceve te asymptotc classfcaton error. Our motvaton for reducng Eq. (9) provdes an alternatve way of reconstructng suc asymptotc errors. By reformulatng Eq. () nto ðp p Þðp r p p r p Þ, we can see tat te optmal metrc tres to mnmze te dfference between r p p and r p p.if r p p r p, ts also mples ~p ~p ; (7) p p for ~p ¼ p þ d N r p, te average probablty at a dstance d N n (6). Tus, te algortm tres to keep te rato of te average probabltes ~p = ~p at a dstance d N as smlar as possble to te rato of probabltes p =p at te test pont. Altoug we cannot keep te probablty densty constant at a dstance, our metrc can stablze te rato between te average of te probablty denstes at a dstance. Ts means tat te expected nearest negbor classfcaton at a dstance d N wll be least based agan even toug te densty nformaton s canged. Fg. 4 sows ow te learned local metrc A opt vares at a pont x for a 3-class Gaussan example and, at a specfc pont, ow te rato of ~p=p s kept as smlar as possble between dfferent classes. In te followng experments, we sow ow te actual nearest negbor classfcaton can beneft from te analyss. At ssue s ow relably we can estmate te Hessan from data. It turns out tat by applyng a smple model, suc as a sngle Gaussan for eac class, we can see a sgnfcant ncrease n performance for many datasets. A coarse model captures te overall curvature structure of te underlyng probablty denstes, provdng gudelnes to te nearest negbor classfer on wc drectons to focus for reconstructng asymptotc results. p 6 EXPERIMENTS We apply our algortm for learnng a local metrc to syntetc and varous real datasets and see ow well t mproves nearest negbor classfcaton performance. We compare te performance of our metod GLML usng class-condtonal Gaussan model n Algortm wt recent metrc learnng dscrmnatve metods wc report state-of-te-art performance on a number of datasets. Tese nclude Informaton-Teoretc Metrc Learnng (ITML) [6], Boost Metrc (BM) [9], and Largest Margn Nearest Negbor (LMNN) 3 [0]. As recently proposed metods, we also present results usng Parametrc Local Metrc Learnng (PLML) [3] and stance Metrc Learnng wt Egenvalue Optmzaton (ML-eg) [5]. We use te mplementatons downloaded from te correspondng autors webstes. We also compare wt a local metrc gven by te Fser kernel [] assumng a sngle Gaussan for te generatve model and usng te locaton parameter to derve te Fser nformaton matrx. Te metrc from te Fser kernel was not orgnally ntended for nearest negbor classfcaton, but t s te only oter reported algortm tat learns a local metrc from generatve models. 6. Experments wt Gaussan ata For te syntetc dataset, we generated data from two-class random Gaussan dstrbutons avng two fxed means. Te covarance matrces are generated from random ortogonal egenvectors and random egenvalues. Experments were performed varyng te nput dmensonalty, and te classfcaton accuraces are sown n Fg. 5a along wt te results of te oter algortms. We used 500 test ponts and an equal number of tranng examples. Te experments were performed wt 0 dfferent realzatons, and te results were averaged. As te dmensonalty grows, te orgnal nearest negbor performance degrades because of te g dmensonalty. However, we see tat te proposed local metrc substantally outperforms te dscrmnatve nearest negbor performance n a g dmensonal space. We note tat ts example s deal for GLML, and t sows consderable mprovement compared to te oter metods.. ttp://userweb.cs.utexas.edu/ pjan/tml/. ttp://code.google.com/p/boostng/ 3. ttp:// klan/ownloads/lmnn.tml

7 IEEE TRANSACTIONS ON PATTERN ANALYSIS AN MACHINE INTELLIGENCE, VOL. 40, NO., JANUARY 08 Fg. 5. (a) Gaussan syntetc data wt varyng dmensonaltes. As te number of dmensons grows, accuraces of many metods degrade. On te oter and, GLML contnues to mprove ts performance vastly over oter metods. Experments (b)() on bencmark datasets vary te number of tranng data per class. Te speec dataset () TI46 s pronounced by egt men and egt women. Te Fser kernel and BM are omtted for (f)() and (),() respectvely, snce ter performances are muc worse tan te nearest negbor classfer wt eucldean and oter metrcs. ML-eg and PLML are (partly) mssng n (d)(f),(),() and (d)(f),() respectvely, due to te complexty of te algortms. 6. Experments wt Real ata Te oter experments consst of te followng bencmark datasets: UCI macne learnng repostory 4 datasets (Ionospere, Wne), and te IA bencmark repostory 5 (German, Image, Waveform, Twonorm). We also used te USPS andwrtten dgts and te TI46 speec dataset. For te USPS data, we reszed te mages to 8 8 pxels and traned on te 64-dmensonal pxel vector data. For te TI46 dataset, te examples consst of spoken sounds pronounced by egt dfferent men and egt dfferent women. We cose te pronuncaton of 0 dgts ( zero to nne ), and performed a 0 class dgt classfcaton task. Ten dfferent flters n te Fourer doman were used as features to preprocess te acoustc data. Te experments were done on 0 data samplng realzatons for Twonorm and TI46, 0 for USPS, 00 for Wne, and 00 for te oters. For ML-eg and PLML, te algortms are not scalable wt te number of data, so te results are not fully obtaned for datasets wt many data. 4. ttp://arcve.cs.uc.edu/ml/ 5. ttp:// bencmark Our algortm as two yperparameters. Te metrc matrx s regularzed as A ¼ A opt þ gi wt te dentty matrx I. We also used a regularzed estmated covarance matrx S ¼ ^S þ ai usng a wt te maxmum-lkelood estmated ^S. Bot g and a are cosen by cross-valdaton on a subset of te tranng data, ten fxed for testng. Te covarance regularzaton a s fxed for all classes n eac experment. In Fg. 5a, we sow te results for te syntetc data wt varyng dmensonaltes. Many algortms sow decreasng accuraces for ger dmensonaltes, were ts decrease s due to te curse of dmensonalty. In g dmensonal space, dstance becomes non-nformatve, and many algortms do not succeed at extractng nformaton lvng n te g dmensonal space. On te oter and, two Gaussans n a g dmensonal space become more separable tan te Gaussans n a lower dmensonal space. Our algortm captures te nformaton effectvely and ncreases te accuracy substantally over oter algortms. From te results sown n Fg. 5b, 5c, 5d, 5e, 5f, 5g, 5, 5, our local metrc algortm generally outperforms most of

8 NOH ET AL.: GENERATIVE LOCAL METRIC LEARNING FOR NEAREST NEIGHBOR CLASSIFICATION 3 Fg. 6. (a) Non-Gaussan data for nearest negbor classfcaton. (b) Classfcaton performance of usng Gaussan generatve models and nearest negbor performance wt eucldean, LMNN, and GLML metrcs. te oter metrcs across most of te datasets. On qute a number of datasets, many of te oter metods do not outperform te orgnal eucldean nearest negbor classfer. Ts s because tese metods (all except PLML and Fser metrc) use a global metrc upon wc performance can be mproved only lttle. On te oter and, te local metrc derved from smple Gaussan dstrbutons always sows a performance gan over te nave nearest negbor classfer. In contrast, usng te Bayes rule wt tese smple Gaussan generatve models often results n a very poor performance. Te computatonal tme usng a local metrc s also very compettve, snce te underlyng SP optmzaton as a smple spectral soluton. Ts s n contrast to oter metods wc numercally solve for a global metrc usng an SP over te data ponts. For some algortms n our experments, we dd not report te results because tose algortms do not scale wt te amount of data. ML-eg obtans te metrc matrx usng an teratve teraton of solvng two problems: an egenvector problem of matrx wt data dmensonalty and an optmzaton problem over te set of dsmlar pars were te number of pars scales wt OðN Þ wt te number of data N. We used two yperparameters te smootng parameter n te algortm and te number of nearest negbors for te dsmlar par-selecton, and tey are selected usng crossvaldaton wt tranng data. We used 0 3 as a tolerance value. PLML tres to solve two objectve functons at te same tme: one s te local smootness objectve functon of te metrc matrces at dfferent ponts, and te oter s te dscrmnatve objectve functon of te metrc matrces at ancor ponts. Te amount of computaton scales wt OðNM þ N MÞ for smootness optmzaton wt M number of ancor ponts and OðM 3 þ MN þ TÞ wt T number of smlar and dsmlar trplets. Te number of trplets can scale wt OðN 3 Þ. Our algortm can easly scale wt te number of tranng data. Once te parameter of te generatve model s estmated, te calculaton cost of te metrc s Oð 3 Þ for solvng te egenvector problem for eac testng datum. In addton, consderng tat te Hessan matrx n Eq. (5) dvded by pðxþ, rrpðxþ pðxþ ¼½S ðx mþðx mþ > S S Š, s te summaton of te rank-one matrx ½S ðx mþðx mþ > S Š and te negatve-nverse-covarance matrx ½ S Š, te egenvalues and te egenvectors can be obtaned usng te rank-one modfcaton metod avng complexty Oð Þ [9], [30]. For example, te two-class problem uses te egenvalues and te egenvectors of p rrp þ p rrp p p ðrrp þ rrp Þ¼ðp p Þp p ð rrp Þ, and because p rrp p rrp rrp p p (8) ¼ S ðx m Þðx m Þ > S S S ðx m Þðx m Þ > S S ; (9) once te egenvalues and te egenvectors of S S are obtaned, tey can be rank-one modfed by te two rank one matrces: ½S ðx m Þðx m Þ > S Š and ½S ðx m Þ ðx m Þ > S Š at eac pont of nterest x were a test datum s located. 6.3 Experments wt an Inconsstent Generatve Model In te experments wt syntetc and real data, we use a generatve model to obtan te curvature nformaton for te metrc. In partcular, experments wt Gaussan data n Fg. 5a sow tat te curvature nformaton from te consstent model obvously mproves te performance. Oter experments n ts secton sow weter te accuracy s mproved wen te generatve model does not contan te true underlyng probablty densty but captures only te roug confguraton of data. Fg. 6a sows a two-dmensonal projecton of te twoclass fve-dmensonal data generated from two Gaussan mxtures were eac class cannot be modeled usng only a sngle Gaussan. Te remanng tree dmensonaltes ave an sotropc Gaussan, were any useful nformaton for dscrmnaton s n te presented two-dmensonal space. Te ellpsods for two classes represent te Gaussan densty functons wc were ft to data usng maxmum lkelood estmaton. As a test usng te Gaussan generatve model bp ðxþ and bp ðxþ wt estmated parameters, te classfcaton can be performed by comparng bp ðxþ5bp ðxþ. InFg.6b Bayes sows te accuracy of ts classfcaton wc s outperformed by te nearest negbor classfcaton, eucldean, due to te

9 4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AN MACHINE INTELLIGENCE, VOL. 40, NO., JANUARY 08 Fg. 7. Isomap embeddng wt (a) eucldean metrc, (b) Large Margn Nearest Negbor (LMNN) metrc, (c) GLML metrc, and (d) t-strbuted Stocastc Negbor Embeddng (t-sne) embeddng. Te nset fgures represent te label dstrbuton. nconsstency of te model. Compared wt te LMNN result, GLML greatly mproves te accuracy altoug GLML uses te nformaton from te nconsstent generatve model. Te result sows tat even toug te generatve model s not flexble enoug for capturng te detaled structure of te true denstes, te roug nformaton of te confguraton elps mprove te accuracy of te nearest negbor classfcaton. In our experments, te generatve model usng one Gaussan captures te global structure of data altoug te data temselves are not exactly from a sngle Gaussan. Ts compensates for te fundamental weakness of te nearest negbor metods wc use only local data. In partcular, n g dmensonal space, te dstance nformaton s easly corrupted by a small amount of ambent nose but by usng te metrc from te global structure, te nose s effcently reduced. Intally, nearest negbor classfcaton s desgned to capture te detaled boundares of dfferent classes by usng local nformaton, and te nformaton necessary for ts metod s a roug gudelne of drectons wc sould ave more focus. We understand tat te cascade of ts generatve and dscrmnatve nformaton s cooperatve because eac model captures dfferent nformaton, te global roug structure for generatve and te local detaled structure for dscrmnatve. 6.4 Embeddng Experments In ts secton, we sow te Isomap embeddngs wen dfferent metrcs are used. Isomap captures local dstance nformaton and fnds te global manfold structure preservng te local dstances [3]. In our experment, we used Isomap for te embeddng of te tree andwrtten dgts, 3, and 8 from MNIST [3]. Wen te eucldean dstance s used n Fg. 7a, te Isomap captures te slope structure of dgts along te orzontal axs. Te slope of wrtng on te rgt-and sde tends to be rotated more clockwse tan te slope on te left-and sde. In terms of te separaton between classes, tree dfferent dgts are rougly separated n ts embeddng, but tere s an overlap trougout te embeddng as well. LMNN ams at fndng te metrc maxmally separatng dfferent classes. Te LMNN embeddng n Fg. 7b sows a clearer separaton between dfferent dgts tan te eucldean embeddng n Fg. 7a, but t stll sows overlap around te boundares. If we consder te motvaton of GLML, te metod does not ntentonally separate dfferent classes. Instead, t fnds te metrc tat reconstructs te asymptotc classfcaton results. However, n Fg. 7c, dfferent classes are more clearly separated for GLML tan LMNN embeddng, and te slope structure s sown along te orzontal axs as well. In fact, t sows an unfoldng of te manfold were te slope of caracters s rotated clockwse as we move along te orzontal axs from left to center, and t s rotated back counterclockwse wen we move from center to te rgt. Ts unfoldng s due to te metrc tat separates dfferent classes wle

10 NOH ET AL.: GENERATIVE LOCAL METRIC LEARNING FOR NEAREST NEIGHBOR CLASSIFICATION 5 smultaneously preservng te slope structure. Here we used a sngle Gaussan generatve model for eac class as before. A result of te conventonal embeddng metod, t-strbuted Stocastc Negbor Embeddng (t-sne), s also presented n Fg. 7d for comparson. Te embeddng usng t-sne s known to fnd clusterng structure wtout actual labels [33]. Accordng to te fgure, t-sne fnds tree bg clusters, eac of wc corresponds to a sngle dgt. However, n te clearly separated clusters, dgts belongng to dfferent clusters are found n many ponts, wc can be compared wt our Gaussan GLML results n wc dfferent classes are ~p þ ~p ¼ ðp þ p Þ r p ðp þ p Þ p ðr p þr p Þ p p r p ¼ ðp þ p Þ r p ; @a ~p ~p þ ~p ¼ p p r p ðp þ p Þ r p p p ; () () 7 EXTENSION TO k-nearest NEIGHBOR CLASSIFICATION In ts secton, we derve te GLML metrc for k-nearest negbor classfcaton. Compared to nearest negbor classfcaton, k-nearest negbor classfcaton needs to combne te expectatons of te combnatoral cases based on te majorty votng strategy. Usng te expected error of tese combnatoral cases and ter devatons from te asymptotc expectatons, we derve te metrc for te k-nearest negbor classfcaton. Frst, te asymptotc expected error for classfyng a datum at x s as follows were te class-posterors are m ¼ p ðxþ p ðxþþp ðxþ and m ¼ k ðxþ ¼m p ðxþ p ðxþþp ðxþ X P k P d k c ¼ d c ¼ m m fc ;...;c k j P k d c ¼ < P k d c ¼g þ m X P k P d k c ¼ d c ¼ m m fc ;...;c k j P k d c ¼ > P k d c ¼g ðk Þ= X k ¼ m l ðk Þ= X k þ m l m l mk l m k l m l ; (0) were m and m are te posterors tat x as Class and Class labels, respectvely, c j s te class of jt nearest negbor, and fc ;...;c k j P k d c ¼5 P k d c ¼g s te set of all k number of labels were te number of Class s greater (or less) tan te number of Class labels. Error occurs wen x as Class and te majorty votng classfes x as Class wt te condton P k d c ¼ < P k d c ¼. Te expectaton of ts error s te frst term of Eq. (0). Anoter error occurs wen x as Class and te majorty votng classfes x as Class wc gves te second term. We consder Eq. (0) and te perturbed probablty, ~p c ðxþ ¼p c ðxþþar p c ðxþ, te expected probablty at te locaton of nearest negbors. Here, a s a constant to represent te devaton of te probablty densty at te pont of nearest negbors. To fnd te devaton of te posterors m and m, we can use a constant a regardless of Class c because we consder te same nearest negbor poston for bot classes. For eac posteror ~m and ~m at nearest negbor pont, we use ~p and ~p to calculate te devaton for eac of te nearest negbors. Now we consder te modfed posterors at t nearest negbor ~m ðþ c of Class c, ten te product of probabltes n Eq. (0) can be modfed as Eq. ð0þ ðk Þ= X k ðk Þ= X ¼ m m l k l mk l þ m m k l m l l ðk Þ= X! m ðk Þ= X þ m ¼ Eq. ð0þ X k þ m j¼ þ m X k j¼ X fc ;...;c k ; P k d ðc ¼Þ ¼lg X fc ;...;c k ; P k d ðc ¼Þ ¼lg ðk Þ= a X ðjþ ðk Þ= a X ðjþ X Y k Y k fc ;...;c k ; Pk d ðc ¼Þ ¼ lg X fc ;...;c k ; Pk d ðc ¼Þ ¼ lg ~m ðþ c ~m ðþ c Y k Y k m ðþ c m ðjþ c j m ðþ c m ðjþ c c c ðjþ (3) (4) wt constant a ðjþ for jt nearest negbor. Here, a ðjþ s te constant tat vares accordng to te locaton of eac nearest negbor for j ¼ ; ;k. All terms n te devaton contan one c j n eac term, and accordng to ðjþ Eq. () and te Eq. (), all terms sare a sngle metrc dependent equaton r p p r p p. In order to mnmze te amount of devaton for k-nearest negbor classfcaton, a r metrc as to be cosen to mnmze p p r p. p Tus, te metrc wt te least devaton of te nearest negbor classfcaton also mnmzes te devaton for te k-nearest negbor classfcaton as well. 7. Experments wt Caltec 0 and Caltec 56 In ts secton, we present expermental results wt Caltec 0 and Caltec 56 [34]. Te dataset contans mages of 0 and 56 dfferent data objects, respectvely. For Caltec 0, we frst use te bag of words of te Speeded Up Robust Feature (SURF) [35] and sow n Fg. 8(a) ow our algortm

11 6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AN MACHINE INTELLIGENCE, VOL. 40, NO., JANUARY 08 Fg. 8. Nearest negbor classfcaton results wt varous metrcs and 5 nearest negbors. Nearest negbor searc uses GLML, PLML, and MLeg metrc for (a) Caltec 0 dataset wt SURF feature, (b) Caltec 0 dataset wt traned AlexNet dden layer, and (c) Caltec 56 dataset wt traned AlexNet dden layer. ncreases te classfcaton accuracy of te nearest negbor classfcaton. In ts fgure, we report te average and standard error of te 5-fold experments wt k ¼ 5 for te 70 percent randomly selected tranng data and te remanng 30 percent testng data for eac class. Te SURF features are obtaned from tranng data, and tey are clustered to produce a bag of words. Usng te centers of te 500 clusters, we canged all mages nto 500 dmensonal vectors, and from te 500 dmensonal data, we extracted 0 and 00 dmensonal data usng prncpal component analyss (PCA). We sow te results of te orgnal nearest negbor classfcaton and te nearest negbor classfcaton usng GLML, PLML, and ML-eg metrcs wt data of 0 labels. Accordng to Fg. 8a, ML-eg sows very smlar accuracy to te orgnal 5-nearest negbor classfcaton, and PLML sows some ncrease wt 500 dmensonaltes but wt te cost of calculaton tme wc was 33 ours compared wt 0 mnutes of te total GLML experment tme. In Fg. 8b, we used te features from te AlexNet wt traned wegts usng ImageNet [36]. Te 4096 features of te second fully-connected layer are used, and te dmensonaltes are reduced to 0, 00, 500, and 000 usng PCA. For te same Caltec 0 data used n Fg. 8a, ts settng sowed accuracy above 80 percent by usng smple 5-nearest negbor classfcaton, and wt te GLML metrc, te accuracy s ncreased wt te largest margn among te results from varous metrcs. In Fg. 8c, we appled te same settng to Caltec 56 data wt more data as well as more labels. We agan used te average of 5-fold experments, but wt Caltec 56, we used 90 percent for tran and 0 percent for test because te 0 percent test set ncluded enoug data. In all experments, GLML ncreases te orgnal 5-nearest negbor classfcaton accuracy, wle oter recent metrc learnng metods fal to ncrease te accuracy wt large numbers of data and labels. 8 CONCLUSIONS In our study, we sow ow a local metrc for nearest negbor classfcaton can be learned usng generatve models. Our experments sow mprovement over compettve metods on a number of expermental datasets. Te learnng algortm s derved from an analyss of te asymptotc performance of te nearest negbor classfer, suc tat te optmal metrc mnmzes te bas of te expected performance of te classfer. Ts connecton to generatve models s very powerful, as t can easly be extended to nclude mssng data one of te major advantages of generatve models n macne learnng. Te use of kernel metods n conjuncton wt our metod s also stragtforward. We use smple Gaussans for te generatve models but could easly extend tem to nclude oter possbltes as well, suc as mxture models, dden Markov models, or oter dynamc generatve models. APPENIX A AVERAGE PROBABILITY ENSITY AT THE NEAREST NEIGHBOR POINT In ts appendx, we derve te average probablty densty on te surface of yperspere were a nearest negbor s found, as n Fg. 3. ue to te local curvature structure of te densty functon, te expectaton of te probablty densty pðx NN Þ at a nearest negbor pont s perturbed, and ere, we derve te devaton n Eq. (6) from te densty at te pont of nterest x 0, pðx 0 Þ. Frst, we consder te volume of a -dmensonal yperspere wt radus r p r V ¼ G þ ; (5) and te volume of te surface wt an nfntesmal tckness dr! d p r dr G þ dr ¼ p r G þ dr: (6) Fnally, we consder te ntegraton of ðx 0 xþ > rr pðxþðx 0 xþ over te space wt Hessan matrx, rrpðxþ. We use te egenvectors of Hessan, u ;...; u as bases wt te coeffcents b ;...; b satsfyng x 0 x ¼ b u þ...þ b u, to represent te ntegraton as ðx 0 xþ > rrpðxþðx 0 xþ dx 0 (7) ¼ ¼ pffffffffffffffffffffffffffffffff ðb u þ...þ b u Þ > rrpðxþ b þ...þb d (8) ðb u þ...þ b u Þ db...db pffffffffffffffffffffffffffffffff b þ...þb d X wt te egenvalues of Hessan, ;...;. b db...db ; (9)

12 NOH ET AL.: GENERATIVE LOCAL METRIC LEARNING FOR NEAREST NEIGHBOR CLASSIFICATION 7 Wt te ntegraton of Eq. (9) and te dervatve wt respect to te radus, we obtan te average Hessan on te surface p r þ P G þ dr =Volume of te surface (30) ¼ p r þ P G þ dr = p r G þ dr (3) ¼ r r pj x0 : (3) Here, we used te Laplacan as te sum of Hessan egenvalues. In oter words, r p ¼ tr½rrpš ¼ P. Now, te average of te probablty densty on te surface of a yperspere can be derved as follows: We consder a -dmensonal yperspere wt radus d wt te center of te yperspere x, and te pont on te surface x 0 wc s close to x. Te probablty densty pðx 0 Þ can be approxmated usng te second order Taylor seres pðx 0 Þ pðxþþrpðxþ > ðx 0 xþ þ ðx0 xþ > rrpðxþðx 0 xþ: (33) We consder te average of pðx 0 Þ on te surface of te yperspere pðx 0 Þ ESð;rÞ ¼ pðxþþrpj > x ðx0 xþ þ E (34) ðx0 xþ > rrpj x ðx 0 xþ Sð;rÞ; were Sð; rþ s te surface of te -dmensonal yperspere wt radus r. Te average of pðxþ Sð;rÞ ¼ pðxþ because pðxþ s constant, and te second term rpj > x ðx0 xþ Sð;rÞ ¼ 0 due to symmetry. Te average of te trd term s ðx0 xþ > rrp j x ðx 0 xþ Sð;rÞ ¼ r r pj x0 yeldng te expectaton of te probablty densty on te surface pðx 0 r Þ ¼ pðxþþ ESð;rÞ r pj x : (35) ACKNOWLEGMENTS Y.-K. No s supported by grants from te U.S. Ar Force Offce of Scentfc Researc, te Natonal Securty Researc Insttute n Korea, and te SNU-MAE BK+ program. B.-T. ang s supported by grants from te U.S. Ar Force Offce of Scentfc Researc, and Korea government (IITP-R , KEIT , KEIT )... Lee s supported by grants from te U.S. Natonal Scence Foundaton, te U.S. epartment of Transportaton, te U.S. Ar Force Offce of Scentfc Researc, te U.S. Army Researc Laboratory, and te U.S. Offce of Naval Researc. REFERENCES [] T. Jaakkola and. Haussler, Explotng generatve models n dscrmnatve classfers, n Proc. Advances Neural Inf. Process. Syst., 998, pp [] A. Ng and M. Jordan, On dscrmnatve vesus generatve classfers: A comparson of logstc regresson and nave Bayes, n Proc. Advances Neural Inf. Process. Syst., 00, pp [3] S. Lacoste-Julen, F. Sa, and M. Jordan, scla: scrmnatve learnng for dmensonalty reducton and classfcaton, n Proc. Advances Neural Inf. Process. Syst., 009, pp [4] J. Lasserre, C. Bsop, and T. Mnka, Prncpled ybrds of generatve and dscrmnatve models, n Proc. IEEE Comput. Soc. Conf. Comput. Vs. Pattern Recog., 006, pp [5] R. Rana, Y. Sen, A. Ng, and A. McCallum, Classfcaton wt ybrd generatve/dscrmnatve models, n Proc. Advances Neural Inf. Process. Syst., 004, pp [6] J. avs, B. Kuls, P. Jan, S. Sra, and I. llon, Informatonteoretc metrc learnng, n Proc. 4t Int. Conf. Mac. Learn., 007, pp [7] A. Globerson and S. Rowes, Metrc learnng by collapsng classes, n Proc. Advances Neural Inf. Process. Syst., 006, pp [8] J. Goldberger, S. Rowes, G. Hnton, and R. Salakutdnov, Negbourood components analyss, n Proc. Advances Neural Inf. Process. Syst., 005, pp [9] C. Sen, J. Km, L. Wang, and A. van den Hengel, Postve semdefnte metrc learnng wt boostng, n Proc. Advances Neural Inf. Process. Syst., 009, pp [0] K. Wenberger, J. Bltzer, and L. Saul, stance metrc learnng for large margn nearest negbor classfcaton, n Proc. Advances Neural Inf. Process. Syst., 006, pp [] G. Ceck, V. Sarma, U. Salt, and S. Bengo, Large scale onlne learnng of mage smlarty troug rankng, J. Mac. Learn. Res., vol., pp , 00. [] S. Trved,. Mcallester, and G. Saknarovc, scrmnatve metrc learnng by negborood gerrymanderng, n Proc. Advances Neural Inf. Process. Syst., 04, pp [3] J. Wang, A. Kalouss, and A. Woznca, Parametrc local metrc learnng for nearest negbor classfcaton, n Proc. Advances Neural Inf. Process. Syst., 0, pp [4]. Kedem, S. Tyree, F. Sa, G. R. Lanckret, and K. Q. Wenberger, Non-lnear metrc learnng, n Proc. Advances Neural Inf. Process. Syst., 0, pp [5] Y. Yng and P. L, stance metrc learnng wt egenvalue optmzaton, J. Mac. Learn. Res., vol. 3, no., pp. 6, Jan. 0. [6] M. N. Gora, N. N. Leonenko, V. V. Mergel, and P. Inverard, A new class of random vector entropy estmators and ts applcatons n testng statstcal ypoteses, J. Nonparametrc Statst., vol. 7, no. 3, pp , 005. [7] F. Perez-Cruz, Estmaton of nformaton teoretc measures for contnuous random varables, n Proc. Advances Neural Inf. Process. Syst., 009, pp [8] Q. Wang, S. R. Kulkarn, and S. Verdu, A nearest-negbor approac to estmatng dvergence between contnuous random vectors, n Proc. IEEE Int. Symp. Inf. Teory, 006, pp [9] Q. Wang, S. R. Kulkarn, and S. Verdu, vergence estmaton for multdmensonal denstes va k-nearest-negbor dstances, IEEE Trans. Inf. Teory, vol. 55, no. 5, pp , May 009. [0] K. Fukunaga and R. Sort, Te optmal dstance measure for nearest negbour classfcaton, IEEE Trans. Inf. Teory, vol. 7, no. 5, pp. 6 67, Sep. 98. [] K. Fukunaga and T. E. Flck, An optmal global nearest negbour measure, IEEE Trans. Pattern Anal. Mac. Intell., vol. 6, no. 3, pp , May 984. [] R. R. Snapp and S. S. Venkates, Asymptotc expansons of te k nearest negbor rsk, Ann. Statctcs, vol. 6, no. 3, pp , 998. [3] K. Fukunaga, Introducton to Statstcal Pattern Recognton. San ego, CA, USA: Academc, 990. [4] K. as and. Nenadc, Approxmate nformaton dscrmnant analyss: A computatonally smple eteroscedastc feature extracton tecnque, Pattern Recog., vol. 4, no. 5, pp , 008. [5]. Nenadc, Informaton dscrmnant analyss: Feature extracton wt an nformaton-teoretc objectve, IEEE Trans. Pattern Anal. Mac. Intell., vol. 9, no. 8, pp , Aug [6] R. uda, P. Hart, and. Stork, Pattern Classfcaton, nd ed. Hoboken, NJ, USA: Wley-Interscence, 000. [7] M. Loog and R. un, Lnear dmensonalty reducton va a eteroscedastc extenson of LA: Te Cernoff crteron, IEEE Trans. Pattern Anal. Mac. Intell., vol. 6, no. 6, pp , Jun [8] J. Kapur, Measures of Informaton and Ter Applcatons. New York, NY, USA: Wley, 994.

8 IEEE TRANSACTIONS ON PATTERN ANALYSIS AN MACHINE INTELLIGENCE, VOL. 40, NO., JANUARY 08 [9] R. Fletcer and M. Powell, On te modfcaton of LL > factorzatons, Mat. Comput., vol. 8, no. 8, pp.

, vol. 90, no. 5500, pp. 39 33, 000. [3] Y. Lecun and C. Cortes, Te MNIST database of andwrtten dgts, [Onlne]. Avalable: ttp://yann.lecun.com/exdb/mnst/ [33] L. van der Maaten and G.

, vol. 8, no. 4, pp. 594 6, Apr. 006. [35] T. Tuytelaars, L. Van Gool, H. Bay, and A. Ess, SURF: Speeded up robust features, Comput. Vs. Image Understandng, vol. 0, no. 3, pp. 346 359, 008. [36] A.

13 8 IEEE TRANSACTIONS ON PATTERN ANALYSIS AN MACHINE INTELLIGENCE, VOL. 40, NO., JANUARY 08 [9] R. Fletcer and M. Powell, On te modfcaton of LL > factorzatons, Mat. Comput., vol. 8, no. 8, pp , 974. [30] J. R. Bunc, C. P. Nelsen, and. Sorensen, Rank-one modfcaton of te symmetrc egenproblem, Numersce Matematk, vol. 3, pp. 3 48, 978. [3] J. B. Tenenbaum, V. de Slva, and J. C. Langford, A global geometrc framework for nonlnear dmensonalty reducton, Sc., vol. 90, no. 5500, pp , 000. [3] Y. Lecun and C. Cortes, Te MNIST database of andwrtten dgts, [Onlne]. Avalable: ttp://yann.lecun.com/exdb/mnst/ [33] L. van der Maaten and G. Hnton, Vsualzng g-dmensonal data usng t-sne, J. Mac. Learn. Res., vol. 9, pp , 008. [34] L. Fe-fe, R. Fergus, and P. Perona, One-sot learnng of object categores, IEEE Trans. Pattern Anal. Mac. Intell., vol. 8, no. 4, pp , Apr [35] T. Tuytelaars, L. Van Gool, H. Bay, and A. Ess, SURF: Speeded up robust features, Comput. Vs. Image Understandng, vol. 0, no. 3, pp , 008. [36] A. Krzevsky, I. Sutskever, and G. E. Hnton, Imagenet classfcaton wt deep convolutonal neural networks, n Proc. Advances Neural Inf. Process. Syst., 0, pp Yung-Kyun No receved te BS degree n pyscs from POSTECH, n 998 and te P degree n computer scence from Seoul Natonal Unversty, n 0. He s currently a BK assstant professor n te Scool of Mecancal and Aerospace Engneerng, Seoul Natonal Unversty. Hs researc nterests nclude metrc learnng and dmensonalty reducton n macne learnng, and e s especally nterested n applyng statstcal teory of nearest negbors to real, large datasets. From 998 to 003, e was a member of te Marketng Researc Group, MPC Ltd., Korea, were e worked as a researcer and a system engneer. He worked n te GRASP Robotcs Laboratory, Unversty of Pennsylvana from 007 to 0 as a vstng researcer. anel. Lee receved te BA (summa cum laude) degree n pyscs from Harvard Unversty, n 990 and te P n condensed matter pyscs from te Massacusetts Insttute of Tecnology, n 995. He s te UPS Foundaton car professor n te Scool of Engneerng and Appled Scence, Unversty of Pennsylvana. Before comng to Penn, e was a researcer wt AT&T and Lucent Bell Laboratores n te Teoretcal Pyscs and Bologcal Computaton departments. He as receved te Natonal Scence Foundaton CAREER award and te Unversty of Pennsylvana Lndback award for dstngused teacng. He was also a fellow of te Hebrew Unversty Insttute of Advanced Studes n Jerusalem, an afflate of te Korea Advanced Insttute of Scence and Tecnology, and organzed te US- Japan Natonal Academy of Engneerng Fronters of Engneerng symposum. As drector of te GRASP Laboratory and co-drector of te CMU-Penn Unversty Transportaton Center, s group focuses on understandng general computatonal prncples n bologcal systems, and on applyng tat knowledge to buld autonomous systems. He s a fellow of te IEEE. " For more nformaton on ts or any oter computng topc, please vst our gtal Lbrary at Byoung-Tak ang (M95) receved te BS and MS degrees n computer scence and engneerng from Seoul Natonal Unversty (SNU), Seoul, Korea, n 986 and 988, respectvely, and te P degree n computer scence from te Unversty of Bonn, Bonn, Germany, n 99. He s te POSCO car professor n te Scool of computer scence and engneerng and adjunct professor of te Graduate Programs n Cogntve Scence and Bran Scence, Seoul Natonal Unversty (SNU), and drects te Bontellgence Laboratory and te Center for Cogntve Robotcs and Artfcal Intellgence (CRAIC). Pror to jonng SNU, e was a researc fellow n te German Natonal Researc Center for Informaton Tecnology (GM), from 99 to 995. He was a vstng professor n te Computer Scence and Artfcal Intellgence Laboratory (CSAIL), Massacusetts Insttute of Tecnology (MIT), Cambrdge, Massacusetts ( ) and te Cogntve Interacton Tecnology Excellence Center (CITEC), Belefeld, and te Cognton for Tecncal Systems (CoTeSys) Cluster of Excellence, Munc, Germany (00-0). Hs researc focuses on understandng te computatonal prncples of uman bran and mnd and on buldng autonomous ntellgent robots tat perceve, act, tnk, and learn lke umans. He serves on te edtoral board of Genetc Programmng and Evolvable Macnes, Appled Intellgence, and Journal of Cogntve Scence and as an assocate edtor for BoSystems, Advances n Natural Computaton, and te IEEE Transactons on Evolutonary Computaton (997-00).

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components