SELECTION OF THE NUMBER OF NEIGHBOURS OF EACH DATA POINT FOR THE LOCALLY LINEAR EMBEDDING ALGORITHM

ISSN 392 24X INFORMATION TECHNOLOGY AND CONTROL, 2007, Vol.36, No.4 SELECTION OF THE NUMBER OF NEIGHBOURS OF EACH DATA POINT FOR THE LOCALLY LINEAR EMBEDDING ALGORITHM Rasa Karbauskatė,2, Olga Kurasova,2, Gntautas Dzemyda,2 Insttute of Mathematcs and Informatcs, Akademjos St. 4, 08663, Vlnus, Lthuana 2 Vlnus Pedagogcal Unversty Studentų St. 39, 0806, Vlnus, Lthuana Abstract. Ths paper deals wth a method, called locally lnear embeddng. It s a nonlnear dmensonalty reducton technque that computes low-dmensonal, neghbourhood preservng embeddngs of hgh dmensonal data and attempts to dscover nonlnear structure n hgh dmensonal data. The mplementaton of the algorthm s farly straghtforward, as the algorthm has only two control parameters: the number of neghbours of each data pont and the regularsaton parameter. The mappng qualty s qute senstve to these parameters. In ths paper, we propose a new way for selectng the number of the nearest neghbours of each data pont. Our approach s expermentally verfed on two data sets: artfcal data and real world pctures. Keywords: locally lnear embeddng; dmensonalty reducton; manfold learnng.. Introducton Data comng from the real world are often dffcult to understand because of ther hgh dmensonalty. A number of dmensonalty reducton technques are proposed, that allow the user to better analyse or vsualze complex data sets. Dmensonalty reducton technques may be dvded nto two classes. In the frst one, there are lnear methods, such as the Prncpal Component Analyss (PCA, [7]), or the classcal scalng ([4, 5]), etc. However, the underlyng structure of real data s often hghly nonlnear and hence cannot be approxmated by lnear manfolds. The second class ncludes nonlnear algorthms, such as nonlnear varants of multdmensonal scalng (MDS) [4, 5], the self-organsng map (SOM) [8], generatve topographc mappng (GTM) [3], prncpal curves and surfaces [6], etc. Several nonlnear manfold learnng methods locally lnear embeddng (LLE) [, 2], Isomap [3], Laplacan Egenmaps [2] have been developed recently. These methods are supposed to overcome the dffcultes experenced wth other classcal nonlnear approaches mentoned above: they are smple to mplement, have a very small number of free parameters, and do not trap local mnma. These algorthms are able to recover the ntrnsc geometrc structure of a broad class of nonlnear data manfolds and come n two flavours: local and global. Local approaches (e.g., LLE, Laplacan Egenmaps) attempt to preserve the local geometry of the data; partcularly, they seek to map nearby ponts on the manfold to nearby ponts n the low-dmensonal representaton. Global approaches (e.g., Isomap) attempt to preserve geometry at all scales, by mappng nearby ponts on the manfold to nearby ponts n a low-dmensonal space, and faraway ponts to faraway ponts. In ths paper, we concentrate on the LLE algorthm. What are the advantages of LLE compared wth PCA and MDS? The dmensonalty reducton by LLE succeeds n dentfyng the underlyng structure of the manfold, whle PCA or MDS methods map faraway data ponts on the manfold to nearby ponts n the plane, falng to dentfy the structure. Unlke MDS, LLE elmnates the need to estmate parwse dstances between wdely separated data ponts. Ths fact s llustrated n Fgure 2, by mappng a nonlnear twodmensonal S-manfold (Fgure ). Fgure. A nonlnear S-manfold consstng of 000 data ponts 359

R. Karbauskatė, O. Kurasova, G. Dzemyda a) the result obtaned by LLE b) the result obtaned by PCA c) the result obtaned by MDS Fgure 2. Embeddngs of the S-manfold, obtaned by dfferent methods The man control parameter of the LLE algorthm s the number of neghbours of each data pont. Ths parameter strongly nfluences the results obtaned. We propose here a new way for selectng the number of the nearest neghbours of each data pont and apply LLE to hgh dmensonal data vsualzaton. 2. Locally lnear embeddng method Locally lnear embeddng (LLE) [, 2] s a nonlnear method for dmensonalty reducton and manfold learnng. Gven a set of data ponts dstrbuted on a manfold n a hgh dmensonal space, LLE s able to project the data to a lower space by unfoldng the manfold. LLE works by assumng that the manfold s well sampled,.e., there are enough data, each data pont and ts neghbours le on or close to a locally lnear patch. Therefore, a data pont can be approxmated as a weghted lnear combnaton of ts neghbours. The basc dea of LLE s that such a lnear combnaton s nvarant under lnear transformatons (translaton, rotaton, and scalng) and, therefore, should reman unchanged after the manfold has been unfolded to a low space. The low dmensonal confguraton of data ponts s gven by solvng two constraned least squares optmsaton problems. The nput of the LLE algorthm conssts of m n dmensonal vectors X, =,..., m (X R ). The output conssts of m d dmensonal vectors Y, =,..., m (Y R ). The LLE algorthm has three steps. In the frst step, one dentfes k neghbours of each data pont X. Dfferent crtera for neghbour selecton can be adopted; the smplest possblty s to choose the k -nearest neghbours accordng to the Eucldean dstance. In the second step, one computes the weghts w j d that reconstruct each data pont best from ts neghbours X N (),..., X the followng error functon 2 m k EW ( ) = X wjx N( j), = j= N (k n X ), mnmzng subject to the constrants w j = and w j = 0, f and X j k j= X are not neghbours. Ths s a typcal constraned least squares optmsaton problem, whch can be easly answered by solvng a lnear system of equatons. The thrd step conssts n mappng each data pont X to a low-dmensonal vector Y, whch best preserve hgh-dmensonal neghbourhood geometry represented by the weghts. That s, the weghts are fxed and we need to mnmze the followng functon: 2 m k Φ = Y Y, ( Y) wj N( j) = j= subject to two constrants: = 0 and w j m Y = m T m YY = I, where I s the d d dentty matrx, = those provde a unque soluton. The most straghtforward method for computng the d - dmensonal coordnates ( d < n ) s to fnd the bottom d + egenvectors of the sparse matrx T 2 M = ( I W ) ( I W ), (W = ( wj, w j,..., w m j ), j =,..., k). These egenvectors are assocated wth the d + smallest egenvalues of M. The bottom egenvector, whose egenvalue s closest to zero, s the unt vector wth all equal components and t s dscarded. The remanng d egenvectors form the d embeddng coordnates that are found by LLE. 3. Selecton of the number of the nearest neghbours The most mportant step to success of LLE s the frst step, that s, to defne the number k of the nearest neghbours for each data pont. The mappng qualty s rather senstve to ths parameter. If k s set too small, the contnuous manfold can falsely be dvded nto dsjont sub-manfolds, n ths way, the mappng does not reflect any global propertes (Fgure 3, for example k = 5 ). If k s too hgh, a large 360

Selecton of the Number of Neghbours of Each Data Pont for the Locally Lnear Embeddng Algorthm number of the nearest neghbours causes smoothng or elmnaton of small-scale structures n the manfold, the mappng loses ts nonlnear character (Fgure 3, for example k = 00 ) and behaves lke tradtonal PCA (Fgure 2b). k = 5 k = 6 k = 7 k = 8 k = 0 k = 5 k = 20 k = 30 k = 3 k = 40 k = 70 k = 00 Fgure 3. Embeddngs of the 2-dmensonal S-manfold, computed for dfferent choces of the number of the nearest neghbours k by LLE a) m = 000, k = 50 b) m = 2000, k = 50 Fgure 4. Embeddngs of the S-manfolds wth LLE The results of LLE [2] are typcally stable over some range of neghbourhood szes. Fgure 3 shows a range of embeddngs dscovered by the LLE algorthm, all on the same data set, but usng dfferent numbers of the nearest neghbours k. A relable embeddng s obtaned over a wde range of values,.e., k [ 8; 30]. However, as mentoned n [2], the sze of that range depends on varous features of the data, such as the samplng densty and manfold geometry. The dependence of LLE results on samplng densty s shown n Fgure 4. Two 2-dmensonal S-manfolds were nvestgated. One of them conssted of 000 ponts and the other of 2000 ponts. In both cases, embeddngs were computed, as k = 50. LLE faled to unravel the S-manfold of 000 ponts and succeeded n unravelng the manfold of 2000 ponts. If the structure of the manfold s known n advance, we can use a subjectve evaluaton that accompanes a human vsual check. But what can we say about the relablty of the embeddngs computed usng a certan value of the parameter k, when the structure of the manfold s not clear? To estmate the embeddngs, t s necessary to use quanttatve numercal measures. earman s rho or the resdual varance s commonly used for estmatng the topology preservaton wth a vew to reduce dmensonalty. Automatc selecton of the number of the nearest neghbours was proposed n [9]. 36

R. Karbauskatė, O. Kurasova, G. Dzemyda 3.. A new way for selectng a proper range of neghbourhood szes As shown n Fgure 3, t s not necessary to fnd the optmal number of the nearest neghbours, but t s enough to estmate a proper range of neghbourhood szes. In ths paper, we propose a new way for solvng ths problem. In order to quanttatvely estmate the topology preservaton, we compute earman s rho. It estmates the correlaton of rank order data,.e., how well the correspondng low-dmensonal projecton preserves the order of the parwse dstances between the hgh-dmensonal data ponts converted to ranks. earman s rho s computed by usng the followng equaton: T ( r () r () ) 6 x y = ρ =, 3 T T where T s the number of dstances to be compared, r x () and r y () are the ranks of the parwse dstances calculated for the orgnal and projected data ponts. ρ. The best value of earman s rho s equal to one. In the calculaton of earman s rho, dstances both on the plane and on a multdmensonal space are used. A queston arses whch dstances should be evaluated when estmatng earman s rho: Eucldean or geodesc? Eucldean dstances are usually used on the plane. On a multdmensonal space, ether the Eucldean or geodesc dstances are appled. Geodesc dstances represent the shortest paths along the curved surface of the manfold. The author n [] states that the Eucldean dstance s not good for fndng the shortest path between ponts wthn the framework of the manfold. The paper [3] states that t s necessary to apply geodesc dstances n order to preserve the global structure of the manfold. It s reasonable to use the Eucldean dstances n case the manfold s flat, therefore n further experments on the plane we wll always evaluate only Eucldean dstances. The S-manfold ( m =000 ) has been nvestgated. The LLE algorthm was run for many tmes gradually ncreasng the number of neghbours k [ 5; 00], each tme calculatng earman s rho (Fgure 5). Two dependences of earman s rho on k have been obtaned: (I) the Eucldean dstances were evaluated n a space, (II) the geodesc dstances were evaluated n a space. Let the number of neghbours be k =00. We see that, when estmatng the Eucldean dstances, the value of earman s rho s near to ( 0.97 ), and when estmatng the geodesc dstances n a space, the value of earman s rho s much lower ( 0.82 ). If k =00, the Eucldean dstances are preserved very well, but the structure of the manfold s destroyed (Fgure 3, k =00 ), and we wshed to preserve t. Ths experment corroborates the fact that t s 2 ndspensable to evaluate geodesc dstances n a space; therefore we wll evaluate only geodesc dstances n our further experments. earman's rho 0.9 0.8 0.7 0.6 0.5 0 20 40 60 80 00 (I) (II) k n LLE Fgure 5. Dependences of earman s rho on k obtaned after vsualzng the S-manfold by LLE: (I) Eucldean dstances were evaluated n a space, (II) geodesc dstances were evaluated n a space Only one parameter s selected n the calculaton algorthm of geodesc dstances the number of the nearest neghbours necessary to draw a graph. Denote t as k. The LLE algorthm also has the same knd geod of parameter, the number of neghbours k. What value of k should t be when calculatng geodesc geod dstances? Should k geod be concdent wth the chosen number of neghbours k n the LLE algorthm? earman's rho 0.9 0.8 0.7 0.6 0.5 0 20 40 60 80 00 (a) (b) k n LLE Fgure 6. Dependences of earman s rho on k obtaned after vsualzng the S-manfold by LLE. Geodesc dstances were evaluated n a space, as (a) k geod = 0, (b) k geod = k In Fgure 6, two dependences of earman s rho on k have been obtaned: (a) when calculatng geodesc dstances n a space, a very small number of neghbours was fxed, e.g., k geod =0, (b) when calculatng geodesc dstances, the number of neghbours was varyng just lke n the LLE algorthm,.e., k geod = k. If k = 00, the value of earman s rho accordng to curve (a) s rather low ( 0.82 ), and the declned curve rses but slghtly. Hence t follows that dstances are badly retaned and the mappng does not 362

Selecton of the Number of Neghbours of Each Data Pont for the Locally Lnear Embeddng Algorthm represent the global structure. Curve (b) llustrates that the value of earman s rho approaches ( 0. 95 ). It mples that the LLE result s rather good. However t s obvous that after vsualsng these data by LLE wth k =00, the resultng mappng does not reflect the structure of the manfold (Fgure 3, k =00 ), though the value of earman s rho s close to. The reason why s as follows: f very many neghbours are selected whle calculatng geodesc k geod dstances n a space, then the structure of nonlnear manfold s destroyed,.e., the nearest neghbours to a pont n a space may be the ponts met n the transton across the manfold (Eucldean dstances are calculated when lookng for neghbours). In Fgure, the neghbours of the pont marked by a black crcular dsk fall nto the black crcle. In ths case, the LLE algorthm contans as many neghbours as that for calculatng geodesc dstances: k = k geod (neghbours n the LLE algorthm are found by calculatng Eucldean dstances). Therefore, faraway ponts on the manfold are treated as the close ones both n a space and n a plane. Ths s the reason why the value of earman s rho ncreases wth an ncrease n number of the nearest neghbours. Good embeddngs n Fgure 3 are obtaned when curve (a) n Fgure 6 reaches ts maxmum. Therefore, earman s rho wth fxed rather small k may be used as crteron for geod vsualzaton qualty. 4. Applcaton of LLE n analyss of pcture set One of the applcatons of the LLE method n practce s vsualzaton of the ponts, the coordnates of whch are comprsed of the parameters of pctures. A pcture s dgtsed,.e., a vector conssts of colour parameters of pxels therefore t s of very large dmenson. The partcularty of these data s that, the data are comprsed of pctures of the same object, by turnng the object gradually at a certan angle. In ths way the ponts dffer from one another slghtly, makng up a certan manfold. For an experment uncoloured pctures were used, obtaned by gradually rotatng a ducklng at the 360 angle [0]. The number of pctures (ponts) was m = 72. The mages had 28 28 grayscale pxels, therefore the dmenson of ponts n a multdmensonal space s n = 6384. The LLE algorthm was run for 35 tmes as k 2; 36. Each tme earman s rho was [ ] calculated. Three dependences of earman s rho on k have been shown n Fgure 7: (I), when calculatng geodesc dstances n a space, a very small number of neghbours was fxed, e.g., = 2 ; (II), when k geod calculatng geodesc dstances, the number of neghbours s varyng just lke n the LLE algorthm,.e., k geod = k ; (III) Eucldean dstances were estmated n a space. We see that cases (I) and (II) bear the hghest values of earman s rho,.e., 0.9 ρ 0.97 as k [ 2 ; 8], whle case (III) has much lower values of earman s rho as k [ 2; 8] 0.66 ρ 0.7. For k 9, the values of earman s rho consderably dmnsh ( ρ 0. 54, as k = 9 ) n case (I), n case (II) they decrease a lttle less ( ρ 0.82, as k = 9 ), and n case (III), on the contrary, the values ncrease. Embeddngs, obtaned after vsualsng these data by LLE, are presented n Fgure 8. Snce the object was gradually turned round at the 360 angle, t s lkely that the true representaton s obtaned n Fgure 8a as k [ 2; 8]. Hence t follows that cases (I) and (II) yeld the rght result. Case (I) llustrates an especally explct dfference between these solutons. earman's rho 0,9 0,8 0,7 0,6 0,5 0,4 0 4 8 2 6 20 24 28 32 36 k n LLE (I) (II) (III) Fgure 7. Dependences of earman s rho on k obtaned after vsualzng pctures of a rotatng ducklng by LLE: (I) geodesc dstances were evaluated n a space, k geod = 2 ; (II) geodesc dstances were evaluated n a space, k geod = k ; (III) - Eucldean dstances were evaluated n a space a) k = 4 b) k = 9 Fgure 8. 2-dmensonal embeddngs of m = 72 pctures of a rotatng ducklng, obtaned by LLE usng k nearest neghbours. Larger crcles mark representatve samples of pctures : 363

R. Karbauskatė, O. Kurasova, G. Dzemyda 5. Conclusons In ths paper, we have explored the LLE algorthm for nonlnear dmensonalty reducton. The man control parameter of LLE s the number of the nearest neghbours of each data pont. Ths parameter greatly nfluences the results obtaned. In ths paper, we propose a new way for selectng the value of ths parameter. In order to quanttatvely estmate the topology preservaton, we compute earman s rho. The experments have shown that the quanttatve measure earman s rho s sutable to estmate the topology preservaton after vsualzng the data by the LLE algorthm. In order that earman s rho properly reflected the projectons obtaned, t s necessary to evaluate the geodesc but not Eucldean dstances when calculatng ts value n an n-dmensonal space by selectng rather a small number of neghbours n the geodesc dstance algorthm. Acknowledgment The authors are very grateful to Dr. Olga Kayo and Dr. Oleg Okun from the Oulu Unversty for ther valuable remarks that allowed us to mprove the qualty of ths paper. The research s partally supported by the Lthuanan State Scence and Studes Foundaton project Informaton technology tools of clncal decson support and ctzens wellness for e.health system (No. B- 0709). References [] C.C. Aggarwal, A. Hnneburg, D.A. Kem. On the surprsng behavor of dstance metrcs n hgh dmensonal space. Lecture Notes n Computer Scence, 973, 200. [2] M. Belkn, P. Nyog. Laplacan egenmaps and spectral technques for embeddng and clusterng. In T.G. Detterch, S. Becker and Z. Ghahraman (eds.), Advances n Neural Informaton Processng Systems 4. MIT Press, 2002 [3] C.M. Bshop, M. Svensén, C.K.I. Wllams. GTM: The generatve topographc mappng. Neural Computaton, 0(): 25 234, 998. [4] I. Borg, P. Groenen. Modern multdmensonal scalng. rnger-verlag, Berln, 997. [5] T. Cox, M. Cox. Multdmensonal Scalng. Chapman & Hall, London, 994. [6] T. Haste and W. Stuetzle. Prncpal curves. Journal of the Amercan Statstcal Assocaton, 84: 502 56, 989. [7] I.T. Jollffe. Prncpal Component Analyss. rnger- Verlag, New York, 989. [8] T. Kohonen. Self-organzng maps. rnger Seres n Informaton Scences. rnger-verlag, Berln, 995. [9] O. Kouropteva, O. Okun, M. Petkanen. Selecton of the optmal parameter value for the locally lnear embeddng algorthm. Proc. of 2002 Internatonal Conference on Fuzzy Systems and Knowledge Dscovery, 2002, 359 363. [0] S. A. Nene, S. K. Nayar and H. Murase Columba Object Image Lbrary (COIL-20). Techncal Report CUCS-005-96, 996. [] S.T. Rowes and L.K. Saul. Nonlnear dmensonalty reducton by locally lnear embeddng. Scence, 290, 2000, 2323 2326. [2] L.K. Saul, S.T. Rowes. Thnk globally, ft locally: Unsupervsed learnng of low dmensonal manfolds. J. Machne Learnng Research, 4, June 2003, 9 55. [3] J.B. Tenenbaum, V. de Slva, J.C. Langford. A global geometrc framework for nonlnear dmensonalty reducton. Scence, 290, 2000, 239 2323. Receved September 2007. 364