Computational and Theoretical Analysis of Null Space and Orthogonal Linear Discriminant Analysis

Jounal of Machine Leaning Reseach 7 2006) 1183 1204 Submitted 12/05; Revised 3/06; Published 7/06 Computational and Theoetical Analysis of Null Space and Othogonal Linea Disciminant Analysis Jieping Ye Depatment of Compute Science and Engineeing Aizona State Univesity Tempe, AZ 85287, USA JIEPING.YE@ASU.EDU Tao Xiong Depatment of Electical and Compute Engineeing Univesity of Minnesota Minneapolis, MN 55455, USA TXIONG@ECE.UMN.EDU Edito: David Madigan Abstact Dimensionality eduction is an impotant pe-pocessing step in many applications. Linea disciminant analysis LDA) is a classical statistical appoach fo supevised dimensionality eduction. It aims to maximize the atio of the between-class distance to the within-class distance, thus maximizing the class discimination. It has been used widely in many applications. Howeve, the classical LDA fomulation equies the nonsingulaity of the scatte matices involved. Fo undesampled poblems, whee the data dimensionality is much lage than the sample size, all scatte matices ae singula and classical LDA fails. Many extensions, including null space LDA NLDA) and othogonal LDA OLDA), have been poposed in the past to ovecome this poblem. NLDA aims to maximize the between-class distance in the null space of the within-class scatte matix, while OLDA computes a set of othogonal disciminant vectos via the simultaneous diagonalization of the scatte matices. They have been applied successfully in vaious applications. In this pape, we pesent a computational and theoetical analysis of NLDA and OLDA. Ou main esult shows that unde a mild condition which holds in many applications involving highdimensional data, NLDA is equivalent to OLDA. We have pefomed extensive expeiments on vaious types of data and esults ae consistent with ou theoetical analysis. We futhe apply the egulaization to OLDA. The algoithm is called egulaized OLDA o ROLDA fo shot). An efficient algoithm is pesented to estimate the egulaization value in ROLDA. A compaative study on classification shows that ROLDA is vey competitive with OLDA. This confims the effectiveness of the egulaization in ROLDA. Keywods: linea disciminant analysis, dimensionality eduction, null space, othogonal matix, egulaization 1. Intoduction Dimensionality eduction is impotant in many applications of data mining, machine leaning, and bioinfomatics, due to the so-called cuse of dimensionality Bellmanna, 1961; Duda et al., 2000; Fukunaga, 1990; Hastie et al., 2001). Many methods have been poposed fo dimensionality eduction, such as pincipal component analysis PCA) Jolliffe, 1986) and linea disciminant analysis c 2006 Jieping Ye and Tao Xiong.

YE AND XIONG LDA) Fukunaga, 1990). LDA aims to find the optimal disciminant vectos tansfomation) by maximizing the atio of the between-class distance to the within-class distance, thus achieving the maximum class discimination. It has been applied successfully in many applications including infomation etieval Bey et al., 1995; Deeweste et al., 1990), face ecognition Belhumeou et al., 1997; Swets and Weng, 1996; Tuk and Pentland, 1991), and micoaay gene expession data analysis Dudoit et al., 2002). Howeve, classical LDA equies the so-called total scatte matix to be nonsingula. In many applications such as those mentioned above, all scatte matices in question can be singula since the data points ae fom a vey high-dimensional space and in geneal the sample size does not exceed this dimensionality. This is known as the singulaity o undesampled poblem Kzanowski et al., 1995). In ecent yeas, many appoaches have been poposed to deal with such high-dimensional, undesampled poblem, including null space LDA NLDA) Chen et al., 2000; Huang et al., 2002), othogonal LDA OLDA) Ye, 2005), uncoelated LDA ULDA) Ye et al., 2004a; Ye, 2005), subspace LDA Belhumeou et al., 1997; Swets and Weng, 1996), egulaized LDA Fiedman, 1989), and pseudo-invese LDA Raudys and Duin, 1998; Skuichina and Duin, 1996). Null space LDA computes the disciminant vectos in the null space of the within-class scatte matix. Uncoelated LDA and othogonal LDA ae among a family of algoithms fo genealized disciminant analysis poposed in Ye, 2005). The featues in ULDA ae uncoelated, while the disciminant vectos in OLDA ae othogonal to each othe. Subspace LDA o PCA+LDA) applies an intemediate dimensionality eduction stage such as PCA to educe the dimensionality of the oiginal data befoe classical LDA is applied. Regulaized LDA uses a scaled multiple of the identity matix to make the scatte matix nonsingula. Pseudo-invese LDA employs the pseudo-invese to ovecome the singulaity poblem. Moe details on these methods, as well as thei elationship, can be found in Ye, 2005). In this pape, we pesent a detailed computational and theoetical analysis of null space LDA and othogonal LDA. In Chen et al., 2000), the null space LDA NLDA) was poposed, whee the between-class distance is maximized in the null space of the within-class scatte matix. The singulaity poblem is thus implicitly avoided. Simila idea has been mentioned biefly in Belhumeou et al., 1997). Huang et al., 2002) impoved the efficiency of the algoithm by fist emoving the null space of the total scatte matix, based on the obsevation that the null space of the total scatte matix is the intesection of the null space of the between-class scatte matix and the null space of the withinclass scatte matix. In othogonal LDA OLDA), a set of othogonal disciminant vectos is computed, based on a genealized optimization citeion Ye, 2005). The optimal tansfomation is computed though the simultaneous diagonalization of the scatte matices, while the singulaity poblem is ovecome implicitly. Disciminant analysis with othogonal tansfomations has been studied in Duchene and Lecleq, 1988; Foley and Sammon, 1975). By a close examination of the computations involved in OLDA, we can decompose the OLDA algoithm into thee steps: fist emove the null space of the total scatte matix; followed by classical uncoelated LDA ULDA), a vaiant of classical LDA details can be found in Section 2.1); and finally apply an othogonalization step to the tansfomation. Both the NLDA algoithm Huang et al., 2002) and the OLDA algoithm Ye, 2005) esult in othogonal tansfomations. Howeve, they applied diffeent schemes in deiving the optimal tansfomations. NLDA computes an othogonal tansfomation in the null space of the within-class scatte matix, while OLDA computes an othogonal tansfomation though the simultaneous diagonaliza- 1184

ANALYSIS OF NULL SPACE AND ORTHOGONAL LINEAR DISCRIMINANT ANALYSIS tion of the scatte matices. Inteestingly, we show in Section 5 that NLDA is equivalent to OLDA, unde a mild condition C1, 1 which holds in many applications involving high-dimensional data see Section 7). Based on the equivalence esult, an impoved algoithm fo NLDA, called inlda, is pesented, which futhe educes the computational cost of the oiginal NLDA algoithm. We extend the OLDA algoithm by applying the egulaization technique, which is commonly used to stabilize the sample covaiance matix estimation and impove the classification pefomance Fiedman, 1989). The algoithm is called egulaized OLDA o ROLDA fo shot). The key idea in ROLDA is to add a constant λ to the diagonal elements of the total scatte matix. Hee λ > 0 is known as the egulaization paamete. Choosing an appopiate egulaization value is a citical issue in ROLDA, as a lage λ may significantly distub the infomation in the scatte matix, while a small λ may not be effective in impoving the classification pefomance. Coss-validation is commonly used to estimate the optimal λ fom a finite set of candidates. Selecting an optimal value fo a paamete such as λ is called model selection Hastie et al., 2001). The computational cost of model selection fo ROLDA can be expensive, especially when the candidate set is lage, since it equies expensive matix computations fo each λ. We show in Section 6 that the computations in ROLDA can be decomposed into two components: the fist component involves matices of high dimensionality but independent of λ, while the second component involves matices of low dimensionality. When seaching fo the optimal λ fom a set of candidates via coss-validation, we epeat the computations involved in the second component only, thus educing the computational cost of model selection in ROLDA. We have conducted expeiments using 14 data sets fom vaious data souces, including lowdimensional data fom UCI Machine Leaning Repositoy 2 and high-dimensional data such as text documents, face images, and gene expession data. Details on these data sets can be found in Section 7.) We did a compaative study of NLDA, inlda, OLDA, ULDA, ROLDA, and Suppot Vecto Machines SVM) Schökopf and Smola, 2002; Vapnik, 1998) in classification. Expeimental esults show that Fo all low-dimensional data sets, the null space of the within-class scatte matix is empty, and both NLDA and inlda do not apply. Howeve, OLDA is applicable and the educed dimensionality of OLDA is in geneal k 1, whee k is the numbe of classes. Condition C1 holds fo most high-dimensional data sets eight out of nine data sets). NLDA, inlda, and OLDA achieve the same classification pefomance, in all cases when condition C1 holds. Fo cases whee condition C1 does not hold, OLDA outpefoms NLDA and inlda, as OLDA has a lage numbe of educed dimensions than NLDA and inlda. These empiical esults ae consistent with ou theoetical analysis. inlda and NLDA achieve simila pefomance in all cases. OLDA is vey competitive with ULDA. This confims the effectiveness of the final othogonalization step in OLDA. ROLDA achieves a bette classification pefomance than OLDA, which shows the effectiveness of the egulaization in ROLDA. Oveall, ROLDA and SVM ae vey competitive with othe methods in classification. The est of the pape is oganized as follows. An oveview of classical LDA and classical uncoelated LDA is given in Section 2. NLDA and OLDA ae discussed in Section 3 and Section 4, 1. Condition C1 equies that the ank of the total scatte matix equals to the sum of the ank of the between-class scatte matix and the ank of the within-class scatte matix. Moe details will be given in Section 5. 2. http://www.ics.uci.edu/ mlean/mlrepositoy.html 1185

YE AND XIONG Notation Desciption Notation Desciption A data matix n numbe of taining data points m data dimensionality l educed dimensionality k numbe of classes S b between-class scatte matix S w within-class scatte matix S t total scatte matix G tansfomation matix S i covaiance matix of the i-th class c i centoid of the i-th class n i sample size of the i-th class c global centoid K numbe of neighbos in K-NN t ank of S t q ank of S b Table 1: Notation. espectively. The elationship between NLDA and OLDA is studied in Section 5. The ROLDA algoithm is pesented in Section 6. Section 7 includes the expeimental esults. We conclude in Section 8. Fo convenience, Table 1 lists the impotant notation used in the est of this pape. 2. Classical Linea Disciminant Analysis Given a data set consisting of n data points {a j } n in IRm, classical LDA computes a linea tansfomation G IR m l l < m) that maps each a j in the m-dimensional space to a vecto â j in the l-dimensional space by â j = G T a j. Define thee matices H w, H b, and S t as follows: H w = 1 n [A 1 c 1 e T ),,A k c k e T )], 1) H b = 1 n [ n 1 c 1 c),, n k c k c)], 2) H t = 1 n A ce T ), 3) whee A = [a 1,,a n ] is the data matix, A i, c i, S i, and n i ae the data matix, the centoid, the covaiance matix, and the sample size of the i-th class, espectively, c is the global centoid, k is the numbe of classes, and e is the vecto of all ones. Then the between-class scatte matix S b, the within-class scatte matix S w, and the total scatte matix S t ae defined as follows Fukunaga, 1990): S w = H w H T w, S b = H b H T b, and S t = H t H T t. It follows fom the definition Ye, 2005) that taces w ) measues the within-class cohesion, taces b ) measues the between-class sepaation, and taces t ) measues the vaiance of the data set, whee the tace of a squae matix is the summation of its diagonal enties Golub and Van Loan, 1996). It is easy to veify that S t = S b +S w. In the lowe-dimensional space esulting fom the linea tansfomation G, the scatte matices become S L w = G T S w G, S L b = GT S b G, and S L t = G T S t G. An optimal tansfomation G would maximize taces L b ) and minimize tacesl w). Classical LDA 1186

ANALYSIS OF NULL SPACE AND ORTHOGONAL LINEAR DISCRIMINANT ANALYSIS aims to compute the optimal G by solving the following optimization poblem: G G = ag max tace T S w G ) 1 G T S b G). 4) G IR m l :G T S w G=I l Othe optimization citeia, including those based on the deteminant could also be used instead Duda et al., 2000; Fukunaga, 1990). The solution to the optimization poblem in Eq. 4) is given by the eigenvectos of Sw 1 S b coesponding to the nonzeo eigenvalues, povided that the withinclass scatte matix S w is nonsingula Fukunaga, 1990). The columns of G fom the disciminant vectos of classical LDA. Since the ank of the between-class scatte matix is bounded fom above by k 1, thee ae at most k 1 disciminant vectos in classical LDA. Note that classical LDA does not handle singula scatte matices, which limits its applicability to low-dimensional data. Seveal methods, including null space LDA and othogonal LDA subspace LDA, wee poposed in the past to deal with such singulaity poblem as discussed in Section 1. 2.1 Classical Uncoelated LDA Classical uncoelated LDA culda) is an extension of classical LDA. A key popety of culda is that the featues in the tansfomed space ae uncoelated, thus educing the edundancy in the tansfomed space. culda aims to find the optimal disciminant vectos that ae S t -othogonal. 3 Specifically, suppose vectos φ 1,φ 2,,φ ae obtained, then the +1)-th vecto φ +1 is the one that maximizes the Fishe citeion function Jin et al., 2001): fφ) = φt S b φ φ T S w φ, 5) subject to the constaints: φ T +1 S tφ i = 0, fo i = 1,,. The algoithm in Jin et al., 2001) finds the disciminant vectos φ i s successively by solving a sequence of genealized eigenvalue poblems, which is expensive fo lage and high-dimensional data sets. Howeve, it has been shown Ye et al., 2004a) that the disciminant vectos of culda can be computed efficiently by solving the following optimization poblem: G G = ag max tace T S w G ) 1 G T S b G), 6) G IR m l :G T S t G=I l whee G = [φ 1,,φ l ], if thee exist l disciminant vectos in culda. Note that in Eq. 6), all disciminant vectos in G ae computed simultaneously. The optimization poblem above is a vaiant of the one in Eq. 4). The optimal G is given by the eigenvectos of St 1 S b. 3. Null Space LDA Chen et al., 2000) poposed the null space LDA NLDA) fo dimensionality eduction, whee the between-class distance is maximized in the null space of the within-class scatte matix. The basic idea behind this algoithm is that the null space of S w may contain significant disciminant infomation if the pojection of S b is not zeo in that diection Chen et al., 2000; Lu et al., 2003). 3. Two vectos x and y ae S t -othogonal, if x T S t y = 0. 1187

YE AND XIONG The singulaity poblem is thus ovecome implicitly. The optimal tansfomation of NLDA can be computed by solving the following optimization poblem: G = agmax G T S w G=0 tacegt S b G). 7) The computation of the optimal G involves the computation of the null space of S w, which may be lage fo high-dimensional data. Indeed, the dimensionality of the null space of S w is at least m+k n, whee m is the data dimensionality, k is the numbe of classes, and n is the sample size. In Chen et al., 2000), a pixel gouping method was used to extact geometic featues and educe the dimensionality of samples, and then NLDA was applied in the new featue space. Huang et al., 2002) impoved the efficiency of the algoithm in Chen et al., 2000) by fist emoving the null space of the total scatte matix S t. It is based on the obsevation that the null space of S t is the intesection of the null space of S b and the null space of S w, as S t = S w + S b. We can efficiently emove the null space of S t as follows. Let H t = UΣV T be the Singula Value Decomposition SVD) Golub and Van Loan, 1996) of H t, whee H t is defined in Eq. 3), U and V ae othogonal, Σt 0 Σ = 0 0 Σ t IR t t is diagonal with the diagonal enties soted in the non-inceasing ode, and t = anks t ). Then ) S t = H t Ht T = UΣV T V Σ T U T = UΣΣ T U T Σ 2 = U t 0 U T. 8) 0 0 Let U = U 1,U 2 ) be a patition of U with U 1 IR m t and U 2 IR m m t). Then the null space of S t can be emoved by pojecting the data onto the subspace spanned by the columns of U 1. Let S b, S w, and S t be the scatte matices afte the emoval of the null space of S t. That is, ), S b = U T 1 S b U 1, S w = U T 1 S w U 1, and S t = U T 1 S t U 1. Note that only U 1 is involved fo the pojection. We can thus apply the educed SVD computation Golub and Van Loan, 1996) on H t with the time complexity of Omn 2 ), instead of Om 2 n). When the data dimensionality m is much lage than the sample size n, this leads to a big eduction in tems of the computational cost. With the computed U 1, the optimal tansfomation of NLDA is given by G = U 1 N, whee N is obtained by solving the following optimization poblem: N = agmax N T S w N=0 tacent S b N). 9) That is, the columns of N lie in the null space of S w, while maximizing tacen T S b N). Let W be the matix so that the columns of W span the null space of S w. Then N = WM, fo some matix M, which is to be detemined next. Since the constaint in Eq. 9) is satisfied with N = WM fo any M, the optimal M can be computed by maximizing tacem T W T S b WM). By imposing the othogonality constaint on M Huang et al., 2002), the optimal M is given by the eigenvectos of W T S b W coesponding to the nonzeo eigenvalues. With the computed U 1, W, and M above, the optimal tansfomation of NLDA is given by G = U 1 WM. 1188

ANALYSIS OF NULL SPACE AND ORTHOGONAL LINEAR DISCRIMINANT ANALYSIS Algoithm 1: NLDA Null space LDA) Input: data matix A Output: tansfomation matix G 1. Fom the matix H t as in Eq. 3); 2. Compute the educed SVD of H t as H t = U 1 Σ t V1 T ; 3. Fom the matices S b = U1 T S bu 1 and S w = U1 T S wu 1 ; 4. Compute the null space, W, of S w, via the eigen-decomposition; 5. Constuct the matix M, consisting of the top eigenvectos of W T S b W; 6. G U 1 WM. In Huang et al., 2002), the matix W is computed via the eigen-decomposition of S w. Moe specifically, let ) 0 0 S w = [W, W] [W, W] T 0 w be its eigen-decomposition, whee [W, W] is othogonal and w is diagonal with positive diagonal enties. Then W foms the null space of S w. The pseudo-code fo the NLDA algoithm is given in Algoithm 1. 4. Othogonal LDA Othogonal LDA OLDA) was poposed in Ye, 2005) as an extension of classical LDA. The disciminant vectos in OLDA ae othogonal to each othe. Futhemoe, OLDA is applicable even when all scatte matices ae singula, thus ovecoming the singulaity poblem. It has been applied successfully in many applications, including document classification, face ecognition, and gene expession data classification. The optimal tansfomation in OLDA can be computed by solving the following optimization poblem: G = agmax G IR m l :G T G=I l tace G T S t G) + G T S b G ), 10) whee M + denotes the pseudo-invese of matix M Golub and Van Loan, 1996). The othogonality condition is imposed in the constaint. The computation of the optimal tansfomation of OLDA is based on the simultaneous diagonalization of the thee scatte matices as follows Ye, 2005). Fom Eq. 8), U 2 lies in the null space of both S b and S w. Thus, U T S b U = U T 1 S b U 1 0 0 0 ), U T S w U = U T 1 S w U 1 0 0 0 ). 11) Denote B = Σt 1 U1 T H b and let B = P ΣQ T be the SVD of B, whee P and Q ae othogonal and Σ is diagonal. Define the matix X as Σ 1 t P 0 X = U 0 I m t ). 12) It can be shown Ye, 2005) that X simultaneously diagonalizes S b, S w, and S t. That is X T S b X = D b, X T S w X = D w, and X T S t X = D t, 13) 1189

YE AND XIONG Algoithm 2: OLDA Othogonal LDA) Input: data matix A Output: tansfomation matix G 1. Compute U 1, Σ t, and P; 2. X q U 1 Σt 1 P q, whee q = anks b ); 3. Compute the QR decomposition of X q as X q = QR; 4. G Q. whee D b, D w, and D t ae diagonal with the diagonal enties in D b soted in the non-inceasing ode. The main esult in Ye, 2005) has shown that the optimal tansfomation of OLDA can be computed though the othogonalization of the columns in X, as summaized in the following theoem: Theoem 4.1 Let X be the matix defined in Eq. 12) and let X q be the matix consisting of the fist q columns of X, whee q = anks b ). Let X q = QR be the QR-decomposition of X q, whee Q has othonomal columns and R is uppe tiangula. Then G = Q solves the optimization poblem in Eq. 10). Fom Theoem 4.1, only the fist q columns of X ae used in computing the optimal G. Fom Eq. 12), the fist q columns of X ae given by X q = U 1 Σ 1 t P q, 14) whee P q consists of the fist q columns of the matix P. We can obseve that U 1 coesponds to the emoval of the null space of S t as in NLDA, while Σt 1 P q is the optimal tansfomation when classical ULDA is applied to the intemediate dimensionality) educed space by the pojection of U 1. The OLDA algoithm can thus be decomposed into thee steps: 1) Remove the null space of S t ; 2) Apply classical ULDA as an intemediate step, since the educed total scatte is nonsingula; and 3) Apply an othogonalization step to the tansfomation, which coesponds to the QR decomposition of X q in Theoem 4.1. The pseudo-code fo the OLDA algoithm is given in Algoithm 2. Remak 1 The ULDA algoithm in Ye et al., 2004a; Ye, 2005) consists of steps 1 and 2 above, without the final othogonalization step. Expeimental esults in Section 7 show that OLDA is competitive with ULDA. The ationale behind this may be that ULDA involves the minimum edundancy in the tansfomed space and is susceptible to ovefitting; OLDA, on the othe hand, emoves the R matix though the QR decomposition in the final othogonalization step, which intoduces the edundancy in the educed space, but may be less susceptible to ovefitting. 5. Relationship Between NLDA and OLDA Both the NLDA algoithm and the OLDA algoithm esult in othogonal tansfomations. Ou empiical esults show that they often lead to simila pefomance, especially fo high-dimensional data. This implies thee may exist an intinsic elationship between these two algoithms. In this section, we take a close look at the elationship between NLDA and OLDA. Moe specifically, we show that NLDA is equivalent to OLDA, unde a mild condition C1 : anks t ) = anks b )+anks w ), 15) 1190

ANALYSIS OF NULL SPACE AND ORTHOGONAL LINEAR DISCRIMINANT ANALYSIS which holds in many applications involving high-dimensional data see Section 7). It is easy to veify fom the definition of the scatte matices that anks t ) anks b )+anks w ). Fom Eqs. 8) and 11), the null space, U 2, of S t can be emoved, as follows: S t = U T 1 S t U 1 = U T 1 S b U 1 +U T 1 S w U 1 = S w + S b IR t t. Since the null space of S t is the intesection of the null space of S b and the null space of S w, the following equalities hold: ank S t ) = anks t ) = t, ank S b ) = anks b ), and ank S w ) = anks w ). Thus condition C1 is equivalent to ank S t ) = ank S b )+ank S w ). The null space of S b and the null space of S w ae citical in ou analysis. The elationship between these two null spaces is studied in the following lemma. Lemma 5.1 Let S t, S b, and S w be defined as above and t = ank S t ). Let {w 1,,w } foms an othonomal basis fo the null space of S w, and let {b 1,,b s } foms an othonomal basis fo the null space of S b. Then, {w 1,,w,b 1,,b s } ae linealy independent. Poof Pove by contadiction. Assume thee exist α i s and β j s, not all zeos, such that It follows that α i w i + s β j b j = 0. 0 = α i w i + s ) T β j b j S w α i w i + since w i s lie in the null space of S w. Hence, ) T ) s s s β j b j S t β j b j = = 0. s ) ) T s s β j b j = β j b j S w β j b j ), ) T )+ s s β j b j S w β j b j ) T ) s β j b j S b β j b j Since S t is nonsingula, we have s β jb j = 0. Thus β j = 0, fo all j, since {b 1,,b s } foms an othonomal basis fo the null space of S b. Similaly, we have and 0 = α i w i + s ) T ) α i w i S t α i w i ) T β j b j S b α i w i + = = 0. s α i w i ) T S w β j b j ) = α i w i )+ ) T α i w i S b α i w i ). ) T ) α i w i S b α i w i 1191

YE AND XIONG Hence α iw i = 0, and α i = 0, fo all i, since {w 1,,w } foms ae othonomal basis fo the null space of S w. This contadicts ou assumption that not all of the α i s and the β j s ae zeo, Thus, {w 1,,w,b 1,,b s } ae linealy independent. Next, we show how to compute the optimal tansfomation of NLDA using these two null spaces. Recall that in NLDA, the null space of S t may be emoved fist. In the following discussion, we wok on the educed scatte matices S w, S b, and S t diectly as in Lemma 5.1. The main esult is summaized in the following theoem. Theoem 5.1 Let U 1, S t, S b, and S w be defined as above and t = ank S t ). Let R = [W,B], whee W = [w 1,,w ], B = [b 1,,b s ], and {w 1,,w,b 1,,b s } ae defined as in Lemma 5.1. Assume that condition C1: anks t ) = anks b )+anks w ) holds. Then G = U 1 WM solves the optimization poblem in Eq. 9), whee the matix M, consisting of the eigenvectos of W T S b W, is othogonal. Poof Fom Lemma 5.1, {w 1,,w,b 1,,b s } IR t is linealy independent. Condition C1 implies that t = + s. Thus {w 1,,w,b 1,,b s } foms a basis fo IR t, that is, R = [W,B] is nonsingula. It follows that R T S t R = R T S b R+R T S w R W T S = b W W T ) S b B W T S B T S b W B T + w W W T S w B S b B B T S w W B T S w B ) ) W T S = b W 0 0 0 + 0 0 0 B T. S w B Since matix R T S t R has full ank, W T S b W, the pojection of S b onto the null space of S w, is nonsingula. Let W T S b W = M b M T be the eigen-decomposition of W T S b W, whee M is othogonal and b is diagonal with positive diagonal enties note that W T S b W is positive definite). Then, fom Section 3, the optimal tansfomation G of NLDA is given by G = U 1 WM. ) Recall that the matix M in NLDA is computed so that tacem T W T S b WM) is maximized. Since taceqaq T ) = tacea) fo any othogonal Q, the solution in NLDA is invaiant unde an abitay othogonal tansfomation. Thus G = U 1 W is also a solution to NLDA, since M is othogonal, as summaized in the following coollay. Coollay 5.1 Assume condition C1: anks t ) = anks b ) + anks w ) holds. Let U 1 and W be defined as in Theoem 5.1. Then G = U 1 W solves the optimization poblem in Eq. 9). That is, G = U 1 W is an optimal tansfomation of NLDA. Coollay 5.1 implies that when condition C1 holds, Step 5 in Algoithm 1 may be emoved, as well as the fomation of S b in Step 3 and the multiplication of U 1 W with M in Step 6. This impoves the efficiency of the NLDA algoithm. The impoved NLDA inlda) algoithm is given in Algoithm 3. Note that it is ecommended in Liu et al., 2004) that the maximization of the between-class distance in Step 5 of Algoithm 1 should be emoved to avoid possible ovefitting. Howeve, Coollay 5.1 shows that unde condition C1, the emoval of Step 5 has no effect on the pefomance of the NLDA algoithm. Next, we show the equivalence elationship between NLDA and OLDA, when condition C1 holds. The main esult is summaized in the following theoem. 1192

ANALYSIS OF NULL SPACE AND ORTHOGONAL LINEAR DISCRIMINANT ANALYSIS Algoithm 3: inlda impoved NLDA) Input: data matix A Output: tansfomation matix G 1. Fom the matix H t as in Eq. 3); 2. Compute the educed SVD of H t as H t = U 1 Σ t V1 T ; 3. Constuct the matix S w = U1 T S wu 1 ; 4. Compute the null space, W, of S w, via the eigen-decomposition; 5. G U 1 W. Theoem 5.2 Assume that condition C1: anks t ) = anks b )+anks w ) holds. Let U 1 and W be defined as in Theoem 5.1. Then, G = U 1 W solves the optimization poblem in Eq. 10). That is, unde the given assumption, OLDA and NLDA ae equivalent. Poof Recall that the optimization involved in OLDA is G = agmax G IR m l :G T G=I l tace S L t ) + S L b), 16) whee S L t = G T S t G and S L b = GT S b G. Fom Section 4, the maximum numbe, l, of disciminant vectos is no lage than q, which is the ank of S b. Recall that q = anks b ) = ank S b ) = ank S t ) ank S w ) =, whee is the dimension of the null space of S w. Based on the popety of the tace of matices, we have tace St L ) + Sb) L + tace S L t ) + Sw) L = tace S L t ) + St L ) = anks L t ) q =, whee the second equality follows since tacea + A) = anka) fo any squae matix A, and the inequality follows since the ank of S L t IR l l is at most l q. It follows that tace S L t ) + S L b), since tace S L t ) + S L w), the tace of the poduct of two positive semi-definite matices, is always nonnegative. Next, we show that the maximum is achieved, when G = U 1 W. Recall that the dimension of the null space, W, of S w is. That is, W IR t. It follows that U 1 W) T S t U 1 W) IR, and anku 1 W) T S t U 1 W)) =. Futhemoe, U 1 W) T S w U 1 W) = W T S w W = 0, as W foms the null space of S w. It follows that, U1 tace W) T S t U 1 W) ) ) + U1 W) T S w U 1 W) = 0. Hence, U1 tace W) T S t U 1 W) ) ) + U1 W) T S b U 1 W) = ank U 1 W) T S t U 1 W) ) tace U1 W) T S t U 1 W) ) + U1 W) T S w U 1 W) )) =. 1193

YE AND XIONG Thus G = U 1 W solves the optimization poblem in Eq. 10). That is, OLDA and NLDA ae equivalent. Theoem 5.2 above shows that unde condition C1, OLDA and NLDA ae equivalent. Next, we show that condition C1 holds when the data points ae linealy independent as summaized below. Theoem 5.3 Assume that condition C2, that is, the n data points in the data matix A IR m n ae linealy independent, holds. Then condition C1: anks t ) = anks b )+anks w ) holds. Poof Since the n columns in A ae linealy independent, H t = A ce T is of ank n 1. That is, anks t ) = n 1. Next we show that anks b ) = k 1 and anks w ) = n k. Thus condition C1 holds. It is easy to veify that anks b ) k 1 and anks w ) n k. We have n 1 = anks t ) anks b )+anks w ) k 1)+n k) = n 1. 17) It follows that all inequalities in Eq. 17) become equalities. That is, anks b ) = k 1, anks w ) = n k, and anks t ) = anks b )+anks w ). 18) Thus, condition C1 holds. Ou expeimental esults in Section 7 show that fo high-dimensional data, the linea independence condition C2 holds in many cases, while condition C1 is satisfied in most cases. This explains why NLDA and OLDA often achieve the same pefomance in many applications involving highdimensional data, such as text documents, face images, and gene expession data. 6. Regulaized Othogonal LDA Recall that OLDA involves the pseudo-invese of the total scatte matix, whose estimation may not be eliable especially fo undesampled data, whee the numbe of dimensions exceeds the sample size. In such case, the paamete estimates can be highly unstable, giving ise to high vaiance. By employing a method of egulaization, one attempts to impove the estimates by egulating this bias vaiance tade-off Fiedman, 1989). We employ the egulaization technique to OLDA by adding a constant λ to the diagonal elements of the total scatte matix. Hee λ > 0 is known as the egulaization paamete. The algoithm is called egulaized OLDA ROLDA). The optimal tansfomation, G, of ROLDA can be computed by solving the following optimization poblem: G G = agmax G IR tace T m l S :G T G=I t + λi m )G ) ) + G T S b G. 19) l The optimal G can be computed by solving an eigenvalue poblem as summaized in the following theoem The poof follows Theoem 3.1 in Ye, 2005) and is thus omitted): Theoem 6.1 Let X q be the matix consisting of the fist q eigenvectos of the matix S t + λi m ) 1 S b 20) coesponding to the nonzeo eigenvalues, whee q = anks b ). Let X q = QR be the QR-decomposition of X q, whee Q has othonomal columns and R is uppe tiangula. Then G = Q solves the optimization poblem in Eq. 19). 1194

ANALYSIS OF NULL SPACE AND ORTHOGONAL LINEAR DISCRIMINANT ANALYSIS Theoem 6.1 implies that the main computation involved in ROLDA is the eigen-decomposition of the matix S t + λi m ) 1 S b. Diect fomation of the matix is expensive fo high-dimensional data, as it is of size m by m. In the following, we pesent an efficient way of computing the eigendecomposition. Denote B = Σ 2 t + λi t ) 1/2 U T 1 H b 21) and let B = P Σ Q ) T 22) be the SVD of B. Fom Eqs. 8) and 11), we have S t + λi m ) 1 Σ 2 S b = U t + λi t ) 1 ) 0 0 λ 1 U T U T U 1 S b U 1 0 I m t 0 0 Σ 2 = U t + λi t ) 1 U1 T H bhb TU ) 1 0 U T 0 0 Σ 2 = U t + λi t ) 1/2 B B ) T Σt 2 + λi t ) 1/2 ) 0 U T 0 0 Σ 2 = U t + λi t ) 1/2 P Σ Σ ) T P ) T Σt 2 + λi t ) 1/2 0 0 0 It follows that the columns of the matix U 1 Σ 2 t + λi t ) 1/2 P q ) U T ) U T. fom the eigenvectos of S t + λi m ) 1 S b coesponding to the top q nonzeo eigenvalues, whee P q denotes the fist q columns of P. That is, X q in Theoem 6.1 is given by X q = U 1 Σ 2 t + λi t ) 1/2 P q. 23) The pseudo-code fo the ROLDA algoithm is given in Algoithm 4. The computations in ROLDA can be decomposed into two components: the fist component involves the matix, U 1 IR m t, of high dimensionality but independent of λ, while the second component involves the matix, Σ 2 t + λi t ) 1/2 P q IR t q, of low dimensionality. When we apply coss-validation to seach fo the optimal λ fom a set of candidates, we epeat the computations involved in the second component only, thus making the computational cost of model selection small. Moe specifically, let Λ = {λ 1,,λ Λ } 24) be the candidate set fo the egulaization paamete λ, whee Λ denotes the size of the candidate set Λ. We apply v-fold coss-validation fo model selection we choose v = 5 in ou expeiment), whee the data is divided into v subsets of appoximately) equal size. All subsets ae mutually exclusive, and in the i-th fold, the i-th subset is held out fo testing and all othe subsets ae used fo taining. Fo each λ j j = 1,, Λ ), we compute the coss-validation accuacy, Accu j), defined as the mean of the accuacies fo all folds. The optimal egulaization value λ j is the one with j = agmax j Accu j). 25) 1195

YE AND XIONG Algoithm 4: ROLDA Regulaized OLDA) Input: data matix A and egulaization value λ Output: tansfomation matix G 1. Compute U 1, Σ t, and Pq, whee q = anks b ); 2. Xq U 1 Σt 2 + λi t ) 1/2 Pq; 3. Compute the QR decomposition of Xq as Xq = QR; 4. G Q. The K-Neaest Neighbo algoithm with K = 1, called 1-NN, is used fo computing the accuacy. The pseudo-code fo the model selection pocedue in ROLDA is given in Algoithm 5. Note that we apply the QR decomposition to instead of as done in Theoem 6.1, since U 1 has othonomal columns. Σ 2 t + λi t ) 1/2 P q IR t q 26) X q = U 1 Σ 2 t + λi t ) 1/2 P q IR m q, 27) Algoithm 5: Model selection fo ROLDA Input: data matix A and candidate set Λ = {λ 1,,λ Λ } Output: optimal egulaization value λ j 1. Fo i = 1 : v /* v-fold coss-validation */ 2. Constuct A i and Aî; /* A i = i-th fold, fo taining and Aî = est, fo testing */ 3. Constuct H b and H t using A i as in Eqs. 2) and 3), espectively; 4. Compute the educed SVD of H t as H t = U 1 Σ t V1 T ; t ankh t); 5. H b,l U1 T H b, q ankh b ); 6. A i L U 1 T Ai ; AîL U 1 T Aî; /* Pojection by U 1 */ 7. Fo j = 1 : Λ /* λ 1,,λ Λ */ 8. D j Σt 2 + λ j I t ) 1/2 ; B D j H b,l 9. Compute the SVD of B as B = P Σ Q ) T ; 10. D q,p D j Pq; Compute the QR decomposition of D q,p as D q,p = QR; 11. A i L QT A i L ; AîL QT AîL ) ; 12. Run 1-NN on A i L,AîL and compute the accuacy, denoted as Accui, j); 13. EndFo 14. EndFo 15. Accu j) 1 v v Accui, j); 16. j agmax j Accu j); 17. Output λ j as the optimal egulaization value. 6.1 Time Complexity We conclude this section by analyzing the time complexity of the model selection pocedue descibed above. 1196

ANALYSIS OF NULL SPACE AND ORTHOGONAL LINEAR DISCRIMINANT ANALYSIS Line 4 in Algoithm 5 takes On 2 m) time fo the educed SVD computation. Lines 5 and 6 take Omtk) = Omnk) and Otmn) = Omn 2 ) time, espectively, fo the matix multiplications. Fo each λ j, fo j = 1,, Λ, of the Fo loop, Lines 9 and 10 take Otk 2 ) = Onk 2 ) time fo the SVD and QR decomposition and matix multiplication. Line 11 takes Oktn) = Okn 2 ) time fo the matix multiplication. The computation of the classification accuacy by 1-NN in Line 12 takes On 2 k/v) time, as the size of the test set, AîL, is about n/v. Thus, the time complexity, T Λ ), of the model selection pocedue is T Λ ) = O v n 2 m+mn 2 + mnk+ Λ nk 2 + kn 2 + n 2 k/v) )). Fo high-dimensional and undesampled data, whee the sample size, n, is much smalle than the dimensionality m, the time complexity is simplified to T Λ ) = O vn 2 m+ Λ n 2 k) ) = O vn 2 m 1+ km )) Λ. When the numbe, k, of classes in the data set is much smalle than the dimensionality, m, the ovehead of estimating the optimal egulaization value among a lage candidate set may be small. Ou expeiments on a collection of high-dimensional and undesampled data see Section 7) show that the computational cost of the model selection pocedue in ROLDA gows slowly as Λ inceases. 7. Expeimental Studies In this section, we pefom extensive expeimental studies to evaluate the theoetical esults and the ROLDA algoithm pesented in this pape. Section 7.1 descibes ou test data sets. We pefom a detailed compaison of NLDA, inlda, and OLDA in Section 7.2. Results ae consistent with ou theoetical analysis. In Section 7.3, we compae the classification pefomance of NLDA, inlda, OLDA, ULDA, ROLDA, and SVM. The K-Neaest-Neighbo K-NN) algoithm with K = 1 is used as the classifie fo all LDA based algoithms. 7.1 Data Sets We used 14 data sets fom vaious data souces in ou expeimental studies. The statistics of ou test data sets ae summaized in Table 2. The fist five data sets, including spambase, 4 balance, wine, wavefom, and vowel, ae lowdimensional data fom the UCI Machine Leaning Repositoy. The next nine data sets, including text documents, face images, and gene expession data, have high dimensionality: e1, e0, and t41 ae thee text document data sets, whee e1 and e0 ae deived fom Reutes-21578 text categoization test collection Distibution 1.0, 5 and t41 is deived fom the TREC-5, TREC-6, and TREC-7 collections; 6 ORL, 7 AR, 8 and PIX 9 ae thee face image data sets; GCM, colon, and ALLAML4 ae thee gene expession data sets Ye et al., 2004b). 4. Only a subset of the oiginal spambase data set is used in ou study. 5. http://www.daviddlewis.com/esouces/testcollections/eutes21578/ 6. http://tec.nist.gov 7. http://www.uk.eseach.att.com/facedatabase.html 8. http://vl1.ecn.pudue.edu/ aleix/aleix face DB.html 9. http://peipa.essex.ac.uk/ipa/pix/faces/mancheste/test-had/ 1197

YE AND XIONG Data Set Sample size n) # of dimensions # of classes taining test total m) k) spambase 400 600 1000 56 2 balance 416 209 625 4 3 wine 118 60 178 13 3 wavefom 300 500 800 21 3 vowel 528 462 990 10 11 e1 490 3759 5 e0 320 2887 4 t41 210 7454 7 ORL 400 10304 40 AR 650 8888 50 PIX 300 10000 30 GCM 198 16063 14 colon 62 2000 2 ALLAML4 72 7129 4 Table 2: Statistics of ou test data sets. Fo the fist five data sets, we used the given patition of taining and test sets, while fo the last nine data sets, we did andom splittings into taining and test sets of atio 2:1. 7.2 Compaison of NLDA, inlda, and OLDA In this expeiment, we did a compaative study of NLDA, inlda, and OLDA. Fo the fist five low-dimensional data sets fom the UCI Machine Leaning Repositoy, we used the given splitting of taining and test sets. The esult is summaized in Table 3. Fo the next nine high-dimensional data sets, we pefomed ou study by epeated andom splittings into taining and test sets. The data was patitioned andomly into a taining set, whee each class consists of two-thids of the whole class and a test set with each class consisting of one-thid of the whole class. The splitting was epeated 20 times and the esulting accuacies of diffeent algoithms fo the fist ten splittings ae summaized in Table 4. Note that the mean accuacy fo the 20 diffeent splittings will be epoted in the next section. The ank of thee scatte matices, S b, S w, and S t, fo each of the splittings is also epoted. The main obsevations fom Table 3 and Table 4 include: Fo the fist five low-dimensional data sets, we have anks b ) = k 1, and anks w ) = anks t ) = m, whee m is the data dimensionality. Thus the null space of S w is empty, and both NLDA and inlda do not apply. Howeve, OLDA is applicable and the educed dimensionality of OLDA is k 1. Fo the next nine high-dimensional data sets, condition C1: anks t ) = anks b )+anks w ) is satisfied in all cases except the e0 data set. Fo the e0 data set, eithe anks t ) = anks b )+ anks w ) o anks t ) = anks b )+anks w ) 1 holds, that is, condition C1 is not seveely violated fo e0. Note that e0 has the smallest numbe of dimensions among the nine high- 1198

ANALYSIS OF NULL SPACE AND ORTHOGONAL LINEAR DISCRIMINANT ANALYSIS Data Set spambase balance wine wavefom vowel NLDA Method inlda OLDA 88.17 86.60 98.33 73.20 56.28 S b 1 2 2 2 10 Rank S w 56 4 13 21 10 S t 56 4 13 21 10 Table 3: Compaison of NLDA, inlda, and OLDA on classification accuacy in pecentage) using five low-dimensional data sets fom UCI Machine Leaning Repositoy. The anks of thee scatte matices ae epoted. dimensional data sets. Fom the expeiments, we may infe that condition C1 is moe likely to hold fo high-dimensional data. NLDA, inlda, and OLDA achieve the same classification pefomance in all cases when condition C1 holds. The empiical esult confims the theoetical analysis in Section 5. This explains why NLDA and OLDA often achieve simila pefomance fo high-dimensional data. We can also obseve that NLDA and inlda achieve simila pefomance in all cases. The numbes of taining data points fo the nine high-dimensional data in the same ode as in the table) ae 325, 212, 140, 280, 450, 210, 125, 68, and 48, espectively. By examining the ank of S t in Table 4, we can obseve that the taining data in six out of nine data sets, including t41, ORL, AR, GCM, colon, and ALLAML4, ae linealy independent. That is, the independence assumption C2 fom Theoem 5.3 holds fo these data sets. It is clea fom the table that fo these six data sets, condition C1 holds and NLDA, inlda, and OLDA achieve the same pefomance. These ae consistent with the theoetical analysis in Section 5. Fo the e0 data set, whee condition C1 does not hold, i.e., anks t ) < anks b )+anks w ), OLDA achieves highe classification accuacy than NLDA and inlda. Recall that the educed dimensionality of OLDA equals anks b ) q. The educed dimensionality in NLDA and inlda equals the dimension of the null space of S w, which equals anks t ) anks w ) < anks b ). That is, OLDA keeps moe dimensions in the tansfomed space than NLDA and inlda. Expeimental esults in e0 show that these exta dimensions used in OLDA impove its classification pefomance. 7.3 Compaative Studies on Classification In this expeiment, we conducted a compaative study of NLDA, inlda, OLDA, ULDA, ROLDA, and SVM in tems of classification. Fo ROLDA, the optimal λ is estimated though coss-validation on a candidate set, Λ = {λ j } Λ. Recall that T Λ ) denotes the computational cost of the model selection pocedue in ROLDA, whee Λ is the size of the candidate set of the egulaization values. We have pefomed model selection on all nine high-dimensional data sets using diffeent values of 1199

YE AND XIONG Data Set Method Ten diffeent splittings into taining and test sets of atio 2:1 NLDA 92.73 93.33 93.33 93.94 94.55 95.15 96.36 95.15 92.12 93.94 inlda 92.73 93.33 93.33 93.94 94.55 95.15 96.36 95.15 92.12 93.94 e1 OLDA 92.73 93.33 93.33 93.94 94.55 95.15 96.36 95.15 92.12 93.94 S b 4 4 4 4 4 4 4 4 4 4 S w 316 318 319 316 316 320 316 318 317 318 S t 320 322 323 320 320 324 320 322 321 322 NLDA 64.81 62.04 64.81 68.52 87.96 70.37 71.30 73.15 87.04 75.93 inlda 65.74 62.04 64.81 69.44 87.96 70.37 71.30 72.22 87.04 75.93 e0 OLDA 75.93 75.00 77.78 74.07 87.96 80.56 74.07 78.70 87.04 79.63 S b 3 3 3 3 3 3 3 3 3 3 S w 205 204 203 203 205 204 201 203 203 205 S t 207 206 205 205 208 206 203 205 206 207 NLDA 97.14 95.71 97.14 98.57 97.14 98.57 100.0 95.71 98.57 95.71 inlda 97.14 95.71 97.14 98.57 97.14 98.57 100.0 95.71 98.57 95.71 t41 OLDA 97.14 95.71 97.14 98.57 97.14 98.57 100.0 95.71 98.57 95.71 S b 6 6 6 6 6 6 6 6 6 6 S w 133 133 133 133 133 133 133 133 133 133 S t 139 139 139 139 139 139 139 139 139 139 NLDA 99.17 96.67 98.33 98.33 95.00 95.83 98.33 97.50 98.33 95.83 inlda 99.17 96.67 98.33 98.33 95.00 95.83 98.33 97.50 98.33 95.83 ORL OLDA 99.17 96.67 98.33 98.33 95.00 95.83 98.33 97.50 98.33 95.83 S b 39 39 39 39 39 39 39 39 39 39 S w 240 240 240 240 240 240 240 240 240 240 S t 279 279 279 279 279 279 279 279 279 279 NLDA 96.50 94.50 96.50 94.00 93.50 94.50 93.50 97.00 94.00 96.00 inlda 96.50 94.50 96.50 94.00 93.50 94.50 93.50 97.00 94.00 96.00 AR OLDA 96.50 94.50 96.50 94.00 93.50 94.50 93.50 97.00 94.00 96.00 S b 49 49 49 49 49 49 49 49 49 49 S w 400 400 400 400 400 400 400 400 400 400 S t 449 449 449 449 449 449 449 449 449 449 NLDA 98.89 97.78 98.89 97.78 98.89 98.89 98.89 97.78 98.89 97.78 inlda 98.89 97.78 98.89 97.78 98.89 98.89 98.89 97.78 98.89 97.78 PIX OLDA 98.89 97.78 98.89 97.78 98.89 98.89 98.89 97.78 98.89 97.78 S b 29 29 29 29 29 29 29 29 29 29 S w 178 179 179 179 178 180 179 179 180 178 S t 207 208 208 208 207 209 208 208 209 207 NLDA 81.54 80.00 81.54 83.08 84.62 87.69 75.38 78.46 84.62 83.08 inlda 81.54 80.00 81.54 83.08 84.62 87.69 75.38 78.46 84.62 83.08 GCM OLDA 81.54 80.00 81.54 83.08 84.62 87.69 75.38 78.46 84.62 83.08 S b 13 13 13 13 13 13 13 13 13 13 S w 111 111 111 111 111 111 111 111 111 111 S t 124 124 124 124 124 124 124 124 124 124 NLDA 91.18 94.12 100.0 97.06 91.18 91.18 97.06 94.12 94.12 97.06 inlda 91.18 94.12 100.0 97.06 91.18 91.18 97.06 94.12 94.12 97.06 colon OLDA 91.18 94.12 100.0 97.06 91.18 91.18 97.06 94.12 94.12 97.06 S b 1 1 1 1 1 1 1 1 1 1 S w 66 66 66 66 66 66 66 66 66 66 S t 67 67 67 67 67 67 67 67 67 67 NLDA 95.83 91.67 95.83 95.83 87.50 95.83 95.83 100.0 91.67 95.83 inlda 95.83 91.67 95.83 95.83 87.50 95.83 95.83 100.0 91.67 95.83 ALLAML4 OLDA 95.83 91.67 95.83 95.83 87.50 95.83 95.83 100.0 91.67 95.83 S b 3 3 3 3 3 3 3 3 3 3 S w 44 44 44 44 44 44 44 44 44 44 S t 47 47 47 47 47 47 47 47 47 47 Table 4: Compaison of classification accuacy in pecentage) fo NLDA, inlda, and OLDA using nine high-dimensional data sets. Ten diffeent splittings into taining and test sets of atio 2:1 fo each of the k classes) ae applied. The ank of thee scatte matices fo each splitting is epoted. 1200

ANALYSIS OF NULL SPACE AND ORTHOGONAL LINEAR DISCRIMINANT ANALYSIS Data Set NLDA inlda OLDA ULDA ROLDA SVM e1 94.33 1.72) 94.33 1.72) 94.33 1.72) 94.76 1.67) 94.79 1.64) 94.54 1.88) e0 74.03 9.22) 74.15 8.19) 79.54 4.73) 79.72 4.82) 85.79 3.66) 85.87 3.34) t41 97.00 2.01) 97.00 2.01) 97.00 2.01) 97.14 2.02) 97.17 2.04) 97.14 2.01) ORL 97.29 1.79) 97.29 1.79) 97.29 1.79) 92.75 1.82) 97.52 1.64) 97.55 1.34) AR 95.42 1.30) 95.42 1.30) 95.42 1.30) 94.37 1.46) 97.30 1.32) 95.75 1.43) PIX 98.22 1.41) 98.22 1.41) 98.22 1.41) 96.61 1.92) 98.29 1.32) 98.50 1.24) GCM 81.77 3.61) 81.77 3.61) 81.77 3.61) 80.46 3.71) 82.69 3.42) 75.31 4.45) Colon 86.50 5.64) 86.50 5.64) 86.50 5.64) 86.50 5.64) 87.00 6.16) 87.25 5.25) ALLAML4 93.54 3.70) 93.54 3.70) 93.54 3.70) 93.75 3.45) 93.75 3.45) 93.70 3.40) Table 5: Compaison of classification accuacy in pecentage) fo six diffeent methods: NLDA, inlda, OLDA, ULDA, ROLDA, and SVM using nine high-dimensional data sets. The mean accuacy and standad deviation in paenthesis) fom 20 diffeent uns ae epoted. Λ. We have obseved that T Λ ) gows slowly as Λ inceases, and the atio, T1024)/T1), on all nine data sets anges fom 1 to 5. Thus, we can un model selection using a lage candidate set of egulaization values, without damatically inceasing the cost. In the following expeiments, we apply model selection to ROLDA with a candidate set of size Λ = 1024, whee λ j = α j /1 α j ), 28) with {α j } Λ unifomly distibuted between 0 and 1. As fo SVM, we employed the coss-validation to estimate the optimal paamete using a candidate set of size 50. To compae diffeent classification algoithms, we applied the same expeimental setting as in Section 7.2. The splitting into taining and test sets of atio 2:1 fo each of the k classes) was epeated 20 times. The final accuacy epoted was the aveage of the 20 diffeent uns. The standad deviation fo each data set was also epoted. The esult on the nine high-dimensionality data sets is summaized in Table 5. As obseved in Section 7.2, OLDA has the same pefomance as NLDA and inlda in all cases except the e0 data set, while NLDA and inlda achieve simila pefomance in all cases. Oveall, ROLDA and SVM ae vey competitive with othe methods. SVM pefoms well in all cases except GCM. The poo pefomance of SVM in GCM has also been obseved in Li et al., 2004). ROLDA outpefoms OLDA fo e0, AR, and GCM, while it is compaable to OLDA fo all othe cases. This confims the effectiveness of the egulaization applied in ROLDA. Note that fom Remak 1, ULDA is closely elated to OLDA. Howeve, unlike OLDA, ULDA does not apply the final othogonalization step. Expeimental esult in Table 5 confims the effectiveness of the othogonalization step in OLDA, especially fo thee face image data sets and GCM. 8. Conclusions In this pape, we pesent a computational and theoetical analysis of two LDA based algoithms, including null space LDA and othogonal LDA. NLDA computes the disciminant vectos in the null space of the within-class scatte matix, while OLDA computes a set of othogonal disciminant vectos via the simultaneous diagonalization of the scatte matices. They have been applied successfully in many applications, such as document classification, face ecognition, and gene expession data classification. 1201

YE AND XIONG Both NLDA and OLDA esult in othogonal tansfomations. Howeve, they applied diffeent schemes in deiving the optimal tansfomation. Ou theoetical analysis in this pape shows that unde a mild condition C1 which holds in many applications involving high-dimensional data, NLDA is equivalent to OLDA. Based on the theoetical analysis, an impoved algoithm fo null space LDA algoithm, called inlda, is poposed. We have pefomed extensive expeimental studies on 14 data sets, including both low-dimensional and high-dimensional data. Results have shown that condition C1 holds fo eight out of the nine high-dimensional data sets, while the null space of S w is empty fo all five low-dimensional data. Thus, NLDA may not be applicable fo low-dimensional data, while OLDA is still applicable in this case. Results ae also consistent with ou theoetical analysis. That is, fo all cases when condition C1 holds, NLDA, inlda, and OLDA achieve the same classification pefomance. We also obseve that fo othe cases with condition C1 violated, OLDA outpefoms NLDA and inlda, due to the exta numbe of dimensions used in OLDA. We also compae NLDA, inlda, and OLDA with uncoelated LDA ULDA), which does not pefom the final othogonalization step. Results show that OLDA is vey competitive with ULDA, which confims the effectiveness of the othogonalization step used in OLDA. Ou empiical and theoetical esults pesented in this pape povide futhe insights into the natue of these two LDA based algoithms. We also pesent the ROLDA algoithm, which extends the OLDA algoithm by applying the egulaization technique. Regulaization may stabilize the sample covaiance matix estimation and impove the classification pefomance. ROLDA involves the egulaization paamete λ, which is commonly estimated via coss-validation. To speed up the coss-validation pocess, we decompose the computations in ROLDA into two components: the fist component involves matices of high dimensionality but independent of λ, while the second component involves matices of low dimensionality. When seaching fo the optimal λ fom a candidate set, we epeat the computations involved in the second component only. A compaative study on classification shows that ROLDA is vey competitive with OLDA, which shows the effectiveness of the egulaization applied in ROLDA. Ou extensive expeimental studies have shown that condition C1 holds fo most high-dimensional data sets. We plan to cay out theoetical analysis on this popety in the futue. Some of the theoetical esults in Hall et al., 2005) may be useful fo ou analysis. The algoithms in Yang et al., 2005; Yu and Yang, 2001) ae closely elated to the null space LDA algoithm discussed in this pape. The analysis pesented in this pape may be useful in undestanding why these algoithms pefom well in many applications, especially in face ecognition. We plan to exploe this futhe in the futue. Acknowledgements We thank the eviewes fo helpful comments. Reseach of JY is sponsoed, in pat, by the Cente fo Evolutionay Functional Genomics of the Biodesign Institute at the Aizona State Univesity. Refeences P. N. Belhumeou, J. P. Hespanha, and D. J. Kiegman. Eigenfaces vs. Fishefaces: Recognition using class specific linea pojection. IEEE Tans Patten Analysis and Machine Intelligence, 19 7):711 720, 1997. 1202