Semi-supervised feature extraction for population structure identification using the Laplacian linear discriminant Usman Roshan 1,* 1

Size: px

Start display at page:

Download "Semi-supervised feature extraction for population structure identification using the Laplacian linear discriminant Usman Roshan 1,* 1"

Valerie Jacobs
5 years ago
Views:

1 Population and Geneti Analysis Semi-supervised feature extration for population struture identifiation using the Laplaian linear disriminant Usman Roshan 1,* 1 Department of Computer Siene, New Jersey Institute of Tehnology, GITC 4400, University Heights, Newark, NJ 07102, USA ABSTRACT Motivation: The identifiation of population struture from genomewide SNP data is of signifiant interest in the population and medial genetis ommunity. A popular solution is to perform unsupervised feature extration using prinipal omponent analysis. Prinipal omponent analysis, however, relies only on global properties of the data. Results: The Laplaian linear disriminant takes into onsideration loal properties of the data as well and, as we show in this study, it an be extended to the semi-supervised setting. This an then be applied to extrat features for identifying population struture when the anestry of some individuals in some sub-populations of the admixture is known. Using real data we simulate suh semisupervised senarios and extrat features using the Laplaian linear disriminant, kernel prinipal omponent analysis, and two reent semi-supervised feature extrators. We show that there is a statistially signifiant improvement in auray when the nearest mean lassifier or k-means lustering is applied on the Laplaian linear disriminant features ompared to kernel prinipal omponent analysis and the other methods. Availability: All neessary software and data for reproduibility purposes is at Contat: usman@s.njit.edu 1 INTRODUCTION The problem of lustering humans into groups of similar geographial anestry arises in the fields of medial and population genetis. In medial genetis disease assoiation studies an lead to misleading results if the underlying population struture is not taken into aount (Ziv and Burhard, 2003; Marhini et. al., 2004; Xu and Shete, 2005; Devlin and Roeder, 1999). In population genetis this an be used to unover demographi history and address related sientifi questions (Cavalli-Sforza and Feldman, 2003). Consequently several methods have been developed for identifying struture from genome wide single nuleotide polymorphism (SNP) data (Tsai et. al., 2005). The two prevailing methods are prinipal omponent analysis (PCA) (Pashou et. al., 2007), whih is an unsupervised feature extration method, and the model-based Markov Chain Monte Carlo method implemented in STRUCTURE (Prithard et. al., 2000), where prior anestry an be speified if available. PCA is very fast and has been shown to separate inter and intra-ontinental admixtures with high auray when followed by the simple k- means lustering algorithm (Pashou et. al., 2007). The model- based approah of STRUCTURE is also onsidered aurate. However it is very slow and thus prohibitive on admixtures with several thousand SNPs or hundreds of individuals. In this study onsider the following problem. Assume that some individuals in the population have known anestry. In pratie this an be obtained using urrent benhmarks suh as the HAPMAP projet ( How an we then extrat features for struture identifiation while taking the prior anestry into onsideration? We all this semi-supervised feature extration sine the data of individuals with unknown anestry is also utilized for extrating features. We propose a semi-supervised Laplaian linear disriminant (LLDA) (Tang et. al., 2006) for solving this problem. LLDA an be onsidered to be a more general feature extrator than PCA. PCA uses only global properties of the data sine it is based on the total satter matrix. LLDA, whih is similar to the maximum margin riterion disriminant (Li et. al., 2006), uses both the total satter matrix, whih aptures global properties of the data, as well as the within lass satter matrix, whih aptures loal properties. Although the within lass satter matrix is normally defined in a supervised framework, it an also be omputed in an unsupervised one using Laplaian matries (Nijima and Okuno, 2007). In this study we extend it to a semi-supervised setting and then extrat features with it. In related work Zhang et. al., 2007, propose a Laplaian based approah that uses must-link and annot-link onstraints for extrating features. Sugiyama et. al., 2007, present a semi-supervised loal Fisher disriminant for feature extration. We ontrast our approah with theirs in the Methods Setion. Using real benhmarks we simulate different semi-supervised senarios and ompare our proposed LLDA approah to kernel PCA and the two reent semi-supervised feature extrators. We ompare the error of the nearest mean lassifier and the k-means lustering algorithm on all the projetions. We show that LLDA attains statistially signifiant higher auraies than kernel PCA even with a small number of individuals with known anestry. These auraies improve as the perent of individuals with known anestry inreases. The two other semi-supervised methods, however, have higher error than our LLDA approah and also kernel PCA on the data onsidered here. 2 METHODS Throughout we assume that n represents the total number of individuals in the population and k represents the number of sub-populations in it. * To whom orrespondene should be addressed. Oxford University Press

2 U. Roshan 2.1 Bakground Enoding the data We assume that for eah individual in the population bialleli SNPs have been assayed. We use the following enoding sheme to onvert eah individual s set of sequened SNPs into numerial vetors on whih PCA an then be performed. Let g be an individual s set of sequened SNPs and let g i represent the i th SNP. g i an be AA, AB, or BB where A and B are alphabetially ordered SNP bases. Following the enoding of smartpa (Patterson et. al., 2006) and similar to the one in (Pashou et. al., 2007) we define the feature vetor x for g by setting x i to 0 if g i is AA, 1 if AB, and 2 if BB. Let eah suh vetor x for the i th individual be the i th olumn of the data matrix X, i.e. X = [x 1,, x n] Prinipal omponent analysis (PCA) Let x be a random variable that represents the feature vetor of an individual s SNP genotype as defined above. Suppose we are interested in omputing a projetion of x onto one dimension suh that the variane (i.e. spread) of the projeted data is maximized. In other words we want to find w suh that Variane(w T x) is maximized subjet to w T w=1. This yields the optimization problem maxw!' T w subjet to w T w =1 (1) w where Σ=E((x-µ)(x-µ) T ) is the ovariane of x and µ=e(x) is the expeted value of x. Using Lagrange multipliers one an show that the eigenvetor of Σ with the largest eigenvalue is the solution to problem (1). The eigenvetor with the next largest eigenvalue yields a projetion orthogonal to the first one and of maximum variane (Alpaydin, 2004). In pratie we use the sample ovariane matrix instead of Σ. Given the data matrix X = [x 1,, x n] (as desribed in the enoding) we define X = [x 1-m,, x n-m], where x i is the feature vetor of the i th individual and m=σx i/n is the mean feature vetor of the population. The sample ovariane matrix is then defined as X X T. This is also alled total satter matrix and is equivalent to S t = 1 n n i=1 (x i! m)(x i! m) T The eigenvetor of S t with the largest eigenvalue (also alled the first prinipal omponent vetor) gives the projetion of maximum variane in the sample. The eigenvetor with the next largest eigenvalue yields a projetion orthogonal to the first one and also of maximum variane. In unsupervised senarios PCA an be very helpful in eluidating lusters in the data Maximum margin disriminant analysis (MMC) In a supervised senario one an use the standard Fisher disriminant (Alpaydin, 2004). Another supervised method is the reent maximum margin disriminant (MMC) (Li et. al., 2006) that does not suffer from singularity problems like its Fisher ounterpart. Define the within lass satter matrix S i for lass i as n i S i = 1 (x (i) j! m (i) )(x (i) j! m (i) ) T n i j =1 the overall within lass satter matrix as S w = 1! n k S k = 1 n n n k!! j =1 and the between lass satter matrix as S b = 1 n n k (m (k )! m)(m (k)! m) T (x (k) j m (k) (k )(x ) j m (k ) ) T where x j (i) is the j th feature vetor of lass i, is the number of lasses, n i is the size of lass i, m (i) is the mean feature vetor of lass i, and m is the mean feature vetor of the entire dataset. The maximum margin riterion for feature extration is defined as (Li et. al., 2006) J = 1 2!! i=1 j =1 p i p j d(c i,c j ) where p i and p j are lass prior probabilities and d(c i,c j) is the interlass distane. Define the interlass distane d(c i,c j) as d(c i,c j)=d(m i,m j)-tr(s i)- tr(s j) where d(m i,m j) is the Eulidean distane between the mean vetors m i and m j and tr(s i) is the overall variane of lass S i. Then if p i=n i/n and d(c i,c j)=d(m i,m j)-tr(s i)-tr(s j), it an be shown that J=tr(S b-s w) (Li et. al., 2006). The linear MMC disriminant aims to find a matrix W = [w 1, w 2,, w d] that maximizes tr(w T (S b! S w )W ) subjet to (w k T w ) for all k [1,d]. This is equivalent to solving max d w 1 w d w k T (S b! S w )w k subjet to w k T w k =1 for k =1,,d Using Lagrange multipliers it an be shown that the d largest eigenvetor of S b-s w are the solution to W. The projetion of the data matrix X is then given by W T X. Sine S t=s w+s b, S b-s w an be rewritten as S t-2s w (Nijima and Okuno, 2007). Thus MMC also takes into onsideration loal properties of the data (by onsidering S w) whereas PCA only onsiders S t Laplaian linear disriminant analysis (LLDA) In order to desribe this disriminant we first define the Laplaian matrix of a weighted graph. Suppose we are given a weighted graph G with n nodes and its assoiated weight matrix W = {w ij : i,j [1,n]}. Then the Laplaian L of G is defined as (Tang et. al., 2006) $!w ij i j( & n & L[i, j] = % ) & d i = w ik i = j& ' * Now we desribe LLDA. First note that the matries S t and S w defined earlier an also be written as (Nijima and Okuno, 2007) S t = 1 n X(I! 1 n eet )X T = 1 n X(I!W g)x T = 1 n XL g X T (3) S w = 1 n X(I! 1 e (k ) e (k )T )X T = 1 n k n X(I!W l )X T = 1 n XL l X T k= l (2) 2

3 Semi-supervised feature extration for population struture identifiation using the Laplaian linear disriminant where e is n dimensional with all entries set to 1 and the i th entry of e (k) is set to 1 if x i belongs to lass k and 0 otherwise. I-W g an be viewed as the Laplaian L g of a global graph where all verties are onneted and eah edge has weight 1/n. Similarly I-W l is the Laplaian L l of a loal graph suh that all verties belonging to the same lass are onneted with weight 1/n k. In terms of the graph Laplaians we an represent S t-2s w as (1/n)X T (L g- 2L l)x using (3). MMC represented in this form is known as Laplaian linear disriminant analysis (Nijima and Okuno, 2007). The d leading eigenvetors of (1/n)X T (L g-2l l)x form the olumns of W and the projetion is then given by W T X. Note that so far the Laplaian of the loal graph is welldefined only under a supervised senario. However, it an also be used in an unsupervised setting (Nijima and Okuno, 2007) and a semi-supervised one as we show below Related work Zhang et. al., 2007, also use a Laplaian graph based approah for semi supervised dimensionality redution. They speify must-link and annotlink pairwise onstraints in the weight matrix and then proeed to ompute its Laplaian. The largest eigenvetors of the Laplaian are then used to projet the data. Our approah, however, is based on the LLDA, whih in turn is based on the MMC riterion. The LLDA approah takes both the global and loal properties into onsideration in separate Laplaians. Furthermore, the Laplaians in our ase are onstruted from nearest neighbor graphs that also inorporate prior information (see Subsetion 2.2.1). The approah of Zhang et. al., 2007, on the other hand, does not use nearest neighbor graphs for onstrution of the weight matrix. Sugiyama et. al., 2007, propose a semi-supervised loal Fisher disriminant for dimensionality redution. They use regularized between-lass and within-lass satter matries and define a trade-off parameter to ontrol the regularization. Their approah is based on the loal Fisher disriminant whereas LLDA (and subsequently our method) is loser to the MMC riterion. Li et. al., 2006, have shown MMC to produe lower error than the Fisher disriminant on faial reognition benhmarks. 2.2 Our ontribution Semi-supervised Laplaian linear disriminant In this study we desribe a semi-supervised extension. Suppose that the anestry of some individuals from some sub-populations in the admixture is known. In order to take this into aount for feature extration in an LLDA framework we proeed as follows. First we define the weight matries for the loal and global graphs as follows. Our definitions of weight matries are similar to the ones in Nijima and Okuno, 2007, exept that we inorporate prior knowledge.! 100p ij if anestry of i and j are speified and l(i) = l( j) % W l [i, j] = k(i, j) if i is among m nearest neighbors of j or vie - versa& $ 0 otherwise ' k(i, j) if i! j% W g [i, j] = & $ 0 if i = j ' where k(i,j) is the similarity of individuals i and j, and l(i) returns the lass (anestry) of individual i. p ij is defined as p ij =! k q ik q jk where q ik is the probability that individual i belongs to sub-population k of the admixture. Current benhmarks are obtained from individuals where both parents and grandparents of an individual are from the same anestry (e.g. Thus q ik is usually 1 or 0. However if an individual has different anestry from the mother and father then q ik will be a probability between 0 and 1. We use prior knowledge also in the onstrution of the nearest neighbor graph. For a given individual i whose anestry is speified we alulate its m nearest neighbors as follows. Let C be the set of individuals with the same speified anestry as that of i. We onsider all individuals in C to be loser to i even if they are onsidered far under a given distane metri. Thus, we add them to the nearest neighbors of i sorted by distane to i. If C < m then we sort the remaining individuals, i.e. the total population without C, by their distane to i and add the losest m- C ones to the nearest neighbor set of i. If the prior anestry of i is not speified we alulate its distane to all other individuals in the admixture and inlude the m nearest ones in the nearest neighbor set of i. After omputing the matries W g and W l we alulate the global and loal Laplaians aording to (2). We then form the matrix (1/n)X T (L g-2l l)x and ompute the d leading eigenvetors as the solution to W, where d is the desired number of redued dimensions. Note that the semi-supervised LLDA as well as the unsupervised one in Nijima and Okuno, 2007 both inlude MMC as a speial ase. If we set W g[i,j]=1/n for all i,j, and W l[i,j]=1/n k if i and j are both in the same lass k and 0 otherwise, we then obtain MMC Our implementation: Kernel-PCA + LLDA The dimensions of (1/n)X T (L g-2l l)x an be very large when thousands of SNPs are given. This an onsiderably slow down the eigenvetor omputations. To overome this we follow the spirit of PCA+LDA algorithms for image reognition (Yang et. al., 2005). We first ompute the full PCA projetion of the SNP data. This an be done effiiently using kernel PCA (Sholkopf and Smola, 2002). Kernel PCA also allows us to model nonlinear relationships in the data. After the kernel PCA transformation eah individual is now represented by an n dimensional vetor (where n is the number of individuals). This is a signifiant redution in dimension. SNPs an be in the order of hundreds of thousands or millions whereas the number of individuals in the population may be muh smaller. We treat the kernel PCA projetion as the new data matrix, i.e. olumn i of the data matrix X is replaed with the kernel PCA projeted vetor of the i th individual. We then apply LLDA on this matrix. Although our method is tehnially kernel-pca + semi-supervised LLDA we refer it to as LLDA hereon. All software and the datasets (inluding the kernel PCA projetions) are available from In partiular, sripts and data required to reprodue the results shown in this study are also provided Parameters In the omputation of W l we set k(i,j) to 1 everywhere, although note that Eulidean, dot produt, or Gaussian kernels may also be used (Nijima and Okuno, 2007). For the nearest neighbor searh we use the Eulidean distane between the kernel-pca projeted vetors. The number of dimensions in the PCA projeted vetors used for the Eulidean distane alula- 3

4 U. Roshan tion is set to 10 and the parameter m (in the definition of W l) is also set to 10. We experimented with fewer dimensions and smaller values of m but observed negligible differenes in error. Sine we use benhmarks where both maternal and paternal parents and grandparents have the same anestry, q ik is set to 1 for the appropriate subpopulation k and 0 for remaining ones if i th s prior anestry is available. Thus p ij is always 1 when both i and j have the same anestry. We used the kernel PCA program gist-kpa of the GIST software suite ( for omputing kernel-pca projetions with the polynomial degree two kernel. We experimented with Gaussian and higher order polynomials but did not observe a signifiant differene in error. We implemented LLDA using Perl and employed the Perl Math Cephes pakage ( for matrix operations. 3 RESULTS In order to study the performane of our proposed approah we simulate two different semi-supervised senarios. In the first senario we assume that for a given admixture some individuals from eah sub-population have known anestry. In the seond one some individuals from only some sub-populations have known anestry. Note that the semi-supervised senario does not affet the PCA projetion sine that is an unsupervised method. However, it produes different LLDA or other semi-supervised projetions. In order to ompare different projetions under the first senario we apply the nearest mean lassifier (Alpaydin, 2004). Sine training samples are available from eah sub-population in this ase it is straightforward to apply this lassifier. The proedure is simple: first alulate the training means of eah sub-population and then assign eah individual to the sub-population with the shortest Eulidean distane to its mean. In the seond senario we use the k-means lustering algorithm (Alpaydin, 2004) on both the projetions. In this situation only some individuals of some subpopulations have known anestry and therefore it is not possible to apply a lassifier. When applying the nearest means or k-means methods on a projetion we onsider only the leading k dimensions, where k is the number of sub-populations in the admixture. Note that this number is not required to ompute the LLDA projetion. We only use it in order to ompare the projetions under the nearest means and k- means algorithms. In pratie the number k an be estimated using various methods (Sanguinetti et. al., 2005). Given a true and estimated lassifiation we use the lassifiation error rate for determining its error. Let T[i] [1,k] be the true subpopulation ID of individual i and similarly let C[i] be the estimated sub-population ID. The error rate is then ( 100! { j : T[ j] C[ j], j [1,n] } ) n If a lustering is given then the above approah is not appliable. Instead we first have to make sure that the estimated subpopulation IDs math the true ones by the following method: for eah estimated sub-population we assign it the ID that is maximum among the true population IDs of all individuals in it. We then proeed with the same formula. We measure the statistial signifiane of the differene in errors between two methods using the Wiloxon signed rank test (Kanji, 1999). As shown in earlier studies PCA followed by k-means an separate sub-populations of the HAPMAP dataset (suh as Chinese and Japanese individuals) with high auray (Pashou et. al., 2007). We fous here on the reent large dataset from Noah Rosenberg s lab (Jakobsson et. al., 2008). This ontains 525,9210 genome-wide SNPs of 485 individuals from 29 populations. From this large dataset we extrat individuals from three different regions as speified in the dataset: East Asia: 10 from Cambodia, 15 from Siberia, 49 from China, and 16 from Japan; 459,188 SNPs without missing entries Afria: 32 Biaka Pygmy individuals, 15 Mbuti Pygmy, 24 Mandenka, 25 Yoruba, 7 San from Namibia, 8 Bantu of South Afria, and 12 Bantu of Kenya; 454,732 SNPs without missing entries. Middle East: 43 Druze from Israel-Carmel, 47 Bedouins from Israel-Negev, 26 Palestinians from Israel-Central, and 30 Mozabite from Algeria-Mzab; 438,596 SNPs without missing entries. We onduted several preliminary studies to determine whih methods to present in the main results. We first ompared kernel- PCA to the implementation of the smartpa program (Patterson et. al., 2006) and found it to have lower error when evaluated under k- means. We also ompared the SSDR method of Zhang et. al., 2007, and the semi-supervised loal Fisher disriminant (SSLFDA) of Sugiyama et. al., 2007, to LLDA under the first semi-supervised senario. We implemented both of these methods in Perl and provide them in our software distribution at In line with our semi-supervised LLDA implementation (kernel-pca + LLDA) we apply both SSDR and SSLFDA on the full kernel PCA projetion. We found both SSDR and SSLFDA to have higher errors than LLDA and even the original kernel PCA on the data onsidered here. Subsequently we omit them from the remaining results. 3.1 Random number of individuals with known anestry from eah sub-population of the admixture In the first semi-supervised senario we assume that some individuals from eah sub-population of the admixture have known anestry. In order to apply LLDA here we would need a minimum of two individuals from eah sub-population with known anestry Between 2 and 4 random individuals of eah sub-population with known anestry (for Middle Eastern admixture between 2 and 8) 4

5 Semi-supervised feature extration for population struture identifiation using the Laplaian linear disriminant We simulate datasets with known anestry as follows. Assume that the admixture has p sub-populations. For eah sub-population j [1,p] we generate a random r j [2,4] and randomly selet r j individuals from sub-population j to have known anestry. For the Middle Eastern admixture, however, we use the range [2,8]. We generate 200 suh random datasets. On eah suh dataset we extrat LLDA features. The kernel-pca features are unsupervised and so independent of the prior anestry. The number of random individuals with known anestry is muh smaller than the total in the admixture. For the datasets in this subsetion the mean number of individuals with known anestries in the East Asian, Afrian, and Middle Eastern admixtures are 11.0, 19.1, and 18.3 respetively. These onstitute 12.2%, 15.5%, and 12.5% of the three admixtures respetively. Sine eah sub-population has a minimum of two individuals with known anestry we an apply the simple nearest means lassifier on the two projetions and measure their error. In Table 1 we report the mean error on both the projetions (averaged over the 200 random trials). On all three admixtures the LLDA error is lower by a statistially signifiant margin. Table 1. Error of nearest means lassifier. * denotes Wiloxon signed rank test p-value < Admixture Kernel PCA Laplaian linear disriminant East Asia * Afria * Middle East * Between 2 and 20% random individuals of eah subpopulation with known anestry We simulate 200 random datasets with known anestry in a manner similar to the one desribed in the previous setion. However, the range for eah r j is [2,0.2n j ] where n j is the number of individuals in sub-population j of the admixture. The mean perent of individuals with known anestry is 13.5%, 15.1%, and 12% for the three admixtures respetively. LLDA still performs signifiantly better as Table 2 shows. Table 2. Caption as Table 1. Admixture Kernel PCA Laplaian linear disriminant East Asia * Afria * Middle East * 3.2 Random number of individuals with known anestry but only from some sub-populations of the admixture In the seond semi-supervised senario we assume that only some individuals from some sub-populations have known anestry. This presents a more diffiult senario sine now a lassifiation method annot be applied. In order to ompare the two projetions in this senario we apply the popular k-means lustering method. This may not neessarily the best way of omparison but it still gives us some idea of how well separated the two projetions are. We run k-means 100 times on eah input and report the error of the lustering with the highest objetive funtion value Between 2 and 4 random individuals of some subpopulations with known anestry (for Middle Eastern admixture between 2 and 8) Assume that the admixture has p sub-populations. We first selet a random number p [1,p] and then pik p random sub-population IDs (from 1 through p) using Fisher-Yates sampling (Knuth, 1998). For eah seleted sub-population j we generate a random r j [2,4] and randomly selet r j individuals from sub-population j to have known anestry (as before). Again for the Middle Eastern admixture we use the range [2,8]. We generate 200 suh random datasets. On eah suh dataset we extrat LLDA features. The kernel-pca features are unsupervised and so independent of the prior anestry. The number of seleted sub-populations p and number of individuals per seleted sub-population varies aross these datasets. For example for the 200 datasets in this subsetion the mean values of p for the East Asian, Afrian, and Middle Eastern admixtures are 2.6, 3.9, and 2.5. These values are very similar for the data in the proeeding subsetions. The number of individuals with known anestry is also muh smaller than the total admixture sizes. For the data in this subsetion the mean number of individuals with known anestry are 7.1, 10.8, and 11.8 for the East Asian, Afrian, and Middle Eastern admixtures. These onstitute 7.8%, 8.8%, and 8.1% of the respetive admixtures. We apply the k-means lustering algorithm on both the kernel- PCA and LLDA projetions. K-means on the kernel-pca projetion has the same error aross the different datasets beause the prior anestry does not affet the projetion. In Table 3 we report the k-means error averaged over the 200 random trials. Exept for the East Asian admixture LLDA has statistially signifiant lower errors. Table 3. Error of k-means. * denotes Wiloxon signed rank test p-value < Admixture Kernel PCA Laplaian linear disriminant East Asia Afria * Middle East * 5

6 U. Roshan Between 2 and 20% random individuals of some subpopulations with known anestry We proeed as in the previous subsetion. We first selet p random population IDs. However, this time we selet r j randomly from [2,0.2n j ] where n j is the number of individuals in the seleted sub-population j of the admixture. The mean perent of individuals with known anestry is 9%, 9.1%, and 7.4% for the three admixtures respetively. As Table 4 shows LLDA performs signifiantly better than kernel-pca. Table 4. Caption as Table 3. Admixture Kernel PCA Laplaian linear disriminant East Asia * Afria * Middle East * On the HAPMAP dataset PCA followed by k-means separates the Japanese and Chinese sub-populations with 1% error (Pashou et. al., 2007). We observed the same error with kernel PCA. On the admixtures onsidered here however, PCA error is onsiderably higher. For example, on the Middle Eastern admixture kernel PCA followed by k-means reahes an error of 29.5%. This admixture is partiularly hard beause of the losely related Bedouin and Palestinian sub-populations. LLDA gives an improvement of almost 10% when mean perent of known anestry is 16.8% over randomly seleted sub-populations of this admixture (see Table 6). The perent of individuals with known anestry is intentionally kept small throughout our experiments in order to simulate hard senarios. In pratie prior anestry of many individuals may be available whih in turn will provide a greater advantage with LLDA. This trend an be observed from Tables 4 through 6: as the mean number of individuals with known anestry inreases the LLDA error drops. Our LLDA implementation is also very fast: for any of the three given admixtures it finishes in a few seonds Between 2 and 35% random individuals of some subpopulations with known anestry We proeed as previously but selet r j randomly from [2,0.35n j ]. The mean perent of individuals with known anestry is 12.2%, 11.9%, and 12% for the three admixtures respetively. Table 5 shows the improvements gained from LLDA. Table 5. Caption as Table 3. Admixture Kernel PCA Laplaian linear disriminant East Asia * Afria * Middle East * Between 2 and 50% random individuals of some subpopulations with known anestry Finally we selet r j randomly from [2,0.5n j ]. The mean number of individuals with known anestry (over the 200 trials) is still small for the data in this subsetion: 15.5 for East Asian admixture, 19.3 for Afrian, and 24.6 for the Middle Eastern. These onstitute 17.2%, 15.7%, and 16.8% of the three admixtures respetively. We summarize the results in Table 6. Table 6. Caption as Table 3. Admixture Kernel PCA Laplaian linear disriminant East Asia * Afria * Middle East * 4 CONCLUSIONS We proposed a semi-supervised Laplaian linear disriminant for extrating features for identifying population struture when the anestry of some individuals is known in advane. Using real benhmarks we simulate various semi-supervised senarios and show that LLDA outperforms kernel PCA by a statistially signifiant margin. In omparison to two reent semi-supervised feature extrators, LLDA and kernel PCA have lower error on the data onsidered here. The proposed LLDA method is fast, an be easily implemented, and aommodates mixed prior anestries. ACKNOWLEDGEMENTS Computational experiments were performed on the CIPRES luster supported by NSF award We thank system administrators at New Jersey Institute of Tehnology and the San Diego Superomputing Center for their support. REFERENCES Alpaydin, E. (2004) Introdution to Mahine Learning, MIT Press Cavalli-Sforza L., Feldman M. (2003) The appliation of moleular geneti approahes to the study of human evolution, Nature Genetis 33: Devlin B, Roeder K (1999) Genomi ontrol for assoiation studies. Biometris 55: Jakobsson M., Sholz S.W., Sheet P., Gibbs J.R., VanLiere J.M., Fung H-C., Szpieh Z.A., Degnan J.H., Wang K., Guerreiro R., Bras J.M., Shymik J.C., Hernandez D.G., Traynor B.J., Simon-Sanhez J., Matarin M., Britton A., van de Leemput J., Rafferty I., Buan M., Cann H.M., Hardy J.A., Rosenberg N.A., Singleton A.B. (2008) Genotype, haplotype and opy-number variation in worldwide human populations. Nature 451: Kanji G. K. (1999) 100 Statistial Tests, London, U.K., Sage Publiations Knuth D. E. (1998) The Art of Computer Programming volume 2, 3 rd edition, Li H., Tiang J., Zhang J. (2006) Effiient and robust feature extration by maximum margin riterion, IEEE Trans. Neural Networks 17(1): Marhini J, Cardon L, Phillips M, Donnelly P (2004) The effets of human population struture on large geneti assoiation studies. Nature Genetis 36: Nijima S., Okuno Y. (2007) Laplaian linear disriminant analysis approah to unsupervised feature seletion, IEEE/ACM Transations on Computational Biology and Bioinformatis PP(99):1-1, doi: /tcbb Disussion 6

7 Semi-supervised feature extration for population struture identifiation using the Laplaian linear disriminant Pashou P, Ziv E, Burhard EG, Choudhry S, Rodriguez-Cintron W, et al. (2007) PCA-Correlated SNPs for Struture Identifiation in Worldwide Human Populations. PLoS Genetis 3(9): e160 doi: /journal.pgen Patterson N, Prie A, Reih D (2006) Population struture and eigenanalysis. PLoS Genetis 2: e190. doi: /journal.pgen Prithard J, Stephens M, Donnelly P (2000) Inferene of population struture using multilous genotype data. Genetis 155: Sanguinetti G., Laidler J., Lawrene N. D. (2005) Automati determination of the number of lusters using spetral algorithms, Proeedings of the IEEE Workshop on Mahine Learning for Signal Proessing, Sholkopf, B., & Smola, A. (2002). Learning with kernels. Cambridge, MA, MIT Press. Sugiyama M., Ide T., Nakajimi S., Sese J. (2007) Semi-supervised loal Fisher disriminant analysis for dimensionality redution, Proeedings of the Workshop on Information-Based Indution Sienes, Tokyo, Japan Tang H, Fang T., Shi P-F. (2006) Laplaian linear disriminant analysis, Pattern Reognition, 39: Tsai H, Choudhry S, Naqvi M, Rodriguez-Cintron W, Burhard E, et al. (2005) Comparison of three methods to estimate geneti anestry and ontrol for stratifiation in geneti assoiation studies among admixed populations, Hum Genet 118: Xu H., Shete S. (2005) Effets of population struture on geneti assoiation studies. BMC Genetis 6(Suppl 1): S109 Yang J., Frangi A.F.,Yang J-Y., Zhang D., Zhong J. (2005) KPCA plus LDA: a omplete kernel Fisher disriminant framework for feature extration and reognition, IEEE Transations on Pattern Analysis and Mahine Intelligene 27(2): Zhang D., Zhou Z-H., Chen S., (2007) Semi-supervised dimensionality redution, Proeedings of the 2007 SIAM International Conferene on Data Mining, Minneapolis, Minnesota, USA Ziv E, Burhard E (2003) Human population struture and geneti assoiation studies. Pharmaogenomis 4:

the data. Structured Principal Component Analysis (SPCA)

the data. Structured Principal Component Analysis (SPCA) Strutured Prinipal Component Analysis Kristin M. Branson and Sameer Agarwal Department of Computer Siene and Engineering University of California, San Diego La Jolla, CA 9193-114 Abstrat Many tasks involving