Clustering of SNP Data with Application to Genomics

Size: px

Start display at page:

Download "Clustering of SNP Data with Application to Genomics"

Stephany Chase
6 years ago
Views:

Clustering of SNP Data with Application to Genomics Michael K Ng, Mark J Li, Sio I Ao Department of Mathematics Hong Kong Baptist University Kowloon Tong, Hong Kong Email: mng@mathhkbueduhk Yiu-ming

1 Clustering of SNP Data with Application to Genomics Michael K Ng, Mark J Li, Sio I Ao Department of Mathematics Hong Kong Baptist University Kowloon Tong, Hong Kong mng@mathhkbueduhk Yiu-ming Cheung Department of Computer Science Hong Kong Baptist University Kowloon Tong, Hong Kong Pak C Sham Genome Research Center The University of Hong Kong Pokfulam Road, Hong Kong Joshua Z Huang E-Business Technology Institute The University of Hong Kong Pokfulam Road, Hong Kong Abstract Single nucleotide polymorphisms (SNPs) are very common throughout the genome and hence are potentially valuable for mapping disease susceptibility loci by detecting association between SNP markers and disease Many methods may only be applicable when marker haplotypes, rather than genotypes (categorical data), are available for analysis In this paper, we explore the properties of k-modes (categorical data) clustering algorithms to SNP data for detecting association between SNP markers and disease Subspace k-modes clustering properties are also considered and tested 1 Introduction Because of their ubiquity there has been considerable interest in using single nucleotide polymorphisms (SNPs) to fine-map susceptibility loci [5] It is estimated that 90% of naturally occurring sequence variations are SNPs [9] These variations are sufficiently finely spaced that one may reasonably expect to find SNPs within a defined chromosomal region which can be sufficient to manifest detectable linkage disequilibrium in some human populations Detecting association between SNPs and disease may provide useful evidence for the existence of a susceptibility locus within such a region, allowing one to proceed to more intensive investigations which can lead to identification of the gene and pathogenic polymorphisms Several strategies have been proposed that utilize two- Research supported in part by HKRGC 7035/04P, 7035/05P and HKBU FRGs point methods to localize the position of a disease locus [11] However, SNPs studied individually might be expected to provide relatively little information for detecting association between a disease and a chromosomal region [25, 27], especially if more than one mutation is present Potentially the amount of information available from SNPs could be increased dramatically by utilizing information from several marker loci simultaneously, with the aim of detecting association with a marker haplotype rather than just one biallelic marker Composite likelihood methods combining disease associations with a series of linked markers from haplotypes have been proposed by Collins et al [10], Lam et al [23] and McPeek & Strahs [24] A problem with any method which directly utilizes case and control haplotypes is that such haplotypes are rarely available for autosomal markers In practice if one wishes to use haplotypes then one must generally rely on a combination of deduction and estimation Methods which are based on modelling explicitly a pattern of development of linkage disequilibrium relationships between marker and disease loci may be expected to perform badly if the assumptions of the model are violated Methods which assume only one mutational event may perform poorly if more than one has occurred Finally, genotyping and map errors may cause particular problems for methods which rely on a regular relationship between physical position and linkage disequilibrium parameters An alternative approach which might tackle some these difficulties would be to utilize data mining (clustering) methods to investigate association between a disease phenotype and a multilocus genotype, without assuming the availability of haplotypes required by most of the above mentioned methods, and without any attempt model the process whereby disease-related haplotypes might have been gener /06 $

2 ated Since multilocus genotype data type is categorical, the main contribution of this paper is to use k-modes clustering algorithms to detect association between a disease and multiple marker geonotypes The outline of this paper is as follows In Section 2, we review a k-modes clustering algorithm In Section 3, a real data set is employed to test the performance k-modes clustering algorithm, and to compare with other methods like logistic regression, neural network and decision tree Finally, we consider subspace clustering algorithms and present some preliminary clustering results 2 The K-modes Clustering Method Since first published in 1997, the k-modes algorithm [19] has become a popular technique in solving categorical data clustering problems in different application domains The k-modes algorithm extends the k-means algorithm by using a simple matching dissimilarity measure for categorical objects, modes instead of means for clusters, and a frequency-based method to update modes in the clustering process to minimize the clustering cost function These extensions have removed the numeric-only limitation of the k- means algorithm and enable the k-means clustering process to be used to efficiently cluster large categorical data sets from real world databases An equivalent nonparametric approach to deriving clusters from categorical data is presented [21] We note that modes for clusters can be viewed as the representative haplotypes of the corresponding cluster On the assmuption some association has been maintained through linkage disequilibrium, this implies that particular haplotypes should be more commonly found on chromosomes bearing pathogenic mutations Such haplotypes should act in a similar way to multallelic markers, and should be better able to produce detectable association, especially when there are multiple mutation events We assume the set of individuals to be clustered is stored in a table T defined by a set of SNP locus attributes (or simply say attributes), A 1, A 2,, A m Each attribute A j describes a domain of values, denoted by DOM(A j ), associated with a defined genotypes An example is given in the following table: Individual (Case or Control) Locus 1 Locus 2 Locus m n For instance, A 1 has two alleles 1 and 3, the three possible genotypes are 11, 13 and 33 Genotypes for each attribute are categorical (nominally) such that no genetic model assumptions are incorporated A domain DOM(A j ) is defined as categorical if it is finite and unordered, eg, for any a, b DOM(A j ), either a = b or a b An individual X in T can be logically represented as a conjunction of attribute-value pairs [A 1 = x 1 ] [A 2 = x 2 ] [A m = x m ] where x j DOM(A j ) for 1 j m Without ambiguity, we represent X as a vector [x 1, x 2,, x m ] X is called a categorical individual if it has only categorical genotype values We consider every individual has exactly attribute genotype values If the value of an attribute A j is missing, then we denote the attribute value of A j by a category ɛ which means empty Let X = {X 1, X 2,, X n } be a set of n individuals Individual X i is represented as [x i,1, x i,2,, x i,m ] We write X i = X k if x i,j = x k,j for 1 j m The relation X i = X k does not mean that X i and X k are the same indvidual in the table, but rather that the two individuals have equal genotype values in attributes A 1, A 2,, A m The k-modes algorithm [19] has made the following modifications to the k-means algorithm: (i) using a simple matching dissimilarity measure for categorical individuals, (ii) replacing the means of clusters with the modes, and (iii) using a frequency based method to find the modes These modifications have removed the numeric-only limitation of the k-means algorithm but maintain its efficiency in clustering large categorical data sets [19] Let X and Y be two categorical individuals represented by [x 1, x 2,, x m ] and [y 1, y 2,, y m ] respectively The simple matching dissimilarity measure between X and Y is defined as follows: m d(x, Y ) δ(x j, y j ) where δ(x j, y j ) = j=1 { 0, xj = y j 1, x j y j (1) It is easy to verify that the function d defines a metric space on the set of categorical individuals Traditionally, the simple matching approach is often used in binary variables which are converted from categorical variables We note that d is also a kind of generalized Hamming distance The k-modes algorithm uses the k-means paradigm to cluster categorical data The objective of clustering a set of n categorical individuals into k clusters is to find W and Z that minimize k n F (W, Z) = w li d(z l, X i ) (2) l=1 i= /06 $

3 subject to and w li {0, 1}, 1 l k, 1 i n, (3) 0 < k w li = 1, 1 i n, (4) l=1 n w li < n, 1 l k, (5) i=1 where k( n) is a known number of clusters, W = [w li ] is a k-by-n {0, 1} matrix, Z = [Z 1, Z 2,, Z k ], and Z i is the ith cluster center with the categorical attributes A 1, A 2,, A m We remind that Z i can be viewed as the representative haplotypes of the corresponding cluster Minimization of F in (2) with the constraints in (3), (4) and (5) forms a class of constrained nonlinear optimization problems whose solutions are unknown The usual method towards optimization of F in (2) is to use partial optimization for Z and W In this method we first fix Z and find necessary conditions on W to minimize F Then we fix W and minimize F with respect to Z This process is formalized in the k-modes algorithm as follows Algorithm The k-modes algorithm 1 Choose an initial mode Z (1) of each cluster Determine W (1) such that F (W, Z (1) ) is minimized Set t = 1 2 Determine Z (t+1) such that F (W (t), Z (t+1) ) is minimized If F (W (t), Z (t+1) ) = F (W (t), Z (t) ), then stop; otherwise goto step 3 3 Determine W (t+1) such that F (W (t+1), Z (t+1) ) is minimized If F (W (t+1), Z (t+1) ) = F (W (t), Z (t+1) ), then stop; otherwise set t = t + 1 and goto Step 2 The matrices W and Z are calculated as follows Let Ẑ be fixed and consider the problem: min W F (W, Ẑ) subject to (3), (4) and (5) The minimizer Ŵ is given by { 1, if d( Ẑ ŵ li = l, X i ) d(ẑh, X i ), 1 h k, 0, otherwise Let X be a set of categorical individuals described by categorical attributes A 1, A 2,, A m and DOM(A j ) = {a (1) j, a (2) j,, a (n j) j }, where n j is the number of categories of attribute A j for 1 j m Let the cluster centers Z l be represented by [z l,1, z l,2,, z l,m ] for 1 l k Then the quantity k n l=1 i=1 w lid(z l, X i ) is minimized iff z l,j = a (r) j DOM(A j ) where (for 1 t n j ) {w li x i,j = a (r) j, w li = 1} {w li x i,j = a (t) j, w li = 1}, for 1 j m Here X denotes the number of elements in the set X 3 Experimental Results In this section, a real data set is employed to test the performance of the k-modes clustering algorithm, and to compare with other methods like logistic regression, neural network and decision tree We analyze the case/control populations of patients served in a data set from Genome Research Center, The University of Hong Kong The data is consisted of 488 cases (patients) recruited from hospitals in Hong Kong and 520 controls (normal) recruited from the community 144 SNPs on chromosome 3p are picked by CLUSTAG developed by us [4] making an average marker density of 1 tagging SNP per 25 kb The following table shows the summary classification results for the k-modes clustering algorithm and the other methods Each classification result is computed by the average of ten runs of the algorithm In the tests of logistic regression, decision tree and neural network, 60% data is used for training and 40% data is used for validation Method Validation Accuracy Logistic Regression Decision Tree Neural Network k-modes Clustering (k = 2) k-modes Clustering (k = 4) k-modes Clustering (k = 6) Subspace Clustering When more SNPs (a genome-wide genotyping) are used to detect the association between a disease and multiple marker geonotypes, we may need to consider subspace clustering techniques More precisely, we expect in a typical dataset that contains the genotype data of several thousands of SNPs in different individuals, it is common to find only several tens of SNPs having genotype patterns that are highly specific to each cluster of individuals The SNPs are called the relevant SNPs, as opposed to the irrelevant SNPs that do not help much in identifying the cluster members (ie, individuals of the same type) Due to the large number of SNPs being irrelevant to each cluster, two individuals in the same cluster could have low similarity when measured by a similarity function (matching distances in Section 2) /06 $

4 that consider the genotypes of all SNPs The clusters may thus be undetectable by the k-modes clustering algorithms The subspace clustering problem is defined for such a scenario Each subspace cluster is a set of individuals with an associated set of relevant SNPs such that in the subspace formed by the relevant SNPs, the individuals are similar to each other but dissimilar to individuals outside the cluster In general, subspace clustering seeks to group objects into clusters on subsets of dimensions or attributes of a data set It pursues two tasks, identification of the subsets of dimensions where clusters can be found and discovery of the clusters from different subsets of dimensions According to the ways with which the subsets of dimensions are identified, we can divide subspace clustering methods into two categories The methods in the first category determine the exact subsets of dimensions where clusters are discovered We call these methods as hard subspace clustering, see for instance [1, 2, 3, 8, 17, 26, 28, 29] The methods in the second category determine the subsets of dimensions according to the contributions of the dimensions in discovering the corresponding clusters The contribution of a dimension is measured by a weight that is assigned to the dimension in the clustering process We call these methods as soft subspace clustering because every dimension contributes to the discovery of clusters but the dimensions with larger weights form the subsets of dimensions of the clusters, for instance, [6, 12, 13, 14, 15, 16, 22] We modify the weighting clustering algorithm EW KM [6] to formulate the k-modes subspace clustering algorithm The CYP2D6 data of Hosking et al [18] is used to test the algorithm Four functional CYP2D6 polymorphisms predicting 99% of slow metabolizers were types on 1018 Caucasians and 41 predicted slow metabolizers were identified Therefore 977 are called normal A full description of the data, the database IDs and primers for the marker SNPs are given in Hosking et al [18] We produced 100 clustering results for k-modes algorithms and EW KM with different γ Here γ is the parameter to used to control the number of relevant SNPs included in a cluster, for details, see [?] If we consider the clustering accuracy as a good clustering result The chance to obtain the good result by employing the EW KM is close to 60% in a large range of γ values [02,70], which is better than the k-modes algorithm The distribution of the clustering accuracies is shown in Table 1 Our preliminary results show that a useful information can be obtained in case-control association studies by using subspace clustering method to analyze high-dimensional categorical data from multiple SNPs The method has the advantage of being almost entirely theoretical, with no attempt to model the population history which has produced disease-related haplotypes It might therefore be less prone than other methods to be sensitive to map errors or model Clustering K-modes EWKM (γ) accuracy Clustering EWKM (γ) accuracy Table 1 Distribution of accuracies in 100 runs for K-modes and EW KM algorithms violations The preliminary investigations we have carried our show that subspace clustering methods can provide a simple and practical method for dealing with the multilocus genotypes which are obtained from standard case-control studies The proposed method allows multiple markers to be analyzed simultaneously, even when haplotypes are unavailable, and do not rely on any model of population history or any genetic map to account for present patterns of linkage disequilibrium References [1] C Aggarwal, C Procopiuc, J Wolf, P Yu, and J Park, Fast algorithms for projected clustering, Proc ACM SIGMOD, pp 61 72, 1999 [2] C Aggarwal and P Yu, Finding generalized projected clusters in high dimensional spaces, Proc ACM SIGMOD, pp 70 81, 2000 [3] R Agrawal, J Gehrke, D Gunopulos, and P Raghavan, Automatic subspace clustering of high dimensional data mining applications, Proc ACM SIG- MOD, pp , 1998 [4] S Ao, K Yip, M Ng, D Cheung, P Yee, I Melhado and P Sham CLUSTAG: hierarchical clustering and graph methods for selecting tag SNPs, Bioinformatics, v21, pp , 2005 [5] A Brookes, The essence of SNPs, Gene, vol 234, pp , /06 $

5 [6] Y Chan, W Ching, M Ng, Z Huang, An optimization algorithm for clustering using weighted dissimilarity measures, Pattern Recognition, vol 37, no 5, pp , 2004 [7] A Chaturvedi, P Green and J Carroll, K-modes clustering, Journal of Classification, vol 18, pp 35 55, 2001 [8] C H Cheng, A W Fu, and Y Zhang, Entropy-based subspace subspace clustering for mining numerical data, Proc of the 5th ACM SIGKDD International Conference on Knowledge and Data Mining, pp 84 93, 1999 [9] F Collins, L Brooks and A Chakravarti, A DNA polymorphism discovery resource for research on human genetic variation, Genome Research, vol 8, pp , 1998 [10] A Collins and N Morton, Mapping a disease locus by allelic association, Proc Natl Acad Sci USA, vol 95, pp , 1998 [11] B Devlin and N Risch, A comparison of linkage disequilibrium measures for fine-scale mapping, Genomics, vol 29, pp , 1995 [12] C Domeniconi, Locally adaptive techniques for pattern classification, Dissertation for Doctor of Philosophy, 2002 [13] C Domeniconi, D Papadopoulos, D Gunopulos, and S Ma, Subspace clustering of high dimensional data, Proc of SIAM International Conference on Data Mining, 2004 [14] J Friedman and J Meulman, Clustering objects on subsets of attributes, JRStatist Soc B, vol 66, no 4, pp , 2004 [15] H Frigui and O Nasraoui, Unsupervised learning of prototypes and attribute weights, Pattern Recognition, vol 37, no 3, pp , 2004 [16] H Frigui and O Nasraoui, Simultaneous clustering and dynamic keyword weighting for text documents, Survey of Text Mining, Michael Berry, Ed, Springer, pp 45 70, 2004 [17] S Goil, H Nagesh, and A Choudhary, Mafia: Efficient and scalable subspace clustering for very large data sets, Technical Report CPDC-TR , Northwest University, 1999 Hagen-Mann, M Ehm, J Riley, Linkage disequilibrium mapping identifies a 390 kb region associated with CYP2D6 poor drug metabolising activity, Pharmacogenomics J, 2, pp , 2002 [19] Z Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery, vol 2, no 3, pp , 1998 [20] Z Huang and Michael Ng, A fuzzy k-mode algorithm for clustering categorical data, IEEE Transactions on Fuzzy System, vol 7, no 4, 1999 [21] Z Huang and M Ng, A note on k-modes clustering, Journal of Classification, vol 20, pp , 2003 [22] L Jing, M Ng, J Xu, and Z Huang, Subspace clustering of text documents with feature weighting k- means algorithm, PAKDD, pp , 2005 [23] J Lam, K Roeder and B Devlin, Haplotype fine mapping by evolutionary trees, Am J Hum Genet, vol 66, pp , 2000 [24] M McPeek and A Strahs, Assessment of linkage disequilibrium by the decay of haplotype sharing with application to fine-scale genetic mapping, Am J Hum Genet, vol 65, pp , 1999 [25] P Sham, J Zhao and D Curtis, The effect of marker characteristics on the power to detect linkage disequilibrium due to single or multiple ancestral mutations, Ann Hum, Genet, vol 64, pp , 2000 [26] K Woo and J Lee, FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting, PhD thesis, Korea Advanced Institute of Science and Technology, Taejon, Korea,2002 [27] M Xiong and L Jin, Comparison of the power and accuracy of biallelic and microsatellite markers in population-based gene-mapping methods, Am J Hum Genet, vol 64, pp , 1999 [28] J Yang, W Wang, H Wang, and P Yu, δ-clusters: capturing subspace correlation in a large data set, In Data Engineering, 2002 Proceedings 18th International Conference on, pp , 2003 [29] K Yip, D Cheung, and M Ng, A practical projected clustering algorithm, IEEE Transactions on Knowledge and Data Engineering, vol 16, no 11, pp , 2004 [18] L Hosking, R Boyd, C Xu, M Nissum, K Cantone, I Purvis, R Khakhar, M Barnes, U Liberwirth, K /06 $

Linkage Disequilibrium Map by Unidimensional Nonnegative Scaling

The First International Symposium on Optimization and Systems Biology (OSB 07) Beijing, China, August 8 10, 2007 Copyright 2007 ORSC & APORC pp. 302 308 Linkage Disequilibrium Map by Unidimensional Nonnegative