9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Size: px

Start display at page:

Download "9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology"

Wesley Bartholomew Knight
6 years ago
Views:

1 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example tasks: cluster analysis, predictive modeling, association analysis, anomaly detection Clustering algorithms Hierarchical Clustering (UPGMA) K-Means Clustering Data mining tasks Predictive vs descriptive tasks Predictive tasks: predict the value of a particular attribute (the target variable) based on the values of other attributes Descriptive tasks: derive patterns (correlations, trends, clusters, trajectories, and anomalies) that summarize the underlying relationships in data Predictive modeling The goal of predictive modeling is to learn a model that minimizes the error between the predicted and true values of the target variable Classification: for discrete target variables Regression: for continuous target variables Cluster analysis The goal is to find groups of similar observations/objects Many clustering algorithms have been developed Applications of clustering in biology Viewing and analyzing vast amounts of biological data as a whole set can be perplexing It is easier to interpret the data if they are partitioned into clusters combining similar data points. Clustering of genes/proteins UPGMA is applied to build guide tree for multiple sequence alignments Distance-based methods for phylogenetic reconstruction Clustering of microarray data Plot each datum as a point in N-dimensional space Make a distance matrix for the distance between every two gene points in the N-dimensional space Genes with a small distance share the same expression characteristics and might be functionally related or similar. Clustering reveal groups of functionally related genes

2 9/9/ Clustering of microarray data (cont d) Two key factors: ) What distance measure is used ) What principle is used to construct clusters Clusters Measures of similarity and dissimilarity (distance) There are many different ways of calculating similarity and distance Knowing your data is important When work on distance, pay attention to three properties: positivity, symmetry, and triangle inequality. Distance measure Correlation coefficient (between two variables X and Y) Pearson correlation coefficient (sensitive to outliers) (with values range from - to ; value of implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases) covariance variance Spearman correlation coefficient (nonparametric version; use the ranks) Absolute value of correlation coefficient, r Euclidean distance Homogeneity and separation principles Homogeneity: Elements within a cluster are close to each other Separation: Elements in different clusters are further apart from each other clustering is not an easy task! Given these points, a clustering algorithm might make two distinct clusters as follows Bad clustering This clustering violates both Homogeneity and Separation principles Good clustering This clustering satisfies both Homogeneity and Separation principles Close distances from points in separate clusters Far distances from points in the same cluster

3 9/9/ Clustering techniques Hierarchical clustering Agglomerative: Start with every element in its own cluster, and iteratively join clusters together Divisive: Start with one cluster and iteratively divide it into smaller clusters Hierarchical: Organize elements into a tree, leaves represent genes and the length of the paths between leaves represents the distances between objects (genes, etc). Similar objects lie within the same subtrees Hierarchical clustering (cont d) Hierarchical Clustering is often used to reveal evolutionary history Hierarchical clustering algorithm. Hierarchical Clustering (d, n). Form n clusters each with one element. Construct a graph T by assigning one vertex to each cluster. while there is more than one cluster. Find the two closest clusters C and C 6. Merge C and C into new cluster C with C + C elements 7. Compute distance from C to all other clusters 8. Add a new vertex C to T and connect to vertices C and C 9. Remove rows and columns of d corresponding to C and C. Add a row and column to d corrsponding to the new cluster C. return T The algorithm takes a nxn distance matrix d of pairwise as an input. Different ways to define distances between clusters may lead to different clusterings Hierarchical clustering: Recomputing distances d min (C, C * ) = min d(x,y) Distance between two clusters is the smallest distance between any pair of their elements (single-linkage) d max (C, C * ) = max d(x,y) Distance between two clusters is the largest distance between any pair of their elements (complete-linkage) d avg (C, C * ) = ( / C * C ) d(x,y) Distance between two clusters is the average distance between all pairs of their elements (average-linkage) K-Means clustering problem Input: A set, V, consisting of n points and a parameter k Output: A set X consisting of k points (cluster centers) that minimizes the squared error distortion d(v,x) over all possible choices of X Given a data point v and a set of points X, define the distance from v to X d(v, X) as the (Euclidian) distance from v to the closest point from X. Given a set of n data points V={v v n } and a set of k points X, define the Squared Error Distortion d(v,x) = d(v i, X) / n < i < n

4 9/9/ -Means clustering problem Input: A set, V, consisting of n points Output: A single point x (cluster center) that minimizes the squared error distortion d(v,x) over all possible choices of x -Means Clustering problem is easy. However, it becomes very difficult (NP-complete) for more than one center. An efficient heuristic method for K-Means clustering is the Lloyd algorithm K-Means clustering: Lloyd algorithm. Lloyd Algorithm. Arbitrarily assign the k cluster centers. while the cluster centers keep changing. Assign each data point to the cluster C i corresponding to the closest cluster representative (center) ( i k). After the assignment of all data points, compute new cluster representatives according to the center of gravity of each cluster, that is, the new cluster representative is v \ C for all v in C for every cluster C *This may lead to merely a locally optimal clustering. expression in condition x x x x x x expression in condition expression in condition expression in condition expression in condition x x x expression in condition expression in condition x x expression in condition x

9/9/ If two genes are related (have similar functions or are co-regulated), their expression profiles should be similar (e.g. low Euclidean distance or high correlation).

5 9/9/ If two genes are related (have similar functions or are co-regulated), their expression profiles should be similar (e.g. low Euclidean distance or high correlation). However, they can have similar expression patterns only under some conditions (e.g. they have similar response to a certain external stimulus, but each of them has some distinct functions at other time). Similarly, for two related conditions, some genes may exhibit different expression patterns (e.g. two tumor samples of different sub-types). As a result, each cluster may involve only a subset of genes and a subset of conditions, which form a checkerboard structure: In reality, each gene/condition may participate in multiple clusters. Co-clustering: simultaneous clustering of the rows and columns of a matrix To discover such data patterns, some biclustering methods have been proposed to cluster both genes and conditions simultaneously. Differences with projected clustering (by observation, not be definition): Projected clustering has a primary clustering target, biclustering usually treats rows and columns equally. Most projected clustering methods define attribute relevance based on value distances, most biclustering methods define biclusters based on other measures. Some biclustering methods do not have the concept of irrelevant attributes. genes Working with big matrices t t t t t t6 t7 t8 t9 g..6. g.. g. g.6 g. g6.6. gn. Time, tissue, sample, environments Gene expression by microarray, RNAseq Gene abundance Clustering vs classification In the terminology of machine learning clustering is unsupervised learning classification is supervised learning Examples Clustering protein sequences to define families Classification: assign a new protein sequence to one of the PFAM families. Sample classification Classify samples based on gene expression pattern AML: acute myeloid leukemia ALL: acute lymphoblastic leukemia c: idealized expression pattern c*: random idealized expression pattern The prediction of a new sample is based on "weighted votes" of a set of informative genes (each gene votes for AML or ALL depending its expression level) Ref: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring (Science, 999)

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA