Microarray data analysis Computational Biology IST Technical University of Lisbon Ana Teresa Freitas 016/017 Microarrays Rows represent genes Columns represent samples Many problems may be solved using clustering Eample of microarray dataset 1
Microarray data S j Epression levels of gene i, across samples G i Epression levels of all genes, for one sample Typical eamples of samples: Heat shock, phases in cell cycle, cancer, normal, Microarray data Genes mrna samples sample1 sample sample3 sample4 sample5 1 0.46 0.30 0.80 1.51 0.90-0.10 0.49 0.4 0.06 0.46 3 0.15 0.74 0.04 0.10 0.0 4-0.45-1.03-0.79-0.56-0.3 5-0.06 1.06 1.35 1.09-1.09 Gene epression level of gene i in mrna sample j Log (treated-ep-value /controlled-ep-value )
What do we actually measure? We measure signal of cdna target(s) which hybridize(s) to the probe (and backgrounds, ratios, standard deviations, dust etc. ) What do we wish to know (an abstraction)? [mrna] 1a, [mrna] 1b,.. [mrna] Na, [mrna] Nb Where N = Number of Genes, a and b = different colors Factors with impact on the signal level Amount of mrna Labeling efficiencies Quality of the RNA Laser/dye combination Detection efficiency of photomultiplier 3
Typical Assumption [mrna] n,a α signal n,a [mrna] n,a = k * signal n,a Normalization constant n = gene inde a = color Low level analysis Image analysis - computation of probes intensities/signals Normalization - is the attempt to compensate for systematic technical differences between chips, to see more clearly the systematic biological differences between samples. Statisticians use the term 'bias' to describe systematic errors, which affect a large number of genes. 4
Normalization Sources of Systematic Errors Different incorporation efficiency of dyes Different amounts of mrna Eperimenter/protocol issues (comparing chips processed by different labs) Different scanning parameters Batch bias Normalization Two problems: How to detect biases? Which genes to use for estimating biases among chips? How to remove the biases? 5
Which genes to use for bias detection? All genes on the chip Assumption: Most of the genes are equally epressed in the compared samples, the proportion of the differential genes is low (<0%). Limits: Not appropriate when comparing highly heterogeneous samples (different tissues) Not appropriate for analysis of dedicated chips (apoptosis chips, inflammation chips etc) Which genes to use for bias detection? Housekeeping genes Assumption: based on prior knowledge a set of genes can be regarded as equally epressed in the compared samples Affy novel chips: normalization set of 100 genes NHGRI s cdna microarrays: 70 "house-keeping" genes set Limits: The validity of the assumption is questionable Housekeeping genes are usually epressed at high levels, not informative for the low intensities range 6
Normalization methods Global normalization (Scaling) enforces the chips to have equal mean (median) intensity Intensity-dependent normalization (Lowess) enforces equal means at all intensities Quantile Normalization enforces the chips to have identical intensity distribution Quantile Normalization Sort each column in the data matri according to genes (probes ) intensities in each chip Compute mean intensity in each rank across the chips Replace each intensity by the mean intensity at its rank Re-order columns to original state, each row corresponds to a gene Chip #1 Chip # Chip #3 Average chip 7
Quantile Normalization Before After What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms 8
Things to study (1) Clustering (grouping) genes: i.e., finding groups of co-regulated genes Eample: Epression levels across time of two clusters of co-regulated genes samples samples Things to study () Clustering (grouping) samples i.e., finding groups of samples with similar genetic profiles (e.g., cancer types). Groups of similar behaviour? 9
Things to study (3) Classifying genes: i.e., deciding if a gene is co-regulated with some known gene(s), based on their epression profiles across samples. Annotated gene 1 Unknown gene samples samples Annotated gene samples Co-regulation? Similar biological function? Same transcription factor? Things to study (4) Classifying samples: i.e., classifying new samples, based on a set of classified samples (eample: cancer versus normal; different types of cancer;) classified samples A B samples to be classified 10
Things to study (5) Selecting genes: a) deciding if a given gene, in isolation, behaves differently in a control versus eperimental situation (e.g., cancer vs normal, two types of cancer, treatment vs non-treatment). b) Selecting which group genes is significantly different in a control versus eperimental situation (same eamples). c) Selecting which group of genes is relevant for a given classification problem. Clustering methods Similarity-based (need a similarity function) Construct a partition Agglomerative, bottom up Searching for an optimal partition Typically hard clustering Model-based (latent models, probabilistic or algebraic) First compute the model Clusters are obtained easily after having a model Typically soft clustering 11
Similary-based clustering Define a similarity function to measure similarity between two objects Common criteria: Find a partition to Maimize intra-cluster similarity Minimize inter-cluster similarity Two ways to construct the partition Hierarchical (e.g.,agglomerative Hierarchical Clustering) Search by starting at a random partition (e.g., K-means) Agglomerative Hierarchical Clustering Given a similarity function to measure similarity between two objects Gradually group similar objects together in a bottom-up fashion Stop when some stopping criterion is met Variations: different ways to compute group similarity based on individual object similarity 1
13 Distance Metrics For clustering algorithms the calculation of a distance between gene vectors or eperiment vectors is a necessary step Distances metrics can be classified as Metric distances Semi-metric distances Metric distances: 1. d ab >= 0. d ab = d ba 3. d aa = 0 4. d ab <= d ac + d cb Semi-metric distances: obey 1) to 3), fail in 4) Distance Metrics Minkowski distance If q = 1, d is Manhattan distance (semi-metric distance) If q =, d is Euclidean distance (metric distance) q q p p q q j i j i j i j i d ) ( ), ( 1 1 + + + = ), ( 1 1 p p j i j i j i j i d + + + = ) ( ), ( 1 1 p p j i j i j i j i d + + + =
Distance Metrics Pearson correlation coefficient (semi-metric distance) d( i, j) = n ( )( ) i = 1 i1 1 i n n ( ) ( ) i = 1 i1 1 i = 1 i ( ) 1 1 ( ) -1 <= d(i,j) <= +1 1 (, ) 1 Distance Metrics Entropy based distances: Mutual Information (semi-metric distance) Mutual Information (MI) is a statistical representation of the correlation of two signals A and B. MI is a measure of the additional information known about one epression pattern when given another. MI is not based on linear models and can therefore also see non-linear dependencies (see picture). 14
Similarity-induced Structure How to Compute Group Similarity? Three Popular Methods: Given two groups g1 and g, Single-link algorithm: s(g1,g)= similarity of the closest pair Complete-link algorithm: s(g1,g)= similarity of the farthest pair Average-link algorithm: s(g1,g)= average of similarity of all pairs 15
Comparison of the Three Methods Single-link Loose clusters Individual decision, sensitive to outliers Complete-link Tight clusters Individual decision, sensitive to outliers Average-link In between Group decision, insensitive to outliers Which one is the best? Depends on what you need! Hierarchical (agglomerative) clustering. Strictly speaking, agglomerative clustering does not produce clusters, but a dendogram dissimilarity 3 5 1 4 3 5 1 4 Cutting the dendogram at a certain level yields clusters. Dendogram cutting is a problem analogous to the selection of K in K-means clustering. 16
Eample of agglomerative gene clustering (Eisen et al, 98) Microarray data from time course of serum stimulation of primary human fibroblasts. Eperiment: Foreskin fibroblasts were grown in culture and were deprived of serum for 48 hr. Serum was added back and samples taken at time 0, 15 min, 30 min, 1hr, hr, 3 hr, 4 hr, 8 hr, 1 hr, 16 hr, 0 hr, 4 hr. Clustering: Agglomerative clustering Correlation Coefficient + (average-link) Clusters with biological interpretation: (A) cholesterol biosynthesis, (B) the cell cycle, (C) the immediate-early response, (D) signalling and angiogenesis, (E) wound healing and tissue remodelling. Data Structures Data matri 11 i1 n1 1f if nf 1p ip np Dissimilarity matri 0 d(,1) d(3,1) : d( n,1) 0 d(3,) : d( n,) 0 : 0 17
Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: ehaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen 67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw 87): Each cluster is represented by one of the objects in the cluster The K-Means Clustering Method Given k, the k-means algorithm is implemented in four steps: Step 1: Partition objects into k nonempty subsets Step : Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Step 3: Assign each object to the cluster with the nearest seed point Go back to Step, stop when no more new assignment 18
10 9 8 7 6 5 4 3 1 0 The K-Means Clustering Method Eample 01345678910 K= Arbitrarily choose K object as initial cluster center Assign each objects to most similar center 10 9 8 7 6 5 4 3 1 0 0 1 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 1 0 reassign 0 1 3 4 5 6 7 8 9 10 Update the cluster means Update the cluster means 10 9 8 7 6 5 4 3 1 0 01345678910 10 9 8 7 6 5 4 3 1 0 reassign 0 1 3 4 5 6 7 8 9 10 Comments on the K-Means Method Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Comparing: PAM: O(k(n-k) ), CLARA: O(ks + k(n-k)) Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-conve shapes 19
Variations of the K-Means Method A few variants of the k-means which differ in Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Handling categorical data: k-modes (Huang 98) Replacing means of clusters with modes Using new dissimilarity measures to deal with categorical objects Using a frequency-based method to update modes of clusters A miture of categorical and numerical data: k-prototype method What is the problem of k-means Method? The k-means algorithm is sensitive to outliers! Since an object with an etremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 10 9 0 1 3 4 5 6 7 8 01345678910 10 9 0 1 3 4 5 6 7 8 01345678910 0
Problem 1 Consider the following epression matri where the epression levels of genes (G1 and G) were analyzed in 7 healthy/infected tissues (conditions C1 to C7). Consider also the problem of grouping tissues given the epression profiles of the genes using clustering algorithms. Determine the dendogram found by a hierarchical clustering algorithm (HCA) using a bottom-up approach, the Euclidean distance to compute the distance between conditions, and the single-link distance to compute the distance between groups (intercluster distance). How would you use the dendogram to group the tissues in groups (clusters) and which will be those clusters? Determine the groups found by the K-means (K=) algorithm when the centroids are initialized with C5 = (4,3) and C6 = (1,1). Biclustering: Motivation Gene epression matrices have been etensively analyzed using clustering in one of two dimensions The gene dimension The condition dimension This corresponds to the Analysis of epression patterns of genes by comparing rows in the matri. Analysis of epression patterns of samples by comparing columns in the matri. 1
Biclustering: Motivation Common objectives pursued when analyzing gene epression data include: 1. Grouping of genes according to their epression under multiple conditions.. Classification of a new gene, given its epression and the epression of other genes, with known classification. 3. Grouping of conditions based on the epression of a number of genes. 4. Classification of a new sample, given the epression of the genes under that eperimental condition. What is Biclustering? Biclustering = Simultaneous clustering of both rows and columns of a data matri. Concept can be traced back to the 70 (Hartigan, 197), although it has been rarely used or studied. The term was introduced by (Cheng and Church, 000) who were the first to used it in gene epression data analysis. Technique used in other fields, such as collaborative filtering, information retrieval and data mining.
What is Biclustering? We consider a n by m data matri, A=(X,Y), where X={ 1,, n } = Set of n rows Y={y 1,, y m } = Set of m columns a ij = numeric value (discrete or real) representing the relation between row i and column j. In the case of gene epression matrices X = Set of Genes Y = Set of Conditions a ij = epression level of gene i under condition j (real value). What is Biclustering? Gene 1 Condition 1 a 11 Gene Epression Matri Condition j a 1j Condition m a 1m Gene i a i1 a ij a im Gene n a n1 a nj a nm A = (X,Y) 3
What is Biclustering? Given the matri A = (X,Y) I = Subset of rows J = Subset of columns (I,Y) = a subset of rows that ehibit similar behavior across the set of all columns = cluster of rows (X,J) = a subset of columns that ehibit similar behavior across the set of all rows = cluster of columns What is Biclustering? (I,J) = a subset of rows and a subset of columns, where the rows ehibit similar behavior across the columns and vice-versa. = sub-matri of A that contains only the elements a ij with set of rows I and set of columns J. = bicluster We want to identify a set of biclusters B k = (I k,j k ). Each bicluster B k must satisfies some specific characteristics of homogeneity. 4
What is Biclustering? C 1 C C 3 C 4 C 5 C 6 C 7 C 8 C 9 C 10 G 1 a 11 G G 3 G 4 G 5 G 6 a 11 a 1 a 31 a 41 a 51 a 61 a 1 a a 3 a 4 a 5 a 6 a 13 a 3 a 33 a 43 a 53 a 63 X = {G 1, G, G 3, G 4, G 5, G 6 } Y= {C 1, C, C 3, C 4, C 5, C 6, C 7, C 8, C 9, C 10 } a 14 a 15 a 16 a 17 a 18 a 19 0 a a a 1 4 a 5 a 6 a 7 a 8 9 0 a a a 31 34 a 35 a 36 a 37 a 38 39 0 a a a 41 44 a 45 a 46 a 47 a 48 49 0 a a a a a a a 51 54 55 56 57 58 59 0 a a a a a a a 61 64 65 66 67 68 69 0 Cluster of Columns (X,J) Cluster of Rows (I,Y) {C 4, C 5, C 6 } {G, G 3, G 4 } I = {G, G 3, G 4 } J = {C 4, C 5, C 6 } Bicluster (I,J) {{G, G 3, G 4 }, {C 4, C 5, C 6 }} What is Biclustering? Biclustering Goals Perform simultaneous clustering on the row and column dimensions of the gene epression matri instead of clustering the rows and columns separetely. Identify sub-matrices (subsets of rows and subsets of columns) with interesting properties. Gene Epression Data Analysis Identify subgroups of genes and subgroups of conditions, where the genes ehibit highly correlated activities for every condition Madeira, Sara C. and Oliveira, Arlindo L. Biclustering Algorithms for Biological Data Analysis: A Survey IEEE/ACM Trans. Comput. Biol. Bioinformatics January 004 5
Bicluster Types An interesting criteria to evaluate a biclustering algorithm concerns the identification of the type of biclusters the algorithm is able to find. There are four major classes of biclusters 1. Biclusters with constant values.. Biclusters with constant values on rows or columns. 3. Biclusters with coherent values. 4. Biclusters with coherent evolutions. Constant Values 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 6
Constant Values on Rows or Columns 1.0 1.0 1.0 1.0 1.0.0 3.0 4.0.0.0.0.0 1.0.0 3.0 4.0 3.0 3.0 3.0 3.0 1.0.0 3.0 4.0 4.0 4.0 4.0 4.0 1.0.0 3.0 4.0 Constant Rows Constant Columns Coherent Values 1.0.0 5.0 0.0 1.0.0 0.5 1.5.0 3.0 6.0 1.0.0 4.0 1.0 3.0 4.0 5.0 8.0 3.0 4.0 8.0.0 6.0 5.0 6.0 9.0 4.0 3.0 6.0 1.5 4.5 Additive Model Multiplicative Model 7
Coherent Evolutions S S S S S3 S3 S3 S3 S4 S4 S4 S4 Overall Coherent Evolution Coherent Evolution On the Rows Coherent Evolutions S S3 S4 70 13 19 10 S S3 S4 49 40 49 35 S S3 S4 40 0 7 15 S S3 S4 90 15 0 1 Coherent Evolution On the Columns Order Preserving Sub-Matri (OPSM) 8
Algorithms When this is the case, a bicluster corresponds to a biclique in the corresponding bipartite graph. Finding a maimum size bicluster Is equivalent to finding the maimum edge biclique in a bipartite graph. This problem is known to be NP-complete (Peeters, 003). More comple cases Where the actual numeric values in the matri A are taken into account to compute the quality of a bicluster Have a compleity that is necessarily no lower than this simpler case. Algorithms Given this, the large majority of the algorithms use heuristic approaches to identify biclusters. In many cases the algorithm is preceded by a normalization step that is applied to the data matri. The goal is to make more evident the patterns of interest. Some algorithms avoid heuristics but ehibit an eponential worst case runtime. 9
Algorithms Different Objectives Identify one bicluster. Identify a given number of biclusters. Different Approaches Discover one bicluster at a time. Discover one set of biclusters at a time. Discover all biclusters at the same time (Simultaneous bicluster identification) Algorithms: Heuristic Approaches Iterative Row and Column Clustering Combination Apply clustering algorithms to the rows and columns of the data matri, separately. Combine the results using some sort of iterative procedure to combine the two cluster arrangements. Divide and Conquer Break the problem into several subproblems that are similar to the original problem but smaller in size. Solve the problems recursively. 30
Algorithms: Heuristic Approaches Combine the intermediate solutions to create a solution to the original problem. Usually break the matri into submatrices (biclusters) based on a certain criterion and then continue the biclustering process on the new submatrices. Greedy Iterative Search Always make a locally optimal choice in the hope that this choice will lead to a globally good solution. Usually perform greedy row/column addition/removal. Algorithms Ehaustive Bicluster Enumeration A number of methods have been used to speed up ehaustive search. In some cases the algorithms assume restrictions on the size of the biclusters that should be listed. 31
Measure cluster homogeneity 3
Missing values: Random numbers Find one bicluster at a time Hide biclustering using random numbers 33
34
Eample Consider the following epression matri J A(X,Y) = I Run Brute-Force Deletion and Addition algorithm to find a Biclustering 35
Eample Run Algorithm, δ = 0 (maimum acceptable mean squared residue score), α = 1,5 (a threshold for the multiple node deletion) aij column aij row aij = 9/(44) a1j = 5/4 ai1 = 7/4 aj = 8/4 ai = 4/4 a3j = 6/4 ai3 = 9/4 a4j = 10/4 ai4 = 9/4 H(I,J) = (1/(44)) * ((a11 a1j ai1 + aij)^ + (a1 a1j ai + aij)^) + (a13 a1j ai3 + aij)^) + (a14 a1j ai4 + aij)^) + (a1 aj ai1 + aij)^) + = 1,8 http://www.kemaleren.com/cheng-and-church.html 36