Clustering Techniques

Size: px

Start display at page:

Download "Clustering Techniques"

Quentin Ellis
6 years ago
Views:

1 Clustering Techniques Bioinformatics: Issues and Algorithms CSE Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture

2 Administrative notes Your final project / paper proposal is due on Friday, November 9 at 5:00 pm. The proposal just needs to be a couple paragraphs telling me the problem area you plan to work on and some of the references you'll probably use. If there's a possible connection between the work you'd like to do and the topics you've heard Professor Marzillier talk about, I'll discuss your proposal with her to get her feedback and suggestions (e.g., other papers you might read, datasets you might use for testing code you develop, etc.). I'll send you feedback on your proposal by the middle of the following week then you're off and running! Lopresti Fall 2007 Lecture

3 Outline DNA Microarrays Hierarchical Clustering K-Means Clustering Conservative & Greedy K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm Lopresti Fall 2007 Lecture

4 Applications of clustering Motivation for clustering (from a general perspective): Viewing and analyzing vast amounts of biological data in its unstructured entirety can be perplexing. It is easier to interpret data if it is organized into clusters that combine similar (i.e., related) data points. From a biological perspective, applications include: Analyzing data from DNA microarray experiments (expression analysis i.e., determining which genes are switched on or off under certain conditions of interest). Building and understanding phylogenetic (evolutionary) trees based on genomic or other data. Lopresti Fall 2007 Lecture

5 Inferring gene functionality What's the problem? Biologists want to know functions of newly-sequenced genes. Simply comparing new gene sequence to known DNA sequences often does not reveal function of new gene. For 40% of sequenced genes, functionality cannot be ascertained by comparing to sequences of known genes. Microarrays allow biologists to infer gene function even when sequence similarity alone is insufficient to infer it. Lopresti Fall 2007 Lecture

Life: a recipe for making proteins http://www.cbs.dtu.

6 Life: a recipe for making proteins Lopresti Fall 2007 Lecture

Recall the Central Dogma http://www.cbs.dtu.

7 Recall the Central Dogma Lopresti Fall 2007 Lecture

Hybridization is central http://www.cbs.dtu.

8 Hybridization is central Lopresti Fall 2007 Lecture

9 Microarrays: the concept Measure level of transcription for a very large number of genes in a single experiment. Lopresti Fall 2007 Lecture

Microarrays and expression analysis Microarrays measure activity (expression level) of genes under varying conditions and/or points in time.

10 Microarrays and expression analysis Microarrays measure activity (expression level) of genes under varying conditions and/or points in time. Expression level is estimated by measuring amount of mrna for that particular gene: A gene is active if it is being transcribed. More mrna usually indicates more gene activity. Lopresti Fall 2007 Lecture

11 Microarrays: how? Lopresti Fall 2007 Lecture

Stanford microarrays: production http://www.cbs.dtu.

12 Stanford microarrays: production Lopresti Fall 2007 Lecture

13 Stanford microarrays: production Lopresti Fall 2007 Lecture

14 Stanford microarrays: production Coating: 1. Rinse of slides: NaOH and EtOH (2 h - shaking). 2. Wash with water. 3. Coat slides: poly-l-lycine (1 h - shaking). 4. Wash and dry. Attach probes: 1. Produce probes (oligos, cdna library, PCR products). 2. Print by the use of a robot. Lopresti Fall 2007 Lecture

15 Stanford microarrays: production Spotting mechanical deposition of probes: Lopresti Fall 2007 Lecture

16 Stanford microarrays: production Lopresti Fall 2007 Lecture

17 Stanford microarrays: production Lopresti Fall 2007 Lecture

Stanford microarrays: production Microarrayer http://www.cbs.dtu.

18 Stanford microarrays: production Microarrayer Lopresti Fall 2007 Lecture

19 Microarray experiments Steps: Produce cdna from mrna (DNA is more stable). Attach phosphor to cdna to see when gene is expressed. Different color phosphors are available to compare many samples at once. Hybridize cdna over microarray. Scan microarray with phosphor-illuminating laser: illumination reveals transcribed genes. Scan microarray multiple times for different color phosphors. Lopresti Fall 2007 Lecture

20 Microarray experiments Then instead of staining, laser illumination can be used Phosphors can be added here instead Lopresti Fall 2007 Lecture

21 Using microarrays Track sample over period of time to see how gene expression changes. Track two different samples under same conditions to see differences in gene expression. Each box represents one gene s expression over time Lopresti Fall 2007 Lecture

22 Using microarrays Interpreting colors: Green: expressed only from control. Red: expressed only from experimental cell. Yellow: equally expressed in both samples. Black: NOT expressed in either control or experimental cells. Lopresti Fall 2007 Lecture

23 Microarray data What does biologist do with microarray data? Microarray data usually transformed into an intensity matrix. Intensity matrix allows biologists to make correlations between different genes (even if they are dissimilar) and to understand how gene functions might be related. Clustering comes into play! Similar behavior? Time: Time X Time Y Time Z Intensity (expression level) of gene at measured time Gene 1 Gene Gene Gene Gene Lopresti Fall 2007 Lecture

24 Clustering microarray data Plot each sample as data point in N-dimensional space. Build matrix for distances between every two gene points. Genes with a small distance share same expression patterns and might be functionally related or similar. Clustering reveal groups of functionally related genes. From Cluster analysis and display of genomewide expression patterns by Eisen, Spellman, Brown, and Botstein, Proc. Natl. Acad. Sci. USA, Vol. 95, pp , December Different genes that express similarly Lopresti Fall 2007 Lecture

25 Clustering microarray data Intensity matrix Pairwise distance matrix Three different clusters Expression patterns as points in 3-D space Lopresti Fall 2007 Lecture

26 Homogeneity and Separation Principles All approaches to clustering guided by two basic principles: Homogeneity: elements within a given cluster are close. Separation: elements in different clusters are further apart. Not that clustering is not an easy task! (Don't be mislead by simple illustrative examples.) Given these points, a clustering algorithm might make two distinct clusters as follows... Lopresti Fall 2007 Lecture

27 Bad clustering This clustering violates both Homogeneity and Separation Principles: Close distances from points in separate clusters Far distances from points in same cluster Lopresti Fall 2007 Lecture

28 Good clustering This clustering satisfies both Homogeneity and Separation Principles: Lopresti Fall 2007 Lecture

29 Clustering techniques Agglomerative: start with every element in its own cluster, and iteratively join clusters together. Divisive: start with one cluster and iteratively divide it into smaller clusters. Hierarchical: organize elements into a tree, leaves represent genes and length of the paths between leaves represents distances between genes. Similar genes lie within same subtrees. Lopresti Fall 2007 Lecture

30 Hierarchical clustering Lopresti Fall 2007 Lecture

31 Hierarchical clustering Hierarchical Clustering often used to reveal evolutionary history: Lopresti Fall 2007 Lecture

32 Hierarchical clustering algorithm Hierarchical Clustering (d, n) Form n clusters each with one element Construct graph T by assigning one vertex to each cluster while there more than one cluster find two closest clusters C 1 and C 2 merge C 1 and C 2 into new cluster C of size C 1 + C 2 compute distance from C to all other clusters add a new vertex C to T and connect to vertices C 1 and C 2 remove rows and columns of d corresponding to C 1 and C 2 add a row and column to d corrsponding to new cluster C return T Algorithm takes a n x n distance matrix d of pairwise distances between points as input. Lopresti Fall 2007 Lecture

33 Hierarchical clustering algorithm Hierarchical Clustering (d, n) Form n clusters each with one element Construct graph T by assigning one vertex to each cluster while there more than one cluster find two closest clusters C 1 and C 2 merge C 1 and C 2 into new cluster C of size C 1 + C 2 compute distance from C to all other clusters add a new vertex C to T and connect to vertices C 1 and C 2 remove rows and columns of d corresponding to C 1 and C 2 add a row and column to d corrsponding to new cluster C return T Different ways to define distances between clusters may lead to different clusterings! Lopresti Fall 2007 Lecture

34 Computing distances d min (C, C * ) = min d(x,y) for all elements x in C and y in C * Distance between two clusters is smallest distance between any pair of elements. d avg (C, C * ) = (1 / C * C ) d(x,y) for all elements x in C and y in C * Distance between two clusters is average distance between all pairs of elements. Lopresti Fall 2007 Lecture

35 Squared-error distortion Given a data point v and a set of points X, define distance from v to X: d(v, X) as (Eucledian) distance from v to closest point from X. Given set of n data points V = {v 1 v n } and set of k points X, define squared-error distortion as: d(v, X) = d(v i, X) 2 / n 1 < i < n Lopresti Fall 2007 Lecture

36 Clustering microarray data: k-means clustering K-means clustering is one way to organize this data: Given set of n data points and an integer k. We want to find set of k points that minimizes mean-squared distance from each data point to its nearest cluster center. Sketch of algorithm: Choose k initial center points randomly and cluster data. Calculate new centers for each cluster using points in cluster. Re-cluster all data using new center points. Repeat last two steps until no data points are moved from one cluster to another or some other convergence criterion is met. Lopresti Fall 2007 Lecture

37 Formal definition of K-Means Clustering The K-Means Clustering Problem. Input: A set, V, consisting of n points along with a parameter k. Output: A set X consisting of k points (cluster centers) that minimizes squared-error distortion d(v, X) over all possible choices of X. A (trivially) simple variation, 1-means clustering: The 1-Means Clustering Problem. Input: A set, V, consisting of n points. 1-means clustering is easy. General k-means clustering is NP-complete, however. Output: A single point x (cluster center) that minimizes squared-error distortion d(v, x) over all possible choices of x. Lopresti Fall 2007 Lecture

38 Clustering microarray data: k-means clustering Pick k = 2 centers at random. Cluster data around these center points. Re-calculate centers based on current clusters. From Data Analysis Tools for DNA Microarrays by Sorin Draghici. Lopresti Fall 2007 Lecture

39 Clustering microarray data: k-means clustering Re-cluster data around new center points. Repeat last two steps until no more data points are moved into a different cluster. From Data Analysis Tools for DNA Microarrays by Sorin Draghici. Lopresti Fall 2007 Lecture

40 K-means clustering: Lloyd's algorithm K-Means Clustering (k) arbitrarily assign k cluster centers while cluster centers keep changing assign each data point to cluster C i corresponding to closest cluster representative (center) (1 i k) after assignment of all data points, compute new cluster representatives according to cluster centers of gravity I.e., new cluster representative is v / C for all v in C output final cluster centers Note that this may only lead to a locally optimal clustering. Lopresti Fall 2007 Lecture

41 K-means clustering: another example expression in condition x 1 x 2 x expression in condition 1 Lopresti Fall 2007 Lecture

42 K-means clustering: another example expression in condition x 1 x 2 x expression in condition 1 Lopresti Fall 2007 Lecture

43 K-means clustering: another example expression in condition x 1 x 2 x 3 expression in condition 1 Lopresti Fall 2007 Lecture

44 K-means clustering: another example expression in condition x 2 x expression in condition 1 x 1 Lopresti Fall 2007 Lecture

45 Conservative k-means clustering Observations: This algorithm, known as Lloyd's algorithm, is fast, but in each iteration it moves many data points, not necessarily causing better convergence. A more conservative method would be to move one point at a time only if it improves the overall clustering cost. The smaller the clustering cost of a partition of data points, the better that clustering is. Different methods (e.g., squared-error distortion) can be used to measure this clustering cost. Lopresti Fall 2007 Lecture

46 Greedy k-means clustering ProgressiveGreedyK-Means(k) select an arbitrary partition P into k clusters while forever bestchange 0 for every cluster C for every element i not in C if moving i to cluster C reduces its clustering cost if (cost(p) cost(p i C ) > bestchange bestchange cost(p) cost(p i C ) i* i, C* C if bestchange > 0 Change partition P by moving i* to C* else return P Lopresti Fall 2007 Lecture

47 Clique graphs A more structured view of clustering: A clique is a graph with every vertex connected to every other vertex. A clique graph is a graph where each connected component is a clique. Clique of size 3 Clique of size 5 Clique of size 6 Clique graph with 3 connected components Lopresti Fall 2007 Lecture

48 Transformation into a clique graph Any graph can be transformed into a clique graph by adding or removing edges. What can we do here? Delete 2 edges Lopresti Fall 2007 Lecture

49 Transformation into a clique graph As with edit distance we studied earlier, there many possible transformations: 1 2 Add 2 edges Or: 1 2 Delete 4 edges Lopresti Fall 2007 Lecture

50 Formal definition of Corrupted Cliques Problem The Corrupted Cliques Problem. Input: A graph, G. Output: The smallest number of additions and removals of edges that will transform G into a clique graph. Our ultimate goal is to have: Vertices represent data points. Edges represent relationship between data points. Cliques represent meaningful groupings (i.e., clusters). Lopresti Fall 2007 Lecture

51 Distance graphs Transform a distance matrix into a distance graph: Genes are represented as vertices in graph. Choose a distance threshold θ. If distance between two vertices is below θ, draw an edge between them. Resulting graph may contain cliques. These cliques represent clusters of similar data points! Lopresti Fall 2007 Lecture

52 Transforming distance graph into clique graph Distance matrix d Distance graph for θ = 7 Clique graph Distance graph for is not quite a clique graph. However, it can be transformed into one by removing edges (g 1,g 10 ) and (g 1,g 9 ). Lopresti Fall 2007 Lecture

53 Heuristics for Corrupted Cliques Problem Corrupted Cliques Problem is NP-Hard, some heuristics exist to approximately solve it. For example, CAST (Cluster Affinity Search Technique) is a practical and fast algorithm for CCP. CAST is based on notion of genes close to given cluster C, or distant from cluster C. Distance between gene i and cluster C defined as: d(i,c) = average distance between i and each gene in C Gene i is close to cluster C if d(i,c) < θ, distant otherwise. Lopresti Fall 2007 Lecture

54 CAST algorithm CAST(S, G, θ) P Ø while S Ø V vertex of maximal degree in distance graph G C {v} while a close gene i not in C or distant gene i in C exists Find nearest close gene i not in C and add it to C Remove farthest distant gene i in C Add cluster C to partition P S S C Remove vertices of cluster C from distance graph G return P S = set of elements, G = distance graph, θ = distance threshold Lopresti Fall 2007 Lecture

55 Wrap-up Readings for next time: BBP Chapters and 20 (tools, datasets, and applications). Remember: Come to class having done the readings. Check Blackboard regularly for updates. Lopresti Fall 2007 Lecture

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching