Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Measure of Distance We wish to define the distance between two objects Distance metric between points: Euclidean distance (EUC) Manhattan distance (MAN) Pearson sample correlation (COR) Angle distance (EISEN considered by Eisen et al., 1998.) Spearman sample correlation (SPEAR) Kandall s τ sample correlation (TAU) Mahalanobis distance Distance metric between distributions: Kullback-Leibler information Hamming s mutual information

R: Distance Metric Between Points dist function in stat package: Euclidean Manhattan hopach package: disscosangle(x, na.rm = TRUE) ** biodist package: cor.dist spearman.dist tau.dist

2.07) -1.60, 1.51, g 0.29) - 0.75, 0.04, g 0.33) -1.45, (-1.76, g 3 2 1 ( ( = = = Euclidean distance: 2.45 g2 3.70 g3 1.93 g2 g1 2.45 0.29) (2.07 0.75)) ( 1.60 ( 0.04) (1.51 3.70 2.07) (0.33 1.60)) ( 1.45 ( 1.51) 1.76 ( 1.93 0.29) (0.33 0.75)) ( 1.45 ( 0.04) 1.76 ( 2 2 2 2 2 2 2 2 2 = + + = + + = + + g3 : vs g2 g3 : vs g1 g2 : vs g1

g g g 1 2 3 = = = (-1.76, -1.45, 0.33) ( 0.04, - 0.75, 0.29) ( 1.51, -1.60, 2.07) Manhattan distance: g1 g1 g2 vs vs vs g2 : -176. 0. 04 + -145. (-0. 75) + 0. 33 0. 29 = 2. 54 g3 : -176. 1.51 + -145. (-1.60) + 0. 33 2.07 = 5.16 g3 : 0.04 1.51 + -0.75 (-1.60) + 0. 29 2.07 = 4.10 g2 g3 g1 2.54 5.16 g2 4.10

Cosine Correlation Distance Note: disscosangle(hopach) = = = = = n i i n i i i x y x d 1 2 1, 1 ), ( x y x y x y x y x where α

Correlation-based distance :

Kullback-Leibler Information Kullback-Leibler information (KLI) considers if the shape of the distribution of features is similar between two genes. f 1 (x) f 2 (x)

Kullback-Leibler Information f1( x) KLI ( f1, f2) = log f f2( x) d KLD ( f1, f2) = [ KLI ( f1, f2) + 1 ( x) dx KLI ( f 2, f 1 )]/ 2 Note: 1. KLI (d KLD ) = 0 if f 1 (x) = f 2 (x). 2. KLI is not symmetric but d KLD is. 3. d KLD does not satisfy the triangle inequality 4. KLI or d KLD is not defined when f 1 (x) 0 but f 2 (x) = 0 for some x.

Mutual Information Mutual information(mi) attempts to measure the distance from independence. Note: f ( x, y) MI ( f = 1, f2) log f ( x, y) dydx x y f1( x) f2( y) 1. If x and y are independent then f(x,y) = f 1 (x)f 2 (y) so that MI = 0. 2. Does not satisfy the triangle inequality

Mutual Information (Joe, 1989) Transformation: δ * = [1 0 exp( 2MI)] δ * 1 1/ 2 δ* can be interpreted as a generalization of the correlation!

R: Distance Between Distributions biodist package: KLD.matrix (kernel density) KLdist.matrix (binning) mutualinfo

Exercise: Apop.xls http://homepage.ntu.edu.tw/~lyliu/introbioinfo/apop.xls Try to compute the distances between the rows (genes).

Distance: Visualization man = dist(apop,"manhattan") heatmap(as.matrix(man)) heatmap(as.matrix(man),rowv=na,colv=na) heatmap(as.matrix(man),rowv=na,colv=na,symm=t) 1 2 3 library(gplots) heatmap.2(as.matrix(man),dendrogram="none",keysize=1.5, Rowv=F,Colv=F, trace="none",density.info="none") 4

1 2

2 3

pairs(cbind(man,mi,klsmooth,klbin))

Cluster Analysis Clustering is the process of grouping together similar entities. It is appropriate when there is no prior knowledge about the data. In a machine learning framework, it is known as unsupervised learning since there is no known desired answer for any particular gene or experiment.

Cluster Analysis The entities that are similar to each other are grouping together and form a cluster. Step 1:Defining the similarity between entities distance metric Step 2:Forming clusters clustering algorithms

Measure of Distance Distance metric between points: Euclidean distance (EUC) Manhattan distance (MAN) Pearson sample correlation (COR) Angle distance (EISEN considered by Eisen et al., 1998.) Spearman sample correlation (SPEAR) Kandall s τ sample correlation (TAU) Mahalanobis distance Distance metric between distributions: Kullback-Leibler information Hamming s mutual information

Cluster Analysis The entities that are similar to each other are grouping together and form a cluster. Step 1: Defining the similarity between entities distance metric Step 2:Forming clusters clustering algorithms

Clustering According to distance between two objects, the entities that are closer to each other are grouping together and form a cluster. clustering algorithms Note: Anything can be clustered. The clustering results may not be related to any biological meanings between the members of a given cluster.

Clustering Usually the results of clustering is shown in a clustering tree, or a dendrogram.

Clustering Algorithm Partitioning: k-means, PAM Hierarchical clustering Model based: SOM

Partitioning Algorithms Partitioning method: Construct a partition of n objects into a set of k clusters k = 3

Partitioning Algorithms Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion k-means: Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids): Each cluster is represented by one of the objects in the cluster

K-means Clustering Step 1: Determine the number of clusters, k. Step 2: Randomly choose k point as the centers of clusters. Step 3: Calculate the distance from each pattern to k centers and associate every object with the closest cluster center. Step 4: Calculate a new center for the updated clusters. Step 5: Repeat steps 3 and 4 until no objects are relocated.

K-means Clustering Example: k = 2 Step 3 Step 4

K-means Clustering Example: k = 2 Repeat Step 3 Repeat Step 4

Example of K-means Clustering Result Average distance to the center of clusters

k-mean Clustering: Properties 1. It is possible to produce empty clusters. To avoid such situation, one can: (i) let the starting cluster centers be in the general area populated by the given data. (ii) randomly choose k points as initial centers.

k-mean Clustering: Properties 2. The results of the algorithm can change between successive runs of the algorithm.

PAM PAM (Partitioning Around Medoids): starts from an initial set of medoids (objects) iteratively replaces one of the medoids by one of the nonmedoids if it improves the total distance of the resulting clustering provides a novel graphical display, the silhouette plot, which allows the user to select the optimal number of clusters. works effectively for small data sets, but does not scale well for large data sets

PAM Step 1: Select k representative objects arbitrarily. Step 2: For each pair of non-selected object h and selected object i, calculate the total swapping cost TC ih If TC ih < 0, i is replaced by h Then assign each non-selected object to the most similar representative object Step 3: Repeat Step 2 until there is no change.

PAM Total Cost = 20 10 10 10 9 9 9 8 8 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Arbitrary choose k object as initial medoids 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Assign each remaining object to nearest medoids 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 K=2 Total Cost = 26 Randomly select a nonmedoid object,o ramdom Do loop Until no change Swapping O and O ramdom If quality is improved. 10 9 8 7 6 5 4 3 2 Compute total cost of swapping 10 9 8 7 6 5 4 3 2 1 1 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10

PAM the next plot is called a silhouette plot each observation is represented by a horizontal bar the groups are slightly separated the length of a bar is a measure of how close the observation is to its assigned group (versus the others)

Silhouette plot of pam(x = as.dist(d), k = 3, diss = TRUE) ALL B ALL ALL B ALL ALL B ALL B ALL ALL B ALL B ALL ALL B ALL ALL B ALL ALL B ALL B ALL ALL B ALL T ALL ALL T ALL T ALL ALL T ALL T ALL AML AML AML AML AML AML AML AML AML AML AML ALL B n = 38 3 clusters C j j : n j ave i Cj s i 1 : 18 0.40 2 : 8 0.54 3 : 12 0.73 0.0 0.2 0.4 0.6 0.8 1.0 Average silhouette width : 0.53 Silhouette width s i

Partitioning Methods: Comment Number of clusters, k: If there are features that clearly distinguish between the classes (e.g. cancer and healthy), the algorithm might use them to construct meaningful clusters. If the analysis has an exploratory character, one could repeat the clustering for several values of k.

Clustering Algorithm Partitioning: k-means, PAM Hierarchical clustering Model based: SOM

Hierarchical Clustering k-means clustering returns a set of k clusters. Hierarchical clustering returns a complete tree with individual patterns as leaves and the convergence points of all branches as the root. root complete tree Distance leaf

Hierarchical Clustering Step 1: Choose one distance measurement Step 2: Construct the hierarchical tree: Bottom-up (agglomerative) method: n 1; starting from the individual patterns and putting smaller clusters together to form bigger clusters. Top-down (divisive) method: 1 n; starting at the root and splitting clusters into smaller ones by nonhierarchical algorithms (e.g., k-means with k = 2).

Hierarchical Clustering: Example Example: Consider 5 experiments (A, B, C, D, E) with the following distance metric: Bottom-up (agglomerative) method: putting similar clusters together to form bigger clusters. A B C D E A 100 400 800 1000 B 300 700 900 C 400 600 D 200 E

Hierarchical Clustering: Example {A,B} C D E {A,B}??? C 400 600 D 200 E Need to define inter-cluster distances

Inter-Cluster Distances Single Linkage Complete Linkage Centroid Linkage Average Linkage

Hierarchical Clustering: Example If we use average linkage: d {A,B},C = (d A,C +d B,C ) / 2, etc. {A,B} C D E {A,B} 350 750 950 C 400 600 D 200 E d {A,B},C = (d A,C +d B,C ) / 2 = (400+300)/2

Hierarchical Clustering: Example * use average linkage {A,B} C {D,E} {A,B} 350 850 C 500 {D,E}

Cutting Tree Diagrams A hierarchical clustering diagram can be used to divide the data into a pre-determined number of clusters by cutting the tree at a certain depth. 2 groups 4 groups

Properties of Hierarchical Clustering Different tree-constructing methods: The same data and the same process obtain the same results by running the same bottom-up method. The same data and the same process obtain two different results by running the same top-down method. Different linkage type produce different results.

Hierarchical Clustering: Comments Objective of the research: To obtain a clustering that reflects the structure of the data. The dendrogram itself is almost never the answer to the research question. Various implementations of hierarchical clustering should not be judged simply by their speed; slower algorithms may be trying to do a better job pf extracting the data features. The order of the objects and clusters in the dendrogram may be misleading.

Orders in Dendrogram

Clustering Algorithm Partitioning: k-means, PAM Hierarchical clustering Model based: SOM

SOM: Motivation Misleading dendrograms: K-means Hierarchical The SOM clustering is designed to create a plot in which similar patterns are plotted next to each other.

Self-Organizing Feature Maps (SOM) SOM: A map consists of many simple elements (nodes or neurons); it is constructed by training. SOMs are believed to resemble processing that can occur in the brain Useful for visualizing high-dimensional data in 2- or 3- D space

Self-Organizing Feature Maps (SOM) Clustering is performed by having several units competing for the current object The unit whose weight vector is closest to the current object wins The winner and its neighbors learn by having their weights adjusted

Self-Organizing Feature Maps (SOM) This process can be visualized by imagining all SOM units being connected to each other by rubber bands. A 2D SOFM trained on 3-dimensional data.

paper: Eisen 1998 Algorithmic Approaches to Clustering Gene Expression Data http://citeseer.nj.nec.com/shamir01algorithmic.html Tibshirani, Hastie, Narasimhan and Chu (2002) http://www.pnas.org/cgi/reprint/99/10/6567 Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20, 53 65