Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Similar documents
CSE 5243 INTRO. TO DATA MINING

Clustering. Lecture 6, 1/24/03 ECS289A

CSE 5243 INTRO. TO DATA MINING

Cluster Analysis for Microarray Data

biodist July 15, 2010 Continuous version of Kullback-Leibler Distance (KLD) KLD.matrix Description

Package biodist. R topics documented: April 9, Title Different distance measures Version Author B. Ding, R. Gentleman and Vincent Carey

ECS 234: Data Analysis: Clustering ECS 234

Exploratory data analysis for microarrays

Distance Measures for Gene Expression (and other high-dimensional) Data

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hierarchical Clustering

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Machine Learning (BSMC-GA 4439) Wenke Liu

Cluster Analysis: Agglomerate Hierarchical Clustering

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Data Mining Algorithms

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Unsupervised Learning

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Clustering CS 550: Machine Learning

Cluster Analysis. Angela Montanari and Laura Anderlucci

Distances, Clustering! Rafael Irizarry!

High throughput Data Analysis 2. Cluster Analysis

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Clustering. Chapter 10 in Introduction to statistical learning

Machine Learning (BSMC-GA 4439) Wenke Liu

What is Unsupervised Learning?

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Chapter 6: Cluster Analysis

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Clustering Gene Expression Data: Acknowledgement: Elizabeth Garrett-Mayer; Shirley Liu; Robert Tibshirani; Guenther Walther; Trevor Hastie

Stat 321: Transposable Data Clustering

Network Traffic Measurements and Analysis

University of Florida CISE department Gator Engineering. Clustering Part 2

Clustering, cont. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Cluster analysis. Agnieszka Nowak - Brzezinska

3. Cluster analysis Overview

Understanding Clustering Supervising the unsupervised

Finding Clusters 1 / 60

University of California, Berkeley

Chapter 6 Continued: Partitioning Methods

Unsupervised Learning

Chapter DM:II. II. Cluster Analysis

Cluster Analysis. Ying Shen, SSE, Tongji University

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Clustering part II 1

3. Cluster analysis Overview

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Hierarchical Clustering 4/5/17

Unsupervised Learning

Unsupervised Learning : Clustering

Multivariate Analysis

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Based on Raymond J. Mooney s slides

Lesson 3. Prof. Enza Messina

Multivariate Methods

Cluster Analysis. CSE634 Data Mining

Unsupervised learning: Clustering & Dimensionality reduction. Theo Knijnenburg Jorma de Ronde

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

CSE 5243 INTRO. TO DATA MINING

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Kapitel 4: Clustering

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

MSA220 - Statistical Learning for Big Data

Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Clustering Lecture 3: Hierarchical Methods

Clustering. Cluster Analysis of Microarray Data. Microarray Data for Clustering. Data for Clustering

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Unsupervised Learning and Clustering

Supervised vs. Unsupervised Learning

Unsupervised Learning and Clustering

Hierarchical clustering

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

CS7267 MACHINE LEARNING

4. Cluster Analysis. Francesc J. Ferri. Dept. d Informàtica. Universitat de València. Febrer F.J. Ferri (Univ. València) AIRF 2/ / 1

10701 Machine Learning. Clustering

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

How do microarrays work

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Unsupervised Learning. Pantelis P. Analytis. Introduction. Finding structure in graphs. Clustering analysis. Dimensionality reduction.

A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap

Introduction to Clustering

Hierarchical Clustering

Cluster Analysis chapter 7. cse634 Data Mining. Professor Anita Wasilewska Compute Science Department Stony Brook University NY

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Transcription:

Measure of Distance We wish to define the distance between two objects Distance metric between points: Euclidean distance (EUC) Manhattan distance (MAN) Pearson sample correlation (COR) Angle distance (EISEN considered by Eisen et al., 1998.) Spearman sample correlation (SPEAR) Kandall s τ sample correlation (TAU) Mahalanobis distance Distance metric between distributions: Kullback-Leibler information Hamming s mutual information

R: Distance Metric Between Points dist function in stat package: Euclidean Manhattan hopach package: disscosangle(x, na.rm = TRUE) ** biodist package: cor.dist spearman.dist tau.dist

2.07) -1.60, 1.51, g 0.29) - 0.75, 0.04, g 0.33) -1.45, (-1.76, g 3 2 1 ( ( = = = Euclidean distance: 2.45 g2 3.70 g3 1.93 g2 g1 2.45 0.29) (2.07 0.75)) ( 1.60 ( 0.04) (1.51 3.70 2.07) (0.33 1.60)) ( 1.45 ( 1.51) 1.76 ( 1.93 0.29) (0.33 0.75)) ( 1.45 ( 0.04) 1.76 ( 2 2 2 2 2 2 2 2 2 = + + = + + = + + g3 : vs g2 g3 : vs g1 g2 : vs g1

g g g 1 2 3 = = = (-1.76, -1.45, 0.33) ( 0.04, - 0.75, 0.29) ( 1.51, -1.60, 2.07) Manhattan distance: g1 g1 g2 vs vs vs g2 : -176. 0. 04 + -145. (-0. 75) + 0. 33 0. 29 = 2. 54 g3 : -176. 1.51 + -145. (-1.60) + 0. 33 2.07 = 5.16 g3 : 0.04 1.51 + -0.75 (-1.60) + 0. 29 2.07 = 4.10 g2 g3 g1 2.54 5.16 g2 4.10

Cosine Correlation Distance Note: disscosangle(hopach) = = = = = n i i n i i i x y x d 1 2 1, 1 ), ( x y x y x y x y x where α

Correlation-based distance :

Measure of Distance We wish to define the distance between two objects Distance metric between points: Euclidean distance (EUC) Manhattan distance (MAN) Pearson sample correlation (COR) Angle distance (EISEN considered by Eisen et al., 1998.) Spearman sample correlation (SPEAR) Kandall s τ sample correlation (TAU) Mahalanobis distance Distance metric between distributions: Kullback-Leibler information Hamming s mutual information

Kullback-Leibler Information Kullback-Leibler information (KLI) considers if the shape of the distribution of features is similar between two genes. f 1 (x) f 2 (x)

Kullback-Leibler Information f1( x) KLI ( f1, f2) = log f f2( x) d KLD ( f1, f2) = [ KLI ( f1, f2) + 1 ( x) dx KLI ( f 2, f 1 )]/ 2 Note: 1. KLI (d KLD ) = 0 if f 1 (x) = f 2 (x). 2. KLI is not symmetric but d KLD is. 3. d KLD does not satisfy the triangle inequality 4. KLI or d KLD is not defined when f 1 (x) 0 but f 2 (x) = 0 for some x.

Mutual Information Mutual information(mi) attempts to measure the distance from independence. Note: f ( x, y) MI ( f = 1, f2) log f ( x, y) dydx x y f1( x) f2( y) 1. If x and y are independent then f(x,y) = f 1 (x)f 2 (y) so that MI = 0. 2. Does not satisfy the triangle inequality

Mutual Information (Joe, 1989) Transformation: δ * = [1 0 exp( 2MI)] δ * 1 1/ 2 δ* can be interpreted as a generalization of the correlation!

R: Distance Between Distributions biodist package: KLD.matrix (kernel density) KLdist.matrix (binning) mutualinfo

Exercise: Apop.xls http://homepage.ntu.edu.tw/~lyliu/introbioinfo/apop.xls Try to compute the distances between the rows (genes).

Distance: Visualization man = dist(apop,"manhattan") heatmap(as.matrix(man)) heatmap(as.matrix(man),rowv=na,colv=na) heatmap(as.matrix(man),rowv=na,colv=na,symm=t) 1 2 3 library(gplots) heatmap.2(as.matrix(man),dendrogram="none",keysize=1.5, Rowv=F,Colv=F, trace="none",density.info="none") 4

1 2

2 3

4

pairs(cbind(man,mi,klsmooth,klbin))

Cluster Analysis Clustering is the process of grouping together similar entities. It is appropriate when there is no prior knowledge about the data. In a machine learning framework, it is known as unsupervised learning since there is no known desired answer for any particular gene or experiment.

Cluster Analysis The entities that are similar to each other are grouping together and form a cluster. Step 1:Defining the similarity between entities distance metric Step 2:Forming clusters clustering algorithms

Measure of Distance Distance metric between points: Euclidean distance (EUC) Manhattan distance (MAN) Pearson sample correlation (COR) Angle distance (EISEN considered by Eisen et al., 1998.) Spearman sample correlation (SPEAR) Kandall s τ sample correlation (TAU) Mahalanobis distance Distance metric between distributions: Kullback-Leibler information Hamming s mutual information

Cluster Analysis The entities that are similar to each other are grouping together and form a cluster. Step 1: Defining the similarity between entities distance metric Step 2:Forming clusters clustering algorithms

Clustering According to distance between two objects, the entities that are closer to each other are grouping together and form a cluster. clustering algorithms Note: Anything can be clustered. The clustering results may not be related to any biological meanings between the members of a given cluster.

Clustering Usually the results of clustering is shown in a clustering tree, or a dendrogram.

Clustering Algorithm Partitioning: k-means, PAM Hierarchical clustering Model based: SOM

Partitioning Algorithms Partitioning method: Construct a partition of n objects into a set of k clusters k = 3

Partitioning Algorithms Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion k-means: Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids): Each cluster is represented by one of the objects in the cluster

K-means Clustering Step 1: Determine the number of clusters, k. Step 2: Randomly choose k point as the centers of clusters. Step 3: Calculate the distance from each pattern to k centers and associate every object with the closest cluster center. Step 4: Calculate a new center for the updated clusters. Step 5: Repeat steps 3 and 4 until no objects are relocated.

K-means Clustering Example: k = 2 Step 3 Step 4

K-means Clustering Example: k = 2 Repeat Step 3 Repeat Step 4

Example of K-means Clustering Result Average distance to the center of clusters

k-mean Clustering: Properties 1. It is possible to produce empty clusters. To avoid such situation, one can: (i) let the starting cluster centers be in the general area populated by the given data. (ii) randomly choose k points as initial centers.

k-mean Clustering: Properties 2. The results of the algorithm can change between successive runs of the algorithm.

PAM PAM (Partitioning Around Medoids): starts from an initial set of medoids (objects) iteratively replaces one of the medoids by one of the nonmedoids if it improves the total distance of the resulting clustering provides a novel graphical display, the silhouette plot, which allows the user to select the optimal number of clusters. works effectively for small data sets, but does not scale well for large data sets

PAM Step 1: Select k representative objects arbitrarily. Step 2: For each pair of non-selected object h and selected object i, calculate the total swapping cost TC ih If TC ih < 0, i is replaced by h Then assign each non-selected object to the most similar representative object Step 3: Repeat Step 2 until there is no change.

PAM Total Cost = 20 10 10 10 9 9 9 8 8 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Arbitrary choose k object as initial medoids 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Assign each remaining object to nearest medoids 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 K=2 Total Cost = 26 Randomly select a nonmedoid object,o ramdom Do loop Until no change Swapping O and O ramdom If quality is improved. 10 9 8 7 6 5 4 3 2 Compute total cost of swapping 10 9 8 7 6 5 4 3 2 1 1 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10

PAM the next plot is called a silhouette plot each observation is represented by a horizontal bar the groups are slightly separated the length of a bar is a measure of how close the observation is to its assigned group (versus the others)

Silhouette plot of pam(x = as.dist(d), k = 3, diss = TRUE) ALL B ALL ALL B ALL ALL B ALL B ALL ALL B ALL B ALL ALL B ALL ALL B ALL ALL B ALL B ALL ALL B ALL T ALL ALL T ALL T ALL ALL T ALL T ALL AML AML AML AML AML AML AML AML AML AML AML ALL B n = 38 3 clusters C j j : n j ave i Cj s i 1 : 18 0.40 2 : 8 0.54 3 : 12 0.73 0.0 0.2 0.4 0.6 0.8 1.0 Average silhouette width : 0.53 Silhouette width s i

Partitioning Methods: Comment Number of clusters, k: If there are features that clearly distinguish between the classes (e.g. cancer and healthy), the algorithm might use them to construct meaningful clusters. If the analysis has an exploratory character, one could repeat the clustering for several values of k.

Clustering Algorithm Partitioning: k-means, PAM Hierarchical clustering Model based: SOM

Hierarchical Clustering k-means clustering returns a set of k clusters. Hierarchical clustering returns a complete tree with individual patterns as leaves and the convergence points of all branches as the root. root complete tree Distance leaf

Hierarchical Clustering Step 1: Choose one distance measurement Step 2: Construct the hierarchical tree: Bottom-up (agglomerative) method: n 1; starting from the individual patterns and putting smaller clusters together to form bigger clusters. Top-down (divisive) method: 1 n; starting at the root and splitting clusters into smaller ones by nonhierarchical algorithms (e.g., k-means with k = 2).

Hierarchical Clustering: Example Example: Consider 5 experiments (A, B, C, D, E) with the following distance metric: Bottom-up (agglomerative) method: putting similar clusters together to form bigger clusters. A B C D E A 100 400 800 1000 B 300 700 900 C 400 600 D 200 E

Hierarchical Clustering: Example {A,B} C D E {A,B}??? C 400 600 D 200 E Need to define inter-cluster distances

Inter-Cluster Distances Single Linkage Complete Linkage Centroid Linkage Average Linkage

Hierarchical Clustering: Example If we use average linkage: d {A,B},C = (d A,C +d B,C ) / 2, etc. {A,B} C D E {A,B} 350 750 950 C 400 600 D 200 E d {A,B},C = (d A,C +d B,C ) / 2 = (400+300)/2

Hierarchical Clustering: Example * use average linkage {A,B} C {D,E} {A,B} 350 850 C 500 {D,E}

Cutting Tree Diagrams A hierarchical clustering diagram can be used to divide the data into a pre-determined number of clusters by cutting the tree at a certain depth. 2 groups 4 groups

Properties of Hierarchical Clustering Different tree-constructing methods: The same data and the same process obtain the same results by running the same bottom-up method. The same data and the same process obtain two different results by running the same top-down method. Different linkage type produce different results.

Hierarchical Clustering: Comments Objective of the research: To obtain a clustering that reflects the structure of the data. The dendrogram itself is almost never the answer to the research question. Various implementations of hierarchical clustering should not be judged simply by their speed; slower algorithms may be trying to do a better job pf extracting the data features. The order of the objects and clusters in the dendrogram may be misleading.

Orders in Dendrogram

Clustering Algorithm Partitioning: k-means, PAM Hierarchical clustering Model based: SOM

SOM: Motivation Misleading dendrograms: K-means Hierarchical The SOM clustering is designed to create a plot in which similar patterns are plotted next to each other.

Self-Organizing Feature Maps (SOM) SOM: A map consists of many simple elements (nodes or neurons); it is constructed by training. SOMs are believed to resemble processing that can occur in the brain Useful for visualizing high-dimensional data in 2- or 3- D space

Self-Organizing Feature Maps (SOM) Clustering is performed by having several units competing for the current object The unit whose weight vector is closest to the current object wins The winner and its neighbors learn by having their weights adjusted

Self-Organizing Feature Maps (SOM) This process can be visualized by imagining all SOM units being connected to each other by rubber bands. A 2D SOFM trained on 3-dimensional data.

paper: Eisen 1998 Algorithmic Approaches to Clustering Gene Expression Data http://citeseer.nj.nec.com/shamir01algorithmic.html Tibshirani, Hastie, Narasimhan and Chu (2002) http://www.pnas.org/cgi/reprint/99/10/6567 Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20, 53 65