Hierarchical clustering for gene expression data analysis

Similar documents
Machine Learning: Algorithms and Applications

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Hierarchical agglomerative. Cluster Analysis. Christine Siedle Clustering 1

Unsupervised Learning

Machine Learning. Topic 6: Clustering

Unsupervised Learning and Clustering

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

K-means and Hierarchical Clustering

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Unsupervised Learning and Clustering

Cluster Analysis of Electrical Behavior

Hierarchical Clustering

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Graph-based Clustering

Clustering Part 3. Hierarchical Clustering

Clustering. A. Bellaachia Page: 1

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Survey of Cluster Analysis and its Various Aspects

Design and Analysis of Algorithms

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Data Mining Concepts & Techniques

Clustering validation

Module Management Tool in Software Development Organizations

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

1 Dynamic Connectivity

APPLIED MACHINE LEARNING

This excerpt from. Foundations of Statistical Natural Language Processing. Christopher D. Manning and Hinrich Schütze The MIT Press.

Insertion Sort. Divide and Conquer Sorting. Divide and Conquer. Mergesort. Mergesort Example. Auxiliary Array

Clustering is a discovery process in data mining.

cos(a, b) = at b a b. To get a distance measure, subtract the cosine similarity from one. dist(a, b) =1 cos(a, b)

Clustering Lecture 3: Hierarchical Methods

SCALABLE AND VISUALIZATION-ORIENTED CLUSTERING FOR EXPLORATORY SPATIAL ANALYSIS

Support Vector Machines

Sorting. Sorting. Why Sort? Consistent Ordering

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Clustering algorithms and validity measures

Keyword-based Document Clustering

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

CS 534: Computer Vision Model Fitting

TOWARDS FUZZY-HARD CLUSTERING MAPPING PROCESSES. MINYAR SASSI National Engineering School of Tunis BP. 37, Le Belvédère, 1002 Tunis, Tunisia

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

On the Efficiency of Swap-Based Clustering

A new segmentation algorithm for medical volume image based on K-means clustering

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Collision Detection. Overview. Efficient Collision Detection. Collision Detection with Rays: Example. C = nm + (n choose 2)

Bidirectional Hierarchical Clustering for Web Mining

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

SHAPE RECOGNITION METHOD BASED ON THE k-nearest NEIGHBOR RULE

Analyzing Popular Clustering Algorithms from Different Viewpoints

A Hierarchical Clustering and Validity Index for Mixed Data

Web Mining: Clustering Web Documents A Preliminary Review

Hierarchical clustering

A Deflected Grid-based Algorithm for Clustering Analysis

Agenda & Reading. Simple If. Decision-Making Statements. COMPSCI 280 S1C Applications Programming. Programming Fundamentals

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

A Two-Stage Algorithm for Data Clustering

the nber of vertces n the graph. spannng tree T beng part of a par of maxmally dstant trees s called extremal. Extremal trees are useful n the mxed an

Programming in Fortran 90 : 2017/2018

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Hierarchical Clustering

CE 221 Data Structures and Algorithms

Data Mining Approaches to User Modeling for Adaptive Hypermedia: Survey and Future Directions

An Optimal iterative Minimal Spanning tree Clustering Algorithm for images

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Clustering on antimatroids and convex geometries

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem

On the Two-level Hybrid Clustering Algorithm

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

The Research of Support Vector Machine in Agricultural Data Classification

5 The Primal-Dual Method

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

12. Segmentation. Computer Engineering, i Sejong University. Dongil Han

Data Mining: Model Evaluation

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Retrieval and Clustering from a 3D Human Database based on Body and Head Shape

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Image Representation & Visualization Basic Imaging Algorithms Shape Representation and Analysis. outline

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Routing in Degree-constrained FSO Mesh Networks

A Clustering Algorithm for Chinese Adjectives and Nouns 1

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

APPLICATION OF IMPROVED K-MEANS ALGORITHM IN THE DELIVERY LOCATION

CSE 347/447: DATA MINING

Topics. Clustering. Unsupervised vs. Supervised. Vehicle Example. Vehicle Clusters Advanced Algorithmics

Clustering Algorithm of Similarity Segmentation based on Point Sorting

Data Mining MTAT (4AP = 6EAP)

More on Sorting: Quick Sort and Heap Sort

Angle-Independent 3D Reconstruction. Ji Zhang Mireille Boutin Daniel Aliaga

Sorting and Algorithm Analysis

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Lecture 4: Principal components

Transcription:

Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t

Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally related genes(or unrelated genes: dfferent clusters) 2. Clusterng of samples (columns) => dentfcaton of sub-types of related samples 3. Two-way clusterng => combned sample clusterng wth gene clusterng to dentfy whch genes are the most mportant forsample clusterng

Herarchcal Clusterng 6 5 4 3 4 2 5 2 0.2 0.5 0. 3 0.05 0 3 2 5 4 6 Herarchcal Clusterng Dendrogram

Dendrograms - The root represents the whole data set - A leaf represents a sngle obect n the data set - An nternal node represent the unon of all obects n ts subtree - The heght of an nternal node represents the dstance between ts two chld nodes

Herarchcal Clusterng Two man types of herarchcal clusterng. Agglomeratve: Start wth the ponts as ndvdual clusters At each step, merge the closest par of clusters. Untl only one cluster (or k clusters) left Ths requres defnng the noton of cluster proxmty. Dvsve: Start wth one, all-nclusve cluster At each step, splt a cluster Untl each cluster contans a pont (or there are k clusters) Need to decde whch cluster to splt at each step.

Basc Agglomeratve Herarchcal Clusterng Algorthm. Intally, each obect forms ts own cluster 2. Compute all parwse dstances between the ntal clusters (obects) repeat 3. Merge the closest par (A, B) n the set of the current clusters nto a new cluster C = A B 4. Remove A and B from the set of current clusters; nsert C nto the set of current clusters 5. Determne the dstance between the new cluster C and all other clusters n the set of current clusters untl only a sngle cluster remans

Agglomeratve Herarchcal Clusterng: Startng Stuaton For agglomeratve herarchcal clusterng we start wth clusters of ndvdual ponts and a proxmty matrx. p p2 p3 p4 p5.. p p2 p3 p4 p5.... Proxmty Matrx

Agglomeratve Herarchcal Clusterng: Intermedate Stuaton After some mergng steps, we have some clusters. C C2 C3 C4 C5 C3 C4 C C2 C3 C C4 C5 C2 C5 Proxmty Matrx

Agglomeratve Herarchcal Clusterng: Intermedate Stuaton We want to merge the two closest clusters (C2 and C5) and update the proxmty matrx. C C2 C3 C4 C5 C3 C4 C C2 C3 C C4 C5 C2 C5 Proxmty Matrx

Agglomeratve Herarchcal Clusterng: after Mergng The queston s How do we update the proxmty matrx? C3 C4 C C2 U C5 C3 C4 C C C2 U C5 C3?????? C2 U C5 C4? Dstance Matrx Key operaton s the computaton of the dstance of two clusters. Dfferent approaches to defnng the dstance between clusters dstngushes the dfferent algorthms

Inter-cluster dstances Four wdely used ways of defnng the nter-cluster dstance,.e., the dstance between two separate clusters C and C, are o sngle lnkage method (nearest neghbor): d( C, C ) = mn, { d( x, y) } x C o complete lnkage method (furthest neghbor): d( C, C ) = max x C, { d( x, y) } y C o average lnkage method (unweghted par-group average): d( C, C ) = avg, { d( x, y) } o centrod lnkage method (dstance between cluster centrods c and c ): x C y C y C d ( C, C ) = d( c, c )

Sngle lnkage (mnmum dstance) method Dstance (dssmlarty) of two clusters s based on the two most smlar (closest) ponts n the dfferent clusters C and C : Determned by one par of ponts,.e., by one lnk n the proxmty graph. Can handle non-ellptcal shapes. Senstve to nose and outlers. { d( x, )} d( C, C ) = mn, y x C y C Smlarty matrx I I2 I3 I4 I5 I.00 0.90 0.0 0.65 0.20 I2 0.90.00 0.70 0.60 0.50 I3 0.0 0.70.00 0.40 0.30 I4 0.65 0.60 0.40.00 0.80 I5 0.20 0.50 0.30 0.80.00 2 3 4 5

Sngle lnkage { (, )} d( C, C ) = mn d x y, x C y C

Herarchcal Clusterng: mnmum dstance 5 2 2 3 3 5 6 0.2 0.5 0. 0.05 4 4 0 3 6 2 5 4 Nested Clusters Dendrogram

Strength of mnmum dstance Orgnal Ponts Two Clusters

Lmtaton of mnmum dstance Orgnal Ponts Two Clusters

Complete Lnkage (maxmum dstance) method Dstance of two clusters s based on the two least smlar (most dstant) ponts n the dfferent clusters C and C : Determned by all pars of ponts n the two clusters. Tends to break large clusters. Less susceptble to nose and outlers. { d( x, )} d( C, C ) = max, y x C y C Smlarty matrx I I2 I3 I4 I5 I.00 0.90 0.0 0.65 0.20 I2 0.90.00 0.70 0.60 0.50 I3 0.0 0.70.00 0.40 0.30 I4 0.65 0.60 0.40.00 0.80 I5 0.20 0.50 0.30 0.80.00 2 3 4 5

Complete lnkage { d( x, )} d( C, C ) = max, y x C y C

Cluster Smlarty: maxmum dstance or Complete Lnkage Smlarty of two clusters s based on the two most dstant ponts n the dfferent clusters. Tends to break large clusters. Less susceptble to nose and outlers. Based towards globular clusters.

Herarchcal Clusterng: maxmum dstance 5 4 2 5 2 3 6 3 4 0.4 0.35 0.3 0.25 0.2 0.5 0. 0.05 0 3 6 4 2 5 Nested Clusters Dendrogram

Strength of maxmum dstance Orgnal Ponts Two Clusters

Lmtatons of maxmum dstance Orgnal Ponts Two Clusters

Average lnkage (average dstance) method Dstance of two clusters s the average of parwse dstances between ponts n the two clusters C and C : Compromse between Sngle and Complete Lnk. Need to use average connectvty for scalablty snce total connectvty favors large clusters. Less susceptble to nose and outlers. Based towards globular clusters. Smlarty matrx d ( C =, C ) d( x, y) C C x y C C I I2 I3 I4 I5 I.00 0.90 0.0 0.65 0.20 I2 0.90.00 0.70 0.60 0.50 I3 0.0 0.70.00 0.40 0.30 I4 0.65 0.60 0.40.00 0.80 I5 0.20 0.50 0.30 0.80.00 2 3 4 5

Average lnkage d ( C =, C ) d( x, y) C C x y C C

Herarchcal Clusterng: Average dstance 5 5 2 2 0.25 0.2 0.5 4 3 4 3 6 0. 0.05 0 3 6 4 2 5 Nested Clusters Dendrogram

Centrod lnkage (centrod dstance) method Dstance of two clusters s dstance of the two centrods c and c of the two clusters C and C : d ( C, C ) = d( c, c ) c = C x C x c = C x C x Compromse between Sngle and Complete Lnk. Less computatonally ntensve wth respect to average lnkage.

Centrod lnkage d ( C, C ) = d( c, c ) c = C x C x c = C x C x

Cluster Smlarty: Ward s Method Smlarty of two clusters s based on the ncrease n squared error when two clusters are merged. Smlar to group average f dstance between ponts s dstance squared. Less susceptble to nose and outlers. Based towards globular clusters. Herarchcal analogue of K-means But Ward s method does not correspond to a local mnmum Can be used to ntalze K-means

Herarchcal Clusterng: Ward s method 5 4 5 2 2 4 3 3 6 0.25 0.2 0.5 0. 0.05 0 3 6 4 2 5 Nested Clusters Dendrogram

Herarchcal Clusterng: comparson Average Ward s Method 2 3 4 5 6 2 5 3 4 MIN MAX 2 3 4 5 6 2 5 3 4 2 3 4 5 6 2 5 3 4 2 3 4 5 6 2 3 4 5

Comparson of mnmum, maxmum, average and centrod dstance Mnmum dstance When d mn s used to measure dstance between clusters, the algorthm s called the nearestneghbor or sngle- lnkage clusterng algorthm If the algorthm s allowed to run untl only one cluster remans, the result s a mnmum spannng tree (MST) Ths algorthm favors elongated classes Maxmum dstance When d max s used to measure dstance between clusters, the algorthm s called the farthestneghbor or complete- lnkage clusterng algorthm From a graph- theoretc pont of vew, each cluster consttutes a complete sub- graph Ths algorthm favors compact classes Average and centrod dstance The mnmum and maxmum dstance are extremely senstve to outlers snce ther measurement of between- cluster dstance nvolves mnma or maxma The average and centrod dstance approaches are more robust to outlers Of the two, the centrod dstance s computatonally more attractve Notce that the average dstance approach nvolves the computaton of C C dstances for each par of clusters

Herarchcal Clusterng: Tme and Space requrements O(N 2 ) space snce t uses the proxmty matrx. N s the number of ponts. O(N 3 ) tme n many cases. There are N steps and at each step the sze, N 2, proxmty matrx must be updated and searched. By beng careful, the complexty can be reduced to O(N 2 log(n) ) tme for some approaches.

Herarchcal Clusterng: problems and lmtatons Once a decson s made to combne two clusters, t cannot be undone. No obectve functon s drectly mnmzed. Dfferent schemes have problems wth one or more of the followng: Senstvty to nose and outlers. Dffculty handlng dfferent szed clusters and convex shapes. Breakng large clusters.

Advantages and dsadvantages of Herarchcal clusterng Advantages Does not requre the number of clusters to be known n advance No nput parameters (besdes the choce of the (ds)smlarty) Computes a complete herarchy of clusters Good result vsualzatons ntegrated nto the methods Dsadvantages May not scale well: runtme for the standard methods: O(n 2 log n) No explct clusters: a flat partton can be derved afterwards (e.g. va a cut through the dendrogram or termnaton condton n the constructon) No automatc dscoverng of optmal clusters

Herarchcal clusterng of tssues and genes: Alzadeh et al. 2000, Dstnct types of dffuse large B-cell lymphoma dentfed by gene expresson proflng, Nature 403:3.