Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t
Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally related genes(or unrelated genes: dfferent clusters) 2. Clusterng of samples (columns) => dentfcaton of sub-types of related samples 3. Two-way clusterng => combned sample clusterng wth gene clusterng to dentfy whch genes are the most mportant forsample clusterng
Herarchcal Clusterng 6 5 4 3 4 2 5 2 0.2 0.5 0. 3 0.05 0 3 2 5 4 6 Herarchcal Clusterng Dendrogram
Dendrograms - The root represents the whole data set - A leaf represents a sngle obect n the data set - An nternal node represent the unon of all obects n ts subtree - The heght of an nternal node represents the dstance between ts two chld nodes
Herarchcal Clusterng Two man types of herarchcal clusterng. Agglomeratve: Start wth the ponts as ndvdual clusters At each step, merge the closest par of clusters. Untl only one cluster (or k clusters) left Ths requres defnng the noton of cluster proxmty. Dvsve: Start wth one, all-nclusve cluster At each step, splt a cluster Untl each cluster contans a pont (or there are k clusters) Need to decde whch cluster to splt at each step.
Basc Agglomeratve Herarchcal Clusterng Algorthm. Intally, each obect forms ts own cluster 2. Compute all parwse dstances between the ntal clusters (obects) repeat 3. Merge the closest par (A, B) n the set of the current clusters nto a new cluster C = A B 4. Remove A and B from the set of current clusters; nsert C nto the set of current clusters 5. Determne the dstance between the new cluster C and all other clusters n the set of current clusters untl only a sngle cluster remans
Agglomeratve Herarchcal Clusterng: Startng Stuaton For agglomeratve herarchcal clusterng we start wth clusters of ndvdual ponts and a proxmty matrx. p p2 p3 p4 p5.. p p2 p3 p4 p5.... Proxmty Matrx
Agglomeratve Herarchcal Clusterng: Intermedate Stuaton After some mergng steps, we have some clusters. C C2 C3 C4 C5 C3 C4 C C2 C3 C C4 C5 C2 C5 Proxmty Matrx
Agglomeratve Herarchcal Clusterng: Intermedate Stuaton We want to merge the two closest clusters (C2 and C5) and update the proxmty matrx. C C2 C3 C4 C5 C3 C4 C C2 C3 C C4 C5 C2 C5 Proxmty Matrx
Agglomeratve Herarchcal Clusterng: after Mergng The queston s How do we update the proxmty matrx? C3 C4 C C2 U C5 C3 C4 C C C2 U C5 C3?????? C2 U C5 C4? Dstance Matrx Key operaton s the computaton of the dstance of two clusters. Dfferent approaches to defnng the dstance between clusters dstngushes the dfferent algorthms
Inter-cluster dstances Four wdely used ways of defnng the nter-cluster dstance,.e., the dstance between two separate clusters C and C, are o sngle lnkage method (nearest neghbor): d( C, C ) = mn, { d( x, y) } x C o complete lnkage method (furthest neghbor): d( C, C ) = max x C, { d( x, y) } y C o average lnkage method (unweghted par-group average): d( C, C ) = avg, { d( x, y) } o centrod lnkage method (dstance between cluster centrods c and c ): x C y C y C d ( C, C ) = d( c, c )
Sngle lnkage (mnmum dstance) method Dstance (dssmlarty) of two clusters s based on the two most smlar (closest) ponts n the dfferent clusters C and C : Determned by one par of ponts,.e., by one lnk n the proxmty graph. Can handle non-ellptcal shapes. Senstve to nose and outlers. { d( x, )} d( C, C ) = mn, y x C y C Smlarty matrx I I2 I3 I4 I5 I.00 0.90 0.0 0.65 0.20 I2 0.90.00 0.70 0.60 0.50 I3 0.0 0.70.00 0.40 0.30 I4 0.65 0.60 0.40.00 0.80 I5 0.20 0.50 0.30 0.80.00 2 3 4 5
Sngle lnkage { (, )} d( C, C ) = mn d x y, x C y C
Herarchcal Clusterng: mnmum dstance 5 2 2 3 3 5 6 0.2 0.5 0. 0.05 4 4 0 3 6 2 5 4 Nested Clusters Dendrogram
Strength of mnmum dstance Orgnal Ponts Two Clusters
Lmtaton of mnmum dstance Orgnal Ponts Two Clusters
Complete Lnkage (maxmum dstance) method Dstance of two clusters s based on the two least smlar (most dstant) ponts n the dfferent clusters C and C : Determned by all pars of ponts n the two clusters. Tends to break large clusters. Less susceptble to nose and outlers. { d( x, )} d( C, C ) = max, y x C y C Smlarty matrx I I2 I3 I4 I5 I.00 0.90 0.0 0.65 0.20 I2 0.90.00 0.70 0.60 0.50 I3 0.0 0.70.00 0.40 0.30 I4 0.65 0.60 0.40.00 0.80 I5 0.20 0.50 0.30 0.80.00 2 3 4 5
Complete lnkage { d( x, )} d( C, C ) = max, y x C y C
Cluster Smlarty: maxmum dstance or Complete Lnkage Smlarty of two clusters s based on the two most dstant ponts n the dfferent clusters. Tends to break large clusters. Less susceptble to nose and outlers. Based towards globular clusters.
Herarchcal Clusterng: maxmum dstance 5 4 2 5 2 3 6 3 4 0.4 0.35 0.3 0.25 0.2 0.5 0. 0.05 0 3 6 4 2 5 Nested Clusters Dendrogram
Strength of maxmum dstance Orgnal Ponts Two Clusters
Lmtatons of maxmum dstance Orgnal Ponts Two Clusters
Average lnkage (average dstance) method Dstance of two clusters s the average of parwse dstances between ponts n the two clusters C and C : Compromse between Sngle and Complete Lnk. Need to use average connectvty for scalablty snce total connectvty favors large clusters. Less susceptble to nose and outlers. Based towards globular clusters. Smlarty matrx d ( C =, C ) d( x, y) C C x y C C I I2 I3 I4 I5 I.00 0.90 0.0 0.65 0.20 I2 0.90.00 0.70 0.60 0.50 I3 0.0 0.70.00 0.40 0.30 I4 0.65 0.60 0.40.00 0.80 I5 0.20 0.50 0.30 0.80.00 2 3 4 5
Average lnkage d ( C =, C ) d( x, y) C C x y C C
Herarchcal Clusterng: Average dstance 5 5 2 2 0.25 0.2 0.5 4 3 4 3 6 0. 0.05 0 3 6 4 2 5 Nested Clusters Dendrogram
Centrod lnkage (centrod dstance) method Dstance of two clusters s dstance of the two centrods c and c of the two clusters C and C : d ( C, C ) = d( c, c ) c = C x C x c = C x C x Compromse between Sngle and Complete Lnk. Less computatonally ntensve wth respect to average lnkage.
Centrod lnkage d ( C, C ) = d( c, c ) c = C x C x c = C x C x
Cluster Smlarty: Ward s Method Smlarty of two clusters s based on the ncrease n squared error when two clusters are merged. Smlar to group average f dstance between ponts s dstance squared. Less susceptble to nose and outlers. Based towards globular clusters. Herarchcal analogue of K-means But Ward s method does not correspond to a local mnmum Can be used to ntalze K-means
Herarchcal Clusterng: Ward s method 5 4 5 2 2 4 3 3 6 0.25 0.2 0.5 0. 0.05 0 3 6 4 2 5 Nested Clusters Dendrogram
Herarchcal Clusterng: comparson Average Ward s Method 2 3 4 5 6 2 5 3 4 MIN MAX 2 3 4 5 6 2 5 3 4 2 3 4 5 6 2 5 3 4 2 3 4 5 6 2 3 4 5
Comparson of mnmum, maxmum, average and centrod dstance Mnmum dstance When d mn s used to measure dstance between clusters, the algorthm s called the nearestneghbor or sngle- lnkage clusterng algorthm If the algorthm s allowed to run untl only one cluster remans, the result s a mnmum spannng tree (MST) Ths algorthm favors elongated classes Maxmum dstance When d max s used to measure dstance between clusters, the algorthm s called the farthestneghbor or complete- lnkage clusterng algorthm From a graph- theoretc pont of vew, each cluster consttutes a complete sub- graph Ths algorthm favors compact classes Average and centrod dstance The mnmum and maxmum dstance are extremely senstve to outlers snce ther measurement of between- cluster dstance nvolves mnma or maxma The average and centrod dstance approaches are more robust to outlers Of the two, the centrod dstance s computatonally more attractve Notce that the average dstance approach nvolves the computaton of C C dstances for each par of clusters
Herarchcal Clusterng: Tme and Space requrements O(N 2 ) space snce t uses the proxmty matrx. N s the number of ponts. O(N 3 ) tme n many cases. There are N steps and at each step the sze, N 2, proxmty matrx must be updated and searched. By beng careful, the complexty can be reduced to O(N 2 log(n) ) tme for some approaches.
Herarchcal Clusterng: problems and lmtatons Once a decson s made to combne two clusters, t cannot be undone. No obectve functon s drectly mnmzed. Dfferent schemes have problems wth one or more of the followng: Senstvty to nose and outlers. Dffculty handlng dfferent szed clusters and convex shapes. Breakng large clusters.
Advantages and dsadvantages of Herarchcal clusterng Advantages Does not requre the number of clusters to be known n advance No nput parameters (besdes the choce of the (ds)smlarty) Computes a complete herarchy of clusters Good result vsualzatons ntegrated nto the methods Dsadvantages May not scale well: runtme for the standard methods: O(n 2 log n) No explct clusters: a flat partton can be derved afterwards (e.g. va a cut through the dendrogram or termnaton condton n the constructon) No automatc dscoverng of optmal clusters
Herarchcal clusterng of tssues and genes: Alzadeh et al. 2000, Dstnct types of dffuse large B-cell lymphoma dentfed by gene expresson proflng, Nature 403:3.