Cutting the Dendrogram through Permutation Tests

Size: px

Start display at page:

Download "Cutting the Dendrogram through Permutation Tests"

Frank Fisher
5 years ago
Views:

1 Cutting the Dendrogram through Permutation Tests Dario Bruzzese 1 and Domenico Vistocco 2 1 Dipartimento di Medicina Preventiva, Università di Napoli - Federico II Via S. Pansini 5, Napoli, Italy, dario.bruzzese@unina.it 2 Dipartimento di Scienze Economiche, Università di Cassino Via S.Angelo S.N., Cassino, Italy, vistocco@unicas.it Abstract. This paper introduces an innovative approach for detecting a suboptimal partition starting from the dendrogram produced by a hierarchical clustering techinique. The approach exploits permutation tests and it can be used regardless of the agglomeration method and distance measure used in the classification process because it relies on the same criteria used for producing it. Moreover, the proposed approach can detect partitions not necessarily identifiable using a traditional cut approach, as the resulting clusters could correspond to different heights of the tree. Keywords: Hierarchical clustering, Permutation tests, Cluster detection 1 Introduction and motivation Hierarchical clustering represents one of the most widespread analytical approach to face with classification problems mainly due to the visual power of the associated graphical representation, the dendrogram, and because of the directness of the cluster generation process. All these aside, the requirement of choosing properly the optimum number of clusters still represents the main difficulty for the final user. Actually, different (semi)automatic criteria can be devised to reach the final classification; very often, the very informal solution adopted is that of finding the height in the dendrogram where large changes in fusion level occur. Broadly speaking, all these criteria determine a threshold value of the ultra-metric used to grow the dendrogram such that all the units with a dissimilarity index below this threshold will belong to the same cluster; however such approach allows to search for the solution only within a small set of the whole family of partitions housed in the dendrogram: those which stem from horizontal cuts of the dendrogram. This induces a biunivocal relation between the number k of clusters and the partition set such that by fixing one element of the relation (e.g. the number of clusters) the other is univocally determined. It could happen, however, that clusters differ in terms of their own internal coherence in a way that the same threshold value wouldn t be suitable for all

2 2 Bruzzese, D. and Vistocco, D. of them. Figure 1 shows the dendrogram obtained on a simulated dataset. The dataset contains 4 different clusters of the same cardinality generated from multivariate normal distributions with different mean vectors and variancecovariance matrices. The Ward criterion with the Euclidean distance was used to grow the tree. Fig. 1. Two different partitions in 4 clusters of a simulated dataset. Solid line refers to a traditional horizontal criterion while the dashed line refers to a possible solution offered by the proposed algorithm. The solid line in Figure 1 highlights the 4-clusters solution obtained by cutting the dendrogram with a traditional horizontal criterion; this partition, which actually is the only one that can produce a 4-clusters solution, isolates a very small cluster on the left side of the dendrogram while leaving ungrouped the two clusters on the right that, on the contrary, contain units belonging to different populations. A different solution, inside those that still comply with the hierarchical classification process, could thus be the one described by the dashed line and characterized by two local thresholds located at different heights; actually it turns out that this non-conventional cut can better recover the original cluster structure (according to the misclassification index the first partition produces an error rate of 0.58 while the second is characterized by an error rate of 0.40). The possibility of merging clusters at different heights (thus conflicting with the classificability principle previously described) makes mandatory the implementation of a procedure able to automatically explore the complete set of partitions by tracing the partial thresholds whenever two clusters plainly reflect specific characteristics. The proposed algorithm exploits the theoretical framework of permutation tests in order to reach this goal. The most important by-product of such approach is the automatic identification of the number of clusters. The paper is organized as follows: the idea used for detecting the partition is introduced in Section 2, the notation and the proposed procedure

3 Cutting the Dendrogram through Permutation Tests 3 is detailed in Section 3; Section 4 shows some results on a genetic dataset and a simulation study in order to explore the influence of tuning parameters on the algorithm output. Finally, some concluding remarks and future work directions follow in the final section. 2 The basic intuition The proposed algorithm exploits a permutation test approach to automatically detect a partition starting from a dendrogram resulting from a hierarchical cluster. The algorithm retraces down-ward the tree, starting from the root of the dendrogram where all objects are classified in a unique cluster and moving down a partial threshold until a link joining two clusters is encountered. A permutation test is thus performed in order to verify whether the two clusters must be accounted as a unique group (the null hypothesis) or not (the alternative one). If the null cannot be rejected, the corresponding branch will become a cluster of the final partition and none of its sub-branches will be longer processed. Otherwise each of them will be further visited in the course of the procedure. In fact, in both cases, the partial threshold will continue its path and the next branch of the dendrogram will be processed. The algorithm stops when there are no more branches that stand the test (i.e. the null cannot be rejected any more). The permutation test on which the whole procedure is based can be summarized in this way. Under the Null, if all the units belonging to each of the two clusters are mixed up together and then randomly split up, with the only constraint of the group cardinality, the distance among the shuffled clusters should not be very different from the original one. Repeating the shuffling m times, a Montecarlo p-value can be computed as the number of permuted distances at least as extreme as the original one. The whole algorithm is detailed in the next section. 3 The algorithm Let denote with n the number of objects to classify, with CL k and Ck R the two classes merged at level k (k = 1,..., n), with h(cl k Ck R ) the height necessary to merge CL k and Ck R. Finally we denote with h(ck j ) the height at which Cj k has been obtained (j {L, R}). In Figure 2 the adopted formalism is superimposed, for k {1, 2}, on the dendrogram shown in the previous section. For each k, the difference between max h ( C k ) j and min h ( C k ) j can j {L,R} j {L,R} be considered as the minimum cost necessary to merge the two classes. Minimum because, at least, the dissimilarity measure used in the agglomeration process, must raise from min h ( C k ) j to max h ( C k ) j in order to merge j {L,R} j {L,R} the two clusters.

4 4 Bruzzese, D. and Vistocco, D. h h(c 2 L C2 R) h(c 2 R) C 2 L C 2 R C 1 L C 1 R Fig. 2. Exemplification of the main notation adopted with respect to the dendrogram reported in Figure 1. The difference between h ( CL k R) Ck and max h ( C k ) j can be, instead, j {L,R} considered as the cost actually incurred for merging CL k and Ck R. The ratio between these two costs: cost ( max CL k CR k ) h ( C k ) j min h ( C k ) j j {L,R} j {L,R} = h ( CL k R) Ck max h ( C k ) j j {L,R} is thus a measure that characterizes the aggregation process resulting in the new class CL k Ck R and is indeed used in the permutation test approach for automatically detecting the clusters. The proposed procedure is detailed in Algorithm 1. In particular we denote with aggregationlevelst ov isit a vector containing the heights of the dendrogram to be explored and with permclusters an object storing the clusters detected by the procedure. The permutation test step is embedded in the row 6 of the Algorithm 1. In particular, for each k a permutation test is designed to test the Null Hypothesis that the two groups CL k and Ck R really belong to the same cluster, i.e. H 0 : CL k CR. k Under H 0, mixing up (i.e. permuting) the statistical units of C k L and Ck R should not alter the aggregation process resulting in their merging in.

5 Cutting the Dendrogram through Permutation Tests 5 Input: A dataset and its related dendrogram Output: A partition of the dataset 1. inizialization: 2. aggregationlevelstovisit h(c 1 L C 1 R) 3. permclusters [ ] 4. i 1 5. repeat 6. if C i L C i R 7. add C i L C i R to permclusters 8. else 9. add h(c i L) and h(c i R) to aggregationlevelstovisit 10. sort aggregationlevelstovisit in descending order 11. end 12. remove the first element from aggregationlevelstovisit 13. i i until aggregationlevelstovisit is empty Algorithm 1: The proposed P ermclust algorithm Let m CL k and mcr k be the two new classes obtained by permuting the elements in CL k and Ck R. As a matter of fact, the hierarchical clustering process is invariant with respect to the permutation of the original observations and thus growing a single dendrogram on the permuted set would simply reestablish the same structure. For this reason, after m CL k and mcr k have been obtained, a new dendrogram is generated for each of them. The heights at which each of the two classes are buit up again, clearly correspond to the heights of the root nodes of the corresponding dendrograms. The ratio: cost ( max mcl k m CR k ) h ( mc k ) j min h ( mc k ) j j {L,R} j {L,R} = h ( CL k R) Ck max h ( mc k ) j j {L,R} is thus a measure that characterizes the aggregation process resulting in the new (potential) class m CL k mcr k. Under H 0, the aggregation process resulting in the new cluster CL k Ck R should be very similar to the one that potentially would have produced m CL k mcr k ; thus the two values cost ( CL k ) ( Ck R and cost mcl k mcr) k should be close enough. The permutation procedure is repeated M times and each time a new couple m CL k, mcr k is obtained. The pvalue Montecarlo (Good, 1994) is thus computed as: p = # { cost ( mcl k ) ( )} mcr k cost C k L CR k + 1 M Some results The PermClust algorithm has been applied both on real and synthetic datasets; in the following the main results will be presented. In all the computations,

6 Bruzzese, D. and Vistocco, D. the dendrograms have been generated with the Euclidean distance and the Ward agglomeration criterion (Maechler et al., 2005).

6 6 Bruzzese, D. and Vistocco, D. the dendrograms have been generated with the Euclidean distance and the Ward agglomeration criterion (Maechler et al., 2005). Unless differently specified, p-values less than 0.01 were considered statistically significant in the permutation test step. Figure 3(a) shows (a zoom of) the dendrogram obtained on the Yeast galactose dataset which describes a subset of 205 genes reflecting four functional categories of the Gene Ontology (Ideker et al., 2001) 1. The value detected Cluster value detected Cluster (a) detected Cluster (b) detected Cluster Fig. 3. (a) The dendrogram obtained on the Yeast Galactose dataset with the partition selected by PermClust algorithm. Numbers refer to the p-values of the associated permutation test. (b) Visual representation of the confusion matrix resulting from PermClust algorithm and (c) from a k-means with k=4. (c) obtained partition is highlighted using red rectangles and clearly reveals the 4-clusters structure originally contained in the dataset. Panels (b) and (c) show the confusion matrices related to the proposed algorithm (b) and to a k-means procedure (c) with k equal to 4. The different sub panels depict the original clusters in the dataset while the different bars refer to the clusters detected by the classification procedures. It can be noticed that the proposed algorithm correctly assigns the units to the first and the fourth class while a small fraction of units belonging to the second cluster is misclassified into the third cluster. K-means, on the contrary, is unable to grasp second cluster (whose units are misclassified in the third cluster). Small misclassification rates characterize also the first and the fourth cluster. The misclassification rate was 1.5% for PermClust and 8.3% for the k-means procedure. It is worth of notice that the partition selected by the proposed algorithm agrees with the hortodox 4-clusters solution but it has been automatically detected by the algorithm. The PermClust algorithm has been also tested on artificial datasets. In particular, Figure 4 shows the results of the algorithm on artificial datasets 1 For this application, the algorithm, written in the R language, uses almost 50 secs. on a Intel Core 2 Duo 2.26 GHz machine with 4 GB of RAM. More efficiency could be achieved optimizing the code and implementing it using a compiled language.

Cutting the Dendrogram through Permutation Tests 7 generated according to the random cluster generation method proposed in Qiu and Joe (2006a, 2009).

01 for the separation index (Qiu and Joe, 2006b) between any cluster and its nearest neighbor cluster which reflects a close cluster structure.

7 Cutting the Dendrogram through Permutation Tests 7 generated according to the random cluster generation method proposed in Qiu and Joe (2006a, 2009). Generated data differ in terms of the number of clusters (k = 2, 3, 4, 5, 6, 7) and of the number of variables (p = 5, 10) 2. The artificial data have been generated using a value of 0.01 for the separation index (Qiu and Joe, 2006b) between any cluster and its nearest neighbor cluster which reflects a close cluster structure. For each combination of k and p, detected Cluster detected Cluster (a) Number of detected clusters (a) (b) Number of detected clusters (b) Fig. 4. Distribution of the number of clusters detected by the PermClust algorithm for artificial datasets in case of 5 variables (a) and 10 variables (b). s = 100 different datasets have been generated. Figure 4(a) shows the number of clusters composing the partition detected by the PermClust algorithm using p = 5 variables. Different columns of the Figure depict the different value of k, while the rows refer to the significance level used in the permutation test step of the algorithm (see row 6 of Algorithm 1). The barplots in each panel show the distribution of the numbers of clusters detected by the algorithm in the s simulations. The same structure is used for the case of a dataset with 10 variables (Figure 4(b)). As can be noticed, the stability of the algorithm strictly depends on the combination among the significance level, the cardinality of the cluster structure and the number of variables. In particular, while a significance level of 0.01 (last row of Figure 4(a) and (b)) always allows to achieve the best results, the accuracy of the solution is inversely proportional to the ratio between k and p. In case of a simple cluster structure (k=2,3), the algorithm seems to fail even with a large number of available variables. 2 With p=15 the performance of the algorithm is almost equal to p=10. The corresponding figure is not reported for sake of brevity.

8 8 Bruzzese, D. and Vistocco, D. 5 Concluding remarks and further developments The output of hierarchical clustering methods is typically displayed as a dendrogram describing a family of partitions indexed by an ultrametric distance. Actually, after the tree structure of the dendrogram has been set up, the most tricky problem is that of cutting the tree with a suitable threshold in order to take out a sub-optimal classification. Several (more or less) objective criteria may be used to achieve this goal, e.g. the deepest step, but most often the partition relies on a subjective choice leaded by interpretation issues. Additionally, whatever the chosen criterion is, only one solution can be obtained for each desired granularity, i.e. the one where clusters are joined at consecutive heights starting from the adopted threshold. In this paper we propose an algorithm, exploiting the methodological framework of permutation test, allowing to find out automatically a suboptimal partition where clusters do not necessarily obey to the afore-mentioned principle. The algorithm allows us to explore partitions which are not directly achievable using a standard cut-level approach. Further works should concern a comparison of the obtained partition with respect to partitions of the same dataset deriving from common partitioning methods. A comparison in terms of quite common quality indexes (Rand, 1971) should strength the proposal. Furthermore the study of the stability of the obtained partitions with respect to tuning parameters used in the permutation test procedure and the study of the computational complexity are topics of interest for further research. References GOOD P. (1994). Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer, New York. IDEKER T., THORSSON V., RANISH J.A., CHRISTMAS R., BUHLER J., ENG J.K., BUMGARNER R.E., GOODLETT D.R., AEBERSOLD R., HOOD L. (2001) Integrated genomic and proteomic analyses of a systemically perturbed metabolic network. Science, 292: J MAECHLER M., ROUSSEEUW P., STRUYF A., HUBERT M. (2005). Cluster Analysis Basics and Extensions. unpublished. QIU W.L., JOE, H. (2006a) Generation of Random Clusters with Specified Degree of Separation. Journal of Classification, 23(2), J QIU W.L., JOE, H. (2006b) Separation Index and Partial Membership for Clustering. Computational Statistics and Data Analysis, 50, QIU W. L., JOE H. (2009). clustergeneration: random cluster generation (with specified degree of separation). R package version R Development Core Team (2009). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN , url = RAND W.M. (1971). Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, December 1971, 66, 336,

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)