Interpretability and Informativeness of Clustering Methods for Exploratory Analysis of Clinical Data

Size: px

Start display at page:

Download "Interpretability and Informativeness of Clustering Methods for Exploratory Analysis of Clinical Data"

Brandon Weaver
5 years ago
Views:

1 Interpretability and Informativeness of Clustering Methods for Exploratory Analysis of Clinical Data Martin Azizyan, Aarti Singh Machine Learning Department Carnegie Mellon University Wei Wu Lane Center for Computational Biology Carnegie Mellon University 1 Introduction Clustering methods are among the most commonly used tools for exploratory data analysis. However, using clustering to perform data analysis can be challenging for modern datasets that contain a large number of dimensions, are complex in nature, and lack a ground-truth labeling. Traditional tools, like summarization and plotting of clusters, are of limited benefit in a high-dimensional setting. On the other hand, while many clustering methods have been studied theoretically, such analysis often has limited instructive value for practical purposes due to unverifiable assumptions, oracle-dependent tuning parameters, or unquantified finite-sample effects. This study is motivated by a clinical dataset that is affected by each of the complications described above, which can reduce the reliability and informativeness of common approaches to cluster analysis. The dataset, which contains measurements of 11 demographic and medical features of 78 patients with various severity levels of asthma and healthy control subjects from the Severe Asthma Research Program (SARP), was studied previously by [1] who used K-means clustering to discover subphenotypes of similarly presenting patients. Clinical knowledge suggests that i) the dataset contains multiple overlapping clusters of patients with no clear low-density separation between them, because symptoms are not clearly delineated across severity levels of asthma, and ii) the density of data points (patients) in different clusters are different because patients with increasing severity of asthma are increasingly rarer (more patients have milder asthma than severe, with the numbers getting fewer as the severity level increases). These characteristics plague other clinical datasets as well, and hence we expect our investigation will inform practice in conducting exploratory analyses of modern clinical datasets using clustering. In order to understand to what extent results based on K-means or other types of clustering methods can be interpreted as reliable, we compare their behaviors on several synthetic examples designed to capture possible complex characteristics of the data, and on the dataset itself. In addition to K-means, we examine the clusterpath [, ], a k-nn graph based density clustering algorithm [], hierarchical clustering [], and spectral clustering [6]. The clusterpath and the density clustering are chosen because they partition data by taking the density of the data into consideration, which seems to be suitable for the asthma dataset. Since these methods are motivated by hierarchical clustering and/or spectral clustering, the latter methods are also included in this work for comparison purposes. We believe that it is useful to better understand the behaviors of different types of clustering algorithms on real-life data and to equip practitioners with some prior intuition regarding the conceptual meaning of clusters discovered by these algorithms. In particular, understanding the significant differences between clustering methods on real data can i) inform the choice of a clustering method from the large set of available ones, and ii) enable meaningful conclusions to be drawn by applying several clustering methods to the same dataset and comparing the results. Methods The objective of K-means clustering is to find a partitioning of the dataset which minimizes the total sum of squared distances between points in the same cluster. While solving this optimization problem exactly is combinatorially difficult, common practice is to use a large number of random restarts of an approximation algorithm (e.g. we use the one implemented in R [7]). 1

2 Spectral clustering [6] first embeds the data points into a lower-dimensional space using eigenvectors of the Laplacian of a graph of similarities between the data points, and subsequently applying K- means on the resulting dataset. We use the symmetric normalized graph Laplacian [8] in all our experiments to compute the embedding. The clusterpath [, ] is a convex formulation of clustering. The algorithm requires as input a (weighted) graph capturing the similarity between the data points (which is similar to spectral clustering), in addition to the original matrix representing the points in Euclidean space. Furthermore, by varying a tuning parameter over a range of values, the resulting clusterings often form a sequence of refinements which can be represented as a hierarchical clustering. Due to the space constraint, we refer to [] for further details. While the clusterpath can be thought of as a convex relaxation of a clustering objective similar to K-means, it is clear that, at least in some cases, the two methods can produce very different results [], in part due to the use of the graph in the clusterpath. Typically either a k-nearest-neighbor- or a Gaussian kernel-based graph is used. We use a k-nearest-neighbor graph to limit the number of edges (with non-zero weight) in the graph, since the number of these edges can significantly affect the computational cost of the algorithm. Computing the clusterpath is non-trivial. Recently, several methods for accelerating the optimization of the clusterpath objective have been proposed [9, 10]. In this work, we implement an algorithm similar to the method described in [9] for our experiments, with some additional tuning for performance including adaptive restarting of acceleration. Despite this, computing the clusterpath is a few orders of magnitude slower than any other methods we use. The density clustering method described in [] uses pruned k-nearest-neighbor graphs to estimate the points lying in connected components of the level sets of the density. Each split in the resulting cluster tree represents points lying in high-density regions that are separated by a low-density region. See for example [11] for further discussion of density clustering methods. In order to maximize the comparability of results, for each dataset we use the same (k-nearest neighbor) graph as input to spectral clustering, density clustering, and the clusterpath. For hierarchical clustering, we use a bottom-up approach based on Ward s clustering criterion [1]. Results We begin by comparing the methods described in the previous section using the projection of the asthma dataset on its top principal components [1]. The PCA projection is plotted in Figure 1a, with the 89 healthy control subjects shown in black, and the remaining 89 asthma patients with varying degrees of symptom severity shown in red. Even though this low dimensional projection cannot capture all the structure of the original dataset, it can serve as a guide for designing further experiments to elucidate aspects of the algorithms in question that may be relevant to a clinical dataset. Figure 1 shows the results from each of the methods described in the previous section on the PCA projected dataset. We used a -nearest neighbor graph where applicable (the results were not very sensitive to the number of nearest neighbors used). The number of clusters computed using K-means was set to 6 to emulate the analysis in [1]. Our first observation is that each method except spectral clustering identifies the control subject group as a single cluster. The spectral clustering result with 6 clusters (Figure 1c) is similar to the K-means clustering, except for the splitting of the control group and joining of the green and cyan clusters. The latter two groups are split when the number of clusters is increased to 7 (Figure 1d), however a portion of the control group is placed in a cluster predominantly composed of asthma patients. This effect is surprising in light of the fact that the density cluster tree separates the control group as a distinct cluster in principle, spectral clustering should partition the data in low density regions as well. On the other hand, it is evident from Figure 1f that density clustering alone provides very limited information about this dataset beyond separating the control group. Although the portion of the data consisting of asthma patients has some interesting structure, the only conclusion about these data points that can be reached from the density cluster tree is that they appear unimodal. Finally, we note that while both the clusterpath and the hierarchical clustering do approximately replicate the same partitioning as K-means, a visual inspection of the hierarchical clustering tree gives the impression that there are as many as reliably separated clusters, while the clusterpath only strongly separates the control group.

3 control patient (a) PCA of asthma dataset (b) K-means, K = 6 (c) Spectral (6 clusters) (d) Spectral (7 clusters) (e) Clusterpath 0 1 (f) Density cluster tree (g) Hierarchical clustering Figure 1: Results on dimensional PCA projection of asthma dataset. Leaves in subfigures (e)-(g) colored according to K-means results in subfigure (b). To further explore the behaviors of these methods in the absence of well-separated clusters, we generate a two-dimensional synthetic dataset by drawing 100 samples from each of three overlapping non-spherical Gaussian components, giving the points shown in Figure a. We see from the density cluster tree (Figure d) that the mixture components are not separated by regions of detectably lower density. Despite this, K-means and spectral clustering both estimate reasonable approximations of the true mixture component labels, as do the clusterpath and hierarchical clustering. It is interesting to note that the black cluster is significantly better separated than the other two according to both the clusterpath and the hierarchical clustering. Finally, we compare the results of each clustering method on the full dataset analyzed by [1]. Figure shows the K-means and spectral clustering results (computed using the full, not projected, dataset) plotted on the same PCA projection as Figure 1a. The leaves of the dendrograms are colored using the labels of the K-means clustering in Figure a (which are identical to the K = 6 clusters discovered by K-means in the analysis of [1]). Spectral clustering (Figure b) again fails to maintain the control group as a single cluster as with the low-dimensional PCA projection data. K-means, the clusterpath, and hierarchical clustering all identify the normal control group as a separate cluster. On the other hand, density clustering now entirely fails to find any clusters whatsoever. It is not immediately clear why it is possible that the control group is only distributed around a separate mode after the PCA projection, or that the dimensionality of the data is simply too high (compared to the number of samples) for this particular algorithm to detect a density cluster. In either case, the density clustering results are entirely non-informative here. The clusterpath results are not much better; beyond separating the control group and a few outliers, the clusterpath tree has no structure. The same is not true of hierarchical clustering although the tree is noisier than its low-dimensional counterpart in Figure 1, it does have some non-trivial structure. It also seems to find some similar clusters to K-means. Conclusion To summarize, in this paper, we explored the interpretability and informativeness of popular clustering methods for identifying groups of asthma patients with different severity levels based on their phenotypes. This dataset, as many other clinical datasets, is characterized by clusters of varying density with the asthma patients corresponding to an almost unimodal distribution. Our results indicate that this characteristic renders methods such as density clustering (which has been the subject of many empirical and theoretical studies) non-informative, despite being density sensitive, as it re-

4 (a) True mixture component labels (b) Clusterpath (c) K-means clustering (d) Density cluster tree (e) Spectral clustering (f) Hierarchical clustering Figure : Results on a simple synthetic dataset with overlapping clusters. Dendrograms colored according to true mixture component labels (a) K-means, K = 6 (b) Spectral (6 clusters) (c) Clusterpath (d) Density cluster tree (e) Hierarchical clustering Figure : Results on full-dimensional asthma dataset. Leaves in subfigures (c)-(e) colored according to K-means results in subfigure (a). lies on existence of low-density separation between clusters. Among partitional methods, K-means seems to outperform spectral clustering since the latter tends to break up the normal control group. Amongst hierarchical methods, the clusterpath fails to yield informative results (other than the normal control cluster) on the asthma dataset, and requires orders of magnitude more computational resources to compute even when using a highly tuned algorithm, while agglomerative hierarchical clustering with the Ward criterion appears to generate noisier (but similar) results than K-means. Acknowledgments We thank Dr. Sally Wenzel at the University of Pittsburgh School of Medicine for sharing with us the asthma data. This research is supported in part by NSF grant IIS-11168, NSF CAREER award IIS-11, and R01GM08769.

5 References [1] Wei Wu, Eugene Bleecker, Wendy Moore, William W Busse, Mario Castro, Kian Fan Chung, William J Calhoun, Serpil Erzurum, Benjamin Gaston, Elliot Israel, et al. Unsupervised phenotyping of severe asthma research program participants using expanded lung data. Journal of Allergy and Clinical Immunology, 1(): , 01. [] Toby Dylan Hocking, Armand Joulin, Francis Bach, Jean-Philippe Vert, et al. Clusterpath: an algorithm for clustering using convex fusion penalties. In 8th international conference on machine learning, 011. [] Fredrik Lindsten, Henrik Ohlsson, and Lennart Ljung. Clustering using sum-of-norms regularization: With application to particle filter output computation. In Statistical Signal Processing Workshop (SSP), 011 IEEE, pages IEEE, 011. [] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for the cluster tree. In Advances in Neural Information Processing Systems, pages 1, 010. [] Rui Xu, Donald Wunsch, et al. Survey of clustering algorithms. Neural Networks, IEEE Transactions on, 16():6 678, 00. [6] Andrew Y Ng, Michael I Jordan, Yair Weiss, et al. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, :89 86, 00. [7] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 01. [8] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17():9 16, 007. [9] E. C. Chi and K. Lange. Splitting Methods for Convex Clustering. ArXiv e-prints, April 01. [10] G. K. Chen, E. Chi, J. Ranola, and K. Lange. Convex Clustering: An Attractive Alternative to Hierarchical Clustering. ArXiv e-prints, September 01. [11] Pavel Berkhin. A survey of clustering data mining techniques. In Grouping multidimensional data, pages 71. Springer, 006. [1] F. Murtagh and P. Legendre. Ward s Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm. ArXiv e-prints, November 011. [1] IT Jolliffe. Principal component analysis. Springer Series in Statistics, Berlin: Springer, 1986, 1, 1986.

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group