Parallel and Hierarchical Mode Association Clustering with an R Package Modalclust

Similar documents
Clustering and Classification of Functional Data

Clustering Lecture 5: Mixture Model

MSA220 - Statistical Learning for Big Data

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

A Nonparametric Statistical Approach to Clustering via Mode Identification

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Expectation Maximization (EM) and Gaussian Mixture Models

Understanding Clustering Supervising the unsupervised

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Clustering CS 550: Machine Learning

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Dynamic Thresholding for Image Analysis

CS 229 Midterm Review

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Clustering and Visualisation of Data

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Cluster Analysis. Debashis Ghosh Department of Statistics Penn State University (based on slides from Jia Li, Dept. of Statistics)

Chapter 6 Continued: Partitioning Methods

Region-based Segmentation

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Motivation. Technical Background

Approaches to Clustering

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Unsupervised Learning : Clustering

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Segmentation of Images

Note Set 4: Finite Mixture Models and the EM Algorithm

Clustering: Classic Methods and Modern Views

Network Traffic Measurements and Analysis

Clustering: K-means and Kernel K-means

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Mixture Models and the EM Algorithm

Clustering. Supervised vs. Unsupervised Learning

Smoothing Dissimilarities for Cluster Analysis: Binary Data and Functional Data

INF 4300 Classification III Anne Solberg The agenda today:

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Nonparametric Regression

Cluster Analysis. Ying Shen, SSE, Tongji University

Unsupervised Learning and Clustering

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Fall 09, Homework 5

Random projection for non-gaussian mixture models

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

COMP5318 Knowledge Management & Data Mining Assignment 1

Cluster Analysis. Angela Montanari and Laura Anderlucci

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Machine Learning Lecture 3

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Data Mining 4. Cluster Analysis

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Parametrizing the easiness of machine learning problems. Sanjoy Dasgupta, UC San Diego

K-means and Hierarchical Clustering

ADVANCED MACHINE LEARNING MACHINE LEARNING. Kernel for Clustering kernel K-Means

Supervised vs. Unsupervised Learning

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

International Journal Of Engineering And Computer Science ISSN: Volume 5 Issue 11 Nov. 2016, Page No.

The exam is closed book, closed notes except your one-page cheat sheet.

Pattern Clustering with Similarity Measures

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Clustering algorithms

The Curse of Dimensionality

Unsupervised Learning

An Introduction to PDF Estimation and Clustering

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Classification: Linear Discriminant Functions

Unsupervised Learning

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

DATA DEPTH AND ITS APPLICATIONS IN CLASSIFICATION

Divide and Conquer Kernel Ridge Regression

GPUML: Graphical processors for speeding up kernel machines

Machine Learning for OR & FE

K-Means Clustering 3/3/17

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007

Lecture 10: Semantic Segmentation and Clustering

Hierarchical Multi level Approach to graph clustering

Robust PDF Table Locator

Machine Learning Lecture 3

Using Statistical Techniques to Improve the QC Process of Swell Noise Filtering

The K-modes and Laplacian K-modes algorithms for clustering

Machine Learning Lecture 3

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Unsupervised Learning and Clustering

Including the Size of Regions in Image Segmentation by Region Based Graph

Developing a Data Driven System for Computational Neuroscience

Machine Learning. Unsupervised Learning. Manfred Huber

Search Engines. Information Retrieval in Practice

Summer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis

Transcription:

Parallel and Hierarchical Mode Association Clustering with an R Package Modalclust Surajit Ray and Yansong Cheng Department of Mathematics and Statistics Boston University 111 Cummington Street, Boston, MA 02215, USA Abstract Modalclust is an R package which performs Hierarchical Mode Association Clustering (HMAC) along with its parallel implementation (PHMAC) over several processors. Modal clustering techniques are especially designed to efficiently extract clusters in high dimensions with arbitrary density shapes. Further, clustering is performed over several resolutions and the results are summarized as a hierarchical tree, thus providing a model based multi resolution cluster analysis. Finally we implement a novel parallel implementation of HMAC which performs the clustering job over several processors thereby dramatically increasing the speed of clustering procedure especially for large data sets. This package also provides a number of functions for visualizing clusters in high dimensions, which can also be used with other clustering softwares. Keywords: Hierarchical mode association clustering, Modal clustering, Modal EM, Nonparametric density, Parallel computing, R, Clustering of Complex Data, Robust analysis. 1. Introduction Cluster analysis is a ubiquitous technique in statistical analysis that has been widely used in multiple disciplines for many years. Historically cluster analysis techniques have been approached from either a fully parametric view, e.g. mixture model based clustering, or a distribution free approach, e.g. linkage based hierarchical clustering. While the parametric paradigm provides the inferential framework and accounts for the sampling variability, it often lacks the flexibility to accommodate complex clusters and are often not scalable to high dimensional data. On the other hand, the distribution free approaches are usually fast and capable of uncovering complex clusters by making use of different distance measures. However, the inferential framework is distinctly Preprint submitted to Computational Statistics & Data Analysis June 2, 2011

missing in the distribution free clustering techniques. Accordingly most clustering packages in R (R Development Core Team, 2010) also fall under the two above mentioned groups of clustering techniques. This paper describes a software program for cluster analysis that can knead the strengths of these two seemingly different approaches and develop a framework of parallel implementation for clustering techniques. For most model based approaches to clustering the following limitations are well recognized in the literature: (i) the number of clusters has to be specified (ii) the mixing densities have to be specified, and as estimating the parameters of the mixture models is often computationally very expensive, we are often forced to limit our choices to simple distributions such as Gaussian; (iii) computational speed is inadequate especially in high dimensions and this coupled with the complexity of the proposed model often limits the use of model-based techniques either theoretically or computationally; (v) it is not straightforward to extend model-based clustering to uncover heterogeneity at multiple resolutions, similar to the one offered by to the model free linkage based hierarchical clustering. Influential work towards resolving the first three issues has been carried out in Fraley and Raftery (1998), Fraley (1998), Fraley and Raftery (1999, 2002b,a) Ray and Lindsay (2005) and Ray and Lindsay (2008). Many previous approaches have focused on model selection either for choosing the number of components or for determining the covariance structure of the density under consideration or both. They work efficiently if the underlying distribution is chosen correctly, but none of these model based approaches is designed to handle a completely arbitrary underlying distribution (see Figure 5 for one such example). That is, we think that limitations due to issues (iii) and (iv), above often necessitates the use of model-free techniques. The hierarchical mode association clustering HMAC (Li et al., 2007), which is constructed by first determining modes of the high-dimensional density and then associating sample points to those modes, is the first multivariate model based clustering approach resolving many of the drawbacks of standard model-based clustering. Specifically, it can accommodate flexible subpopulation structures at multiple resolutions while retaining the desired natural inferential framework of parametric mixtures. Modalclust is the package implemented in R (R Development Core Team, 2010) for carrying out HMAC along with its parallel implementation (PHMAC) over several processors. Though mode-counting or mode hunting has been extensively used as a clustering technique (see 2

Silverman, 1981, 1983, Hartigan and Hartigan, 1985, Cheng and Hall, 1998, 1999, Hartigan, 1988, and references therein), most implementation are limited to univariate data. Generalization to higher dimensions was limited both due to the computational complexity of finding modes in higher dimension and the lack of any natural framework to study the inferential properties of modes in higher dimensions. The HMAC provides a computationally fast iterative algorithm for calculating the modes and thereby providing a clustering approach which is scalable to high dimensions. This article provides the description of the R-package that implements HMAC and additionally provides an wide array visualization tools for representing clusters in high dimensions. Further, we propose a novel parallel implementation of the approach which dramatically reduces the computational time especially for large data sets, both in data dimensions and the number of observations. This paper is organized as follows: Section 2 briefly introduces the algorithm of Modal Expectation Maximization (MEM) and builds the notion of mode association clustering technique. Section 3 describes a parallel computing framework of HMAC along with computing time comparisons. Section 4 illustrates the implementation of clustering functions in the R-package Modalclust along with examples of the plotting functions especially designed for objects of class hmac. Section 5 provides the conclusion and discussion. 2. Modal EM and HMAC The main challenge of using mode-based clustering in high dimensions is the cost of computing modes, which are mathematically evaluated as local maximas of the density function with support on R D, D being the data dimension. Traditional techniques of finding local maxima, such as hill climbing works well for univariate data. But multivariate hill climbing is computationally expensive thereby limiting its use in high dimensions. Li et al. (2007) proposed an algorithm that solves a local maximum of a kernel density by ascending iterations starting from the data points. Since the algorithm is very similar to Expectation Maximization (EM) algorithm, it is named as Modal Expectation Maximization (MEM). Given a multivariate kernel K, let the density of the data be given by f(x Σ) = n i=1 1 n K(x x i Σ), where Σ is the matrix of smoothing parameters. Now, given any initial value x (0), the MEM solves a local maximum of the mixture density by alternating the following two steps until it meets some user defined stopping criterion. Step 1. Let p i = K(x x i Σ) nf(x (r) ), i = 1,..., n. 3

Step 2. Update x (r+1) = argmax x n p i log f i (x). i=1 Details of convergence of the MEM approach can be found in Li et al. (2007). The above iterative steps provide a computationally simpler approach for hill climbing from any starting point x R D by exploiting the properties of density functions. Further, in the special case of Gaussians kernels, i.e., K(x x i Σ) = φ(x x i, Σ), where φ( ) is the pdf of a Gaussian distribution, the update of x (r+1) is simply x (r+1) = n p i x i. allowing us to avoid the numerical optimization of Step 2. i=1 Now we present the HMAC algorithm. First we scale the data and use a kernel density estimator, with a normal kernel to estimate the density of the data. The variance of the kernel, Σ is a diagonal matrix with all entries σ 2 denoted by D(σ 2 ), thus σ 2 is the single smoothing parameter for all the dimensions. The choice of the smoothing parameter is an area of research in itself. In the present version of the program we incorporate the strategy of using pseudo degrees of freedom, first proposed in Lindsay et al. (2008) and then developed in Lindsay et al. (2011). Their strategy provides us with a range of smoothing parameters and exploring them from finest to coarsest resolution provides the user with the desired hierarchical clustering. First we describe the steps of Mode Association Clustering (MAC) for a single bandwidth σ 2. Algorithm for MAC for a single bandwidth Step 1. Given a set of data S = {x 1,x 2,...,x n }, x i R d form kernel density f(x S,σ 2 ) = n i=1 1 n φ( x x i,d(σ 2 ) ). (1) Step 2. Use f(x S,σ 2 ) as the density function. Use each x i, i = 1,2,...,n, as the initial value in the MEM algorithm and find a mode of f(x S,σ 2 ). Let the mode identified by starting from x i be M σ (x i ). Step 3. Extract distinctive values from the set {M σ (x i ),i = 1,2,...,n} to form a set G. Label the elements in G from 1 to G. In practice, due to finite precision, two modes are regarded equal if their distance is below a threshold. Step 4. If M σ (x i ) equals the k th element in G, x i is put in the k th cluster. 4

We note that when the bandwidth σ increases, the kernel density estimator f(x S, σ 2 ) in (1) becomes smoother, and thus more points tend to climb to the same mode. This suggests a natural approach for hierarchical organization (or nesting ) of our MAC clusters. Thus, given a range of bandwidths σ 1 < σ 2 < < σ L, The clustering can be performed in the following bottom-up manner. First we perform MAC at the smallest bandwidth σ 1. At any bandwidth σ l, the clustering representatives in G l 1 obtained from the preceding bandwidth are fed into MAC using the density f(x S, σ 2 l ). The modes identified at this level form a new set of cluster representatives G l. This procedure is repeated across all σ l s. This preserves the hierarchy of clusters and thus the name Hierarchical Mode Association Clustering (HMAC). To summarize we present the HMAC procedure in the following box. Algorithm of HMAC for a sequence of bandwidths Step 1. Start with the data G 0 = {x 1,...,x n } and set level l = 0 and initialize the mode association of the i th data point as P 0 (x i ) = i. Step 2. l l + 1. Step 3. Form kernel density as in (1) using σ 2 l. Step 4. Cluster the points in G l 1 by using density f(x S,σl 2 ). Let the set of distinct modes obtained be G l. Step 5. If P l 1 (x i ) = k and the k th element in G l 1 is clustered to the k th mode in G l, then P l (x i ) = k. In another word, the cluster of x i at level l is determined by its cluster representative in G l 1. Step 6. Stop if l = L, otherwise go to Step 2. 3. Parallel HMAC The HMAC approach, though scalable to higher number of dimensions becomes computationally expensive when the number of objects (n) becomes large. The first level of HMAC requires one to start the MEM from each observation and find its modal association with respect to the overall density. This step alone becomes computationally expensive when n becomes large. Note that the hierarchical steps for level l = 2 onwards only needs to start the MEM from the modes of the previous level G l 1 and hence the computational cost does not increase at the rate of n. Fortunately the mode association clustering provides a natural framework of proposing a divide and conquer algorithm of clustering. One can simply partition the data into m partitions, perform modal clustering on each of those partitions and then pool the modes obtained from each of these 5

partitions and allow the modes to form the L 1 top layers of modes. If the user has access to multiple computing cores on the same machine or several processors of a shared memory computing cluster, the divide and conquer algorithm can be seamlessly parallelized. This section provides the details of the parallel version of HMAC and its application together with some comparisons of performance of the parallel approach. First we describe the main framework of parallel HMAC and then we provide details on regarding how to choose the partitions. Algorithm for parallel HMAC (PHMAC) in multiple dimensions Step 1. Let G 0 = {x 1,..., x n }. Partition the data (n objects) into m partitions G j o, j = 1, 2,...,m. Step 2. Perform HMAC on each of these partitions at the lowest resolution, i.e, using σ 1 and get the modes G j 1, j = 1, 2,...,m. Step 3. Pool the modes from each partitioned to form G 1 = Step 4. Perform HMAC starting from Step 2 and l = 1 and obtain the final hierarchical clustering. m j=1 G j 1 A demonstration of different steps of parallel clustering with four random partitions is given in Figure 1. The original data set is partitioned into four random partitions and initial modal clustering is performed within the partitions. In the next step the modes of each of these partitions are merged to form the overall modal clusters in Figure 1(c). 3.1. Discussion on parallel clustering In general model based clustering is hard to parallelize. But modes have a natural hierarchy and it is computationally easy to merge modes from different partitions. But one needs to decide on how to best perform the partition and how many partitions to use. In this paper we will just provide some guidelines regarding the two choices without exploring their optimality in details. In the absence of any other knowledge one should randomly partition the data. Other choices include partitioning data based on certain coordinates which form a natural clustering and then taking products of a few of those coordinates to build the overall partition. This strategy might increase the computational speed by restricting the modes within a relatively homogeneous set of observations. Another choice might be to sample the data and build partitions based on the modes 6

Partition 1 Partition 2 V2 0.015 0.025 0.02 0.015 0.02 0.01 0.005 0.005 0.01 0.015 0.025 0.015 0.02 0.02 0.025 Partition 3 Partition 4 V1 (a) (b) Partition 1 Partition 2 Merge modes from partitions Partition 3 Partition 4 (c) (d) Figure 1: Steps in parallel HMAC procedure for a simulated data (a) shows the simulated data with four clusters along with the contour plot, where the color indicates the final clustering using phmac; (b) shows the four random partitions of the unlabeled data along with the modes (red asterisk) at each partition; (c) shows the mode obtained from the four partitions; (d) shows the final modes (green triangles) starting from the modes of the partitioned data. 7

of the sampled data. The number of partitions one should use to minimize the computational time is a complex function of the number of available processors, the number of observations to be clustered and the choice of smoothing parameter. If using more partitions, one might speed up the first step but would run the risk of having to merge more modes at the next level, where the hill climbing is done from the collection of modes from each partitions with respect to the overall density. On the other hand for large n if one chooses too few partitions or no partitions, one would require huge computational cost at the lowest level. Finally the choice of smoothing parameter will also determine how many modes one needs to start from at the merged level. Next we compare the computing speed for parallel versus serial clustering using 1, 2, 4, 8 and 12 multicore processors. Tests were performed on a 64 bits 4 Quad Core AMD 8384 (2.7 Ghz each core), with 16 GB RAM running Linux Centos 5 and R version 2.11.0 From Table 1, it is clear to observe that parallel computing significantly increases the computing speed. Because the nonparametric density estimate in (1) is a sum of kernels centered at every data point, the amount of computation to identify the mode associated with a single point grows linearly with n, the size of the data set. The computational complexity of clustering all the data by HMAC is thus quadratic in n. Suppose we have p processors. The computing complexity for the initial level for serial HMAC is n 2 and for parallel HMAC is thus ( n p )2. But as discussed before, we can see that the computational speed is not a monotonically decreasing function of the number of processors. Theoretically, it is true that more processors can reduce the computing complexity at the initial step. However, in practice, if the data set is not sufficiently large, using more processors may not save time as it may produce a large number of modes in the pooled stage. When the n=10,000 or n=50,000 including more processors provides dramatic decrease in computing time, where as for n=2000 there is a clear dip in time elapsed when using 4 or 8 processors instead of the maximum 12 processors. For n=50,000 the decrease in computing time from using one processor to using 12 processors is more than 40 folds ( see Figure 2) but even if the user is able to use just two processors the computing time is reduced to one third of what a single processor would take. Even for n=20,000 the advantage of using 12 processors is almost 30 fold, whereas for n=2,000 the advantage is only 8 fold. In fact the lowest time is actually clocked by 8 processors for n=2,000 but using all 12 processors does not increase the time very much. These comparisons show the potential for parallelizing the modal 8

clustering algorithm and its inherent use of clustering high throughput data. Table 1: Comparison of computing time (elapsed time in seconds) using different number of processors Number of processors Data dimensions 1 2 4 8 12 n=2,000, d=2 56.58 17.01 7.84 6.91 8.02 n=2,000, d=20 323.16 128.13 112.42 190.11 250.22 n=2,000, d=40 730.18 560.16 687.79 764.29 753.36 n=10,000, d=2 3849.83 871.33 276.88 145.61 131.22 n=10,000, d=20 8410.96 1694.82 585.33 536.32 459.88 n=50,000, d=2 210295.29 71152.82 23383.61 11959.24 4875.64 Fold increase in time 1 10 20 30 40 N=2,000 N=10,000 N=50,000 1 2 4 8 12 Number of processors Figure 2: Comparison of fold increase in time of clustering two dimensional data of different sample sizes with respect to using 12 processors 9

4. Example of using R-package Modalclust In this section we will demonstrate the functions and plotting tools available in the Modalclust package. 4.1. Modal Clustering First we provide an example of performing modal clustering to extract the subpopulations in the logcta20 dataset. The description of the dataset is given in the package and the scatter plot along with its smooth density is provided in Figure 3. First we use the following commands to download and install the package: > install.packages( Modalclust ) > library(modalclust) Using the following commands, we can get the standard (serial) HMAC and parallel HMAC using two processors for logcta20 data. > logcta20.hmac <-phmac(logcta20,npart=1,parallel=f) > logcta20p2.hmac <-phmac(logcta20,npart=2,parallel=t) Colors represent different clusters V2 6 4 2 0 V2 6 4 2 0 6 4 2 0 V1 Figure 3: Scatter plot of logcta20 data 6 4 2 0 V1 Level 4 with number of clusters 3 Figure 4: HMAC implied clusters of logcta20 data 10

Both implementations result in the final clustering given in Figure 4, which clearly identifies the three distinct subpopluation which has been biologically validated. Other model based clustering methods in R, such as Mclust, kmeans could not capture the subpopulation structure as the individual density are non-normal in shape. Distance based clustering method e.g. hclust with a range of linkage functions performed even poorly. By default the function selects an interesting range of smoothing parameters with ten σ 2 values and the final clustering only shows the results from the levels which produced merging from previous its level. For example for the logcta20 the smoothing parameters chosen automatically are > logcta20.hmac$sigma [1] 0.26 0.29 0.31 0.34 0.38 0.43 0.49 0.58 0.72 0.94 which are chosen using the pseudo degrees of freedom criterion mentioned in Section 2. Though we started with 10 different smoothing levels the final clustering shows only 6 different levels along with a decreasing number of hierarchical cluster > logcta20.hmac$level [1] 1 2 3 3 3 4 4 4 5 6 > logcta20.hmac$n.cluster [1] 11 7 5 5 5 3 3 3 2 1 The user can also provide smoothing levels using the option sigmaselect in phmac. There is also the option of starting the phmac from user defined modes instead of the original data points. This option comes handy if the user wishes to merge clusters obtained from other model based clustering e.g. Mclust, kmeans. 4.2. Some Examples of Plotting There are several plotting functions in Modalclust which can be used to visualize the output from the function phmac. The plotting functions are defined on object class hmac, which is the default class of a phmac output. These plot functions will be illustrated through a data set named disc2d, composed of 400 observations representing the shape of two half discs. The scatter plot of disc2d along with its contours is given in Figure 5. 11

0 1 2 3 4 5 1.0 0.5 0.0 0.5 1.0 V1 V2 0.02 0.02 0.02 0.02 0.04 0.04 0.04 0.04 0.06 0.06 0.06 0.08 0.08 0.1 0.1 0.12 0.12 0.14 0.14 0.16 0.16 0.18 0.18 0.2 0.2 Contour plot Figure 5: The scatter plot of disc2d data along with its probability contours HMAC Dendrogram clustering level 1 2 3 4 123456789 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39 40 41 42 44 52 36 43 45 46 47 48 49 50 51 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 204 201 202 203 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 373 374 376 379 381 384 386 389 372 375 377 378 380 382 383 385 387 388 390 391 392 393 394 395 396 397 398 399 400 Figure 6: Hierarchical tree (Dendrogram) of the disc2d data showing the clustering at four levels of smoothing 12

First we will discuss the standard plot function for an object of class hmac. This unique and extremely informative plot shows the hierarchical tree obtained from modal clustering. It is available for any dimensions and can be obtained by > data(disc2d.hmac) > plot(disc2d.hmac) The dendrogram obtained for the disc2d data is given in Figure 6. The y-axis gives the different levels and the tree displays the merging at different levels. There are several options available for drawing the tree including starting the tree from a specific level, drawing the tree only up to a desired number of clusters and comparison of the clustering results with user defined clusters. Output from this tree is used for other plotting functions available in this package. Hard clustering produced by HMAC Colors represent different clusters Soft clustering produced by HMAC Colors represent different clusters boundary points V2 1.0 0.5 0.0 0.5 1.0 V2 1.0 0.5 0.0 0.5 1.0 0 1 2 3 4 5 V1 0 1 2 3 4 5 V1 Figure 7: Hard clustering for disc2d data at level 3 Figure 8: Soft clustering for disc2d data at level 2 The rest of the plotting functions are mainly designed for visualizing clustering results for two dimensional data, although one can provide multivariate extensions of the functions by considering all possible pairwise dimensions. One can obtain the hard clustering of the data for each level using the command > hard.hmac(disc2d.hmac) 13

Alternatively the user can specify the smoothing level or the number of desired clusters and obtain the corresponding hard clustering of the data. For example the plot in Figure 7 can be obtained by either of the two commands > hard.hmac(disc2d.hmac,n.cluster=2) > hard.hmac(disc2d.hmac,level=3) Another function allows the user to visualize the soft clustering of the data based on the posterior probability of each observation belonging to the clusters at a specified level. For example the plot in Figure 8 can be obtained by using > soft.hmac(disc2d.hmac,n.cluster=3) The plot enables us to visualize the probabilistic clustering of the three cluster model. User can specify a probability threshold for assigning observations which clearly belong to a cluster or lies in the boundary of more than one clusters. Points having posterior probability below the user specified boundlevel (default value 0.4) are assigned as boundary points and colored in gray. In Figure 8 we have five boundary points among the 400 original observations that are clustered. Additionally at any specified level or cluster size the plot=false option in hard.hmac returns the cluster membership. Similarly plot=false option in soft.hmac returns a list that contains the posterior probability of each observation and boundary points. > disc2d.2clust <- hard.hmac(disc2d.hmac,n.cluster=2,plot=f) > disc2d.2clust.soft <- soft.hmac(disc2d.hmac,n.cluster=2,plot=f) Finally we demonstrate another very useful function for choosing a cluster dynamically from a two dimensional plot. The function choose.cluster allows the user to click on any part of a two dimensional plot and dynamically select the cluster that point will belong to. One can start the display by invoking the command > choose.cluster(disc2d.hmac,n.cluster=2) which will open up a graphical window with the scatter plot as displayed in the left panel of Figure 9. After the user clicks a point anywhere near the upper disc the points in the cluster consisting of upper disc will light up as in the right panel of Figure 9. 14

V2 1.0 0.5 0.0 0.5 1.0 V2 1.0 0.5 0.0 0.5 1.0 0 1 2 3 4 5 Choose a point by V1 clicking on the graph Stop the program by click anywhere outside the graph 0 1 2 3 4 5 Choose a point by V1 clicking on the graph Stop the program by click anywhere outside the graph Figure 9: Graphical display for choosing the cluster using the function choose.cluster for the logcta20 data at level 3 with 2 clusters. The left panel displays the plot before the click and the right panel highlights the points after the user pointer click at the arrow head ( ). If the user clicks any existing data point, all other points belonging to the same cluster will light up. One can stop the program by clicking anywhere outside the plot area. This is an extremely useful function and can be used in merging clusters based on expert opinion or to design semisupervised clustering. 5. Discussion Modalclust performs a hierarchical model based clustering allowing for arbitrary density shapes.. Parallel computing can dramatically increase the computing speed by splitting the data and running the HMAC simultaneously on multicore processors. Plotting functions give the user a comprehensive visualizing and understanding of the clustering result. One future work from this stage would be increase computing speed, especially for large data set. From Section 3, it is clear to see, parallel computing increases the computing speed a lot. That relies on the computing equipment. If one user has no multicore or a few multicore processors available, it will take a lot of the computing resources when clustering large data sets. One potential way to solve the computing speed problem is using k-means or other faster clustering techniques 15

initially, and using the HMAC from the centers of each cluster of initial clustering results. For example, if we have a data set with 20,000 observations, we can use k-means clustering and choose a certain number of centers, like 200 centers and run k-means clustering first. And then we start from the centers of 200 clusters and clustering by HMAC. Theoretically it is a sub-optimal way compared with running HMAC for all points. In practice, it is very useful to reduce the computing costs and still obtain the right clustering. In addition, we are currently working on an implementation of modal clustering for online or streaming data, where the goal would be to update an existing cluster with the new data without storing all the original data points and allowing for creation of new clusters and merging of existing clusters. Sources, binaries and documentation of Modalclust are available for download from the Comprehensive R Archive Network http://cran.r-project.org/ under the GNU Public License. References Cheng, M.-Y., Hall, P., 1998. Calibrating the excess mass and dip tests of modality. J. R. Stat. Soc. Ser. B Stat. Methodol. 60 (3), 579 589. Cheng, M.-Y., Hall, P., 1999. Mode testing in difficult cases. Ann. Statist. 27 (4), 1294 1315. Fraley, C., 1998. Algorithms for model-based Gaussian hierarchical clustering. SIAM Journal on Scientific Computing 20, 270 281. Fraley, C., Raftery, A. E., 1998. How many clusters? Which clustering method? - Answers via model-based cluster analysis. Computer Journal 41, 578 588. Fraley, C., Raftery, A. E., 1999. MCLUST: Software for model-based cluster analysis. Journal of Classification 16, 297 306. Fraley, C., Raftery, A. E., October 2002a. MCLUST: Software for model-based clustering, density estimation, and discriminant analysis. Technical Report 415, University of Washington, Department of Statistics. Fraley, C., Raftery, A. E., 2002b. Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association 97, 611 631. Hartigan, J. A., 1988. The span test for unimodality. In: Bock, H. H. (Ed.), Classification and Related Methods of Data Analysis. Elsevier/North-Holland [Elsevier Science Publishing Co., New York; North- Holland Publishing Co., Amsterdam], pp. 229 236. 16

Hartigan, J. A., Hartigan, P. M., 1985. The dip test of unimodality. Ann. Statist. 13 (1), 70 84. Li, J., Ray, S., Lindsay, B., 2007. A nonparametric statistical approach to clustering via mode identification. Journal of Machine Learning Research 8, 1687 1723. Lindsay, B. G., Markatou, M., Ray, S., 2011. Spectral degrees of freedom and highdimensional smoothing. Tech. rep., Under Preparation. Lindsay, B. G., Markatou, M., Ray, S., Yang, K., Chen, S.-C., 2008. Quadratic distances on probabilities: a unified foundation. Ann. Statist. 36 (2), 983 1006. R Development Core Team, 2010. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0. URL http://www.r-project.org Ray, S., Lindsay, B., 2005. The topography of multivariate normal mixtures. Ann. Statist. 33 (5), 2042 2065. Ray, S., Lindsay, B. G., 2008. Model selection in high dimensions: a quadratic-risk-based approach. J. Roy. Statist. Soc. Ser. B 70 (1), 95 118. Silverman, B. W., 1981. Using kernel density estimates to investigate multimodality. J. Roy. Statist. Soc. Ser. B 43 (1), 97 99. Silverman, B. W., 1983. Some properties of a test for multimodality based on kernel density estimates. In: Probability, statistics and analysis. Vol. 79 of London Math. Soc. Lecture Note Ser. Cambridge Univ. Press, Cambridge, pp. 248 259. 17