Hierarchical 3-D von Mises-Fisher Mixture Model

Size: px

Start display at page:

Download "Hierarchical 3-D von Mises-Fisher Mixture Model"

Dominic Sanders
6 years ago
Views:

1 Hierarchical 3-D von Mises-Fisher Mixture Model Md. Abul Hasnat Université Jean Monnet, Saint Etienne, France. Olivier Alata Université Jean Monnet, Saint Etienne, France. Alain Trémeau Université Jean Monnet, Saint Etienne, France. Abstract In this paper, we propose a complete method for clustering data, which are in the form of unit vectors. The solution consists of a distribution based clustering algorithm with the assumption of a generative model. In the model, the data is generated from a finite statistical mixture model based on the von Mises-Fisher (vmf) distribution. Initially, Bregman soft clustering algorithm is applied to obtain the parameters of the vmf mixture model (vmf-mm) for certain maximum number of components. Then, a hierarchy of mixture models is generated from the parameters. The hierarchy is generated by appropriately using Bregman divergence to compute dissimilarity among distributions as well as fuse/merge the centroids of the clusters. After constructing the hierarchy, Kullback Leibler divergence (KLD) is used to compute the distance between statistical mixture models with different number of components. Finally, a threshold (KLD value) is used to select number of components of the mixture model. The proposed method is called Hierarchical 3-D von Mises-Fisher mixture model. We validated the method by applying it on simulated data. Additionally, we applied the proposed method to cluster image normal, which are computed from the depth image. As an outcome of the clustering, we obtained a bottom-up segmentation of the depth image. Obtained results confirmed our assumption that the proposed method can be a potential tool to analyze depth images. Proceedings of the 1st Workshop on Divergences and Divergence Learning (at ICML 2013), Atlanta, Georgia, USA, Copyright 2013 by the author(s). 1. Introduction Data/features in the form of a unit vector exhibits directional behavior. For this type of features, directional distributions (Mardia & Jupp 2000) are the standard choice to construct a statistical mixture model (Murphy 2012). Such data frequently appear in varieties of domains in order to analyze image, speech signals, text documents, gene expressions (Banerjee et al. 2005), treatment beam (Bangert et al. 2010) etc. The sample space for directional distributions is circle, sphere and hypersphere. Most prominent distributions in directional statistics (von Mises-Fisher, Kent, Watson, Bingham etc.) belong to exponential family of distributions (Mardia & Jupp 2000). Directional distributions are associated with complicated normalizing constants. For this reason, analytical solution to obtain maximum likelihood estimate (MLE) of the parameters even for a single distribution is difficult (Sra 2012). The minimal set of parameters in a directional distribution is the mean and concentration. Satisfactory approximation is available (Mardia & Jupp 2000) for lower dimensional data. However, for higher dimensional data, estimation of the concentration parameters is non-trivial since it involves functional inversion of ratios of special function such as Bessel functions (Banerjee et al. 2005). The fundamental directional distribution is the von Mises- Fisher (vmf) distribution, which is also called Fisher distribution for (Mardia & Jupp 2000). It models data concentrated around a mean-direction. Heuristic approximation of the parameters for higher dimensional data distributed according to mixture of vmf distribution is obtained (Banerjee et al. 2005). In data clustering problem, statistical mixture model is one of the most prominent and widely used tools. It consists of a base probability distribution for the observed data and prior probability for the clusters (Murphy 2012) (that generates the data samples). Therefore, a mixture

2 model is a powerful mechanism to explain the appearance of data through a generative process. Expectation Maximization (EM) is the most common algorithm to learn the parameters of a mixture model. Standard EM approach maximizes log likelihood of the data (Murphy 2012), while considering constrains in the optimization goal. However, the maximization (M-step) of EM algorithm is often computationally expensive. Banerjee et al. (Banerjee et al. 2005) proposed Bregman soft clustering algorithm for MLE of parameters. The algorithm has the following attractive features: (a) it is equivalent to EM for mixture of exponential families; (b) it simplifies the computationally expensive M-step; (c) it is applicable to mixed data types and (d) its computational complexity is linear in the data points. Bregman soft clustering is a centroid based parametric clustering approach (e.g., k-means) which arises by special choice of Bregman divergence (Banerjee et al. 2005). Bregman divergences include a large number of distortion functions commonly used in data clustering problem. Due to the bijection between Bregman divergence and exponential family, the maximization step for the density parameters in EM algorithm reduces to a simple weighted averaging step (equivalent to update the centroid of the cluster). Moreover, Bregman divergence can be applied to compute the relative entropy (KLD) between statistical distributions that belong to exponential family. Therefore, in hierarchical mixture model (Garcia & Nielsen 2010) Bregman divergence can be effectively used to compute the dissimilarity matrix. These benefits of using Bregman divergence provide strong motivation to design a clustering model that estimates parameter as well as optimal number of components. A statistical distribution can take the benefits of Bregman divergence if its canonical exponential family representation is available. While it exists for several distributions (Nielsen & Garcia 2009), the vmf distribution is yet to have such canonical representation. Finite mixtures of vmf distributions were introduced by (Banerjee et al. 2005) where they proposed EM algorithm for MLE of parameters. However, they did not address the issue of component selection. A nonparametric Bayesian framework considering vmf mixture model (imfmm) was proposed by (Bangert et al. 2010), where they discussed about the strategy to obtain number of components of the model. However, imfmm is nondeterministic and computationally expensive. A nonlinear least-squares technique to compute vmf mixture model was proposed by (McGraw et al. 2006), which is different than the family of methods we consider in this research. In this paper, we propose a hierarchical von Mises-Fisher mixture model (H-vMF-MM) in 2D sphere (or 3D Euclidean space). The model addresses several prominent issues such as: parameter estimation, number of component selection and computational efficiency. In order to adapt our clustering problem with Bregman soft clustering, we derived canonical exponential family representation of the vmf distribution. We compute Bregman divergence by exploiting the exponential parameters and Legendre dual of the log normalizing function (Banerjee et al. 2005). The clustering approach begins with Bregman soft clustering algorithm considering maximum number of components called in the mixture. A hierarchical clustering is then performed on the components of vmf-mm. This provides a hierarchical structure as well as the parameters of simplified mixture models (Garcia & Nielsen 2010) of to 1 class. Finally, we apply KLD based threshold among different mixture models in order to obtain desired number of components. Experiments on simulated data exhibits satisfactory clustering performance of the proposed approach. We applied the proposed method to analyze depth images (captured by Microsoft Kinect camera) with the aim to obtain bottom up image segmentation. For this purpose, we consider image normal (a 3 dimensional unit vector describe the surface property of individual pixels) as the feature vector. We conducted experiments on the NYU depth dataset (Silberman et al. 2012). Experimental results show that the proposed method can be considered a very useful tool for analyzing depth images captured by any range sensing devices / technologies. Our key contributions in this paper are: (a) provide a mathematical formulation to compute Bregman divergence among vmf distributions (for d=3); (b) exploit the divergence to design a computationally efficient hierarchical clustering scheme. Overall, we propose a complete clustering algorithm. We validate the performance of the proposed algorithm by applying it on simulated data as well as real image features. The remaining of this paper is structured as follows: Section 2 describes the complete clustering model. Experimental results followed by discussion are reported in Section 3. Finally, section 4 draws conclusion and possible future extensions of the model. 2. Hierarchical 3-D von Mises-Fisher (vmf) Mixture Model 2.1 vmf distributions and its canonical form (for d=3) EXPONENTIAL FAMILY OF DISTRIBUTIONS A multivariate probability density function belongs to the exponential family if it has the following form (Murphy 2012; Banerjee et al. 2005; Garcia & Nielsen 2010): ( ) Here, denotes the sufficient statistics, denotes the natural parameters. is the log partition function. is the carrier measure. is the inner product.

3 The expectation of the sufficient statistics w.r.t. the density function (Eq. (1)) is called the expectation parameter (Banerjee et al. 2005) which is computed as:, -. There exists a one-to-one correspondence between expectation and natural parameters, which allows them to span spaces that exhibit dual relationship (Banerjee et al. 2005). Relationship among these two forms of parameters can be expressed as: Here, is the gradient of. (2) VON MISES-FISHER DISTRIBUTIONS FOR 3 DIMENSIONS For a d dimensional random unit vector (i.e. and ), the vmf distribution on the (d 1) dimensional sphere (i.e. ) is defined as (Mardia & Jupp 2000): Here, denotes the mean (with ), denotes the concentration parameter (with ). The normalization constant is equal to: Here represents the modified Bessel function of the first kind and order. For d = 3, the normalizing factor of the distribution, is (Mardia & Jupp 2000): Therefore we can rewrite Eq. (3) as: (. /*, for d=3 (4) 2.2 Bregman divergence (BD) for vmf (d=3) distributions BD FOR EXPONENTIAL FAMILY OF DISTRIBUTIONS For strictly convex functions F (with natural parameters and ), BD (, ) can be formally defined as (Banerjee et al. 2005): (5) Here, is the log normalizing function of the natural parameter and is the gradient of. The oneto-one correspondence (Eq. (2)) between the natural and expectation parameter provides an equivalent form of BD (alternative of Eq. (5)) as: (6) Here, is the Legendre dual of log normalizing function (Banerjee et al. 2005). is the gradient of. can be computed by exploiting its relationship with Log normalization function (Banerjee et al. 2005):, which is expressed as ( ) (7) Due to the bijection 1 between BD and exponential families, Eq. (5) and (6) can be used to compute the distance between exponential family distributions BD AMONG VMF (3 DIMENSIONS) DISTRIBUTIONS Considering the canonical representation of exponential family (Eq. (1)), the vmf for d=3 (Eq. (4)) can be decomposed as: sufficient statistics, natural parameter, log normalizing function ( ) and carrier measure. The mean and concentration parameter can be written in terms of natural parameter as: (8) The gradient of the log normalizing function can be written as: ( Considering Eq. (2) we can write: { * { } } and, * ( ) ( ) + Now, we can write: Where, ( ) ( ) *( ) ( ) + ( ) From Eq. (9): * ( ) ( ) + Using the property of collinear vectors we can write: * ( ) ( ) + We can apply Newton-Raphson method to compute from using the following iterative update equation: 1 The bijection is expressed as: ( ) where is a uniquely determined function. For more details, please see: theorem 3 of reference (Banerjee et al. 2005).

4 Where, ( ) and ( ) Considering (Nielsen & Garcia 2009), we can use equations (6, 10, 11 and 12) to compute the BD between clusters. However, this computation is not appropriate to compute divergence between a sample and clusters. The reason is that, for the vmf samples (unit vector), computed from is very high and eventually the becomes. This problem is already addressed for univariate gaussian and corresponding solution is proposed by (Garcia & Nielsen 2010; Nielsen & Garcia 2009). Adopting the solution in our case, we can write the simplified (of Eq. (6)) BD as: * + (13) 2.3 Clustering A generative model (Murphy 2012) is assumed for the appearance of the directional data clusters. The model consists of a mixture of vmf distributions such as (Banerjee et al. 2005): Here, denotes a single sample, * + is the set of component parameters, is the mixing proportion and is the vmf distribution for any particular component BREGMAN SOFT CLUSTERING WITH FIXED K Solution of this clustering problem (Eq. (14)) is to compute MLE of each component (vmf distribution) parameters (Banerjee et al. 2005). Therefore, the goal is to obtain for such that: with,. Here, * +, N denotes total number of data samples. Bregman soft clustering exploits BD in the EM framework (Murphy 2012) to compute MLE of the model (mixture of exponential family distributions) parameters and provide a soft clustering of the dataset (Banerjee et al. 2005). In the expectation step (E-step) of the algorithm, the posterior probability is computed as:. / ( ) Here, denotes the expectation parameter for data sample. and denotes the expectation parameter for any cluster and. Eq. (13) is used to compute the BD. The maximization step (M-step) updates the mixing proportion and expectation parameter for each class as (Banerjee et al. 2005): HIERARCHICAL CLUSTERING In the previous step, we applied Bregman soft clustering with a fixed number of components. Let us denote for further usage, where denotes the maximum number of clusters to begin with. In the context of data clustering with a finite mixture model, choice of number of components is one of the principle problems (Murphy 2012). In our approach we address the solution of this problem by first generating a hierarchical structure consisting of number of components and then choose optimal number of components from the hierarchical structure. We apply agglomerative hierarchical clustering (Murphy 2012; Garcia & Nielsen 2010) on the mixture model parameters (computed after applying Bregman soft clustering). The dissimilarity matrix is computed using BD based sided distance (Garcia & Nielsen 2010). The expectation parameters based left-sided (type of divergence is chosen empirically, explained in section 3.1.3) divergence (Eq. (6)) is chosen to compute the distance between two clusters as: ( ) The average distance (Murphy 2012) criterion is chosen empirically (explained in section 3.1.3) as the linkage criteria of hierarchical clustering in order to determine the order of subset merging. 2.4 Hierarchical mixture model for vmf distribution We consider a statistical mixture model consists of number of vmf distributions (Eq. (14)). Let, { } be the set of component parameters that we obtained after applying Bregman soft clustering on. Applying hierarchical clustering on T generates a nested cluster structure through a bottom up merging process (Murphy 2012). During the merging process, in each iteration, two components are merged and number of clusters reduced by 1. Parameters of the merged clusters are computed as weighted average of the expectation parameters: Computing expectation parameter of a merged cluster in this way is analogues to the computation of left sided (type of centroid is chosen empirically, explained in

5 section 3.1.3) Bregman centroid 2 (Garcia & Nielsen 2010). Note that, the type of Bregman centroid used for merging/fusion of two clusters parameters, must correspond to the sided/symmetric BD used to build the dissimilarity matrix. The decision of the usage of sided (left, right or symmetric) centroid is made empirically. Moreover, the merging process updates the cluster membership of each sample. The nested structure obtained from the abovementioned method is called hierarchical von Mises-Fisher mixture model (H-vMF-MM). This model (H-vMF-MM) facilitates flexible access to the cluster parameters and associated data members at any particular resolution (i.e. for any particular number of clusters) within 1. In literature, this is called mixture model simplification process (Garcia & Nielsen 2010). 2.5 Choice of optimal number of components Let us consider T as the vmf-mm with number of components. H-vMF-MM generates parameters for any particular number of clusters within 1. The problem of finding optimal mixture model size can be described as the identification of the desired mixture model { } with number of components from T KULLBACK LEIBLER DIVERGENCE (KLD) BASED COMPONENT SELECTION KLD is the fundamental measure of distance between two statistical distributions (Garcia & Nielsen 2010; Hershey & Olsen 2007). An equivalent measure can be obtained with BD (using Eq. (6)). In order to select number of components, it is necessary to measure distance among mixture models. To the best of our knowledge, no solution exists to use BD for such measure. Similarly, No closed-form approximation exists for computing the KL divergence among mixture models (Hershey & Olsen 2007). Therefore, we need to use an approximation of the distance (Hershey & Olsen 2007). The goal of applying KL divergence ( mixture models is to obtain such that: ( ) ) among where the value is user defined (Garcia & Nielsen 2010). Classical Monte-Carlo sampling based distance approximation is employed to compute among two mixture models (Hershey & Olsen 2007) in the following form: ( ) ( ) Here, denotes the number of i.i.d samples obtained using a sampling procedure from the associated mixture models. 2.6 Complete clustering method We propose a complete data clustering method which is illustrated in Fig.1. We consider the data as Nx3 unit vectors. The clustering procedure begins with applying Bregman soft clustering on the vmf-mm with number of components. We initialize the mixture model using kmeans++ (Arthur & Vassilvitskii 2007) clustering algorithm. For each component, Bregman soft clustering generates associated probability and parameters. For data samples, it provides associated labels. In the next step, we apply hierarchical clustering on the set of parameters obtained from previous step. The outcome of the hierarchical clustering is a nested cluster structure composed of mixture model parameters with different number of components ranging from Additionally the hierarchical clustering updates the labels of the input data samples. Finally, clustered data membership for every sample is obtained as: Here, and denotes the expectation parameter associated with sample and cluster j = {1,, }. Let, = { } be a set of mixture models consists of different number of components with * }. An example of a mixture model from set is { }. 2 In Eq. (19), the centroid is computed from expectation parameters which is different than the centroid computation with natural parameter in (Garcia & Nielsen 2010). Figure 1. Proposed data clustering method. Prm: Parameters * + of the mixture model, : soft data membership to clusters computed with Eq. (16).

3. Experiments 3.1 Clustering with simulated data 3.1.1 SIMULATED DATA SAMPLES We considered a finite set of sample unit vectors * + on a sphere.

6 3. Experiments 3.1 Clustering with simulated data SIMULATED DATA SAMPLES We considered a finite set of sample unit vectors * + on a sphere. These samples were drawn independently from a vmf-mm with different number of components. For generating samples, we followed standard sampling method for vmf-mm (Dhillon & Sra 2003) and the Metropolis Hastings (MH) algorithm (Murphy 2012). In order to verify the efficiency of our proposed method, we generated two types of samples: (a) well separated with manually selected parameters (b) not-well separated with random parameters. For each type, we generated 10,000 i.i.d. samples. Fig. 2 illustrates an example of different type of simulated data samples. evaluated appropriate distance types (left /right /symmetric) and linkage criteria (Murphy 2012) with respect to KLD and resolution (number of components). The parameter fusion choices (left/right/symmetric centroid) during subset merging correspond to the type of divergence/distance that we used to compute the dissimilarity matrix. We computed the average of KLD values obtained from the evaluation of data samples consisting of mixture models (vmf-mm) with different number of components. Below (Table 1 and Fig. 3), we present results obtained from evaluating the hierarchical cluster construction with a mixture model which consists of 7 components. We began our experiments to select appropriate linkage criteria (single, complete, average, ward, weighted, median and centroid). To this aim, we computed cophenetic correlation coefficient (Romesburg 1984) among combinations of distances and linkage criteria. Table 1 presents the numerical evaluation, which indicates that the average linkage criterion is the best choice for our model. Table 1. Numerical evaluation using cophenetic correlation coefficient. Each entry in the table indicates the average value for a particular choice of distance type and linkage criteria. Figure 2. (a) Well separated samples generated from 3 components vmf-mm (b) Not well separated samples generated from 7 components vmf-mm BREGMAN SOFT CLUSTERING According to the proposed model, the primary choice in the entire clustering process was the maximum number of components. For the simulated samples we set which was determined empirically. In order to evaluate Bregman soft clustering performance, we computed the negative log likelihood (nllh) value: Link type Left sided Right sided Symmetric Single Complete Average Ward Weighted Median Centroid Fig. 3 illustrates the results obtained for evaluating distance/divergence types. Similar to (Garcia & Nielsen 2010), we observe that the left sided BD provides the best simplification quality with respect to KLD values. ( ) We kept track on the nllh values with respect to the number of iterations necessary to converge. At this stage, we did not compare the resulting clusters with the simulated ground truth, due to the fact that we preset for the soft clustering HIERARCHICAL CLUSTERING CONSTRUCTION Similar to hierarchical representations of mixtures of exponential families (Garcia & Nielsen 2010), we Figure 3. Evaluation of distance type and linkage criteria. Average KLD values for different type of distances. Linkage criteria: average link. KLD threshold value: 0.1.

7 Figure 4. Resulting clusters generated for different number of components. Associated KLD threshold values are provided. We applied these experiments on the simulated data (mixture model with different number of components). Based on complete evaluation we choose left-sided BD with the average-link criteria to construct H-vMF-MM KLD BASED COMPONENT SELECTION In this approach, a simplified mixture was obtained based on a user defined threshold value. From Fig. 3 we can have the idea of selecting a threshold value for simulated data. From our experiments we observed that, for the well separated samples, choosing a very small threshold value (, see Fig. 3 - line at the bottom) allows the selection of correct number of components. However, the observation was not evident for the not-well separated samples. Therefore, we learnt the optimal threshold from the ground truth data with our threshold selection algorithm. The threshold selection algorithm was applied on not-well separated samples only. For this purpose, we generated simulated data with different number of samples (2k, 5k, 10k, 20k, 50k), different number of components (3, 5, 7, 9) and different values. Table 2. Empirical threshold obtained from learning threshold value from simulated data. Num. classes Th. Val groups of samples were experimented where sufficient randomness in parameter selection were ensured. Therefore, finally the optimal threshold selection algorithm was applied on ~50k times. From experiments on threshold learning, we observed that threshold value has an inverse relationship with the number of classes. Table 2 presents the threshold value obtained for different number of classes EVALUATION AND COMPARATIVE STUDY We evaluated and compared the performance of H-vMF- MM based on accuracy and computational efficiency. In order to analyze the accuracy, we used simulated data set for which ground truths were known. Table 3 presents the comparison 3 of clustering accuracy for simulated samples with different number of classes and types. Table 3. Comparison of clustering accuracy. Experimented on - two different classes: 3 and 5; two different types: well separated (ws) and not well separated (nws) of simulated samples. Methods: kmeans++ (KMPP), Gaussian Mixture Model (GMM), Spherical kmeans (SPKM), vmf-mm and H- vmf-mm. KM PP GMM SP KM VMF MM HVMF MM 3 cl, ws cl, nws cl, ws cl, nws In each group of samples the optimal threshold value algorithm was applied ~5000 times (50 simulations, 10 times per simulation, 10 different values). Total 3 In order to compare with different methods, we obtained MATLAB implantation either provided by the authors (KMPP, SPKM and VMFMM) or from standard toolbox (GMM).

8 From the results in Table 3, it is evident that H-vMF-MM provides sufficiently high clustering accuracy. Moreover, it is better/equivalent to other methods. It appears that for the not well separated samples its performance is notably better than others. Next, we evaluated computational efficiency. For this purpose, we explored an alternative pathway without considering BD. The steps include: (a) apply classical EM (Banerjee et al. 2005) to obtain vmf-mm parameters; (b) construct hierarchical structure with cosine distance; (c) use KLD threshold to obtain clustering at particular resolution. We observed that due to the incorporation of BD, our proposed model enhances the computational efficiency in the first two steps of the alternative pathway. 3.2 Depth Image Analysis The proposed method was experimented on the depth images obtained from NYU depth dataset (Silberman et al. 2012). First, we computed the image normal (Silberman et al. 2012) for every pixel. Then, we applied H-vMF-MM to cluster the image. Resulting clusters generated a bottom-up segmentation of the depth image. Fig. 4 illustrates the clustering performance of H-vMF- MM on a depth image with different resolutions (number of components). In Fig. 4, the RGB image is shown in order to provide the readers an idea about the contents of the depth image. KLD threshold exhibits inverse relation (see Fig. 4) with the resolution, which is similar to the experiments with not-well separated simulated data (see section 3.1.4). Therefore, we can interpret the obtained image segmentation from the perspective of increasing or decreasing the threshold value. Increasing threshold is equivalent to merging image regions. This is evident when threshold value increased from 0.19 to 0.2 (resolution decreases from 7 to 6, see 3 rd row and 3 rd column of Fig. 4). In contrary, decreasing threshold is equivalent to splitting the image regions. We observe from the results that, the clustering provides sufficient semantic interpretation about the structure of the indoor scene. Most interestingly, we notice that it provides three principal surfaces (planes in the indoor scene) when the resolution is 4 (see 2nd row 3rd column of Fig. 4). It appears that, the more we increase the resolution (starting from 2), the more we can discover the principal surfaces present in the image. However, increasing the resolution too much will enforce oversegmentation (evident from resolution 7). Therefore, careful choice of the threshold value is very important. On the other hand, we observe from Table 2 that determining a unique threshold will not provide true number of classes. Rather, a unique threshold value for overall clustering task will create over or under partitioning the generated cluster. Therefore, in the context of this research it remains an open problem to work in future. Additionally, we noticed that the computed normal contains noisy information. This may be another significant issue that affects the final clustering result. This is evident from Fig. 4, where a new cluster appears around the paper towel dispenser when the number of components is 6 or more (see 3 rd row). The source of noise is caused by the low depth accuracy. 4. Conclusion Statistical distributions, which belong to the exponential family, provide an advantage of designing a mixture model. The advantage includes efficient computation of the model parameters by exploiting algorithms such as Bregman soft clustering (Banerjee et al. 2005). Directional distributions provide the benefit of designing a mixture model that captures true nature of data, which have the form of a unit vector. Among them, most prominent distributions belong to exponential family. However, because of the complicated normalization term in those distributions, it is not possible to take benefit of being in an exponential family. It is found that, for three or less dimensional vectors the normalization term is mathematically less complicated. We take advantage of this and derive the canonical representation of the most fundamental directional distribution called von Mises- Fisher. With the canonical representation, we propose a complete clustering model called Hierarchical von Mises Fisher Mixture Model (H-vMF-MM) that most closely resembles the hierarchical models proposed by Garcia (Garcia & Nielsen 2010). In our proposed model we exploited Bregman divergence at two stages of the clustering tasks: (a) soft clustering and (b) hierarchical clustering. In addition to that, we used KLD in order to determine the resolution (number of components) of the clustering. Therefore, appropriate use of divergence plays a significant role in designing the complete method. We conducted initial experiments on the simulated data, which clearly justify the validity of the proposed model for directional data. Then we used the model to cluster image normals (Silberman et al. 2012) which eventually generates a bottom up segmentation of the depth image. Segmentation results generated by the model provide semantic interpretation of indoor scene surfaces. Therefore, the proposed model can be used for computer vision tasks. However, we believe that this model will be equally applicable to other data mining and clustering tasks which involve directional data. We foresee several future directions to extend the work, such as: (a) determine the KLD threshold value in order to obtain optimal number of semantic classes; (b) explore other clustering approaches such as total Bregman soft clustering (Liu et al. 2012) in order to enhance robustness w.r.t. noise and outliers (c) propose similar model for other directional distributions which belong to the exponential family; (d) propose a model for high dimensional data; (e) extension of the model in a Bayesian framework.

9 References Arthur, D., and Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding. Eighteenth annual ACM- SIAM symposium on Discrete algorithms (pp ). New Orleans, Louisiana: Society for Industrial and Applied Mathematics. Banerjee, A., Dhillon, I. S., Ghosh, J., and Sra, S. (2005). Clustering on the Unit Hypersphere using von Mises- Fisher Distributions. Journal of Machine Learning Research (JMLR), 6, Banerjee, A., Merugu, S., Dhillon, I., and Ghosh, J. (2005). Clustering with Bregman divergences. Journal of Machine Learning Research, 6, Bangert, M., Hennig, P., Oelfke, and Uwe. (2010). Using an Infinite Von Mises-Fisher Mixture Model to Cluster Treatment Beam Directions in External Radiation Therapy. International Conference on Machine Learning and Applications (ICMLA), (pp ). Dhillon, I. S., and Sra, S. (2003). Modeling Data using Directional Distributions. Comp. Sci., Univ. of Texas at Austin. Garcia, V., and Nielsen, F. (2010). Simplification and hierarchical representations of mixtures of exponential families. Signal Processing, 90(12), Hershey, J. R., and Olsen, P. A. (2007). Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models. IEEE International Conference on Acoustics, Speech and Signal Processing, 4, pp. IV IV-320. Liu, M., Vemuri, B. C., Amari, S.-I., and Nielsen, F. (2012). Shape Retrieval Using Hierarchical Total Bregman Soft Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(12), Mardia, K. V., and Jupp, P. (2000). Directional Statistics (2nd ed.). John Wiley and Sons Ltd. McGraw, T., Vemuri, B., Yezierski, R., and Mareci, T. (2006). Segmentation of high angular resolution diffusion MRI modeled as a field of von mises-fisher mixtures. European conference on Computer Vision (pp ). Springer-Verlag. Murphy, K. P. (2012). Machine Learning a Probabilistic Perspective. MIT Press. Nielsen, F., and Garcia, V. (2009). Statistical exponential families: A digest with flash cards. Computing Research Repository (CoRR), abs/ Romesburg, H. C. (1984). Cluster Analysis for Researchers. Belmont, Calif.: Lifetime Learning publications. Silberman, N., Hoiem, D., Kohli, P., Fergus, and Rob. (2012). Indoor Segmentation and Support Inference from RGBD Images. European Conference on Computer Vision. Sra, S. (2012). A short note on parameter approximation for von Mises-Fisher distributions: and a fast implementation of Is(x). Computational Statistics, 27,

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of