Online Adaptive Hierarchical Clustering in a Decision Tree Framework

Size: px
Start display at page:

Download "Online Adaptive Hierarchical Clustering in a Decision Tree Framework"

Transcription

1 Journal of Pattern Recognition Research 2 (2) Received March 8, 2. Accepted May 7, 2. Online Adaptive Hierarchical Clustering in a Decision Tree Framework Jayanta Basak NetApp India Private Limited, Advanced Technology Group, Bangalore, India basak@netapp.com, basakjayanta@yahoo.com Abstract Online clustering is an issue in large amount of data crunching. Moreover, having a coarse-to-fine grain analysis is also desirable. We address both these problems in a single framework by designing an online adaptive hierarchical clustering algorithm in a decision tree framework. Our model consists of an online adaptive binary tree and a code formation layer. The adaptive tree hierarchically partitions the data and the finest level clusters are represented by the leaf nodes. The code formation layer stores the representative codes of the clusters corresponding to the leaf nodes, and the tree adapts the separating hyperplanes between the clusters at every layer in an online adaptive mode. The membership of a sample in a cluster is decided by the tree and the tree parameters are guided by the stored codes. As opposed to the existing hierarchical clustering techniques where certain local objective function at every level is optimized, we adapt the tree in an online adaptive mode by minimizing a global objective functional. We use the same global objective functional as used in the fuzzy c-means algorithm (FCM), however, we observe that the effect of the control parameter is different from that in the FCM. In our model the control parameter regulates the size and the number of clusters whereas in the FCM the number of clusters is always constant (c). We never freeze the adaptation process. For every input sample, the tree allocates it to certain leaf node and at the same time adapts the tree parameters simultaneously with the adaptation of the stored codes. We validate the effectiveness of our model on certain real-life datasets and also show that the model is able to perform unsupervised classification on certain datasets. Keywords: Decision tree, online adaptive learning, fuzzy c-means Pattern clustering [, 2] is a well-studied topic in the pattern recognition literature where samples are grouped into different clusters based on certain self-similarity measure. Online clustering is an issue in large amount of data crunching. Moreover, having a coarse-to-fine grain analysis is also desirable. We address both these problems in a single framework by designing an online adaptive hierarchical clustering algorithm in a decision tree framework. Depending on different criteria such as data representation, similarity/dissimilarity measure, interpretation, domain knowledge and modalility (e.g., incremental/batch mode), different clustering algorithms have been developed a comprehensive discussion on which can be found in [2]. Methods of clustering can be generically partitioned into two groups namely, partitional and hierarchical. In partitional or flat clustering, the data samples are grouped into several disjoint groups depending on certain criteria function and there exists no hierarchy between the clusters. Among different partitional (flat) clustering algorithms, K-means and fuzzy c-means are widely studied in the literature. The quality of partitional clustering algorithms often depends on the objective functional that is minimized to produce the clustering. Fuzzy c-means or soft c-means [3] (Appendix I) and its variations in general employ certain global objective functional, and the clusters are described in terms of cluster centers and the membership values of the samples in different clusters. c 2 JPRR. All rights reserved. Permissions to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or to republish, requires a fee and/or special permission from JPRR.

2 Basak Hierarchical clustering [], on the other hand generates certain nested hierarchical groupings of the data samples where the top layers of the hierarchy contain coarse level grouping with less number of clusters and lower layers represent finer level groupings with a large number of clusters. Hierarchical clustering is usually performed in two different ways namely, divisive and agglomerative. In agglomerative clustering, smaller clusters are grouped into larger clusters in a bottom-up manner [4, 5]. In the divisive hierarchical clustering algorithm, the set of data samples are iteratively partitioned in a top-down manner to construct finer level clustering [6, 7, 8, 9,, ]. In divisive clustering hierarchical self-organization maps have also been used where at each level a growing self-organization map is generated depending on the data distribution at the layer [, 9, 8]. Decision trees have been used for divisive hierarchical clustering [2, 3, 4, 5]. As an example, in [2], the process of construction of unsupervised decision tree was mapped onto that of a supervised decision tree where samples are artificially injected such that empty regions contain sparse injected samples. The decision tree is then constructed by discriminating between the original samples and the artificially injected samples. At every level of the tree, the number of the injected samples is controlled depending on the number of actual data samples available at that node. In [4], top-down induction mechanism of decision trees is used for clustering. The top-down induction mechanism is specifically used to build first-order logical decision tree representation of the inductive logical programming system. In predictive clustering [3], a decision tree structure is induced based on the attribute values such that the separation between the clusters and the homogeneity within clusters in terms of the attribute values are maximized. A value of the target variable is then associated with each cluster represented by the leaf of the tree. In [5], the dataset is iteratively partitioned at each level of the tree by measuring the entropy from the histogram of the data distribution. However, in all these algorithms [2, 3, 4, 5], a decision tree is constructed based only on the local splitting criteria at each node as in the supervised counterpart of the decision trees [6, 7, 8]. In such cases of top-down divisive partitioning, if an error is committed at a top layer then it becomes very hard to correct that in the bottom layer. In order to overcome such situation, efforts have been made to define certain global objective function on a decision tree in the context of clustering, and the objective function is optimized to generate the clustering. For example, with a given tree topology, a stochastic approximation of the data distribution through deterministic annealing is performed in [9] to obtain groups of data samples at the leaf nodes of the tree. Essentially in this approach the cross-entropy measure between the probabilistic cluster assignments in the leaf nodes and that in the parent nodes is minimized. The formulation is then extended to an online learning set-up to handle non-stationary data. In this paper, we present a method for performing online adaptive hierarchical clustering in a decision tree framework (a preliminary version of this work can be found in [2]). Our model consists of two components namely, an online adaptive decision tree and a code formation layer. The adaptive tree component represents the clusters in a hierarchical manner and the leaf nodes represent the clusters at the finest granularity. The code formation layer stores the representative codes of the clusters corresponding to the leaf nodes. For the adaptive tree, we use a complete binary tree of a specified depth. We use a global objective function considering the activation of the leaf nodes and the stored codes together to represent the compactness of the clusters. We use the same objective function as used in the fuzzy c-means algorithm (Appendix I) where we replace the cluster centers by the stored codes. 22

3 Online Adaptive Hierarchical Clustering a Decision Tree Framework In our model, we do not explicitly compute the cluster centers since the clusters are not represented by their centers rather these are determined by the separating hyperplanes of the adaptive tree. We adapt the tree paramenters and the stored codes simultaneously in an online adaptive manner by minimizing the objective functional for each sample separately considering the samples are independently selected. We do not perform any estimation of the data distribution, rather we perform deterministic gradient descent on the objective functional. In this adaptation process, we adapt the parameters of the entire tree for every sample. Therefore unlike the usual divisive clustering, we do not recursively partition the dataset in a top-down manner. We never freeze the learning i.e., whenever a new data sample is available, the leaf nodes of the tree get activated and we find out the maximally activated leaf node to represent the corresponding cluster for the input sample; and at the same time we incrementally adapt the parameters of the tree and the stored codes. We do not decrease the learning rate explicitly depending on the number of iterations, rather we compute the optimal learning rate for every sample to minimize the objective functional. We also observe that even though we consider a fixed tree topology, the tree structure also adapts depending on the dataset and order of appearance of the input samples. In this framework, we attain an online adaptive hierarchical clustering algorithm in a tree framework which exhibits stability as well as plasticity. Since we do not compute the cluster centers directly from the data (rather they are determined by the tree parameters), we observe stability in the behavior of the tree. At the same time, incremental learning rules enable plasticity of the tree. The selection of global objective function relieves the tree of local partitioning rules. The hyperplanes at the intermediate nodes are adjusted based on the global behavior. This enables the tree of incremental learning and also reduces the risk of committing early errors of local partitioning. The intuition of reducing error due to local partitioning is similar to that in the supervised OADT [2] where we performed Gaussian parity problem classification which is not possible in classical decision tree framework with local partitioning.. Description of the Model The model (Figure ) consists of two components namely an online adaptive tree [2, 22] and a code formation layer. We used online adaptive tree before for pattern classification and function approximation tasks in the supervised mode [2, 22]. Here we employ online adaptive tree in the unsupervised mode for adaptive hierarchical clustering. The adaptive tree is a complete binary tree of a certain specified depth l. Each non-terminal node i has a local decision hyperplane characterized by a vector (w i,θ i ) with w i = such that (w i,θ i ) together has n free variables for an n-dimensional input space. Each nonterminal node of the adaptive tree receives the same input pattern x = [x,x 2,,x n ], and the activation from its parent node. The root node receives only the input pattern x. Each nonterminal node produces two outputs, one for the left child and the other for the right child. We number the nodes in the breadth-first order such that the left child of node i is indexed as 2i and the right one as 2i +. The activation of two child nodes of a non-terminal node i are given as u 2i = u i.f(wi Tx + θ i) u 2i+ = u i.f( (wi Tx + θ i)) () where u i is the activation of node i, and f(.) represents the local soft partitioning function governed by the parameter vector (w i,θ i ). We choose f(.) as a sigmoidal function. The 23

4 Basak Adaptive Tree x u(2) ( w θ ) 2 2 x ( w θ ) u()= ( w θ ) 3 3 x x x x x u(3) u(4) ( w θ ) ( w θ ) u(5) u(6) ( w θ 4 4 ) ( w θ ) u(7) β Code Formation Layer β Nn x x 2 x 3 x n Fig. : Structure of the model. It has two components namely, adaptive tree and a code formation layer. Every intermediate node and the root node of the adaptive tree receives the input pattern x. Also the code formation layer receives the input pattern x as input. 24

5 Online Adaptive Hierarchical Clustering a Decision Tree Framework activation of the root node, indexed by, is always set to unity. The activation of a leaf node p can therefore be derived as u p = i P p f (lr(i,p)(w i.x + θ i )). (2) where P p is the path from the root node to the leaf node p, and lr(.) is an indicator function indicating whether the leaf node p is on the right or left path of node i such that if p is on the left path of i lr(i,p) = if p is on the right path of i (3) otherwise Considering the property of sigmoidal function f(.) that f( v) = f(v), we observe the sum of activation of two siblings is always equal to that of their parent irrespective of the level of the nodes (since every intermediate node has a sigmoidal function to produce the activations of the left and right child nodes). Since the sum of two sibling leaf nodes is equal to their parent, the sum of activations of two sibling parent nodes is equal to the parent at the next higher layer, and so on. We extend this reasoning to obtain (Appendix II) u j = u (4) j Ω where Ω represents the set of leaf nodes, and u is the activation of the root node. As mentioned before, we always set u =, and therefore u j = (5) We have chosen f(.) as f(v) = j Ω + exp( mv) where m determines the steepness of the function f(.). In the appendix (Appendix III), we show that m log l/ǫ (7) δ where δ is a margin between a point (pattern) and the closest decision hyperplane, ǫ is a constant such that the minimum activation of a maximally activated node should be no less that ǫ. For example, if we choose ǫ =. and δ =. then m log(l). The code formation layer stores the codes representative of the clusters corresponding to the leaf nodes. The codes are equivalent to the cluster centers as used in the fuzzy c-means algorithm, however, we do not compute the codes by finding out the mean of the membership values. Corresponding to each leaf node j, there is a code β j = {β j,β j2,,β jn } stored in the code formation layer. Thus for N leaf nodes corresponding to a maximum N clusters we have N code vectors β,β 2,,β N where β N = {β N,β N2,,β Nn } as shown in Figure. The code formation layer receives the same input x and measures the deviation of the input x from the stored code β j for an activated leaf node j, i.e., x β j 2. If the leaf node activation is high and the deviation is also large then the error (value of the objective function) is large, and if either of the deviation or the leaf node activation is small then the (6) 25

6 Basak error should be small. Therefore, at each leaf node j, we have an error x β j 2 u /α j where u j is the activation of leaf node j and α is a control parameter. We iterate to minimize the value of an objective function which measures the total error over all leaf nodes, by adapting the stored codes β and the stored parameter values (w, θ). In this model, the maximum number of possible clusters is N = 2 l where l denotes the depth of the adaptive tree. 2. Learning Rules We define a global objective functional E by considering the discrepancy between the input patterns and the existing codes in β, and the activation of the leaf nodes together. E = x E(x) = 2 x β j 2 u /α j (x) (8) x where α is a constant which acts as a control parameter. The objective functional is identical to that in the fuzzy c-means [3]. In fuzzy c-means (Appendix I), u represents the membership value of a sample in a specific cluster, and β represents the cluster centers. The parameter α controls the fuzzification or the spread of membership values. In the limiting condition α =, fuzzy c-means becomes hard c-means (Appendix I). The number of clusters in fuzzy c-means is constant as defined by the parameter c. The membership values and the cluster centers are updated in an iterative two-step EM-type algorithm where the membership values are computed based on the cluster centers in the first step and the cluster centers are updates using the membership values in the second step of the iteration. The stored codes β in our model, act as the reference points of the clusters which are analogous to the cluster centers as in the fuzzy c-means [3]. However, in our model the clusters are not explicitly defined by the stored codes, rather we obtain the clusters from the separating hyperplanes of the adaptive tree. We learn both β and the separating hyperplanes together unlike the fuzzy c-means [3]. Moreover, the parameter α in our model, controls the number and size of the clusters unlike the EM-based fuzzy c-means algorithm where the number of clusters is constant. In the online mode, we compute the objective functional for each individual sample x as E(x) = 2 j Ω j Ω x β j 2 u /α j (9) considering the samples are independently selected. We then adapt the parameters of the model to minimize E(x) for each individual sample in the online mode. For a given α and for an observed sample x, we minimize E(x) by steepest gradient descent such that β j w i θ i E = η opt β j E = η opt w i () E = η opt θ i where η opt is a certain optimal learning rate parameter decided in the online mode. Evaluating the partials in Equation (), we obtain β j = η opt u /α j (x β j ) w i = η opt q i x () θ i = η opt λq i 26

7 Online Adaptive Hierarchical Clustering a Decision Tree Framework where q i is expressed as q i = m x β j 2 (u j ) /α ( v ij )lr(i,j) (2) 2α j Ω Each nonterminal node i locally computes v ij which is given as v ij = f((w T i x + θ i )lr(i,j)) (3) Since (w,θ) at each nonterminal node has n free variables together (n is the input dimensionality) subject to the condition that w =, the updating of w in Equation () is restricted to the normalization condition. The parameters w is updated such a way that w lies in the hyperplane normal to w which ensures that w. w =, i.e., the constraint w = is satisfied for small changes after updating. Therefore, we take the projection of q i x onto the normal hyperplane of w i and obtain the w i as w i = η opt q i (I w i w i T )x (4) where I is the identity matrix. In Equation (), the parameter λ represents the scaling of the learning rate parameter with respect to θ. We normalize the input variables such that x is bounded in [,+] n, and we select λ =. Note that if the input is not normalized then we can select a different value for λ. Since we never freeze the adaptation process, we decide the learning rate adaptively based on the discrepancy measure E(x) for a sample x in the online mode. For every input sample x at every iteration, we select η opt using the line search method (also used by Friedman in [23]) such that the change in E to E + E is linearly extrapolated to zero. Equating the first order approximation of E (line search) to E, we obtain, η opt = E j Ω (u j) 2/α + i N q2 i (λ + x 2 (w T i x) 2 ) (5) In Equation (5), we observe that for very compact or point clusters, η can become very large near optima. In order to stabilize the behavior of the adaptive tree, we modify the learning rate as η opt = E γ + j Ω (u j) 2/α + i N q2 i (λ + x 2 (w i T x) 2 ) (6) with a constant γ. The parameter γ acts as a regularizer in obtaining the learning rate. In our model, we chose γ =. From the learning rules, we observe that the codes represented by β are updated such that they approximate the input pattern subject to the activation of the leaf nodes. The maximally activated leaf node attracts the corresponding β most towards the input pattern. If there is a close match between the input pattern x and the stored code β corresponding to the maximally activated leaf node, then the decision hyperplanes are not perturbed much since x β becomes very small. On the other hand, if there is a mismatch then the movement of the decision hyperplane is decided by the stored function lr(i,j). If the maximally activated leaf node j is on the left path of the nonterminal node i then lr(i,j) =, i.e., w i is moved such a way that w i has an opposite sign of x subject to being in the 27

8 Basak normal hyperplane of w i. Since the activated leaf node is on the left path i.e., x is on the left half of the decision hyperplane, the decision hyperplane effectively moves opposite to x to increase the activation of the activated leaf node. The magnitude of the shift depends on the activation value of the leaf, and the discrepancy between the corresponding stored code and the input pattern, Similar reasoning is valid if the maximally activated leaf node j is on the right path of the nonterminal node i. We also observe that as we move towards the root node, the decision hyperplanes are perturbed by the cumulative effect of the entire subtree under the nonterminal node (if j is not under the subtree of i then lr(i,j) = ). We also notice that the updating rule for decision hyperplanes are independent of each other. For example, a leaf node j can be on the left path of a nonterminal node i with lr(i,j) =, however it can be on the right path of its left child. Therefore, the leaf nodes affect different nonterminal nodes differently at different levels depending on the stored function lr(i, j). In the adaptive process, if the input patterns are drawn from a stable distribution then E decreases (due to the decrease in discrepancy between stored codes and the input patterns) and therefore η opt decreases with the number of iterations. and therefore, the updating rules reveal that the corresponding codes in β are also less affected by the presence of input patterns. In our model, we do not explicitly decrease η through any learning schedule; rather depending on the input distribution, η opt adjusts itself. The overall algorithm for adapting the tree in the online mode is given as follows. Step : Initialize the tree parameters (w) and the code vectors (β) Repeat the next steps until new pattern x appears (in the online mode we never freeze the learning) Step 2: for given x, compute the leaf node activations Step 3: for given x, compute q for each intermediate node of the tree (Equation 2), Step 4: for given x, compute η opt (Equation 6) Step 5: for given x, compute w for all intermediate nodes (Equation 4), compute β for all leaf nodes (Equation ), Step 6: for given x, update w as w = w + w for all intermediate nodes, update β as β = β + β for all leaf nodes. Continue Steps 2-6 until new pattern is presented to the model. 3. Choice of α The objective functional in Equation (8) is exactly the same as that in fuzzy c-means [24, 3] (Appendix I) where β corresponds to the cluster centers, u corresponds to the membership values, and h = /α. However, h in soft c-means is chosen in [, ] (i.e., /α [, )) and usually h > which means α <. In the limit α =, fuzzy c-means algorithm becomes the hard c-means algorithm (Appendix I). In fuzzy c-means α is chosen less than unity (i.e., h > ) in order to make the membership function wider to obtain smoother cluster boundaries. However, α in fuzzy c-means has no role in deciding the number of clusters, the number of clusters being a user-defined parameter. In our model on the other hand, leaf node activations do not directly depend on the computed codes in the code formation layer. Rather, the decision hyperplanes stored in the intermediate nodes of the adaptive 28

9 Online Adaptive Hierarchical Clustering a Decision Tree Framework tree determine the leaf node activations. The code formation layer, parameterized by the β matrix, stores the respective codes corresponding to the groups of samples identified by the adaptive tree, each leaf node representing one group of samples respectively. The decision hyperplanes in the adaptive tree and the codes in the code formation layer are adapted together to minimize the objective functional (Equation (8)). The number of clusters in our model is guided by the choice of α. In the limiting condition, as α, E, and the model does not adapt at all and remains in the initial configuration. On the other hand, for α, there is no role of individual leaf node activations in E, and the parameters of the adaptive tree are adapted in such a way that all samples are merged into one group. We can also observe the same behavior from the combination of adaptation rules and the learning rate (Equation (2) and (5)), where with the increase in the value of α, the magnitude of the changes in parameter values increase, and this enables model to become more adaptive. In other words, even though there are 2 l leaf nodes corresponding to a maximum of 2 l clusters, the effective number of clusters depends on the parameter α. The effective number of clusters depends on the number of leaf nodes which are maximally activated for a set of input samples. Thus if there exist a leaf node which is never maximally activated for any input sample then the leaf node does not represent any cluster. In this model the effective number of clusters depends on the control paramter α. We explain the effect of α empirically with a simple example. Let there be only two different samples x and x 2, and the model has only two leaf nodes. Without loss of generality, let us represent the activation of leaf nodes by u and u 2 respectively. Let the distance between the two samples be d = x x 2. Ideally, there are two possible cases. Case I: The sample x forms one cluster and x 2 forms the second one such that codes formed are β = x and β 2 = x 2. For the minimum value of E, a separating hyperplane exactly normal to the vector x x 2 and equidistant from the samples is constructed. In such case, when x is presented to the model then u (x ) = /( + exp( md/2)) and u 2 (x ) = /( + exp(md/2)). In such a scenario, i.e., E(x ) = x β 2 u /α + x β 2 2 u /α 2 (7) E(x ) = d 2 ( + exp(md/2)) (/α) (8) We get exactly the same value for E(x 2 ) due to symmetry and therefore the total loss is E = E(x ) + E(x 2 ) = 2d 2 ( + exp(md/2)) (/α) (9) Case II: Both the samples x and x 2 form one cluster and the code β is exactly placed in the middle such that β = (x +x 2 )/2. Since both samples are in the same cluster, there is no separating hyperplane between them (ideally it is at an infinite distance from x and x 2 ), and only one leaf node is activated such that u = and u 2 = for both x and x 2. In such case, E(x ) = x β 2 u /α + x β 2 u /α 2 (2) i.e., E(x ) = d 2 /4 (2) 29

10 Basak Since we obtain exactly the same value for E(x 2 ), the total loss is E 2 = E(x ) + E(x 2 ) = d2 2 From Equations (9) and and (22), we obtain ( ) α E 4 α = + exp(md/2) E 2 (22) (23) From Equation (23), we observe that there exists an α = 2 log 2( + exp(md/2)) (24) such that for α α, E E 2. In other words, since we minimize the objective functional, the scenario in case II is preferred over that in case I if we choose α > α, i.e., both samples are clustered in the same group. On the other hand for a choice of α < α, the samples are preferred to represent two different groups (since in that case E < E 2 ). As discussed in appendix (Appendix III), ( + exp( mδ)) l ǫ (25) where l is the depth of the tree, δ is the minimum distance of a sample from the nearest hyperplane, and ( ǫ) is the minimum activation of the corresponding cluster (activation of the maximally activated leaf node). In this example, l = since there are only two leaf nodes. In the present example, if we consider δ = d/2 then After simplification, Therefore from Equations (24) and (27), we have + exp( md/2) ǫ (26) + exp(md/2) /ǫ (27) α 2 log 2(/ǫ) (28) Therefore, if we want to prevent fragmentation i.e., assign both x and x 2 in the same cluster then we should have α α 2 log 2(/ǫ) (29) In the first case, in order to assign both the samples in a single cluster, if we choose ǫ = then we should have α. Similarly if we consider ǫ =.25, we should have α. The same reasoning is valid for more than two samples with multiple levels in the adaptive tree. In the case of multiple levels, we assume that two closest points are separated only at the last level of the tree. Therefore, for a small α, the adaptive tree can partition single small clusters into multiple ones. In short, for a choice of large α, the number of groups formed by the adaptive tree decreases and large chunks of data are grouped together. On the other hand, if α is decreased then the number of groups increases. In other words, we can vary the value of α in the range [α, ) to obtain finer to coarser clustering. In soft c-means, on 2

11 Online Adaptive Hierarchical Clustering a Decision Tree Framework the contrary, the number of clusters is always constant and equal to c. The parameter α, in soft c-means, is used to control the radius of each soft cluster during the computation of the centroid of each cluster; and the membership values of the samples to different clusters depend on the computed cluster centers. In our model, although the maximum number of possible clusters is equal to the number of leaf nodes, the actual number of clusters depends on the choice of α. With the increase in α, the number of clusters decreases. In other words, although the objective functional in Equation (8) is exactly the same as that in the soft c-means, the effect of α is different in these two algorithms. 4. Experimental Results We implemented the model in the MATLAB environment. We experimented the performance of model with both synthetic and real-life data sets. 4. Protocol The entire batch of samples is repeatedly presented in the online mode, and we call the number of times the batch of samples presented as the number of epochs. If the data density is high or the dataset has a set of compact clusters then we observe that the model takes less number of epochs to converge; and for relatively low data density or in the presence of sparse clusters, the model takes a larger number of epochs. On an average, we observed that the model converges near its local optimal solution within 5 epochs and sometimes even less, although the required number of epochs increases with the depth of the tree. We normalize all input patterns such that each component of x lies in the range [,+]. We normalize each component of x (say, x i ) as ˆx i = 2x i (x max i + x min x max i x min i where x max i and x min i are the maximum and minimum values of x i over all observations respectively. In this normalization, we do not separately process the outliers. Data outliers can badly influence the variables in such normalization. Instead of linear scaling, use of certain non-linear scaling or certain more robust scaling such as that based on inter-quartile range may further improve the performance of our model. However, in this paper, we preserve the exact distribution of the samples and experiment with the capability of the model in extracting the groups of samples in the online adaptive mode. The performance of the model depends on the slope (parameter m) of the sigmoidal function (Equation (6)). In general, we observed that for a given control parameter α, the number of decision hyperplanes increases with the decrease in the slope (this is also validated in Equation (24) where α decreases with the decrease in m). This is due to the fact that if we decrease the slope then more than one leaf node gets significant activation to attract the codes (β) in the code formation layer and thereby the number of clusters increases. On the other hand, with the increase in the slope, only a few (in most of the cases only one) leaf nodes are significantly activated to influence the decision hyperplanes and therefore the number of clusters decreases. In general, for a small m, a group of leaf nodes together represents a compact cluster. As stated in Equation (5), for a given depth l of the adaptive tree, we can fix the slope as m δ log(l/ǫ), where δ is a margin between a point (pattern) and the closest decision hyperplane, ǫ is a constant such that the minimum activation of a maximally activated node should be no less that ǫ. Experimentally, we fixed δ =.2 for an ǫ =.. In other words, we fix the slope as m = 5log(l) for given depth l of the tree. Note that, the value of δ that we have chosen is valid for the normalized i ) (3) 2

12 Basak input space that we use. If the data is not normalized and the input range is higher then a larger value of δ can also be selected. We also observed the behavior of the model with respect to the control parameter α. As we increase α, the number of clusters extracted by the model decreases and vice-versa. In the adaptation rule (Equation ()) as stated before, we always select λ as unity since we operate in the normalized input space. We also select the constant γ = in computing the learning rate irrespective of any other parameter. We initialize β to zero. We also initialize θ at each nonterminal node to zero. We initialize the weight vectors w such that every component of each w is the same. In other words, for an n-dimensional input space, we initialize w such that every component of w is equal to / n which makes w =. Therefore, the performance of the model solely depends on the order in which the samples are presented and independent of initial condition (since the initial condition is the same in all experiments). We also experimented by initializing each weight vector with randomly generated Gaussian distribution. We observed that initialization with randomly generated normally distributed weight vectors can sometimes lead to better results than that with fixed initialization (all weights being equal). In the next section, we illustrate the results for Iris and Crabs dataset with an initialization using normally distributed weight vectors. We then perform class discovery with fixed initialization (all weight being equal) and then report the results on a face dataset using the same fixed initialization. We experimented with different depths of the tree and naturally the number of clusters extracted by the model increases with the depth of the adaptive tree. 4.2 Examples 4.2. Synthetic Data We used a synthetically generated data to test the performance of the model. We used mixture of Gaussian to generate the dataset. Figure 2 illustrates the clusters extracted by the model for different values of α and different depths. The objective of this experiment is to show that the model behaves in the same manner as we discussed, i.e., with the increase in α, the number of clusters decreases and vice-versa. Similarly, with the increase in depth the number of clusters increases. As we stated the effect of α in grouping the data by the model, we illustrate the same in Figure 3. We illustrate the results for α =,.2,.4, and.6 respectively for an adaptive tree of depth l = 5 with m = 5 for 2 epochs. We observe that for an α =., relatively large number of regions are identified by the tree. This is due to the fact that initially the regions are created and due to smaller value of α they are not marged with other regions. As α is increased to.6, we observe that different groups of data are merged into one segment by the tree, and seven different groups are identified Crabs Data In crabs dataset [25], there are total 2 samples, each sample having five attributes. The crabs are from two different species, each species has male and female categories. Thus there are in total four different classes and the data set contains 5 samples from each class. The samples from different classes get highly overlapped when the data is projected onto two-dimensional space spanned by the first and second principal components. Interestingly, the four classes are nicely observable when the data is displayed in the space spanned by the second and third principal components. We therefore obtain the second and third principal components and project the dataset onto these two principal components to obtain the two dimensional projected data. We then normalize such that each projected sample is bounded 22

13 Online Adaptive Hierarchical Clustering a Decision Tree Framework (a) (b) (c) (d) (e) (f) (g).5.5 (h) Fig. 2: The decision hyperplanes constructed by the model for different depths with different control parameters with a constant m = 5. The constructed hyperplanes are shown for (a)depth equal to 3 and α =.2, (b) depth equal to 3 and α =.5, (c) depth equal to 3 and α = 3., (d) depth equal to 4 and α =.2, (e) depth equal to 4 and α = 3., (f) depth equal to 4 and α = 4., (g) depth equal to 5 and α =.2, (h) depth equal to 5 and α = 4. 23

14 Basak (a) (b) (c) (d) Fig.3: The decision hyperplanes constructed by the tree with a constant m = 5 and depth equal to 5. The control parameter α is varied. (a), (b), (c), and (d) illustrate the behavior of the tree for α =, α =.2, α =.4, and α =.6 respectively. 24

15 Online Adaptive Hierarchical Clustering a Decision Tree Framework in [-,]. Figure 4(a)-(d) illustrates the behavior of model for depths equal to two to five with a constant α =.2. Figure 4(e)-(f) illustrates the behavior for depths equal to four and five with α = 2.. In Figure 4(a), we observe that the model is not able to separate the data points from different classes completely with only four regions. However, if we consider only species information, then the two different species are well separated with a depth of 2. In Figure 4(c), we observe that the model generated a region where samples from two different classes (same species with different sex) are mixed up. In all other situations, the model is able to separate the four different classes by allocating more than one leaf node to each class. As a comparison, we illustrate the results obtained with the fuzzy c-means algorithm (Appendix I) for different number of clusters. We experimented with different values of the exponent h. We observed that for relatively small number of clusters (such as 4 or 5), the results are same for a large range of exponent values. As the number of clusters increases, the nature of grouping depends on the exponent value. We report the results of FCM for different number of clusters with a fixed exponent h = 2.. We observe that the proposed clustering algorithm is able to separate different class structures as good as the fuzzy c-means algorithm although the proposed algorithm is hierarchical in nature Iris Data We illustrate the behavior of the mode for iris dataset (the original dataset was reported in [26], and we obtain the data from [27]). In the iris dataset, there are three different types of iris flowers, each category having four different features namely sepal length, sepal width, petal length, and petal width. There are 5 samples from each class resulting in a total of 5 samples in the four dimensional space. The three different classes can be nicely identified when the data set is displayed in a two-dimensional space spanned by two derived features namely sepal area (sepal length sepal width) and petal area (petal length petal width). We transform the data into these two dimensions then normalize. Figure 6 illustrates the behavior of the model for this dataset with these two normalized derived features for depths equal to two to four with a constant α =.2. We observe that a model with depth equal to 2, is able to separate one class completely. However, it creates one region (corresponding to one leaf node) where samples from two different classes are mixed up. The purity of the regions improves a lot when we use a depth equal to 3. For a depth equal to 4, we observe that the model is able to perfectly separate the three different classes from the data. As a comparison, we illustrate the results obtained with the fuzzy c-means algorithm (Appendix I) for different number of clusters. We experimented with different values of the exponent h. We observed that for relatively small number of clusters (such as 4 or 5), the results are same for a large range of exponent values. As the number of clusters increases, the nature of grouping depends on the exponent value. We report the results of FCM for different number of clusters with a fixed exponent h = 2.. We report only those results where the performance of FCM was visually most appealing. We observe that even for 3 clusters, FCM is able to separate one class completely just as performed by the proposed model with a depth of 2. As the number of clusters increases, the FCM algorithm is able to partially separate the two other classes. However, we observe that the class in the middle of the other two is better separated by the proposed model as compared to FCM. In our model the class structure is captured by two approximate parallel lines which is not so apparent in FCM although FCM separates the class to certain extent from the other two classes. 25

16 Basak.5.5 2nd principal component 2nd principal component rd principal component (a) rd principal component (b).5.5 2nd principal component 2nd principal component rd principal component (c) rd principal component (d).5.5 2nd principal component 2nd principal component rd principal component (e) rd principal component (f) Fig. 4: The decision hyperplanes constructed by the proposed model on the crabs dataset projected onto a two-dimensional plane spanned by the second and third principal components. The four different classes in the dataset are marked with different notations. Figure illustrates the hyperplanes with a constant m = 5 for (a) depth equal to 2 and α =.2, (b) depth equal to 3 and α =.2, (c) depth equal to 4 and α =.2, (d) depth equal to 5 and α =.2, (e) depth equal to 4 and α = 2., (f) depth equal to 5 and α =

17 Online Adaptive Hierarchical Clustering a Decision Tree Framework.5.5 2nd principal component 2nd principal component rd principal component (a) rd principal component (b).5.5 2nd principal component 2nd principal component rd principal component (c) rd principal component (d).5.5 2nd principal component 2nd principal component rd principal component (e) rd principal component (f) Fig.5: The decision hyperplanes constructed by the FCM algorithm on the crabs dataset projected onto a two-dimensional plane spanned by the second and third principal components. The four different classes in the dataset are marked with different notations. Figure illustrates the hyperplanes with a constant h = 2. for number of clusters equal to (a) 4, (b) 6, (c) 8, (d), (e) 2, and (f) 4. 27

18 Basak.5.5 petal area petal area sepal area (a) sepal area (b).5 petal area sepal area (c) Fig. 6: The decision hyperplanes constructed by the proposed model on the iris dataset projected onto a two-dimensional plane spanned by the sepal area (product of first and second attribute values) and the petal area (product of third and fourth attribute values). Figure illustrates the hyperplanes with a constant m = 5 and α =.2 for (a) depth equal to 2, (b) depth equal to 3, and (c) depth equal to 4. 28

19 Online Adaptive Hierarchical Clustering a Decision Tree Framework.5.5 petal area petal area sepal area (a) sepal area (b).5.5 petal area petal area sepal area (c) sepal area (d).5.5 petal area petal area sepal area (e) sepal area (f) Fig.7: The decision hyperplanes constructed by the FCM algorithm on the iris dataset projected onto a two-dimensional plane spanned by the second and third principal components. The four different classes in the dataset are marked with different notations. Figure illustrates the hyperplanes with a constant h = 2 for number of clusters equal to (a) 3, (b) 4, (c) 6, (d) 7, (e) 9, and (f). 29

20 Basak Unsupervised Classification on Real-life Data Here we demonstrate the effectiveness of the model in performing the class discovery on certain real-life data sets (in the UCI [27] repository) by means of unsupervised classification. In order to obtain the classification performance, we consider the samples allocated to each leaf node. For each leaf node, we obtain the classlabel of the majority of the samples allocated to that leaf node, and assign the corresponding class label to that leaf node. Thus we assign a specific class label to each leaf node of the tree depending on the groups of samples allocated to that leaf node. In crabs dataset, we use two class labels (species only) instead of the four labels. Apart from the iris and crabs dataset, we also use five other data sets namely pima-indians-diabetes (originally reported in [28], we call it pima ), Wisconsin Diagnostic Breast Cancer (originally reported in [29], in short called WDBC ), Wisconsin Prognostic Breast Cancer (originally reported in [3], in short called WPBC ), E-Coli bacteria dataset, and BUPA liver disease dataset. We obtained these three data sets from [27]. We modified the Ecoli data originally used in [3] and later in [32] for predicting protein localization sites. The original dataset consists of eight different classes out of which three classes namely, outer membrane lipoprotein, inner membrane lipoprotein, and inner membrane cleavable signal sequence have only five, two, and two instances respectively. We omitted samples from these three classes and report the results for the rest five different classes. Table summarizes the data set description. Table : Description of the pattern classification datasets. Data Set No. of Instances No. of Features No. of Classes Indian diabetes (Pima) Diagnostic Breast Cancer (Wdbc) Prognostic Breat Cancer (Wpbc) Liver Disease (Bupa) Flower (Iris) Bacteria (Ecoli) Crabs In performing the unsupervised classification using real-life datasets, we compare the results using our model with that obtained with the fuzzy c-means clustering algorithm. As a comparison, we also provide the cross-validated results obtained by a supervised decision tree, C4.5, [8, 33] on these data sets. In our model, the randomness can come in two different ways. One in the initialization process and second in the order of presentation of the samples in the online mode. We eliminate the randomness in the initialization process by using a fixed initialization. We initialize all components of all the weight vectors (w) to be equal, initialize the bias component (θ) to zero for all intermediate nodes, and initialize the code vectors (β) equal to zero. We experiment with different random order of presentation of the samples in the online mode and report the average performance over different trials for each dataset and each tree. In the case of fuzzy c-means, the second randomness is not present since it is a batch-mode algorithm. However, performance of the FCM algorithm 22

21 Online Adaptive Hierarchical Clustering a Decision Tree Framework depends on the initialization and we report the average performance of the FCM algorithm over different trials. In performing the unsupervised classification, we experimented with different values of α and the depth of the tree. We observed that the performance depends on the value of α. If α is very low then the classification accuracy degrades due to fragmentation. The performance improves with an increase in α. Classification accuracy again degrades if the value of α becomes large since many samples are grouped together. Figure 8 illustrates the behavior of the model on three different datasets namely, PIMA, WDBC, and ECOLI depth=3 depth=4 depth=5 depth=6 depth= depth=3 depth=4 depth=5 depth=6 depth=7 93 classification accuracy classification accuracy (a) α α (b) 9 85 depth=3 depth=4 depth=5 depth=6 depth=7 classification accuracy (c) α Fig. 8: Dependencies of the classification accuracy on the parameter α of the adaptive tree for (a) PIMA dataset, (b) WDBC dataset, and (c) ECOLI dataset. In Table 2, we report the best results obtained by the proposed model, and that by the FCM. As a comparison, we provide cross-validated results obtained by the supervised C4.5 decision tree. For the fuzzy c-means algorithm, we experimented with different combinations of the number of clusters and the exponent (h). We increased the number of clusters to 5 and the exponent (h) to 5.. For each dataset we report the best result obtained by the FCM algorithm, and report the number of clusters and the exponent with each score. In the case of our model also, we tested with different combinations of the depth of the adaptive tree and the parameter α. We increased the depth up to 7 and α to 3.. We report the best results and the corresponding value of α for each depth of the adaptive tree. Since the performance of the adaptive tree depends on the sequence in which the samples appear, 22

22 Basak we report the standard deviation of the scores over trials with different input sequences. Similarly performance of the FCM depends on the initialization and we report the standard deviation over different trials. We observe that the adaptive tree is able to produce better classification accuracy over FCM for the BUPA, PIMA, and IRIS datasets. On the other hand, FCM is much better for the CRABS dataset, and also performs better for the WDBC and ECOLI dataset. The performances are comparable for the WPBC dataset. In other words, we observe that although we perform hierarchical clustering, the adaptive tree is able to produce results comparable with the partitional clustering algorithm on certain datasets. We also compare the results with that produced by agglomerative hierarchical clustering using group average (dendogram). We have constrained the AHC by producing at most 2 clusters for each dataset, and we report the results for 2 clusters. We observe that in the case of AHC the results consistently improve as we increase the number of clusters. In order to make it comparable with the fuzzy c-means, we have constrained to 2 clusters. We observe that the adaptive tree is able to produce comparable and sometimes better results than AHC when AHC is constrained to 2 clusters. Table 2: Classification accuracies obtained by the unsupervised online adaptive tree ( OADT stands for online adaptive decision tree). As a comparison, we provide the results produced by the fuzzy c-means algorithm, dendogram, and the supervised C4.5 decision tree. Depth BUPA PIMA WDBC WPBC IRIS ECOLI CRABS (std) (±.269) (±.263) (±.67) (±43) (±.344) (±.89) (±.438) (α) () (.6) (.8) (.8) (.7) (.6) (.2) (std) (±.7) (±.246) (±.877) (±.892) (±.335) (±.498) (±.474) (α) (.8) (.8) (.8) (.8) (.8) (.7) (.6) OADT (std) (±.54) (±.528) (±.395) (±.797) (±.93) (±.88) (±37) (α) (.7) (.8) (.8) (.8) () (.6) () (std) (±.874) (±.69) (±.893) (±.247) (±.53) (±.594) (±.755) (α) (.7) (.8) (.8) (.8) () (.7) () (std) (±.39) (±.6) (±.86) (±.832) (±.344) (±.94) (±27) (α) () (.7) (.9) (.8) (.6) (.7) () FCM (std) (±.775) (±.8) (±69) (±.7) (±.629) (±6) (±2.8) (# Cluster) [5] [7] [5] [5] [3] [3] [5] (h = /α) (.2) (.4) (.2) (.2) (.8) (.2) (3.8) Dendogram C (depth) (7) (9) (6) (6) (4) (5) (7) We validate the clustering produced by adaptive tree by computing the F-measure validation index [5]. In Table 3, we report the F-measure indices for fixed value of α for different depths of the tree on all datasets as in Table. We also tested the effectiveness of our model in performing unsupervised classification of acute leukemia samples from the gene expression using DNA microarray as used in Golub et al [34]. In the data set, there are 72 samples consisting of two different types of leukemia namely, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). For each 222

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Some questions of consensus building using co-association

Some questions of consensus building using co-association Some questions of consensus building using co-association VITALIY TAYANOV Polish-Japanese High School of Computer Technics Aleja Legionow, 4190, Bytom POLAND vtayanov@yahoo.com Abstract: In this paper

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

Data mining with Support Vector Machine

Data mining with Support Vector Machine Data mining with Support Vector Machine Ms. Arti Patle IES, IPS Academy Indore (M.P.) artipatle@gmail.com Mr. Deepak Singh Chouhan IES, IPS Academy Indore (M.P.) deepak.schouhan@yahoo.com Abstract: Machine

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. Unsupervised Learning. Manfred Huber Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York A Systematic Overview of Data Mining Algorithms Sargur Srihari University at Buffalo The State University of New York 1 Topics Data Mining Algorithm Definition Example of CART Classification Iris, Wine

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Artificial Intelligence. Programming Styles

Artificial Intelligence. Programming Styles Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to

More information

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Thomas Giraud Simon Chabot October 12, 2013 Contents 1 Discriminant analysis 3 1.1 Main idea................................

More information

Data mining with sparse grids

Data mining with sparse grids Data mining with sparse grids Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Data mining with sparse grids p.1/40 Overview What is Data mining? Regularization networks

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

Swarm Based Fuzzy Clustering with Partition Validity

Swarm Based Fuzzy Clustering with Partition Validity Swarm Based Fuzzy Clustering with Partition Validity Lawrence O. Hall and Parag M. Kanade Computer Science & Engineering Dept University of South Florida, Tampa FL 33620 @csee.usf.edu Abstract

More information

Bioinformatics - Lecture 07

Bioinformatics - Lecture 07 Bioinformatics - Lecture 07 Bioinformatics Clusters and networks Martin Saturka http://www.bioplexity.org/lectures/ EBI version 0.4 Creative Commons Attribution-Share Alike 2.5 License Learning on profiles

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Introduction to Mobile Robotics

Introduction to Mobile Robotics Introduction to Mobile Robotics Clustering Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann Clustering (1) Common technique for statistical data analysis (machine learning,

More information

Chapter DM:II. II. Cluster Analysis

Chapter DM:II. II. Cluster Analysis Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

K-Means Clustering 3/3/17

K-Means Clustering 3/3/17 K-Means Clustering 3/3/17 Unsupervised Learning We have a collection of unlabeled data points. We want to find underlying structure in the data. Examples: Identify groups of similar data points. Clustering

More information

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall Midterm Exam Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall Covers topics through Decision Trees and Random Forests (does not include constraint satisfaction) Closed book 8.5 x 11 sheet with notes

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Unsupervised: no target value to predict

Unsupervised: no target value to predict Clustering Unsupervised: no target value to predict Differences between models/algorithms: Exclusive vs. overlapping Deterministic vs. probabilistic Hierarchical vs. flat Incremental vs. batch learning

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1

More information

Applying the Possibilistic C-Means Algorithm in Kernel-Induced Spaces

Applying the Possibilistic C-Means Algorithm in Kernel-Induced Spaces 1 Applying the Possibilistic C-Means Algorithm in Kernel-Induced Spaces Maurizio Filippone, Francesco Masulli, and Stefano Rovetta M. Filippone is with the Department of Computer Science of the University

More information

Comparing Univariate and Multivariate Decision Trees *

Comparing Univariate and Multivariate Decision Trees * Comparing Univariate and Multivariate Decision Trees * Olcay Taner Yıldız, Ethem Alpaydın Department of Computer Engineering Boğaziçi University, 80815 İstanbul Turkey yildizol@cmpe.boun.edu.tr, alpaydin@boun.edu.tr

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Clustering with Reinforcement Learning

Clustering with Reinforcement Learning Clustering with Reinforcement Learning Wesam Barbakh and Colin Fyfe, The University of Paisley, Scotland. email:wesam.barbakh,colin.fyfe@paisley.ac.uk Abstract We show how a previously derived method of

More information

A Unified Framework to Integrate Supervision and Metric Learning into Clustering

A Unified Framework to Integrate Supervision and Metric Learning into Clustering A Unified Framework to Integrate Supervision and Metric Learning into Clustering Xin Li and Dan Roth Department of Computer Science University of Illinois, Urbana, IL 61801 (xli1,danr)@uiuc.edu December

More information

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple

More information

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013 Machine Learning Topic 5: Linear Discriminants Bryan Pardo, EECS 349 Machine Learning, 2013 Thanks to Mark Cartwright for his extensive contributions to these slides Thanks to Alpaydin, Bishop, and Duda/Hart/Stork

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

Univariate and Multivariate Decision Trees

Univariate and Multivariate Decision Trees Univariate and Multivariate Decision Trees Olcay Taner Yıldız and Ethem Alpaydın Department of Computer Engineering Boğaziçi University İstanbul 80815 Turkey Abstract. Univariate decision trees at each

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

Using the Kolmogorov-Smirnov Test for Image Segmentation

Using the Kolmogorov-Smirnov Test for Image Segmentation Using the Kolmogorov-Smirnov Test for Image Segmentation Yong Jae Lee CS395T Computational Statistics Final Project Report May 6th, 2009 I. INTRODUCTION Image segmentation is a fundamental task in computer

More information

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Implementation: Real machine learning schemes Decision trees Classification

More information

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Do Something..

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

The Projected Dip-means Clustering Algorithm

The Projected Dip-means Clustering Algorithm Theofilos Chamalis Department of Computer Science & Engineering University of Ioannina GR 45110, Ioannina, Greece thchama@cs.uoi.gr ABSTRACT One of the major research issues in data clustering concerns

More information

Kernel Based Fuzzy Ant Clustering with Partition validity

Kernel Based Fuzzy Ant Clustering with Partition validity 2006 IEEE International Conference on Fuzzy Systems Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada July 6-2, 2006 Kernel Based Fuzzy Ant Clustering with Partition validity Yuhua Gu and Lawrence

More information

All lecture slides will be available at CSC2515_Winter15.html

All lecture slides will be available at  CSC2515_Winter15.html CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 9: Support Vector Machines All lecture slides will be available at http://www.cs.toronto.edu/~urtasun/courses/csc2515/ CSC2515_Winter15.html Many

More information

Kernel-based online machine learning and support vector reduction

Kernel-based online machine learning and support vector reduction Kernel-based online machine learning and support vector reduction Sumeet Agarwal 1, V. Vijaya Saradhi 2 andharishkarnick 2 1- IBM India Research Lab, New Delhi, India. 2- Department of Computer Science

More information

Further Applications of a Particle Visualization Framework

Further Applications of a Particle Visualization Framework Further Applications of a Particle Visualization Framework Ke Yin, Ian Davidson Department of Computer Science SUNY-Albany 1400 Washington Ave. Albany, NY, USA, 12222. Abstract. Our previous work introduced

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

K-Means Clustering. Sargur Srihari

K-Means Clustering. Sargur Srihari K-Means Clustering Sargur srihari@cedar.buffalo.edu 1 Topics in Mixture Models and EM Mixture models K-means Clustering Mixtures of Gaussians Maximum Likelihood EM for Gaussian mistures EM Algorithm Gaussian

More information

Fall 09, Homework 5

Fall 09, Homework 5 5-38 Fall 09, Homework 5 Due: Wednesday, November 8th, beginning of the class You can work in a group of up to two people. This group does not need to be the same group as for the other homeworks. You

More information

Road map. Basic concepts

Road map. Basic concepts Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?

More information

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines Quiz Section Week 8 May 17, 2016 Machine learning and Support Vector Machines Another definition of supervised machine learning Given N training examples (objects) {(x 1,y 1 ), (x 2,y 2 ),, (x N,y N )}

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

Unsupervised Learning: Clustering

Unsupervised Learning: Clustering Unsupervised Learning: Clustering Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer Machine Learning Supervised Learning Unsupervised Learning

More information

Region-based Segmentation

Region-based Segmentation Region-based Segmentation Image Segmentation Group similar components (such as, pixels in an image, image frames in a video) to obtain a compact representation. Applications: Finding tumors, veins, etc.

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14

More information

CLUSTERING IN BIOINFORMATICS

CLUSTERING IN BIOINFORMATICS CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of

More information

COMS 4771 Clustering. Nakul Verma

COMS 4771 Clustering. Nakul Verma COMS 4771 Clustering Nakul Verma Supervised Learning Data: Supervised learning Assumption: there is a (relatively simple) function such that for most i Learning task: given n examples from the data, find

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Associative Cellular Learning Automata and its Applications

Associative Cellular Learning Automata and its Applications Associative Cellular Learning Automata and its Applications Meysam Ahangaran and Nasrin Taghizadeh and Hamid Beigy Department of Computer Engineering, Sharif University of Technology, Tehran, Iran ahangaran@iust.ac.ir,

More information

Introduction to R and Statistical Data Analysis

Introduction to R and Statistical Data Analysis Microarray Center Introduction to R and Statistical Data Analysis PART II Petr Nazarov petr.nazarov@crp-sante.lu 22-11-2010 OUTLINE PART II Descriptive statistics in R (8) sum, mean, median, sd, var, cor,

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki Wagner Meira Jr. Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA Department

More information

Lecture #11: The Perceptron

Lecture #11: The Perceptron Lecture #11: The Perceptron Mat Kallada STAT2450 - Introduction to Data Mining Outline for Today Welcome back! Assignment 3 The Perceptron Learning Method Perceptron Learning Rule Assignment 3 Will be

More information

Clustering algorithms and introduction to persistent homology

Clustering algorithms and introduction to persistent homology Foundations of Geometric Methods in Data Analysis 2017-18 Clustering algorithms and introduction to persistent homology Frédéric Chazal INRIA Saclay - Ile-de-France frederic.chazal@inria.fr Introduction

More information

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. Michael Nechyba 1. Abstract The objective of this project is to apply well known

More information

Methods for Intelligent Systems

Methods for Intelligent Systems Methods for Intelligent Systems Lecture Notes on Clustering (II) Davide Eynard eynard@elet.polimi.it Department of Electronics and Information Politecnico di Milano Davide Eynard - Lecture Notes on Clustering

More information

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Midterm Examination CS540-2: Introduction to Artificial Intelligence Midterm Examination CS540-2: Introduction to Artificial Intelligence March 15, 2018 LAST NAME: FIRST NAME: Problem Score Max Score 1 12 2 13 3 9 4 11 5 8 6 13 7 9 8 16 9 9 Total 100 Question 1. [12] Search

More information

CloNI: clustering of JN -interval discretization

CloNI: clustering of JN -interval discretization CloNI: clustering of JN -interval discretization C. Ratanamahatana Department of Computer Science, University of California, Riverside, USA Abstract It is known that the naive Bayesian classifier typically

More information

Introduction to Artificial Intelligence

Introduction to Artificial Intelligence Introduction to Artificial Intelligence COMP307 Machine Learning 2: 3-K Techniques Yi Mei yi.mei@ecs.vuw.ac.nz 1 Outline K-Nearest Neighbour method Classification (Supervised learning) Basic NN (1-NN)

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2 161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under

More information

Comparative Study of Clustering Algorithms using R

Comparative Study of Clustering Algorithms using R Comparative Study of Clustering Algorithms using R Debayan Das 1 and D. Peter Augustine 2 1 ( M.Sc Computer Science Student, Christ University, Bangalore, India) 2 (Associate Professor, Department of Computer

More information

Clustering: Classic Methods and Modern Views

Clustering: Classic Methods and Modern Views Clustering: Classic Methods and Modern Views Marina Meilă University of Washington mmp@stat.washington.edu June 22, 2015 Lorentz Center Workshop on Clusters, Games and Axioms Outline Paradigms for clustering

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo CSE 6242 A / CX 4242 DVA March 6, 2014 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Analyze! Limited memory size! Data may not be fitted to the memory of your machine! Slow computation!

More information

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Module 4. Non-linear machine learning econometrics: Support Vector Machine Module 4. Non-linear machine learning econometrics: Support Vector Machine THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Introduction When the assumption of linearity

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

4. Ad-hoc I: Hierarchical clustering

4. Ad-hoc I: Hierarchical clustering 4. Ad-hoc I: Hierarchical clustering Hierarchical versus Flat Flat methods generate a single partition into k clusters. The number k of clusters has to be determined by the user ahead of time. Hierarchical

More information

Hierarchical clustering

Hierarchical clustering Hierarchical clustering Rebecca C. Steorts, Duke University STA 325, Chapter 10 ISL 1 / 63 Agenda K-means versus Hierarchical clustering Agglomerative vs divisive clustering Dendogram (tree) Hierarchical

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information