Online Adaptive Hierarchical Clustering in a Decision Tree Framework

Size: px

Start display at page:

Download "Online Adaptive Hierarchical Clustering in a Decision Tree Framework"

Ashlynn Howard
6 years ago
Views:

1 Journal of Pattern Recognition Research 2 (2) Received March 8, 2. Accepted May 7, 2. Online Adaptive Hierarchical Clustering in a Decision Tree Framework Jayanta Basak NetApp India Private Limited, Advanced Technology Group, Bangalore, India basak@netapp.com, basakjayanta@yahoo.com Abstract Online clustering is an issue in large amount of data crunching. Moreover, having a coarse-to-fine grain analysis is also desirable. We address both these problems in a single framework by designing an online adaptive hierarchical clustering algorithm in a decision tree framework. Our model consists of an online adaptive binary tree and a code formation layer. The adaptive tree hierarchically partitions the data and the finest level clusters are represented by the leaf nodes. The code formation layer stores the representative codes of the clusters corresponding to the leaf nodes, and the tree adapts the separating hyperplanes between the clusters at every layer in an online adaptive mode. The membership of a sample in a cluster is decided by the tree and the tree parameters are guided by the stored codes. As opposed to the existing hierarchical clustering techniques where certain local objective function at every level is optimized, we adapt the tree in an online adaptive mode by minimizing a global objective functional. We use the same global objective functional as used in the fuzzy c-means algorithm (FCM), however, we observe that the effect of the control parameter is different from that in the FCM. In our model the control parameter regulates the size and the number of clusters whereas in the FCM the number of clusters is always constant (c). We never freeze the adaptation process. For every input sample, the tree allocates it to certain leaf node and at the same time adapts the tree parameters simultaneously with the adaptation of the stored codes. We validate the effectiveness of our model on certain real-life datasets and also show that the model is able to perform unsupervised classification on certain datasets. Keywords: Decision tree, online adaptive learning, fuzzy c-means Pattern clustering [, 2] is a well-studied topic in the pattern recognition literature where samples are grouped into different clusters based on certain self-similarity measure. Online clustering is an issue in large amount of data crunching. Moreover, having a coarse-to-fine grain analysis is also desirable. We address both these problems in a single framework by designing an online adaptive hierarchical clustering algorithm in a decision tree framework. Depending on different criteria such as data representation, similarity/dissimilarity measure, interpretation, domain knowledge and modalility (e.g., incremental/batch mode), different clustering algorithms have been developed a comprehensive discussion on which can be found in [2]. Methods of clustering can be generically partitioned into two groups namely, partitional and hierarchical. In partitional or flat clustering, the data samples are grouped into several disjoint groups depending on certain criteria function and there exists no hierarchy between the clusters. Among different partitional (flat) clustering algorithms, K-means and fuzzy c-means are widely studied in the literature. The quality of partitional clustering algorithms often depends on the objective functional that is minimized to produce the clustering. Fuzzy c-means or soft c-means [3] (Appendix I) and its variations in general employ certain global objective functional, and the clusters are described in terms of cluster centers and the membership values of the samples in different clusters. c 2 JPRR. All rights reserved. Permissions to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or to republish, requires a fee and/or special permission from JPRR.

2 Basak Hierarchical clustering [], on the other hand generates certain nested hierarchical groupings of the data samples where the top layers of the hierarchy contain coarse level grouping with less number of clusters and lower layers represent finer level groupings with a large number of clusters. Hierarchical clustering is usually performed in two different ways namely, divisive and agglomerative. In agglomerative clustering, smaller clusters are grouped into larger clusters in a bottom-up manner [4, 5]. In the divisive hierarchical clustering algorithm, the set of data samples are iteratively partitioned in a top-down manner to construct finer level clustering [6, 7, 8, 9,, ]. In divisive clustering hierarchical self-organization maps have also been used where at each level a growing self-organization map is generated depending on the data distribution at the layer [, 9, 8]. Decision trees have been used for divisive hierarchical clustering [2, 3, 4, 5]. As an example, in [2], the process of construction of unsupervised decision tree was mapped onto that of a supervised decision tree where samples are artificially injected such that empty regions contain sparse injected samples. The decision tree is then constructed by discriminating between the original samples and the artificially injected samples. At every level of the tree, the number of the injected samples is controlled depending on the number of actual data samples available at that node. In [4], top-down induction mechanism of decision trees is used for clustering. The top-down induction mechanism is specifically used to build first-order logical decision tree representation of the inductive logical programming system. In predictive clustering [3], a decision tree structure is induced based on the attribute values such that the separation between the clusters and the homogeneity within clusters in terms of the attribute values are maximized. A value of the target variable is then associated with each cluster represented by the leaf of the tree. In [5], the dataset is iteratively partitioned at each level of the tree by measuring the entropy from the histogram of the data distribution. However, in all these algorithms [2, 3, 4, 5], a decision tree is constructed based only on the local splitting criteria at each node as in the supervised counterpart of the decision trees [6, 7, 8]. In such cases of top-down divisive partitioning, if an error is committed at a top layer then it becomes very hard to correct that in the bottom layer. In order to overcome such situation, efforts have been made to define certain global objective function on a decision tree in the context of clustering, and the objective function is optimized to generate the clustering. For example, with a given tree topology, a stochastic approximation of the data distribution through deterministic annealing is performed in [9] to obtain groups of data samples at the leaf nodes of the tree. Essentially in this approach the cross-entropy measure between the probabilistic cluster assignments in the leaf nodes and that in the parent nodes is minimized. The formulation is then extended to an online learning set-up to handle non-stationary data. In this paper, we present a method for performing online adaptive hierarchical clustering in a decision tree framework (a preliminary version of this work can be found in [2]). Our model consists of two components namely, an online adaptive decision tree and a code formation layer. The adaptive tree component represents the clusters in a hierarchical manner and the leaf nodes represent the clusters at the finest granularity. The code formation layer stores the representative codes of the clusters corresponding to the leaf nodes. For the adaptive tree, we use a complete binary tree of a specified depth. We use a global objective function considering the activation of the leaf nodes and the stored codes together to represent the compactness of the clusters. We use the same objective function as used in the fuzzy c-means algorithm (Appendix I) where we replace the cluster centers by the stored codes. 22

3 Online Adaptive Hierarchical Clustering a Decision Tree Framework In our model, we do not explicitly compute the cluster centers since the clusters are not represented by their centers rather these are determined by the separating hyperplanes of the adaptive tree. We adapt the tree paramenters and the stored codes simultaneously in an online adaptive manner by minimizing the objective functional for each sample separately considering the samples are independently selected. We do not perform any estimation of the data distribution, rather we perform deterministic gradient descent on the objective functional. In this adaptation process, we adapt the parameters of the entire tree for every sample. Therefore unlike the usual divisive clustering, we do not recursively partition the dataset in a top-down manner. We never freeze the learning i.e., whenever a new data sample is available, the leaf nodes of the tree get activated and we find out the maximally activated leaf node to represent the corresponding cluster for the input sample; and at the same time we incrementally adapt the parameters of the tree and the stored codes. We do not decrease the learning rate explicitly depending on the number of iterations, rather we compute the optimal learning rate for every sample to minimize the objective functional. We also observe that even though we consider a fixed tree topology, the tree structure also adapts depending on the dataset and order of appearance of the input samples. In this framework, we attain an online adaptive hierarchical clustering algorithm in a tree framework which exhibits stability as well as plasticity. Since we do not compute the cluster centers directly from the data (rather they are determined by the tree parameters), we observe stability in the behavior of the tree. At the same time, incremental learning rules enable plasticity of the tree. The selection of global objective function relieves the tree of local partitioning rules. The hyperplanes at the intermediate nodes are adjusted based on the global behavior. This enables the tree of incremental learning and also reduces the risk of committing early errors of local partitioning. The intuition of reducing error due to local partitioning is similar to that in the supervised OADT [2] where we performed Gaussian parity problem classification which is not possible in classical decision tree framework with local partitioning.. Description of the Model The model (Figure ) consists of two components namely an online adaptive tree [2, 22] and a code formation layer. We used online adaptive tree before for pattern classification and function approximation tasks in the supervised mode [2, 22]. Here we employ online adaptive tree in the unsupervised mode for adaptive hierarchical clustering. The adaptive tree is a complete binary tree of a certain specified depth l. Each non-terminal node i has a local decision hyperplane characterized by a vector (w i,θ i ) with w i = such that (w i,θ i ) together has n free variables for an n-dimensional input space. Each nonterminal node of the adaptive tree receives the same input pattern x = [x,x 2,,x n ], and the activation from its parent node. The root node receives only the input pattern x. Each nonterminal node produces two outputs, one for the left child and the other for the right child. We number the nodes in the breadth-first order such that the left child of node i is indexed as 2i and the right one as 2i +. The activation of two child nodes of a non-terminal node i are given as u 2i = u i.f(wi Tx + θ i) u 2i+ = u i.f( (wi Tx + θ i)) () where u i is the activation of node i, and f(.) represents the local soft partitioning function governed by the parameter vector (w i,θ i ). We choose f(.) as a sigmoidal function. The 23

4 Basak Adaptive Tree x u(2) ( w θ ) 2 2 x ( w θ ) u()= ( w θ ) 3 3 x x x x x u(3) u(4) ( w θ ) ( w θ ) u(5) u(6) ( w θ 4 4 ) ( w θ ) u(7) β Code Formation Layer β Nn x x 2 x 3 x n Fig. : Structure of the model. It has two components namely, adaptive tree and a code formation layer. Every intermediate node and the root node of the adaptive tree receives the input pattern x. Also the code formation layer receives the input pattern x as input. 24

5 Online Adaptive Hierarchical Clustering a Decision Tree Framework activation of the root node, indexed by, is always set to unity. The activation of a leaf node p can therefore be derived as u p = i P p f (lr(i,p)(w i.x + θ i )). (2) where P p is the path from the root node to the leaf node p, and lr(.) is an indicator function indicating whether the leaf node p is on the right or left path of node i such that if p is on the left path of i lr(i,p) = if p is on the right path of i (3) otherwise Considering the property of sigmoidal function f(.) that f( v) = f(v), we observe the sum of activation of two siblings is always equal to that of their parent irrespective of the level of the nodes (since every intermediate node has a sigmoidal function to produce the activations of the left and right child nodes). Since the sum of two sibling leaf nodes is equal to their parent, the sum of activations of two sibling parent nodes is equal to the parent at the next higher layer, and so on. We extend this reasoning to obtain (Appendix II) u j = u (4) j Ω where Ω represents the set of leaf nodes, and u is the activation of the root node. As mentioned before, we always set u =, and therefore u j = (5) We have chosen f(.) as f(v) = j Ω + exp( mv) where m determines the steepness of the function f(.). In the appendix (Appendix III), we show that m log l/ǫ (7) δ where δ is a margin between a point (pattern) and the closest decision hyperplane, ǫ is a constant such that the minimum activation of a maximally activated node should be no less that ǫ. For example, if we choose ǫ =. and δ =. then m log(l). The code formation layer stores the codes representative of the clusters corresponding to the leaf nodes. The codes are equivalent to the cluster centers as used in the fuzzy c-means algorithm, however, we do not compute the codes by finding out the mean of the membership values. Corresponding to each leaf node j, there is a code β j = {β j,β j2,,β jn } stored in the code formation layer. Thus for N leaf nodes corresponding to a maximum N clusters we have N code vectors β,β 2,,β N where β N = {β N,β N2,,β Nn } as shown in Figure. The code formation layer receives the same input x and measures the deviation of the input x from the stored code β j for an activated leaf node j, i.e., x β j 2. If the leaf node activation is high and the deviation is also large then the error (value of the objective function) is large, and if either of the deviation or the leaf node activation is small then the (6) 25

6 Basak error should be small. Therefore, at each leaf node j, we have an error x β j 2 u /α j where u j is the activation of leaf node j and α is a control parameter. We iterate to minimize the value of an objective function which measures the total error over all leaf nodes, by adapting the stored codes β and the stored parameter values (w, θ). In this model, the maximum number of possible clusters is N = 2 l where l denotes the depth of the adaptive tree. 2. Learning Rules We define a global objective functional E by considering the discrepancy between the input patterns and the existing codes in β, and the activation of the leaf nodes together. E = x E(x) = 2 x β j 2 u /α j (x) (8) x where α is a constant which acts as a control parameter. The objective functional is identical to that in the fuzzy c-means [3]. In fuzzy c-means (Appendix I), u represents the membership value of a sample in a specific cluster, and β represents the cluster centers. The parameter α controls the fuzzification or the spread of membership values. In the limiting condition α =, fuzzy c-means becomes hard c-means (Appendix I). The number of clusters in fuzzy c-means is constant as defined by the parameter c. The membership values and the cluster centers are updated in an iterative two-step EM-type algorithm where the membership values are computed based on the cluster centers in the first step and the cluster centers are updates using the membership values in the second step of the iteration. The stored codes β in our model, act as the reference points of the clusters which are analogous to the cluster centers as in the fuzzy c-means [3]. However, in our model the clusters are not explicitly defined by the stored codes, rather we obtain the clusters from the separating hyperplanes of the adaptive tree. We learn both β and the separating hyperplanes together unlike the fuzzy c-means [3]. Moreover, the parameter α in our model, controls the number and size of the clusters unlike the EM-based fuzzy c-means algorithm where the number of clusters is constant. In the online mode, we compute the objective functional for each individual sample x as E(x) = 2 j Ω j Ω x β j 2 u /α j (9) considering the samples are independently selected. We then adapt the parameters of the model to minimize E(x) for each individual sample in the online mode. For a given α and for an observed sample x, we minimize E(x) by steepest gradient descent such that β j w i θ i E = η opt β j E = η opt w i () E = η opt θ i where η opt is a certain optimal learning rate parameter decided in the online mode. Evaluating the partials in Equation (), we obtain β j = η opt u /α j (x β j ) w i = η opt q i x () θ i = η opt λq i 26

7 Online Adaptive Hierarchical Clustering a Decision Tree Framework where q i is expressed as q i = m x β j 2 (u j ) /α ( v ij )lr(i,j) (2) 2α j Ω Each nonterminal node i locally computes v ij which is given as v ij = f((w T i x + θ i )lr(i,j)) (3) Since (w,θ) at each nonterminal node has n free variables together (n is the input dimensionality) subject to the condition that w =, the updating of w in Equation () is restricted to the normalization condition. The parameters w is updated such a way that w lies in the hyperplane normal to w which ensures that w. w =, i.e., the constraint w = is satisfied for small changes after updating. Therefore, we take the projection of q i x onto the normal hyperplane of w i and obtain the w i as w i = η opt q i (I w i w i T )x (4) where I is the identity matrix. In Equation (), the parameter λ represents the scaling of the learning rate parameter with respect to θ. We normalize the input variables such that x is bounded in [,+] n, and we select λ =. Note that if the input is not normalized then we can select a different value for λ. Since we never freeze the adaptation process, we decide the learning rate adaptively based on the discrepancy measure E(x) for a sample x in the online mode. For every input sample x at every iteration, we select η opt using the line search method (also used by Friedman in [23]) such that the change in E to E + E is linearly extrapolated to zero. Equating the first order approximation of E (line search) to E, we obtain, η opt = E j Ω (u j) 2/α + i N q2 i (λ + x 2 (w T i x) 2 ) (5) In Equation (5), we observe that for very compact or point clusters, η can become very large near optima. In order to stabilize the behavior of the adaptive tree, we modify the learning rate as η opt = E γ + j Ω (u j) 2/α + i N q2 i (λ + x 2 (w i T x) 2 ) (6) with a constant γ. The parameter γ acts as a regularizer in obtaining the learning rate. In our model, we chose γ =. From the learning rules, we observe that the codes represented by β are updated such that they approximate the input pattern subject to the activation of the leaf nodes. The maximally activated leaf node attracts the corresponding β most towards the input pattern. If there is a close match between the input pattern x and the stored code β corresponding to the maximally activated leaf node, then the decision hyperplanes are not perturbed much since x β becomes very small. On the other hand, if there is a mismatch then the movement of the decision hyperplane is decided by the stored function lr(i,j). If the maximally activated leaf node j is on the left path of the nonterminal node i then lr(i,j) =, i.e., w i is moved such a way that w i has an opposite sign of x subject to being in the 27

8 Basak normal hyperplane of w i. Since the activated leaf node is on the left path i.e., x is on the left half of the decision hyperplane, the decision hyperplane effectively moves opposite to x to increase the activation of the activated leaf node. The magnitude of the shift depends on the activation value of the leaf, and the discrepancy between the corresponding stored code and the input pattern, Similar reasoning is valid if the maximally activated leaf node j is on the right path of the nonterminal node i. We also observe that as we move towards the root node, the decision hyperplanes are perturbed by the cumulative effect of the entire subtree under the nonterminal node (if j is not under the subtree of i then lr(i,j) = ). We also notice that the updating rule for decision hyperplanes are independent of each other. For example, a leaf node j can be on the left path of a nonterminal node i with lr(i,j) =, however it can be on the right path of its left child. Therefore, the leaf nodes affect different nonterminal nodes differently at different levels depending on the stored function lr(i, j). In the adaptive process, if the input patterns are drawn from a stable distribution then E decreases (due to the decrease in discrepancy between stored codes and the input patterns) and therefore η opt decreases with the number of iterations. and therefore, the updating rules reveal that the corresponding codes in β are also less affected by the presence of input patterns. In our model, we do not explicitly decrease η through any learning schedule; rather depending on the input distribution, η opt adjusts itself. The overall algorithm for adapting the tree in the online mode is given as follows. Step : Initialize the tree parameters (w) and the code vectors (β) Repeat the next steps until new pattern x appears (in the online mode we never freeze the learning) Step 2: for given x, compute the leaf node activations Step 3: for given x, compute q for each intermediate node of the tree (Equation 2), Step 4: for given x, compute η opt (Equation 6) Step 5: for given x, compute w for all intermediate nodes (Equation 4), compute β for all leaf nodes (Equation ), Step 6: for given x, update w as w = w + w for all intermediate nodes, update β as β = β + β for all leaf nodes. Continue Steps 2-6 until new pattern is presented to the model. 3. Choice of α The objective functional in Equation (8) is exactly the same as that in fuzzy c-means [24, 3] (Appendix I) where β corresponds to the cluster centers, u corresponds to the membership values, and h = /α. However, h in soft c-means is chosen in [, ] (i.e., /α [, )) and usually h > which means α <. In the limit α =, fuzzy c-means algorithm becomes the hard c-means algorithm (Appendix I). In fuzzy c-means α is chosen less than unity (i.e., h > ) in order to make the membership function wider to obtain smoother cluster boundaries. However, α in fuzzy c-means has no role in deciding the number of clusters, the number of clusters being a user-defined parameter. In our model on the other hand, leaf node activations do not directly depend on the computed codes in the code formation layer. Rather, the decision hyperplanes stored in the intermediate nodes of the adaptive 28

9 Online Adaptive Hierarchical Clustering a Decision Tree Framework tree determine the leaf node activations. The code formation layer, parameterized by the β matrix, stores the respective codes corresponding to the groups of samples identified by the adaptive tree, each leaf node representing one group of samples respectively. The decision hyperplanes in the adaptive tree and the codes in the code formation layer are adapted together to minimize the objective functional (Equation (8)). The number of clusters in our model is guided by the choice of α. In the limiting condition, as α, E, and the model does not adapt at all and remains in the initial configuration. On the other hand, for α, there is no role of individual leaf node activations in E, and the parameters of the adaptive tree are adapted in such a way that all samples are merged into one group. We can also observe the same behavior from the combination of adaptation rules and the learning rate (Equation (2) and (5)), where with the increase in the value of α, the magnitude of the changes in parameter values increase, and this enables model to become more adaptive. In other words, even though there are 2 l leaf nodes corresponding to a maximum of 2 l clusters, the effective number of clusters depends on the parameter α. The effective number of clusters depends on the number of leaf nodes which are maximally activated for a set of input samples. Thus if there exist a leaf node which is never maximally activated for any input sample then the leaf node does not represent any cluster. In this model the effective number of clusters depends on the control paramter α. We explain the effect of α empirically with a simple example. Let there be only two different samples x and x 2, and the model has only two leaf nodes. Without loss of generality, let us represent the activation of leaf nodes by u and u 2 respectively. Let the distance between the two samples be d = x x 2. Ideally, there are two possible cases. Case I: The sample x forms one cluster and x 2 forms the second one such that codes formed are β = x and β 2 = x 2. For the minimum value of E, a separating hyperplane exactly normal to the vector x x 2 and equidistant from the samples is constructed. In such case, when x is presented to the model then u (x ) = /( + exp( md/2)) and u 2 (x ) = /( + exp(md/2)). In such a scenario, i.e., E(x ) = x β 2 u /α + x β 2 2 u /α 2 (7) E(x ) = d 2 ( + exp(md/2)) (/α) (8) We get exactly the same value for E(x 2 ) due to symmetry and therefore the total loss is E = E(x ) + E(x 2 ) = 2d 2 ( + exp(md/2)) (/α) (9) Case II: Both the samples x and x 2 form one cluster and the code β is exactly placed in the middle such that β = (x +x 2 )/2. Since both samples are in the same cluster, there is no separating hyperplane between them (ideally it is at an infinite distance from x and x 2 ), and only one leaf node is activated such that u = and u 2 = for both x and x 2. In such case, E(x ) = x β 2 u /α + x β 2 u /α 2 (2) i.e., E(x ) = d 2 /4 (2) 29

10 Basak Since we obtain exactly the same value for E(x 2 ), the total loss is E 2 = E(x ) + E(x 2 ) = d2 2 From Equations (9) and and (22), we obtain ( ) α E 4 α = + exp(md/2) E 2 (22) (23) From Equation (23), we observe that there exists an α = 2 log 2( + exp(md/2)) (24) such that for α α, E E 2. In other words, since we minimize the objective functional, the scenario in case II is preferred over that in case I if we choose α > α, i.e., both samples are clustered in the same group. On the other hand for a choice of α < α, the samples are preferred to represent two different groups (since in that case E < E 2 ). As discussed in appendix (Appendix III), ( + exp( mδ)) l ǫ (25) where l is the depth of the tree, δ is the minimum distance of a sample from the nearest hyperplane, and ( ǫ) is the minimum activation of the corresponding cluster (activation of the maximally activated leaf node). In this example, l = since there are only two leaf nodes. In the present example, if we consider δ = d/2 then After simplification, Therefore from Equations (24) and (27), we have + exp( md/2) ǫ (26) + exp(md/2) /ǫ (27) α 2 log 2(/ǫ) (28) Therefore, if we want to prevent fragmentation i.e., assign both x and x 2 in the same cluster then we should have α α 2 log 2(/ǫ) (29) In the first case, in order to assign both the samples in a single cluster, if we choose ǫ = then we should have α. Similarly if we consider ǫ =.25, we should have α. The same reasoning is valid for more than two samples with multiple levels in the adaptive tree. In the case of multiple levels, we assume that two closest points are separated only at the last level of the tree. Therefore, for a small α, the adaptive tree can partition single small clusters into multiple ones. In short, for a choice of large α, the number of groups formed by the adaptive tree decreases and large chunks of data are grouped together. On the other hand, if α is decreased then the number of groups increases. In other words, we can vary the value of α in the range [α, ) to obtain finer to coarser clustering. In soft c-means, on 2

11 Online Adaptive Hierarchical Clustering a Decision Tree Framework the contrary, the number of clusters is always constant and equal to c. The parameter α, in soft c-means, is used to control the radius of each soft cluster during the computation of the centroid of each cluster; and the membership values of the samples to different clusters depend on the computed cluster centers. In our model, although the maximum number of possible clusters is equal to the number of leaf nodes, the actual number of clusters depends on the choice of α. With the increase in α, the number of clusters decreases. In other words, although the objective functional in Equation (8) is exactly the same as that in the soft c-means, the effect of α is different in these two algorithms. 4. Experimental Results We implemented the model in the MATLAB environment. We experimented the performance of model with both synthetic and real-life data sets. 4. Protocol The entire batch of samples is repeatedly presented in the online mode, and we call the number of times the batch of samples presented as the number of epochs. If the data density is high or the dataset has a set of compact clusters then we observe that the model takes less number of epochs to converge; and for relatively low data density or in the presence of sparse clusters, the model takes a larger number of epochs. On an average, we observed that the model converges near its local optimal solution within 5 epochs and sometimes even less, although the required number of epochs increases with the depth of the tree. We normalize all input patterns such that each component of x lies in the range [,+]. We normalize each component of x (say, x i ) as ˆx i = 2x i (x max i + x min x max i x min i where x max i and x min i are the maximum and minimum values of x i over all observations respectively. In this normalization, we do not separately process the outliers. Data outliers can badly influence the variables in such normalization. Instead of linear scaling, use of certain non-linear scaling or certain more robust scaling such as that based on inter-quartile range may further improve the performance of our model. However, in this paper, we preserve the exact distribution of the samples and experiment with the capability of the model in extracting the groups of samples in the online adaptive mode. The performance of the model depends on the slope (parameter m) of the sigmoidal function (Equation (6)). In general, we observed that for a given control parameter α, the number of decision hyperplanes increases with the decrease in the slope (this is also validated in Equation (24) where α decreases with the decrease in m). This is due to the fact that if we decrease the slope then more than one leaf node gets significant activation to attract the codes (β) in the code formation layer and thereby the number of clusters increases. On the other hand, with the increase in the slope, only a few (in most of the cases only one) leaf nodes are significantly activated to influence the decision hyperplanes and therefore the number of clusters decreases. In general, for a small m, a group of leaf nodes together represents a compact cluster. As stated in Equation (5), for a given depth l of the adaptive tree, we can fix the slope as m δ log(l/ǫ), where δ is a margin between a point (pattern) and the closest decision hyperplane, ǫ is a constant such that the minimum activation of a maximally activated node should be no less that ǫ. Experimentally, we fixed δ =.2 for an ǫ =.. In other words, we fix the slope as m = 5log(l) for given depth l of the tree. Note that, the value of δ that we have chosen is valid for the normalized i ) (3) 2

12 Basak input space that we use. If the data is not normalized and the input range is higher then a larger value of δ can also be selected. We also observed the behavior of the model with respect to the control parameter α. As we increase α, the number of clusters extracted by the model decreases and vice-versa. In the adaptation rule (Equation ()) as stated before, we always select λ as unity since we operate in the normalized input space. We also select the constant γ = in computing the learning rate irrespective of any other parameter. We initialize β to zero. We also initialize θ at each nonterminal node to zero. We initialize the weight vectors w such that every component of each w is the same. In other words, for an n-dimensional input space, we initialize w such that every component of w is equal to / n which makes w =. Therefore, the performance of the model solely depends on the order in which the samples are presented and independent of initial condition (since the initial condition is the same in all experiments). We also experimented by initializing each weight vector with randomly generated Gaussian distribution. We observed that initialization with randomly generated normally distributed weight vectors can sometimes lead to better results than that with fixed initialization (all weights being equal). In the next section, we illustrate the results for Iris and Crabs dataset with an initialization using normally distributed weight vectors. We then perform class discovery with fixed initialization (all weight being equal) and then report the results on a face dataset using the same fixed initialization. We experimented with different depths of the tree and naturally the number of clusters extracted by the model increases with the depth of the adaptive tree. 4.2 Examples 4.2. Synthetic Data We used a synthetically generated data to test the performance of the model. We used mixture of Gaussian to generate the dataset. Figure 2 illustrates the clusters extracted by the model for different values of α and different depths. The objective of this experiment is to show that the model behaves in the same manner as we discussed, i.e., with the increase in α, the number of clusters decreases and vice-versa. Similarly, with the increase in depth the number of clusters increases. As we stated the effect of α in grouping the data by the model, we illustrate the same in Figure 3. We illustrate the results for α =,.2,.4, and.6 respectively for an adaptive tree of depth l = 5 with m = 5 for 2 epochs. We observe that for an α =., relatively large number of regions are identified by the tree. This is due to the fact that initially the regions are created and due to smaller value of α they are not marged with other regions. As α is increased to.6, we observe that different groups of data are merged into one segment by the tree, and seven different groups are identified Crabs Data In crabs dataset [25], there are total 2 samples, each sample having five attributes. The crabs are from two different species, each species has male and female categories. Thus there are in total four different classes and the data set contains 5 samples from each class. The samples from different classes get highly overlapped when the data is projected onto two-dimensional space spanned by the first and second principal components. Interestingly, the four classes are nicely observable when the data is displayed in the space spanned by the second and third principal components. We therefore obtain the second and third principal components and project the dataset onto these two principal components to obtain the two dimensional projected data. We then normalize such that each projected sample is bounded 22

13 Online Adaptive Hierarchical Clustering a Decision Tree Framework (a) (b) (c) (d) (e) (f) (g).5.5 (h) Fig. 2: The decision hyperplanes constructed by the model for different depths with different control parameters with a constant m = 5. The constructed hyperplanes are shown for (a)depth equal to 3 and α =.2, (b) depth equal to 3 and α =.5, (c) depth equal to 3 and α = 3., (d) depth equal to 4 and α =.2, (e) depth equal to 4 and α = 3., (f) depth equal to 4 and α = 4., (g) depth equal to 5 and α =.2, (h) depth equal to 5 and α = 4. 23

14 Basak (a) (b) (c) (d) Fig.3: The decision hyperplanes constructed by the tree with a constant m = 5 and depth equal to 5. The control parameter α is varied. (a), (b), (c), and (d) illustrate the behavior of the tree for α =, α =.2, α =.4, and α =.6 respectively. 24

15 Online Adaptive Hierarchical Clustering a Decision Tree Framework in [-,]. Figure 4(a)-(d) illustrates the behavior of model for depths equal to two to five with a constant α =.2. Figure 4(e)-(f) illustrates the behavior for depths equal to four and five with α = 2.. In Figure 4(a), we observe that the model is not able to separate the data points from different classes completely with only four regions. However, if we consider only species information, then the two different species are well separated with a depth of 2. In Figure 4(c), we observe that the model generated a region where samples from two different classes (same species with different sex) are mixed up. In all other situations, the model is able to separate the four different classes by allocating more than one leaf node to each class. As a comparison, we illustrate the results obtained with the fuzzy c-means algorithm (Appendix I) for different number of clusters. We experimented with different values of the exponent h. We observed that for relatively small number of clusters (such as 4 or 5), the results are same for a large range of exponent values. As the number of clusters increases, the nature of grouping depends on the exponent value. We report the results of FCM for different number of clusters with a fixed exponent h = 2.. We observe that the proposed clustering algorithm is able to separate different class structures as good as the fuzzy c-means algorithm although the proposed algorithm is hierarchical in nature Iris Data We illustrate the behavior of the mode for iris dataset (the original dataset was reported in [26], and we obtain the data from [27]). In the iris dataset, there are three different types of iris flowers, each category having four different features namely sepal length, sepal width, petal length, and petal width. There are 5 samples from each class resulting in a total of 5 samples in the four dimensional space. The three different classes can be nicely identified when the data set is displayed in a two-dimensional space spanned by two derived features namely sepal area (sepal length sepal width) and petal area (petal length petal width). We transform the data into these two dimensions then normalize. Figure 6 illustrates the behavior of the model for this dataset with these two normalized derived features for depths equal to two to four with a constant α =.2. We observe that a model with depth equal to 2, is able to separate one class completely. However, it creates one region (corresponding to one leaf node) where samples from two different classes are mixed up. The purity of the regions improves a lot when we use a depth equal to 3. For a depth equal to 4, we observe that the model is able to perfectly separate the three different classes from the data. As a comparison, we illustrate the results obtained with the fuzzy c-means algorithm (Appendix I) for different number of clusters. We experimented with different values of the exponent h. We observed that for relatively small number of clusters (such as 4 or 5), the results are same for a large range of exponent values. As the number of clusters increases, the nature of grouping depends on the exponent value. We report the results of FCM for different number of clusters with a fixed exponent h = 2.. We report only those results where the performance of FCM was visually most appealing. We observe that even for 3 clusters, FCM is able to separate one class completely just as performed by the proposed model with a depth of 2. As the number of clusters increases, the FCM algorithm is able to partially separate the two other classes. However, we observe that the class in the middle of the other two is better separated by the proposed model as compared to FCM. In our model the class structure is captured by two approximate parallel lines which is not so apparent in FCM although FCM separates the class to certain extent from the other two classes. 25

16 Basak.5.5 2nd principal component 2nd principal component rd principal component (a) rd principal component (b).5.5 2nd principal component 2nd principal component rd principal component (c) rd principal component (d).5.5 2nd principal component 2nd principal component rd principal component (e) rd principal component (f) Fig. 4: The decision hyperplanes constructed by the proposed model on the crabs dataset projected onto a two-dimensional plane spanned by the second and third principal components. The four different classes in the dataset are marked with different notations. Figure illustrates the hyperplanes with a constant m = 5 for (a) depth equal to 2 and α =.2, (b) depth equal to 3 and α =.2, (c) depth equal to 4 and α =.2, (d) depth equal to 5 and α =.2, (e) depth equal to 4 and α = 2., (f) depth equal to 5 and α =

17 Online Adaptive Hierarchical Clustering a Decision Tree Framework.5.5 2nd principal component 2nd principal component rd principal component (a) rd principal component (b).5.5 2nd principal component 2nd principal component rd principal component (c) rd principal component (d).5.5 2nd principal component 2nd principal component rd principal component (e) rd principal component (f) Fig.5: The decision hyperplanes constructed by the FCM algorithm on the crabs dataset projected onto a two-dimensional plane spanned by the second and third principal components. The four different classes in the dataset are marked with different notations. Figure illustrates the hyperplanes with a constant h = 2. for number of clusters equal to (a) 4, (b) 6, (c) 8, (d), (e) 2, and (f) 4. 27

18 Basak.5.5 petal area petal area sepal area (a) sepal area (b).5 petal area sepal area (c) Fig. 6: The decision hyperplanes constructed by the proposed model on the iris dataset projected onto a two-dimensional plane spanned by the sepal area (product of first and second attribute values) and the petal area (product of third and fourth attribute values). Figure illustrates the hyperplanes with a constant m = 5 and α =.2 for (a) depth equal to 2, (b) depth equal to 3, and (c) depth equal to 4. 28

19 Online Adaptive Hierarchical Clustering a Decision Tree Framework.5.5 petal area petal area sepal area (a) sepal area (b).5.5 petal area petal area sepal area (c) sepal area (d).5.5 petal area petal area sepal area (e) sepal area (f) Fig.7: The decision hyperplanes constructed by the FCM algorithm on the iris dataset projected onto a two-dimensional plane spanned by the second and third principal components. The four different classes in the dataset are marked with different notations. Figure illustrates the hyperplanes with a constant h = 2 for number of clusters equal to (a) 3, (b) 4, (c) 6, (d) 7, (e) 9, and (f). 29

20 Basak Unsupervised Classification on Real-life Data Here we demonstrate the effectiveness of the model in performing the class discovery on certain real-life data sets (in the UCI [27] repository) by means of unsupervised classification. In order to obtain the classification performance, we consider the samples allocated to each leaf node. For each leaf node, we obtain the classlabel of the majority of the samples allocated to that leaf node, and assign the corresponding class label to that leaf node. Thus we assign a specific class label to each leaf node of the tree depending on the groups of samples allocated to that leaf node. In crabs dataset, we use two class labels (species only) instead of the four labels. Apart from the iris and crabs dataset, we also use five other data sets namely pima-indians-diabetes (originally reported in [28], we call it pima ), Wisconsin Diagnostic Breast Cancer (originally reported in [29], in short called WDBC ), Wisconsin Prognostic Breast Cancer (originally reported in [3], in short called WPBC ), E-Coli bacteria dataset, and BUPA liver disease dataset. We obtained these three data sets from [27]. We modified the Ecoli data originally used in [3] and later in [32] for predicting protein localization sites. The original dataset consists of eight different classes out of which three classes namely, outer membrane lipoprotein, inner membrane lipoprotein, and inner membrane cleavable signal sequence have only five, two, and two instances respectively. We omitted samples from these three classes and report the results for the rest five different classes. Table summarizes the data set description. Table : Description of the pattern classification datasets. Data Set No. of Instances No. of Features No. of Classes Indian diabetes (Pima) Diagnostic Breast Cancer (Wdbc) Prognostic Breat Cancer (Wpbc) Liver Disease (Bupa) Flower (Iris) Bacteria (Ecoli) Crabs In performing the unsupervised classification using real-life datasets, we compare the results using our model with that obtained with the fuzzy c-means clustering algorithm. As a comparison, we also provide the cross-validated results obtained by a supervised decision tree, C4.5, [8, 33] on these data sets. In our model, the randomness can come in two different ways. One in the initialization process and second in the order of presentation of the samples in the online mode. We eliminate the randomness in the initialization process by using a fixed initialization. We initialize all components of all the weight vectors (w) to be equal, initialize the bias component (θ) to zero for all intermediate nodes, and initialize the code vectors (β) equal to zero. We experiment with different random order of presentation of the samples in the online mode and report the average performance over different trials for each dataset and each tree. In the case of fuzzy c-means, the second randomness is not present since it is a batch-mode algorithm. However, performance of the FCM algorithm 22

21 Online Adaptive Hierarchical Clustering a Decision Tree Framework depends on the initialization and we report the average performance of the FCM algorithm over different trials. In performing the unsupervised classification, we experimented with different values of α and the depth of the tree. We observed that the performance depends on the value of α. If α is very low then the classification accuracy degrades due to fragmentation. The performance improves with an increase in α. Classification accuracy again degrades if the value of α becomes large since many samples are grouped together. Figure 8 illustrates the behavior of the model on three different datasets namely, PIMA, WDBC, and ECOLI depth=3 depth=4 depth=5 depth=6 depth= depth=3 depth=4 depth=5 depth=6 depth=7 93 classification accuracy classification accuracy (a) α α (b) 9 85 depth=3 depth=4 depth=5 depth=6 depth=7 classification accuracy (c) α Fig. 8: Dependencies of the classification accuracy on the parameter α of the adaptive tree for (a) PIMA dataset, (b) WDBC dataset, and (c) ECOLI dataset. In Table 2, we report the best results obtained by the proposed model, and that by the FCM. As a comparison, we provide cross-validated results obtained by the supervised C4.5 decision tree. For the fuzzy c-means algorithm, we experimented with different combinations of the number of clusters and the exponent (h). We increased the number of clusters to 5 and the exponent (h) to 5.. For each dataset we report the best result obtained by the FCM algorithm, and report the number of clusters and the exponent with each score. In the case of our model also, we tested with different combinations of the depth of the adaptive tree and the parameter α. We increased the depth up to 7 and α to 3.. We report the best results and the corresponding value of α for each depth of the adaptive tree. Since the performance of the adaptive tree depends on the sequence in which the samples appear, 22

22 Basak we report the standard deviation of the scores over trials with different input sequences. Similarly performance of the FCM depends on the initialization and we report the standard deviation over different trials. We observe that the adaptive tree is able to produce better classification accuracy over FCM for the BUPA, PIMA, and IRIS datasets. On the other hand, FCM is much better for the CRABS dataset, and also performs better for the WDBC and ECOLI dataset. The performances are comparable for the WPBC dataset. In other words, we observe that although we perform hierarchical clustering, the adaptive tree is able to produce results comparable with the partitional clustering algorithm on certain datasets. We also compare the results with that produced by agglomerative hierarchical clustering using group average (dendogram). We have constrained the AHC by producing at most 2 clusters for each dataset, and we report the results for 2 clusters. We observe that in the case of AHC the results consistently improve as we increase the number of clusters. In order to make it comparable with the fuzzy c-means, we have constrained to 2 clusters. We observe that the adaptive tree is able to produce comparable and sometimes better results than AHC when AHC is constrained to 2 clusters. Table 2: Classification accuracies obtained by the unsupervised online adaptive tree ( OADT stands for online adaptive decision tree). As a comparison, we provide the results produced by the fuzzy c-means algorithm, dendogram, and the supervised C4.5 decision tree. Depth BUPA PIMA WDBC WPBC IRIS ECOLI CRABS (std) (±.269) (±.263) (±.67) (±43) (±.344) (±.89) (±.438) (α) () (.6) (.8) (.8) (.7) (.6) (.2) (std) (±.7) (±.246) (±.877) (±.892) (±.335) (±.498) (±.474) (α) (.8) (.8) (.8) (.8) (.8) (.7) (.6) OADT (std) (±.54) (±.528) (±.395) (±.797) (±.93) (±.88) (±37) (α) (.7) (.8) (.8) (.8) () (.6) () (std) (±.874) (±.69) (±.893) (±.247) (±.53) (±.594) (±.755) (α) (.7) (.8) (.8) (.8) () (.7) () (std) (±.39) (±.6) (±.86) (±.832) (±.344) (±.94) (±27) (α) () (.7) (.9) (.8) (.6) (.7) () FCM (std) (±.775) (±.8) (±69) (±.7) (±.629) (±6) (±2.8) (# Cluster) [5] [7] [5] [5] [3] [3] [5] (h = /α) (.2) (.4) (.2) (.2) (.8) (.2) (3.8) Dendogram C (depth) (7) (9) (6) (6) (4) (5) (7) We validate the clustering produced by adaptive tree by computing the F-measure validation index [5]. In Table 3, we report the F-measure indices for fixed value of α for different depths of the tree on all datasets as in Table. We also tested the effectiveness of our model in performing unsupervised classification of acute leukemia samples from the gene expression using DNA microarray as used in Golub et al [34]. In the data set, there are 72 samples consisting of two different types of leukemia namely, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). For each 222

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)