Cluster-Based Cumulative Ensembles

Cluster-Based Cumulative Ensembles Hanan G. Ayad and Mohamed S. Kamel Pattern Analysis and Mahine Intelligene Lab, Eletrial and Computer Engineering, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada {hanan, mkamel}@pami.uwaterloo.a http://pami.uwaterloo.a/ Abstrat. In this paper, we propose a luster-based umulative representation for luster ensembles. Cluster labels are mapped to inrementally aumulated lusters, and a mathing riterion based on maximum similarity is used. The ensemble method is investigated with bootstrap re-sampling, where the k-means algorithm is used to generate high granularity lusterings. For ombining, group average hierarhial metalustering is applied and the Jaard measure is used for luster similarity omputation. Patterns are assigned to ombined meta-lusters based on estimated luster assignment probabilities. The luster-based umulative ensembles are more ompat than o-assoiation-based ensembles. Experimental results on artifiial and real data show redution of the aross varying ensemble parameters and luster strutures. 1 Introdution Motivated by the advanes in lassifier ensembles, whih ombine the preditions of multiple lassifiers; luster ensembles that ombine multiple data partitionings have started to gain an inreasing interest [1 8]. Cluster ensembles an be illustrated by the shemati model in Figure 1. The model inludes two main elements, the ensemble generation and the ombination sheme. The ensemble generation takes as input a dataset of d-dimensional pattern vetors represented by an N d matrix X = {x (i) } N i=1, where N is the number of patterns and the row vetor x (i) represents the ith pattern. The ensemble generation generates multiple lusterings, represented here by luster label vetors {y (b) } B b=1. The ombining sheme (or the onsensus funtion [1]), an be thought of as omprising two sub-elements. The first is the ensemble mapping whih defines a representation Z of the ensemble outputs and an assoiated mapping method. The lak of diret orrespondene between the labels generated by the individual lusterings leads to the need for this mapping omponent. For instane, the o-assoiation (or o-ourrene) matrix [2] is an example of a representation generated by an ensemble mapping that side-steps the label orrespondene problem, at a omputational ost of O(N 2 ). The maximum likelihood mapping [8] is another example of ensemble mapping in whih the re-labelling problem is formulated as a weighted bipartite mathing problem and is solved

2 H. G. Ayad and M. S. Kamel using the Hungarian method [9] with a omputational ost of O(k 3 ) where k is the number of lusters. Combination Sheme (1) y Input dataset X Ensemble Generation (2) y Ensemble Mapping Z Combining Algorithm Estimated luster labels (ombined lustering) ŷ pˆ (B) y Estimated probabilities (ombined lustering) Fig.1. Shemati model of luster ensembles The seond sub-element of a ombining sheme is the ombining algorithm whih uses Z to generate the ombined lustering ŷ. A potential derivative of the luster ensemble is the estimation of the probabilities ˆp with whih data points belong to the ombined lusters. The ombining algorithm often lends itself to a lustering problem, where the data is given by the new representation Z. It is noted that if the label orrespondene problem is resolved and the number of lusters in the base lusterings {y (i) } B i=1 is the same as the number of lusters k in the ombined lustering ŷ, majority voting [6] or maximum likelihood lassifiation [8] an be readily applied. However, if k, o-assoiation-based onsensus funtions are often applied [2 4]. While allowing arbitrary luster strutures to be disovered, o-assoiation-based onsensus funtions are omputationally expensive and hene not pratial for large datasets. Re-sampling methods are well established approahes for estimating improved data statistis [1]. In partiular, bagging [11] has been introdued in regression and lassifiation. In bagging, the training dataset of size N is perturbed using bootstrap re-sampling to generate learning datasets by randomly sampling N patterns with replaement. This yields dupliate patterns in a bootstrap dataset. The bootstrap re-sampling proess is independently repeated B times and the B datasets are treated as independent learning sets. Dudoit and Fridlyand [6] used bagging with the Partitioning Around Medoids (PAM) lustering method to improve the auray of lustering. They use two methods for ombining multiple partitions. The first applies voting and the seond reates a new dissimilarity matrix similar to the o-assoiation matrix used in [2]. In the voting method, the same number of lusters is used for lustering and ombining, and the input dataset is lustered one to reate a referene lustering. The luster labels of eah bootstrap repliation are permuted suh that they fit best to the referene lustering. They reported that the bagged lustering were generally as aurate and often signifiantly more aurate than

Cluster-Based Cumulative Ensembles 3 a single lustering. Fisher and Buhmann [8] applied bagging to improve the quality of the path-based lustering method. They ritiqued the use of a referene lustering in the mapping method of Dudoit and Fridlyand [6], arguing that it imposes undesirable influene. Instead, they seleted a re-labelling out of all k! permutations for a lustering, suh that it maximizes the sum over the empirial luster assignment probabilities estimated from previous mappings, over all objets of the new mapping onfiguration. The problem of finding the best permutation is formulated as a weighted bipartite mathing problem and the Hungarian method is used to solve a maximum bipartite mathing problem. They reported that bagging inreases the reliability of the results and provides a measure of unertainty of the luster assignment. Again, in this method, the number of lusters used in the ensemble is the same as the number of ombined lusters. Minaei, Tophy and Punh [7] empirially investigated the effetiveness of bootstrapping with several onsensus funtions by examining the auray of the ombined lustering for varied resolution of partitions (i.e., number of lusters) and ensemble size. They report that lustering of bootstrapping leads to improved onsensus lustering of the data. They further onlude that the the best onsensus funtion remains an open question, as different onsensus funtions seem to suit different luster strutures. In this paper, we propose an ensemble mapping representation based on the generated lusters, as high-level data granules. Re-labelling of lusters is based on maximizing individual luster similarity to inrementally-aumulated lusters. Based on this representation, different ombining algorithms an be used suh as hierarhial lustering algorithms, for instane. Here, group average (i.e. average link) hierarhial meta-lustering is applied. We experimentally investigate the effetiveness of the proposed onsensus funtion, with bootstrap re-sampling, and the k-means as the underlying lustering algorithm. 2 Cluster-Based Cumulative Ensemble 2.1 Ensemble Mapping The ensemble representation onsists of a umulative N matrix Z summarising the ensemble outputs, where is a given number of lusters that is used in generating multiple lusterings, suh that k N where k is the number of ombined lusters. The data values in Z reflet the frequeny of ourrene of eah pattern in eah of the aumulated lusters. The k-means algorithm with the Eulidean distane is used to generate a lustering y (b) = π(x (b), ) of a bootstrapped learning set in {X (b) } B b=1, where B is the size of the ensemble, and y (b) is an N-dimensional labeling vetor. That is, π is a mapping funtion π : X (b) {,,}, where label is assigned to patterns that didn t appear in the bootstrap learning set X (b). Eah instane of the N matrix, denoted by Z (b), is inrementally updated from the ensemble {y (b) } B b=1 as follows. 1. Z (1) is initialized using y (1), as given below. Re-labelling and aumulation start by proessing lustering y (2).

4 H. G. Ayad and M. S. Kamel { z (1) 1 if objet j is in luster i aording to lustering y (1) ij = otherwise 2. Let eah luster in a given lustering y (b+1) be represented by a binary N- dimensional vetor v with 1 s in entries orresponding to the luster members and s otherwise. Let eah luster extrated from the rows z (b) i of Z (b) be represented by the binary N-dimensional vetor w whose entries are 1 s for non-zero olumns of z (b) i and s otherwise. Compute the similarity between eah pair of vetors v and w using the Jaard measure given as J(v,w) = vw/( v 2 + w 2 vw) 3. Map eah luster label i {1,, } in lustering y (b+1) to its most similar luster labelled j {1,, } of the previously aumulated lusters represented by the rows of Z (b). Hene, inrement the entries of row j of Z (b) orresponding to members of the luster labelled i in lustering y (b+1). 4. Z (b+1) Z (b). The mapping proess is repeated until Z (B) is omputed. The umulative luster-based mapping of the ensemble ulminates in the matrix Z = Z (B), as a voting struture that summarises the ensemble. While in the maximum likelihood mapping [8], the best luster label permutation is found and = k is used, in this paper, eah luster is re-labelled to math its most similar luster from the aumulated lusters. This is done for the following reasons. First, sine the base lusterings represent high resolution partitions of non-idential bootstrap learning sets, this leads to highly diverse lusterings, suh that finding the best permutation beomes less meaningful. For quantitative measures of diversity in luster ensembles, the reader is referred to [5]. Seond, sine the aumulated lusters will be merged in a later stage by the ombining algorithm, we are most onerned at this stage in a mapping whih maximizes the similarities and hene minimizes the variane of the mapped lusters. We found that this mathing method an oasionally result in a umulative luster to beome singled out when no subsequently added lusters are mapped to it. If a hierarhial lustering algorithm is used, this problem an lead to a degenerate dendrogram and empty luster(s) in the ombined lustering. Therefore, we detet this ondition, and the orresponding solution is disarded. Usually, a good solution is reahed in a few iterations. An alternative remedy is to math eah of the umulative lusters to its most similar luster from eah subsequently mapped lustering, instead of the reverse way. This ensures that the above mentioned ondition does not our, but it an introdue influene from earlier lusterings and less inorporation of the diversity in the ensemble. An advantage of this representation is that it allows several alternative views (interpretations) to be onsidered by the ombining algorithm. For instane, Z may be treated as a pattern matrix. This allows different distane/similarity measures and ombining algorithms to be applied to generate the ombined lustering. Alternatively, Z may be treated as the joint probability distribution of two disrete random variables indexing the rows and olumns of Z. This allows for information theoreti formulations for finding of the ombined lusters.

Cluster-Based Cumulative Ensembles 5 Furthermore, the size of this representation is N versus N 2 for the oassoiation-based representation, where N. While, in the ase of the oassoiation matrix, the hierarhial lustering algorithm runs on the N N matrix, in the ase of the luster-based umulative representation, it runs on a distane matrix omputed from the N matrix Z. 2.2 Combining Using Hierarhial Group Average Meta-lustering Motivated by what is believed to be a reasonable disriminating strategy based on the average of a hosen distane measure between lusters, the proposed algorithm is the group average hierarhial lustering. The ombining algorithm starts by omputing the distanes between the rows of Z (i.e. the umulative lusters). This is a total of ( 2) distanes, and one minus the binary Jaard measure, given in Setion 2.1, is used to ompute the distanes. The group-average hierarhial lustering is used to luster the lusters, hene the name metalustering. In this algorithm, the distane between a pair of lusters d(c 1, C 2 ) is defined as the average distane between the objets in eah luster, where the objets in this ase are the umulative lusters. It is omputed as follows, d(c 1, C 2 ) = mean (z1,z 2) C 1 C 2 d(z 1,z 2 ), where d(z 1,z 2 ) = 1 J(z 1,z 2 ) The dendrogram is ut to generate k meta-lusters {M j } k j=1 representing a partitioning of the umulative lusters {z i } i=1. The merged lusters are averaged in a k N matrix M = {m ji } for j {1,,k} and i {1,,N}. So far, only the binary version of the umulative matrix has been used for distane omputations. Now, in determining the final lustering, the frequeny values aumulated in Z are averaged in the meta-luster matrix M and used to ompute the luster assignment probabilities. Then, eah objet is assigned to its most likely meta-luster. Let M be a random variable indexing the meta-lusters and taking values in {1,,k}, let X be a random variable indexing the patterns and taking values in {1,,N}, and let ˆp(M = j X = i) be the onditional probability of eah of the k meta-lusters, given an objet i, whih we also write as p(m j x i ). Here, we use x i to denote the objet index of the pattern x (i), and we use M j to denote a meta-luster represented by the row j in M. The probability estimates ˆp(M j x i ) are omputed as ˆp(M j x i ) = mji. 3 Experimental Analysis k m li l=1 Performane is evaluated based on the s whih are omputed by solving the orrespondene problem between the labels of a lustering solution and the true lustering using the Hungarian method. 3.1 Experiments with Artifiial Data The artifiial datasets are shown in Figure 2. The first, alled Elongatedellipses onsists of 1 2D points in 2 equal lusters. The Cresents dataset

6 H. G. Ayad and M. S. Kamel onsists of 1 2D points in 2 equal lusters. The Differing-ellipses onsists of 25 2D points in 2 lusters of sizes 5 and 2. The dataset alled 8D5K was generated and used in [1]. It onsists of 1 points from 8D Gaussian distributions (2 points eah). For visualization, the 8D5K data is projeted onto the first two prinipal omponents. 6 Elongated ellipses dataset 2 Cresents dataset 4 15 2 1 X 2 X 2 5 2 4 15 1 5 5 1 X 1 5 3 2 1 1 2 X 1 6 Differing ellipses dataset 1 8D5K dataset 5.5 X 2 4 3 2 PC 2.5 1 2 4 6 8 X 1 1 1.5.5 1 PC 1 Fig. 2. Satter plots of the artifiial datasets. The last 8 dimensional dataset is projeted on the first 2 prinipal omponents For eah dataset, we use B = 1, and vary. We measure the of the orresponding bagged ensemble at the true number of lusters k and ompare it to the k-means at the same k. The results in Figure 3 show that the proposed bagging ensemble signifiantly lowers the for varied luster strutures. In order to illustrate the luster-based umulative ensemble, we show in Figure 4 (a) a plot of the points frequenies in eah of the aumulated lusters at = 4 for the elongated-ellipses dataset. The points are ordered suh that the first 5 points belong to the first luster followed by 5 from the seond luster. The dendrogram orresponding to the hierarhial group average meta-lustering on the 4 umulative lusters is shown in Figure 4 (b). 3.2 Experiments with Real Data We use six datasets from the UCI mahine learning repository. Sine the Eulidean distane is not sale invariant, we standardize the features for those datasets in whih the sales widely vary for the different features. The datasets used are, (1) the iris plant dataset, (2) the wine reognition dataset, (3) the Wisonsin

Cluster-Based Cumulative Ensembles 7.25 Elongated ellipses dataset.35.3 Cresents dataset.2.25.15.1.5.2.15.1.5 4 6 8 1 12.35.3.25 Differing ellipses dataset 4 6 8 1 12 Ensemble K means.35.3.25 8D5K dataset.2.15.2.15.1.1.5.5 4 6 8 1 12 8 1 12 14 Fig. 3. Error rates for artifiial datasets using the bagged luster ensembles and the k-means algorithm at given k Points freq in C1 Points freq in C2 Frequeny Frequeny 2 4 6 8 2 4 6 2 4 6 8 1 Points labels X Points freq in C3 Frequeny Frequeny 2 4 6 2 4 6 2 4 6 8 1 Points labels X Points freq in C4 Height.6.7.8 2 Cluster Dendrogram 3 1 4 2 4 6 8 1 Points labels X (a) 2 4 6 8 1 Points labels X lusters aggregation hlust (*, "average") (b) Fig.4. (a) Aumulated lusters. (b) Generated dendrogram.

8 H. G. Ayad and M. S. Kamel breast aner dataset, (4) the Wisonsin diagnosti breast aner (WDBC), (5) a random sample of size 5 from the optial reognition of handwritten digits dataset, and (6) a random sample of size 5 from the pen-based reognition of handwritten digits dataset. We standardized the features for the wine reognition and the WDBC datasets. The mean s of the k-means (over 1 runs), at the true k, for the above datasets are,.27,.378,.395,.923,.288,.3298, respetively..8 Iris Plant luster-based - alink pattern-based - alink pattern-based - slink pattern-based.8 - link Wine.6.6.4.2.4.2 4 6 8 1 12 Breast Caner.4.3.2.1 4 6 8 1 12 Optial digits.8.6.4.2 15 2 25 3 4 6 8 1 12 WDBC.8.6.4.2 4 6 8 1 12 Pen-based digits.8.6.4.2 15 2 25 3 Fig. 5. Error rates on the real datasets for the proposed ensemble versus oassoiations-based ensembles using single, omplete and average link. Figure 5 shows a omparison of the luster-based umulative ensembles with hierarhial group average (denoted in Figure 5 by luster-based alink) to pattern o-assoiation-based ensembles, when single, omplete and average link variants of the hierarhial lustering are applied (denoted by pattern-based slink, link, and alink, respetively). In the experiments, we use B = 1 and k orresponding to the true number of lusters. The results show that the luster-based alink ensembles perform ompetitively well ompared to pattern-based alink ensembles. On the other hand, the o-assoiation-based single and omplete link ensembles showed poor performane.

Cluster-Based Cumulative Ensembles 9 mean µ e.25.2.15.1 Iris Plant =3 =5 =7 =9 =11 mean µ e.25.2.15.1.5 Wine =4 =6 =8 =1 =12.5 5 1 25 5 1 B 5 1 25 5 1 B mean µ e.4.3.2.1 Cresents =4 =6 =8 =9 =1 mean µ e.7.6.5.4.3.2 Differing-ellipses =4 =6 =8 =1 =12 5 1 25 5 1 B.1 5 1 25 5 1 B Fig.6. Effet of ensemble size B. The X-axis is log sale. 3.3 Varying The Ensemble Size We study the effet of the ensemble size B, for values of B 1. Figure 6 shows the mean s on real and artifiial datasets for B = 5, 1, 25, 5, and 1, and for varying number of base lusters. Eah ensemble at a given and B is repeated r = 5 times and the mean is omputed. There is a general trend of redution in s as B inreases. However, we observe that most gain in auray ours for B = 25, and 5. We also observe that the differenes between the s of ensembles of varying values of tend to derease as B inreases, i.e., the variability of the s orresponding to different values of is redued when B is inreased. However, in some ases, it is noted that that amount of redution in the error depends on. For instane, this an be observed for = 4 in the resents and differing-ellipses datasets. 4 Conlusion The proposed luster-based umulative representation is more ompat than the o-assoiation matrix. Experimental results on artifiial datasets emphasised the potential of the proposed ensemble method in substantially lowering the error rate, and in finding arbitrary luster strutures. For the real datasets, the

1 H. G. Ayad and M. S. Kamel luster-based umulative ensembles, using group average hierarhial lustering, signifiantly outperformed o-assoiation-based ensembles, using the single and omplete link algorithms. They showed ompetitive performane ompared to o-assoiation-based ensembles, using the group average algorithm. In [12], the group average algorithm is shown to approximately minimize the maximum luster variane. Suh model seems to represent a better fit to the data summarised in Z. A further potential benefit of this paper is that o-assoiation-based onsensus funtions other than hierarhial methods, suh as [3, 4], an also be adapted to the luster-based umulative representation, rendering them more effiient. This will be investigated in future work. Aknowledgments This work was partially funded by an NSERC strategi grant. Referenes 1. A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for ombining multiple partitions. Journal on Mahine Learning Researh (JMLR), 3:583 617, Deember 22. 2. A. Fred and A.K. Jain. Data lustering using evidene aumulation. In Proeedings of the 16th International Conferene on Pattern Reognition. ICPR 22, volume 4, pages 276 28, Quebe City, Quebe, Canada, August 22. 3. H. Ayad and M. Kamel. Finding natural lusters using multi-lusterer ombiner based on shared nearest neighbors. In Multiple Classifier Systems: Fourth International Workshop, MCS 23, UK, Proeedings., pages 166 175, 23. 4. H. Ayad, O. Basir, and M. Kamel. A probabilisti model using information theoreti measures for luster ensembles. In Multiple Classifier Systems: Fifth International Workshop, MCS 24, Cagliari, Italy, Proeedings, pages 144 153, 24. 5. L. I. Kunheva and S.T. Hadjitodorov. Using diversity in luster ensembles. In IEEE International Conferene on Systems, Man and Cybernetis, Proeedings, The Hague, The Netherlands., 24. 6. S. Dudoit and J. Fridlyand. Bagging to improve the auray of a lustering proedure. Bioinformatis, 19(9):19 199, 23. 7. B. Minaei, A. Tophy, and W. Punh. Ensembles of partitions via data resampling. In Intl. Conf. on Information Tehnology: Coding and Computing, ITCC4, Proeedings, Las Vegas, April 24. 8. B. Fisher and J.M. Buhmann. Bagging for path-based lustering. IEEE Transations on Pattern Analysis and Mahine Intelligene, 25(11):1411 1415, 23. 9. H. Kuhn. The hungarian method for the assignment problem. Naval Researh Logisti Quarterly, 2:83 97, 1955. 1. R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classifiation. John Wiley & Sons, 21. 11. Leo Breiman. Bagging preditors. Mahine Learning, 26(2):123 14, 1996. 12. S. Kamvar, D. Klein, and C. Manning. Interpreting and extending lassial agglomerative lustering algorithms using a model-based approah. In Proeedings of the 19th Int. Conf. Mahine Learning, pages 283 29, 22.