From Comparing Clusterings to Combining Clusterings

Similar documents
Cluster Analysis of Electrical Behavior

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Classifier Selection Based on Data Complexity Measures *

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Hierarchical clustering for gene expression data analysis

Machine Learning. Topic 6: Clustering

Determining Fuzzy Sets for Quantitative Attributes in Data Mining Problems

Support Vector Machines

Machine Learning: Algorithms and Applications

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

An Optimal Algorithm for Prufer Codes *

Module Management Tool in Software Development Organizations

Unsupervised Learning

Clustering Algorithm Combining CPSO with K-Means Chunqin Gu 1, a, Qian Tao 2, b

Mathematics 256 a course in differential equations for engineering students

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

A new selection strategy for selective cluster ensemble based on Diversity and Independency

Feature Reduction and Selection

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

K-means and Hierarchical Clustering

A Deflected Grid-based Algorithm for Clustering Analysis

Intra-Parametric Analysis of a Fuzzy MOLP

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Available online at Available online at Advanced in Control Engineering and Information Science

A Binarization Algorithm specialized on Document Images and Photos

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

CS 534: Computer Vision Model Fitting

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Incremental Learning with Support Vector Machines and Fuzzy Set Theory

The Research of Support Vector Machine in Agricultural Data Classification

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

EVALUATION OF THE PERFORMANCES OF ARTIFICIAL BEE COLONY AND INVASIVE WEED OPTIMIZATION ALGORITHMS ON THE MODIFIED BENCHMARK FUNCTIONS

A Robust Method for Estimating the Fundamental Matrix

A Simple Methodology for Database Clustering. Hao Tang 12 Guangdong University of Technology, Guangdong, , China

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Optimizing Document Scoring for Query Retrieval

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

A Combined Approach for Mining Fuzzy Frequent Itemset

EXTENDED BIC CRITERION FOR MODEL SELECTION

S1 Note. Basis functions.

Related-Mode Attacks on CTR Encryption Mode

A Clustering Algorithm for Chinese Adjectives and Nouns 1

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

Unsupervised Learning and Clustering

Meta-heuristics for Multidimensional Knapsack Problems

Learning from Multiple Related Data Streams with Asynchronous Flowing Speeds

X- Chart Using ANOM Approach

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Load-Balanced Anycast Routing

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

A fast algorithm for color image segmentation

Intelligent Information Acquisition for Improved Clustering

An Entropy-Based Approach to Integrated Information Needs Assessment

Clustering Algorithm of Similarity Segmentation based on Point Sorting

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments

A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

CHAPTER 2 DECOMPOSITION OF GRAPHS

Programming in Fortran 90 : 2017/2018

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

Collaboratively Regularized Nearest Points for Set Based Recognition

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

CSCI 5417 Information Retrieval Systems Jim Martin!

Graph-based Clustering

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

Life Tables (Times) Summary. Sample StatFolio: lifetable times.sgp

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article

Parameter estimation for incomplete bivariate longitudinal data in clinical trials

Smoothing Spline ANOVA for variable screening

Relevance Assignment and Fusion of Multiple Learning Methods Applied to Remote Sensing Image Analysis

An Image Fusion Approach Based on Segmentation Region

The Shortest Path of Touring Lines given in the Plane

Efficient Segmentation and Classification of Remote Sensing Image Using Local Self Similarity

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

Multi-objective Virtual Machine Placement for Load Balancing

User Authentication Based On Behavioral Mouse Dynamics Biometrics

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION

Fitting: Deformable contours April 26 th, 2018

APPLICATION OF IMPROVED K-MEANS ALGORITHM IN THE DELIVERY LOCATION

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

TOWARDS FUZZY-HARD CLUSTERING MAPPING PROCESSES. MINYAR SASSI National Engineering School of Tunis BP. 37, Le Belvédère, 1002 Tunis, Tunisia

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Transcription:

Proceedngs of the Twenty-Thrd AAAI Conference on Artfcal Intellgence (008 From Comparng Clusterngs to Combnng Clusterngs Zhwu Lu and Yuxn Peng and Janguo Xao Insttute of Computer Scence and Technology, Pekng Unversty, Beng 0087, Chna {luzhwu,pengyuxn,xg}@cst.pku.edu.cn Abstract Ths paper presents a fast smulated annealng framework for combnng multple clusterngs (.e. clusterng ensemble based on some measures of agreement between parttons, whch are orgnally used to compare two clusterngs (the obtaned clusterng vs. a ground truth clusterng for the evaluaton of a clusterng algorthm. Though we can follow a greedy strategy to optmze these measures as obectve functons of clusterng ensemble, some local optma may be obtaned and smultaneously the computatonal cost s too large. To avod the local optma, we then consder a smulated annealng optmzaton scheme that operates through sngle label changes. Moreover, for these measures between parttons based on the relatonshp (oned or separated of pars of obects such as Rand ndex, we can update them ncrementally for each label change, whch makes sure the smulated annealng optmzaton scheme s computatonally feasble. The smulaton and real-lfe experments then demonstrate that the proposed framework can acheve superor results. Introducton Comparng clusterngs plays an mportant role n the evaluaton of clusterng algorthms. A number of crtera have been proposed to measure how close the obtaned clusterng s to a ground truth clusterng, such as mutual nformaton (MI (Strehl and Ghosh 00, Rand ndex (Rand 97; Hubert and Arabe 985, Jaccard ndex (Denoeud and Guénoche 006, and Wallace ndex (Wallace 983. One mportant applcaton of these measures s to make obectve evaluaton of mage segmentaton algorthms (Unnkrshnan, Pantofaru, and Hebert 007, snce mage segmentaton can be consdered as a clusterng problem. Snce the maor dffculty of clusterng combnaton s ust n fndng a consensus partton from the ensemble of parttons, these measures for comparng clusterngs can further be used as the obectve functons of clusterng ensemble. Here, t s only dfferent n that the consensus partton has to be compared to multple parttons. Such consensus functons have been developed n (Strehl and Ghosh 00 based Correspondng author. Copyrght c 008, Assocaton for the Advancement of Artfcal Intellgence (www.aaa.org. All rghts reserved. on MI. Though a greedy strategy can be used to maxmze normalzed MI va sngle label change, the computatonal cost s too large. Hence, we resort to those measures between parttons based on the relatonshp (oned or separated of pars of obects such as Rand ndex, Jaccard ndex, and Wallace ndex, whch can be updated ncrementally for each sngle label change. Moreover, to resolve the local convergence problem, we follow a smulated annealng optmzaton scheme, whch s computatonally feasble due to the ncremental update of obectve functon. We have actually proposed a fast smulated annealng framework for clusterng ensemble based on measures for comparng clusterngs. There are three man advantages to the proposed framework: developng a seres of consensus functons for clusterng ensemble, not ust one; avodng the local optma problem; 3 low computatonal complexty of our consensus functons - O(nkr for n obects, k clusters n the target partton, and r clusterngs n the ensemble. Our framework s readly applcable to large data sets, as opposed to other consensus functons whch are based on the co-assocaton of obects n clusters from an ensemble wth quadratc complexty O(n kr. Moreover, unlke those algorthms that search for a consensus partton va re-labelng and subsequent votng, ths framework can operate wth arbtrary parttons wth varyng numbers of clusters, not constraned to a predetermned number of clusters n the ensemble parttons. The rest of ths paper s organzed as follows. Secton descrbes relevant research on clusterng combnaton. In secton 3, we brefly ntroduce some measures for comparng clusterngs and especally gve three of them n detal. Secton 4 then presents the smulated annealng framework for clusterng ensemble based on the three measures. The expermental results on several data sets are presented n secton 5, followed by the conclusons n secton 6. Motvaton and Related Work Approaches to combnaton of clusterngs dffer n two man respects, namely the way n whch the contrbutng component clusterngs are obtaned and the method by whch they are combned. One mportant consensus functon s proposed by (Fred and Jan 005 to summarze varous clusterng results n a co-assocaton matrx. Co-assocaton values represent the strength of assocaton between obects by an- 665

alyzng how often each par of obects appears n the same cluster. Then the co-assocaton matrx serves as a smlarty matrx for the data tems. The fnal clusterng s formed from the co-assocaton matrx by lnkng the obects whose co-assocaton value exceeds a certan threshold. One drawback of the co-assocaton consensus functon s ts quadratc computatonal complexty n the number of obects O(n. Moreover, experments n (Topchy, Jan, and Punch 005 show co-assocaton methods are usually unrelable wth the number of clusterngs r<50. Some hypergraph-based consensus functons have also been developed n (Strehl and Ghosh 00. All the clusters n the ensemble parttons can be represented as hyperedges on a graph wth n vertces. Each hyperedge descrbes a set of obects belongng to the same cluster. A consensus functon can be formulated as a soluton to k-way mn-cut hypergraph parttonng problem. One hypergraph-based method s the meta-clusterng algorthm (MCLA, whch also uses hyperedge collapsng operatons to determne soft cluster membershp values for each obect. Hypergraph methods seem to work best for nearly balanced clusters. A dfferent consensus functon has been developed n (Topchy, Jan, and Punch 003 based on nformatontheoretc prncples. An elegant soluton can be obtaned from a generalzed defnton of MI, namely Quadratc MI (QMI, whch can be effectvely maxmzed by the k-means algorthm n the space of specally transformed cluster labels of the gven ensemble. However, t s senstve to ntalzaton due to the local optmzaton scheme of k-means. In (Dudot and Frdlyand 003; Fscher and Buhmann 003, a combnaton of parttons by re-labelng and votng s mplemented. Ther works pursue drect re-labelng approaches to the correspondence problem. A re-labelng can be done optmally between two clusterngs usng the Hungaran algorthm. After an overall consstent re-labelng, votng can be appled to determne cluster membershp for each obect. However, ths votng method needs a very large number of clusterngs to obtan a relable result. A probablstc model of consensus s offered by (Topchy, Jan, and Punch 005 usng a fnte mxture of multnomal dstrbutons n the space of cluster labels. A combned partton s found as a soluton to the correspondng maxmum lkelhood problem usng the EM algorthm. Snce the EM consensus functon needs to estmate too many parameters, accuracy degradaton wll nevtably occur wth ncreasng number of parttons when sample sze s fxed. To summarze, exstng consensus functons suffer from a number of drawbacks that nclude complexty, heurstc character of obectve functon, and uncertan statstcal status of the consensus soluton. Ths paper ust ams to overcome these drawbacks through developng a fast smulated annealng framework for combnng multple clusterngs based on those measures for comparng clusterngs. Measures for Comparng Clusterngs Ths secton frst presents the basc notatons for comparng two clusterngs, and then ntroduces three measures of agreement between parttons whch wll be used for combnng multple clusterngs n the rest of the paper. Notatons and Problem Statement Let λ a and λ b be two clusterngs of the sample data set X = {x t } n t=, wth k a and k b groups respectvely. To compare these two clusterngs, we have to frst gve a quanttatve measure of agreement between them. In the case of evaluatng a clusterng algorthm, t means that we have to show how close the obtaned clusterng s to a ground truth clusterng. Snce these measures wll further be used as obectve functons of clusterng ensemble, t s mportant that we can update them ncrementally for sngle label change. The computaton of the new obectve functon n ths way can lead to much less computatonal cost. Hence, we focus on these measures whch can be specfed as: S(λ a,λ b =f({n a } ka =, {nb } k b =, {n }, ( where n a s the number of obects n cluster C accordng to λ a, n b s the number of obects n cluster C accordng to λ b, and n denotes the number of obects that are n cluster C accordng to λ a as well as n group C accordng to λ b. When an obect (whch s n C accordng to λ b moves from cluster C to cluster C accordng to λ a, only the followng updates arse for ths sngle label change: ˆn a = n a, ˆn a = na +, ( ˆn = n, ˆn = n +. (3 Accordng to (, S(λ a,λ b may then be updated ncrementally. Though many measures for comparng clusterngs can be represented as (, we wll focus on one specal type of measures based on the relatonshp (oned or separated of pars of obects such as Rand ndex, Jaccard ndex, and Wallace ndex n the followng. The comparson of parttons for ths type of measures s ust based on the pars of obects of X. Two parttons λ a and λ b agree on a par of obects x and x f these obects are smultaneously oned or separated n them. On the other hand, there s a dsagreement f x and x are oned n one of them and separated n the other. Let n A be the number of pars smultaneously oned together, n B the number of pars oned n λ a and separated n λ b, n C the number of pars separated n λ a and oned n λ b, and n D the number of pars smultaneously separated. Accordng to (Hubert and Arabe 985, we have n A =,, nb = a na, b and n C = n D = na n B n C. na. Moreover, we can easly obtan Rand Index Rand ndex s a popular nonparametrc measure n statstcs lterature and works by countng pars of obects that have compatble label relatonshps n the two clusterngs to be compared. More formally, the Rand ndex (Rand 97 can be computed as the rato of the number of pars of obects havng the same label relatonshp n λ a and λ b as: ( n R(λ a,λ b =(n A + n D /, (4 where n A + n D = +, a b. 666

A problem wth the Rand ndex s that the expected value of the Rand ndex of two random parttons does not take a constant value. The corrected Rand ndex proposed by (Hubert and Arabe 985 assumes the generalzed hypergeometrc dstrbuton as the model of randomness,.e., the two parttons λ a and λ b are pcked at random such that the number of obects n the clusters are fxed. Under ths model, the corrected Rand ndex can be gven as: CR(λ a,λ b =, h a h b /, (5 (ha + h b h a h b / where h a = a and h b = b. In the followng, we actually use ths verson of Rand ndex for combnng multple clusterngs. Jaccard Index In the Rand ndex, the pars smultaneously oned or separated are counted n the same way. However, parttons are often nterpreted as classes of oned obects, the separatons beng the consequences of ths clusterng. We then use the Jaccard ndex (Denoeud and Guénoche 006, noted J, whch does not consder the n D smultaneous separatons: J(λ a,λ b = n A nd =, h a + h b,, (6 where nd = n A + n B + n C = h a + h b n A. Wallace Index Ths ndex s very natural, and t s the number of oned pars common to two parttons λ a and λ b dvded by the number of possble pars (Wallace 983: W (λ a,λ b = n A ha h b =, ha h. (7 b Ths last quantty depends on the partton of reference and, f we do not want to favor nether λ a nor λ b, the geometrcal average s used. The Proposed Framework The above measures of agreement between parttons for comparng clusterngs are further used as obectve functons of clusterng ensemble. In ths secton, we frst gve detals about the clusterng ensemble problem, and then present a fast smulated annealng framework for combnng multple clusterngs that operates through sngle label changes to optmze these measure-based obectve functons. The Clusterng Ensemble Problem Gven a set of r parttons Λ={λ q q =,..., r}, wth the q-th partton λ q havng k q clusters, the consensus functon Γ for combnng multple clusterngs can be defned ust as (Strehl and Ghosh 00: Γ:Λ λ, N n r N n, (8 whch maps a set of clusterngs to an ntegrated clusterng. If there s no pror nformaton about the relatve mportance of the ndvdual groupngs, then a reasonable goal for the consensus answer s to seek a clusterng that shares the most nformaton wth the orgnal clusterngs. More precsely, based on the measure of agreement (.e. shared nformaton between parttons, we can now defne a measure between a set of r parttons Λ and a sngle partton λ as the average shared nformaton: S(λ, Λ = r S(λ, λ q. (9 r q= Hence, the problem of clusterng ensemble s ust to fnd a consensus partton λ of the data set X that maxmzes the obectve functon S(λ, Λ from the gathered parttons Λ: λ =argmax λ r r S(λ, λ q. (0 q= The desred number of clusters k n the consensus clusterng λ deserves a separate dscusson that s beyond the scope of ths paper. Here, we smply assume that the target number of clusters s predetermned for the consensus clusterng. More detals about ths model selecton problem can be found n (Fgueredo and Jan 00. To update the obectve functon of clusterng ensemble ncrementally, we have to consder those measures whch take the form of (. Though many measures for comparng clusterngs can be represented as (, we wll focus on one specal type of measures based on the relatonshp (oned or separated of pars of obects n the followng. Actually, only three measures,.e. the Rand ndex, Jaccard ndex, and Wallace ndex, are used as the obecton functons of clusterng ensemble. Moreover, to resolve the local convergence problem of the greedy optmzaton strategy, we further take nto account the smulated annealng scheme. Note that our clusterng ensemble algorthms developed n the followng can be modfed slghtly when other types of measures specfed as ( are used as obectve functons. Hence, we have actually presented a smulated annealng framework for combnng multple clusterngs. Clusterng Ensemble va Smulated Annealng Gven a set of r parttons Λ={λ q q =,..., r}, the obectve functon of clusterng ensemble can ust be set as the measure between a sngle partton λ and Λ n (9. The measure S(λ, λ q between λ and λ q can be Rand ndex, Jaccard ndex, or Wallace ndex. Accordng to (5 (7, we can set S(λ, λ q as any of the followng three measures: S(λ, λ q h q 0 = h h q /( n (h + h q h h q /, ( S(λ, λ q = h q 0 /(h + h q hq 0, ( S(λ, λ q = h q 0 h / h q, (3 where h q 0 = q, h =, and h q = q., Here, the frequency counts are denoted a lttle dfferently 667

from (: n s the number of obects n cluster C accordng to λ, n q s the number of obects n cluster C accordng to λ q, and n q s the number of obects that are n cluster C accordng to λ and n cluster C accordng to λ q. Note that the correspondng algorthms based on these three measures whch follow the smulated annealng optmzaton scheme are denoted as SA-RI, SA-JI, and SA-WI, respectvely. To fnd the consensus partton from the multple clusterngs Λ, we can maxmze the obectve functon S(λ, Λ by sngle label change. That s, we randomly select an obect x t from the data set X = {x t } n t=, and then change the label of t λ(x t = to another randomly selected label accordng to λ,.e., move t from the current cluster C to another cluster C. Such sngle label change only leads to the followng updates: ˆn = n, ˆn = n +, (4 ˆn q = nq, ˆnq = nq +, (5 where = λ q (x t (q =,..., r. For each λ q Λ, to update S(λ, λ q, we can frst calculate h and h q 0 ncrementally: ĥ = h + n n +, (6 ĥ q 0 = h q 0 + nq nq +. (7 Note that h q keeps fxed for each label change. Hence, we can obtan the new Ŝ(λ, λq accordng to ( (3, and the new obectve functon Ŝ(λ, Λ s ust the mean of {Ŝ(λ, λq } r q=. Here, t s worth pontng out that the update of the obectve functon has only lnear tme complexty O(r for sngle label change, whch makes sure that the smulated annealng scheme s computatonally feasble for the maxmum of S(λ, Λ. We further take nto account a smplfed smulated annealng scheme to determne whether to select the sngle label change λ(x t :. At a temperature T, the probablty of selectng the sngle label change λ(x t : can be calculated as follows: { P (λ(x t : f ΔS >0 = e ΔS T otherwse, (8 where ΔS = Ŝ(λ, Λ S(λ, Λ. We actually select the sngle label change f P (λ(x t : s hgher than a threshold P 0 (0 <P 0 < ; otherwse, we wll dscard t and begn to try the next sngle label change. The complete descrpton of our smulated annealng framework for clusterng ensemble s fnally summarzed n Table. The tme complexty s O(nk r. Expermental Results The experments are conducted wth artfcal and real-lfe data sets, where true natural clusters are known, to valdate both accuracy and robustness of consensus va our smulated annealng framework. We also explore the data sets usng seven dfferent consensus functons. Table : Clusterng Ensemble va Smulated Annealng Input:. A set of r parttons Λ={λ q q =,..., r}. The desred number of clusters k 3. The threshold for selectng label change P 0 4. The coolng rato c (0 <c< Output: The consensus clusterng λ Process:. Select a canddate clusterng λ by some combnaton methods, and set the temperature T = T 0.. Start a loop wth all obects set unvsted (v(t = 0, t =,..., n. Randomly select an unvsted obect x t from X, and change the label λ(x t to the other k labels. If a label change s selected accordng to (8, we mmedately set v(t = and try a new unvsted obect. If there s no label change for x t,wealsosetv(t =and go to a new obect. The loop s stopped untl all obects are vsted. 3. Set T = c T, and go to step. If there s no label change durng two successve loops, stop the algorthm and output λ = λ. Data Sets The detals of the four data sets used n the experments are summarzed n Table. Two artfcal data sets, -sprals and half-rngs, are shown n Fgure, whch are dffcult for any centrod based clusterng algorthms. We also use two real-lfe data sets, rs and wne data, from UCI benchmark repostory. Snce the last feature of wne data s far larger than the others, we frst regularze them nto an nterval of [0, 0]. Note that the other three data sets keep unchanged. Table : Detals of the four data sets. The average clusterng error s obtaned by the k-means algorthm. Data sets #features k n Avg. error (% -sprals 90 4.5 half-rngs 500 6.4 rs 4 3 50.7 wne 3 3 78 8.4 The average clusterng errors by the k-means algorthm for 0 ndependent runs on the four data sets are lsted n Table, whch are consdered as baselnes for those consensus functons. As for the regularzaton of wne data, the average error by the k-means algorthm can be decreased from 36.3% to 8.4% for 0 ndependent runs. Here, we evaluate the performance of a clusterng algorthm by matchng the detected and the known parttons of the data sets ust as (Topchy, Jan, and Punch 005. The best possble matchng of clusters provdes a measure of perfor- 668

0.5 0 0.5.5.5.5 0.5 0 (a 0.8 0.6 0.4 0. 0 0. 0.4 0.6 0.8.5 0.5 0 0.5.5 Fgure : Two artfcal data sets dffcult for any centrod based clusterng algorthms: (a -sprals; (b half-rngs. (b Table 3: Average error rate (% on the -sprals data set. The k-means algorthm randomly selects k [4, 7] to generate r parttons for dfferent combnaton methods. 0 37.5 39.6 38.7 45. 45. 46.8 39.3 0 35.9 37.8 37.3 43.8 44.4 47.8 37.6 30 36.0 37.0 39.3 4. 43.6 47.3 40. 40 37.6 39.7 37.6 40.8 4. 46.9 38.4 50 36. 39. 36. 4.8 43.9 44.4 36.4 mance expressed as the msassgnment rate. To determne the clusterng error, one needs to solve the correspondence problem between the labels of known and derved clusters. The optmal correspondence can be obtaned usng the Hungaran method for mnmal weght bpartte matchng problem wth O(k 3 complexty for k clusters. Selecton of Parameters and Algorthms To mplement our smulated annealng framework for clusterng ensemble, we have to select two mportant parameters,.e., the threshold P 0 for selectng label change and the coolng rato c (0 <c<. When the coolng rato c takes a larger value, we may obtan a better soluton but the algorthm may converge slower. Meanwhle, when the threshold P 0 s larger, the algorthm may converge faster but the local optma may be avoded at a lower probablty. To acheve a tradeoff between the clusterng accuracy and speed, we smply set P 0 =0.85 and c =0.99 n all the experments. Moreover, the temperature T s ntalzed by T =0.S 0 where S 0 s the ntal value of obectve functon. Our three smulated annealng methods (.e. SA-RI, SA- JI, and SA-WI for clusterng combnaton are also compared to four other consensus functons:. k-modes algorthm for consensus clusterng n ths paper, whch s orgnally developed to make categorcal clusterng (Huang 998.. EM algorthm for consensus clusterng va the mxture model (Topchy, Jan, and Punch 005. 3. QMI approach descrbed n (Topchy, Jan, and Punch 003, whch s actually mplemented by the k-means algorthm n the space of specally transformed cluster labels of the gven ensemble. 4. MCLA whch s a hypergraph method ntroduced n (Strehl and Ghosh 00. Note that our methods are ntalzed by k-modes ust because ths algorthm runs very fast, and other consensus functons can be used as ntalzatons smlarly. Snce the co-assocaton methods have O(n complexty and may lead to severe computatonal lmtatons, our methods are not compared to these algorthms. The performance of the co-assocaton methods has been already analyzed n (Topchy, Jan, and Punch 003. The code s avalable at http://www.strehl.com Table 4: Average error rate (% on the half-rngs data set. The k-means algorthm randomly selects k [3, 5] to generate r parttons for dfferent combnaton methods. 0 0.4.4 0.3 6.9 6.4 5.7 4.6 0 8.5.5 3.5 7.7 4.4 5.3 9.9 30 8. 0.4 9.0 5. 6.9 4.6 4.9 40 7.6 7.7 9. 8.5 7.5 5.9 3.5 50 8.3 9.4 0.0 9.3 8.5 6.6.7 The k-means algorthm s used as a method of generatng the parttons for the combnaton. Dversty of the parttons s ensured by: ( ntalzng the algorthm randomly; ( selectng the number of clusters k randomly. In the experments, we actually gve k a random value around the number of true natural clusters k (k k. We have found that ths method of generatng parttons leads to better results than that only by random ntalzaton. Moreover, we vary the number of combned clusterngs r n the range [0, 50]. Comparson wth Other Consensus Functons Only man results for each of the four data sets are presented n Tables 3 6 due to space lmtatons. Actually, we have ntalzed our smulated annealng methods by other consensus functons besdes k-modes, and some smlar results can be obtaned. Here, the tables report the average error rate (% of clusterng combnaton from 0 ndependent runs. Frst observaton s that our smulated annealng methods (especally SA-RI perform generally better than other consensus functons. Snce our methods only lead to slghtly hgher clusterng errors n a few cases as compared wth MCLA, we can thnk our methods preferred by overall eval- Table 5: Average error rate (% on the rs data set. The k- means algorthm randomly selects k [3, 5] to generate r parttons for dfferent combnaton methods. 0 0.7 0.7 0.6 3.4.3 4.3 0.4 0 0.6 0.9 0.8.9 7.5 4.8 0.6 30 0.7 0.7 0.9 3. 8..3 0.5 40 0.7.8 0.7.6 6.6 3.9 0.7 50 0.7 0.7 0.7 9.9 6.9.6 0.7 669

Table 6: Average error rate (% on the wne data set. The k-means algorthm randomly selects k [4, 6] to generate r parttons for dfferent combnaton methods. 0 6.5 6.7 6.5.3 7. 8.8 7.6 0 6.5 6.5 6.3.4 7.9 0.4 8.5 30 6.4 6.3 6.3.4 3. 7.5 7.4 40 6.3 6.3 6. 0. 7. 7.4 7.5 50 6.3 6. 6. 8.. 7.3 7.8 Corrected Rand Index 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0 0 0 30 40 50 60 70 Number of Loops 0. 0 0 40 60 80 00 0 40 60 80 00 Number of Loops (a (b Fgure : The ascent of corrected Rand ndex on two reallfe data sets (only SA-RI consdered: (a rs; (b wne. uaton. Among our three methods, SA-RI performs the best generally. All co-assocaton methods are usually unrelable wth r<50 and ths s where our methods are postoned. The k-modes, EM, and QMI consensus functons all have the local convergence problem. Snce our methods are ust ntalzed by k-modes, we can fnd that local optma are successfully avoded due to the smulated annealng optmzaton scheme. Fgure further shows the ascent of corrected Rand ndex on two real-lfe data sets (only SA-RI wth r =30consdered durng optmzaton. Moreover, t s also nterestng to note that, as expected, the average error of consensus clusterng by our smulated annealng methods s lower than average error of the k- means clusterngs n the ensemble (Table when k s chosen to be equal to the true number of clusters k. Fnally, the average tme taken by our three methods (Matlab code s less than 30 seconds per run on a GHz PC n all cases. As reported n (Strehl and Ghosh 00, experments wth n = 400, k =0, r =8average one hour usng the greedy algorthm based on normalzed MI (smlar to our methods. However, our methods only take about 0 seconds n ths case,.e., our methods are computatonally feasble n spte of the costly annealng procedure. Corrected Rand Index 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.5 Conclusons We have proposed a fast smulated annealng framework for combnng multple clusterngs based on some measures for comparng clusterngs. When the obectve functons of clusterng ensemble are specfed as those measures based on the relatonshp of pars of obects n the data set, we can then update them ncrementally for each sngle label change, whch makes sure that the proposed smulated annealng optmzaton scheme s computatonally feasble. The smulaton and real-lfe experments then demonstrate that the proposed framework can acheve superor results. Snce clusterng ensemble s actually equvalent to categorcal clusterng, our methods wll further be evaluated n ths applcaton n the future work. Acknowledgements Ths work was fully supported by the Natonal Natural Scence Foundaton of Chna under Grant No. 6050306, the Beng Natural Scence Foundaton of Chna under Grant No. 40805, and the Program for New Century Excellent Talents n Unversty under Grant No. NCET-06-0009. References Denoeud, L., and Guénoche, A. 006. Comparson of dstance ndces between parttons. In Proceedngs of the IFCS 006: Data Scence and Classfcaton, 8. Dudot, S., and Frdlyand, J. 003. Baggng to mprove the accuracy of a clusterng procedure. Bonformatcs 9(9:090 099. Fgueredo, M. A. T., and Jan, A. K. 00. Unsupervsed learnng of fnte mxture models. IEEE Trans. on Pattern Analyss and Machne Intellgence 4(3:38 396. Fscher, R. B., and Buhmann, J. M. 003. Path-based clusterng for groupng of smooth curves and texture segmentaton. IEEE Trans. on Pattern Analyss and Machne Intellgence 5(4:53 58. Fred, A. L. N., and Jan, A. K. 005. Combnng multple clusterngs usng evdence accumulaton. IEEE Trans. on Pattern Analyss and Machne Intellgence 7(6:835 850. Huang, Z. 998. Extensons to the k-means algorthm for clusterng large data sets wth categorcal values. Data Mnng and Knowledge Dscovery :83 304. Hubert, L., and Arabe, P. 985. Comparng parttons. Journal of Classfcaton :93 8. Rand, W. M. 97. Obectve crtera for the evaluaton of clusterng methods. Journal of the Amercan Statstcal Assocaton 66:846 850. Strehl, A., and Ghosh, J. 00. Cluster ensembles - a knowledge reuse framework for combnng parttonngs. In Proceedngs of Conference on Artfcal Intellgence (AAAI, 93 99. Topchy, A.; Jan, A. K.; and Punch, W. 003. Combnng multple weak clusterngs. In Proceedngs of IEEE Internatonal Conference on Data Mnng, 33 338. Topchy, A.; Jan, A. K.; and Punch, W. 005. Clusterng ensembles: models of consensus and weak parttons. IEEE Trans. on Pattern Analyss and Machne Intellgence 7(:866 88. Unnkrshnan, R.; Pantofaru, C.; and Hebert, M. 007. Toward obectve evaluaton of mage segmentaton algorthms. IEEE Trans. on Pattern Analyss and Machne Intellgence 9(6:99 944. Wallace, D. L. 983. Comment on a method for comparng two herarchcal clusterngs. Journal of the Amercan Statstcal Assocaton 78:569 576. 670