Clustering is a discovery process in data mining.

Similar documents
Hierarchical clustering for gene expression data analysis

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Machine Learning: Algorithms and Applications

Unsupervised Learning

CS 534: Computer Vision Model Fitting

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Parallelism for Nested Loops with Non-uniform and Flow Dependences

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Machine Learning. Topic 6: Clustering

A Binarization Algorithm specialized on Document Images and Photos

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

K-means and Hierarchical Clustering

Active Contours/Snakes

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Graph-based Clustering

Support Vector Machines

Analyzing Popular Clustering Algorithms from Different Viewpoints

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering

Fitting: Deformable contours April 26 th, 2018

Performance Evaluation of Information Retrieval Systems

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

Bidirectional Hierarchical Clustering for Web Mining

Module Management Tool in Software Development Organizations

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

The Codesign Challenge

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

A Clustering Algorithm for Key Frame Extraction Based on Density Peak

APPLICATION OF A COMPUTATIONALLY EFFICIENT GEOSTATISTICAL APPROACH TO CHARACTERIZING VARIABLY SPACED WATER-TABLE DATA

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Hierarchical agglomerative. Cluster Analysis. Christine Siedle Clustering 1

An Entropy-Based Approach to Integrated Information Needs Assessment

Cluster Analysis of Electrical Behavior

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

SCALABLE AND VISUALIZATION-ORIENTED CLUSTERING FOR EXPLORATORY SPATIAL ANALYSIS

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Solving two-person zero-sum game by Matlab

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Programming in Fortran 90 : 2017/2018

Classifier Selection Based on Data Complexity Measures *

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Background Removal in Image indexing and Retrieval

Support Vector Machines

An Optimal Algorithm for Prufer Codes *

Lecture 5: Multilayer Perceptrons

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Deflected Grid-based Algorithm for Clustering Analysis

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Simplification of 3D Meshes

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Clustering Algorithm of Similarity Segmentation based on Point Sorting

Clustering. A. Bellaachia Page: 1

Smoothing Spline ANOVA for variable screening

Accounting for the Use of Different Length Scale Factors in x, y and z Directions

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Wishing you all a Total Quality New Year!

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Hermite Splines in Lie Groups as Products of Geodesics

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

Problem Set 3 Solutions

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Survey of Cluster Analysis and its Various Aspects

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

Image Alignment CSC 767

Concurrent Apriori Data Mining Algorithms

An Image Fusion Approach Based on Segmentation Region

y and the total sum of

Web Mining: Clustering Web Documents A Preliminary Review

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

BRDPHHC: A Balance RDF Data Partitioning Algorithm based on Hybrid Hierarchical Clustering

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

Mathematics 256 a course in differential equations for engineering students

MOTION PANORAMA CONSTRUCTION FROM STREAMING VIDEO FOR POWER- CONSTRAINED MOBILE MULTIMEDIA ENVIRONMENTS XUNYU PAN

Keyword-based Document Clustering

Query Clustering Using a Hybrid Query Similarity Measure

Parallel matrix-vector multiplication

APPLIED MACHINE LEARNING

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Clustering algorithms and validity measures

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Vanishing Hull. Jinhui Hu, Suya You, Ulrich Neumann University of Southern California {jinhuihu,suyay,

Available online at Available online at Advanced in Control Engineering and Information Science

Computer Animation and Visualisation. Lecture 4. Rigging / Skinning

S1 Note. Basis functions.

Object-Based Techniques for Image Retrieval

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Fitting & Matching. Lecture 4 Prof. Bregler. Slides from: S. Lazebnik, S. Seitz, M. Pollefeys, A. Effros.

Simulation Based Analysis of FAST TCP using OMNET++

Image Representation & Visualization Basic Imaging Algorithms Shape Representation and Analysis. outline

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Multi-stable Perception. Necker Cube

Lecture #15 Lecture Notes

Transcription:

Cover Feature Chameleon: Herarchcal Clusterng Usng Dynamc Modelng Many advanced algorthms have dffculty dealng wth hghly varable clusters that do not follow a preconceved model. By basng ts selectons on both nterconnectvty and closeness, the Chameleon algorthm yelds accurate results for these hghly varable clusters. George Karyps Eu-Hong (Sam) Han Vpn Kumar Unversty of Mnnesota Clusterng s a dscovery process n data mnng. 1 It groups a set of data n a way that maxmzes the smlarty wthn clusters and mnmzes the smlarty between two dfferent clusters. 1,2 These dscovered clusters can help explan the characterstcs of the underlyng data dstrbuton and serve as the foundaton for other data mnng and analyss technques. Clusterng s useful n characterzng customer groups based on purchasng patterns, categorzng Web documents, 3 groupng genes and protens that have smlar functonalty, 4 groupng spatal locatons prone to earthquakes based on sesmologcal data, and so on. Most exstng clusterng algorthms fnd clusters that ft some statc model. Although effectve n some cases, these algorthms can break down that s, cluster the data ncorrectly f the user doesn t select approprate statc-model parameters. Or sometmes the model cannot adequately capture the clusters characterstcs. Most of these algorthms break down when the data contans clusters of dverse shapes, denstes, and szes. Exstng algorthms use a statc model of the clusters and do not use nformaton about the nature of ndvdual clusters as they are merged. Furthermore, one set of schemes (the CURE algorthm and related schemes) gnores the nformaton about the aggregate nterconnectvty of tems n two clusters. The other set of schemes (the Rock algorthm, group averagng method, and related schemes) gnores nformaton about the closeness of two clusters as defned by the smlarty of the closest tems across two clusters. (For more nformaton, see the Lmtatons of Tradtonal Clusterng Algorthms sdebar.) By only consderng ether nterconnectvty or closeness, these algorthms can easly select and merge the wrong par of clusters. For nstance, an algorthm that focuses only on the closeness of two clusters wll ncorrectly merge the clusters n Fgure 1a over those n Fgure 1b. Smlarly, an algorthm that focuses only on nterconnectvty wll, n Fgure 2, ncorrectly merge the dark-blue wth the red cluster rather than the green one. Here, we assume that the aggregate nterconnectvty between the tems n the dark-blue and red clusters s greater than that of the dark-blue and green clusters. However, the border ponts of the dark-blue cluster are much closer to those of the green cluster than those of the red cluster. CHAMELEON: CLUSTERING USING DYNAMIC MODELING Chameleon s a new agglomeratve herarchcal clusterng algorthm that overcomes the lmtatons of exstng clusterng algorthms. Fgure 3 (on page 70) provdes an overvew of the overall approach used by Chameleon to fnd the clusters n a data set. The Chameleon algorthm s key feature s that t accounts for both nterconnectvty and closeness n dentfyng the most smlar par of clusters. It thus avods the lmtatons dscussed earler. Furthermore, Chameleon uses a novel approach to model the degree of nterconnectvty and closeness between each par of clusters. Ths approach consders the nternal characterstcs of the clusters themselves. Thus, t does not depend on a statc, user-suppled model and can automatcally adapt to the nternal characterstcs of the merged clusters. Chameleon operates on a sparse graph n whch nodes represent data tems, and weghted edges represent smlartes among the data tems. Ths sparsegraph representaton allows Chameleon to scale to large data sets and to successfully use data sets that 68 Computer 0018-9162/99/$10.00 1999 IEEE

are avalable only n smlarty space and not n metrc spaces. 5 Data sets n a metrc space have a fxed number of attrbutes for each data tem, whereas data sets n a smlarty space only provde smlartes between data tems. Chameleon fnds the clusters n the data set by usng a two-phase algorthm. Durng the frst phase, Chameleon uses a graph-parttonng algorthm to cluster the data tems nto several relatvely small subclusters. Durng the second phase, t uses an algorthm to fnd the genune clusters by repeatedly combnng these subclusters. Modelng the data Gven a matrx whose entres represent the smlarty between data tems, many methods can be used to fnd a graph representaton. 2,8 In fact, modelng data tems as a graph s very common n many herarchcal clusterng algorthms. 2,8,9 Chameleon s sparse-graph representaton of the tems s based on the commonly used k-nearest-neghbor graph approach. Each vertex of the k-nearest-neghbor graph represents a data tem. An edge exsts between two vertces v and u f u s among the k most smlar ponts of v, or v s among the k most smlar ponts of u. (a) Fgure 1. Algorthms based only on closeness wll ncorrectly merge (a) the dark-blue and green clusters because these two clusters are closer together than (b) the red and cyan clusters. Fgure 2. Algorthms based only on the nterconnectvty of two clusters wll ncorrectly merge the dark-blue and red clusters rather than dark-blue and green clusters. (b) Lmtatons of Tradtonal Clusterng Algorthms Partton-based clusterng technques such as K-Means 2 and Clarans 6 attempt to break a data set nto K clusters such that the partton optmzes a gven crteron. These algorthms assume that clusters are hyper-ellpsodal and of smlar szes. They can t fnd clusters that vary n sze, as shown n Fgure A1, or concave shapes, as shown n Fgure A2. DBScan 7 (Densty-Based Spatal Clusterng of Applcatons wth Nose), a wellknown spatal clusterng algorthm, can fnd clusters of arbtrary shapes. DBScan defnes a cluster to be a maxmum set of densty-connected ponts, whch means that every core pont n a cluster must have at least a mnmum number of ponts (MnPts) wthn a gven radus (Eps). DBScan assumes that all ponts wthn genune clusters can be reached from one another by traversng a path of denstyconnected ponts and ponts across dfferent clusters cannot. DBScan can fnd arbtrarly shaped clusters f the cluster densty can be determned beforehand and (1) (2) the cluster densty s unform. Herarchcal clusterng algorthms produce a nested sequence of clusters wth a sngle, all-nclusve cluster at the top and sngle-pont clusters at the bottom. Agglomeratve herarchcal algorthms 2 start wth each data pont as a separate cluster. Each step of the algorthm nvolves mergng two clusters that are the most smlar. After each merger, the total number of clusters decreases by one. Users can repeat these steps untl they obtan the desred number of clusters or the dstance between the two closest clusters goes above a certan threshold. The many varatons of agglomeratve herarchcal algorthms 2 prmarly dffer n how they update the smlarty between exstng and merged clusters. In some herarchcal methods, each clus- Fgure A. Data sets on whch centrod and medod approaches fal: Clusters (1) of wdely dfferent szes and (2) wth concave shapes. August 1999 69

Data set Construct a sparse graph k-nearest neghbor graph Partton the graph Merge parttons Fnal clusters Fgure 3. Chameleon uses a two-phase algorthm, whch frst parttons the data tems nto subclusters and then repeatedly combnes these subclusters to obtan the fnal clusters. Fgure 4 llustrates the 1-, 2-, and 3-nearest-neghbor graphs of a smple data set. There are several advantages to representng the tems to be clustered usng a k-nearest-neghbor graph. Data tems that are far apart are completely dsconnected, and the weghts on the edges capture the underlyng populaton densty of the space. Items n denser and sparser regons are modeled unformly, and the sparsty of the representaton leads to computatonally effcent algorthms. Because Chameleon operates on a sparse graph, each cluster s nothng more than a subgraph of the data set s orgnal sparse-graph representaton. Modelng cluster smlarty Chameleon uses a dynamc modelng framework to determne the smlarty between pars of clusters by lookng at ther relatve nterconnectvty (RI) and relatve closeness (RC). Chameleon selects pars to merge for whch both RI and RC are hgh. That s, t selects clusters that are well nterconnected as well as close together. Relatve nterconnectvty. Clusterng algorthms typcally measure the absolute nterconnectvty between clusters C and C n terms of edge cut the sum of the weght of the edges that straddle the two clusters, whch we denote EC(C, C ). Relatve nterconnectvty between clusters s ther absolute nterconnectvty normalzed wth respect to ther nternal nterconnectvtes. To get the cluster s nternal nterconnectvty, we sum the edges crossng a mn-cut bsecton that splts the cluster nto two roughly equal parts. Recent advances n graph parttonng have made t possble to effcently fnd such quanttes. 10 Thus, the relatve nterconnectvty between a par of clusters C and C s RI( C, C ) = EC( C, C ) EC( C) + EC( C ) 2 By focusng on relatve nterconnectvty, Chameleon ter s represented by a centrod or medod a data pont that s the closest to the center of the cluster and the smlarty between two clusters s measured by the smlarty between the centrods/medods. Both of these schemes fal for data n whch ponts n a gven cluster are closer to the center of another cluster than to the center of ther own cluster. Ths stuaton occurs n many natural clusters, 4,8 for example, f there s a large varaton n cluster szes, as n Fgure A1, or when cluster shapes are concave, as n Fgure A2. The sngle-lnk herarchcal method 2 measures the smlarty between two clusters by the smlarty of the closest par of data ponts belongng to dfferent clusters. Unlke the centrod/medod-based methods, ths method can fnd clusters of arbtrary shape and dfferent szes. However, t s hghly susceptble to nose, outlers, and artfacts. Researchers have proposed CURE 9 (Clusterng Usng Representatves) to remedy the drawbacks of both of these methods whle combnng ther advantages. In CURE, a cluster s represented by selectng a constant number of well-scattered ponts and shrnkng them toward the cluster s centrod, accordng to a shrnkng factor. CURE measures the smlarty between two clusters by the smlarty of the closest par of ponts belongng to dfferent clusters. Unlke centrod/medod-based methods, CURE can fnd clusters of arbtrary shapes and szes, as t represents each cluster va multple representatve ponts. Shrnkng the representatve ponts toward the centrod allows CURE to avod some of the problems assocated wth nose and outlers. However, these technques fal to account for specal characterstcs of ndvdual clusters. They can make ncorrect mergng decsons when the underlyng data does not follow the assumed model or when nose s present. In some algorthms, the smlarty between two clusters s captured by the aggregate of the smlartes (that s, the nterconnectvty) among pars of tems belongng to dfferent clusters. The ratonale for ths approach s that subclusters belongng to the same cluster wll tend to have hgh nterconnectvty. But the aggregate nterconnectvty between two clusters depends on the sze of the clusters; n general, pars of larger clusters wll have hgher nterconnectvty. Many such schemes normalze the aggregate smlarty between a par of clusters wth respect to the expected nterconnectvty of the clusters nvolved. For example, the wdely used group-average method 2 assumes fully connected clusters, and thus scales the aggregate smlarty between two clusters by n m, where n and m are the number of members n the two clusters. Rock 8 (Robust Clusterng Usng Lnks), a recently developed algorthm that operates on a derved smlarty graph, scales the aggregate nterconnectvty wth respect to a user-specfed nterconnectvty model. However, the maor lmtaton of all such schemes s that they assume a statc, usersuppled nterconnectvty model. Such models are nflexble and can easly lead to ncorrect mergng decsons when the model under- or overestmates the nterconnectvty of the data set or when dfferent clusters exhbt dfferent nterconnectvty characterstcs. Although some schemes allow the connectvty to vary for dfferent problem domans (as does Rock 8 ), t s stll the same for all clusters rrespectve of ther denstes and shapes. 70 Computer

can overcome the lmtatons of exstng algorthms that use statc nterconnectvty models. Relatve nterconnectvty can account for dfferences n cluster shapes as well as dfferences n the degree of nterconnectvty for dfferent clusters. Relatve closeness. Relatve closeness nvolves concepts that are analogous to those developed for relatve nterconnectvty. The absolute closeness of clusters s the average weght (as opposed to the sum of weghts for nterconnectvty) of the edges that connect vertces n C to those n C. Snce these connectons come from the k-nearest-neghbor graph, ther average strength provdes a good measure of the affnty between the data tems along the nterface layer of the two clusters. At the same tme, ths measure s tolerant of outlers and nose. To get a cluster s nternal closeness, we take the average of the edge weghts across a mn-cut bsecton that splts the cluster nto two roughly equal parts. The relatve closeness between a par of clusters s the absolute closeness normalzed wth respect to the nternal closeness of the two clusters: RC(C, C ) = SEC(C,C ) C C C SEC C C ( ) + C C SEC ( C ) + + where SEC(C ) and SEC(C ) are the average weghts of the edges that belong n the mn-cut bsector of clusters C and C, and SEC(C, C ) s the average weght of the edges that connect vertces n C and C. Terms C and C are the number of data ponts n each cluster. Ths equaton also normalzes the absolute closeness of the two clusters by the weghted average of the nternal closeness of C and C. Ths dscourages the mergng of small sparse clusters nto large dense clusters. In general, the relatve closeness between two clusters s less than one because the edges that connect vertces n dfferent clusters have a smaller weght. By focusng on the relatve closeness, Chameleon can overcome the lmtatons of exstng algorthms that look only at the absolute closeness. By lookng at the relatve closeness, Chameleon correctly merges clusters so that the resultng cluster has a unform degree of closeness between ts tems. Process The dynamc framework for modelng cluster smlarty s applcable only when each cluster contans a suffcently large number of vertces (data tems). The reason s that to compute the relatve nterconnectvty and closeness of clusters, Chameleon needs to compute each cluster s nternal nterconnectvty and closeness, nether of whch can be accurately calculated for clusters contanng a few data ponts. For ths (a) (b) (c) (d) Fgure 4. K-nearest-neghbor graphs from orgnal data n two dmensons: (a) orgnal data, (b) 1-, (c) 2-, and (d) 3-nearest neghbor graphs. reason, Chameleon has a frst phase that clusters the data tems nto several subclusters that contan a suffcent number of tems to allow dynamc modelng. In a second phase, t dscovers the genune clusters n the data set by usng the dynamc modelng framework to herarchcally merge these subclusters. Chameleon fnds the ntal subclusters usng hmets, 10 a fast, hgh-qualty graph-parttonng algorthm. hmets parttons the k-nearest-neghbor graph of the data set nto several parttons such that t mnmzes the edge cut. Snce each edge n the k-nearestneghbor graph represents the smlarty among data ponts, a parttonng that mnmzes the edge cut effectvely mnmzes the relatonshp (affnty) among data ponts across the parttons. After fndng subclusters, Chameleon swtches to an algorthm that repeatedly combnes these small subclusters, usng the relatve nterconnectvty and relatve closeness framework. There are many ways to develop an algorthm that accounts for both of these measures. Chameleon uses two dfferent schemes. User-specfed thresholds. The frst merges only those pars of clusters exceedng user-specfed thresholds for relatve nterconnectvty (T RI ) and relatve closeness (T RC ). In ths approach, Chameleon vsts each cluster C and checks for adacent clusters C wth RI and RC that exceed these thresholds. If more than one adacent cluster satsfes these condtons, then Chameleon wll merge C wth the cluster that t s most connected to that s, the par C and C wth the hghest absolute nterconnectvty. Once Chameleon chooses a partner for every cluster, t performs these mergers and repeats the entre process. Users can control the characterstcs of the desred clusters wth T RI and T RC. Functon-defned optmzaton. The second scheme mplemented n Chameleon uses a functon to combne the relatve nterconnectvty and relatve closeness. Chameleon selects cluster pars that maxmze ths functon. Snce our goal s to merge pars for whch both the relatve nterconnectvty and relatve closeness are hgh, a natural way of defnng such a functon s to take ther product. That s, select clusters that maxmze RI(C,C ) RC(C,C ). Ths formula gves equal mportance to both parameters. However, qute often we may prefer clusters that gve a hgher preference to one of these two measures. For ths reason, Chameleon selects the par of clusters that maxmzes RI(C,C ) RC(C,C ) α August 1999 71

Fgure 5. Four data sets used n our experments: (a) DS1 has 8,000 ponts; (b) DS2 has 6,000 ponts; (c) DS3 has 10,000 ponts; and (d) DS4 has 8,000 ponts. (a) (b) (c) (d) where α s a user-specfed parameter. If α > 1, then Chameleon gves a hgher mportance to the relatve closeness, and when α < 1, t gves a hgher mportance to the relatve nterconnectvty. RESULTS AND COMPARISONS We compared Chameleon s performance aganst that of CURE 9 and DBScan 7 on four dfferent data sets. These data sets contan from 6,000 to 10,000 ponts n two dmensons; the ponts form geometrc shapes as shown n Fgure 5. These data sets represent some dffcult clusterng nstances because they contan clusters of arbtrary shape, proxmty, orentaton, and varyng denstes. They also contan sgnfcant nose and specal artfacts. Many of these results are avalable at http://www.cs.umn.edu/ ~han/chameleon.html. Chameleon s applcable to any data set for whch a smlarty matrx s avalable (or can be constructed). However, we evaluated t wth two-dmensonal data sets because smlar data sets have been used to evaluate other state-of-the-art algorthms. Clusters n 2D data sets are also easy to vsualze and compare. Fgure 6 shows the clusters that Chameleon found for each of the four data sets, usng the same set of parameter values. In partcular, we used k = 10 n a functon-defned optmzaton wth α = 2.0. Ponts n dfferent clusters are represented usng a combnaton of colors and glyphs. As a result, ponts that belong to the same cluster use both the same color and glyph. For example, n the DS3 clusters, there are two cyan clusters. One contans the ponts n the regon between the two crcles nsde the ellpse, and the other, the lne between the two horzontal bars and the c-shaped cluster. Two clusters are dark blue one corresponds to the upsde-down c-shaped cluster and the other, the crcle nsde the candy cane. Ther ponts use dfferent glyphs (bells and squares for the frst par, and squares and Fgure 6. Clusters dscovered by Chameleon for the four data sets. 72 Computer

bells for the second), so they denote dfferent clusters. The clusterng shown n Fgure 6 corresponds to the earlest pont at whch Chameleon was able to fnd the genune clusters n the data set. That s, they correspond to the earlest teraton at whch Chameleon dentfes the genune clusters and places each n a sngle cluster. Thus for a gven data set, Fgure 6 shows more than the number of genune clusters; the addtonal clusters contan outlers. Chameleon correctly dentfes the genune clusters n all four data sets. For example, n DS1, Chameleon fnds sx clusters, fve of whch are genune clusters, and the sxth (shown n brown) corresponds to outlers that connect the two ellpsod clusters. In DS3, Chameleon fnds the nne genune clusters and two clusters that contan outlers. In DS2 and DS4, Chameleon accurately fnds only the genune cluster. CURE We also evaluated CURE, whch has been shown to effectvely fnd clusters n two-dmensonal data sets. 9 CURE was able to fnd the rght clusters for DS1 and DS2, but t faled to fnd the rght clusters on the remanng two data sets. Fgure 7 shows the results obtaned by CURE for the DS3 and DS4 data sets. For each of the data sets, Fgure 7 shows two dfferent clusterng solutons contanng dfferent numbers of clusters. The clusterng soluton n Fgure 7a corresponds to the earlest pont n the process n whch CURE merges subclusters that belong to two dfferent genune clusters. As we can see from Fgure 7b, n the case of DS3, CURE makes a mstake when gong down to 25 clusters, as t merges one of the crcles nsde the ellpse wth a porton of the ellpse. Smlarly, n the case of DS4 (Fgure 7c), CURE also makes a mstake when gong down to 25 clusters by mergng the small crcular cluster wth a porton of the upsde-down Y-shaped cluster. The second clusterng soluton (Fgure 7d) corresponds to solutons that contan as many clusters as those dscovered by Chameleon. These solutons are consderably worse than the frst set of solutons, ndcatng that CURE s mergng scheme makes multple mstakes for these data sets. DBSCAN Fgure 8 shows the clusters found by DBScan for DS1 and DS2, usng dfferent values of the Eps parameter. Followng the recommendaton of DBScan s (a) (b) Fgure 7. Clusters dentfed by CURE wth shrnkng factor 0.3 and number of representatve ponts equal to 10. For DS3, CURE frst merges subclusters that belong to two dfferent subclusters at (a) 25 clusters. Wth (b) 11 clusters specfed the same number that Chameleon found CURE obtans the result shown. CURE produced these results for DS4 wth (c) 25 and (d) 8 clusters specfed. (c) (d) August 1999 73

Fgure 8. DBScan results for DS1 wth MnPts at 4 and Eps at (a) 0.5 and (b) 0.4. (a) (b) (a) (b) (c) Fgure 9. DBScan results for DS2 wth MnPts at 4 and Eps at (a) 5.0, (b) 3.5, and (c) 3.0. developers, 7 we fxed the MnPts at 4 and vared Eps n these experments. The clusters produced for DS1 llustrate that DBScan cannot effectvely fnd clusters of dfferent densty. In Fgure 8a, DBScan puts the two red ellpses (at the top of the mage) nto the same cluster because the outler ponts satsfy the densty requrements as dctated by the Eps and MnPts parameters. Decreasng the value of Eps can separate these clusters, as shown n Fgure 8b for Eps at 0.4. Although DBScan separates the ellpses wthout fragmentng them, t now fragments the largest but lower densty cluster nto several small subclusters. Our experments have shown that DBScan exhbts smlar characterstcs on DS4. The clusters produced for DS2 llustrate that DBScan cannot effectvely fnd clusters wth varable nternal denstes. Fgure 9a through Fgure 9c show a sequence of three clusterng solutons for decreasng values of Eps. As we decrease Eps n the hope of separatng the two clusters, the natural clusters n the data set are fragmented nto several smaller clusters. DBScan fnds the correct clusters n DS3 gven the rght combnaton of Eps and MnPts. If we fx MnPts at 4, then the algorthm works fne on ths data set as long as Eps s from 5.7 to 6.1. Even though we chose to model the data usng a k- nearest-neghbor graph, t s entrely possble to use other graph representatons such as those based on mutual shared neghbors, 2,8,11 whch are sutable for partcular applcaton domans. Furthermore, dfferent domans may requre dfferent models for capturng relatve closeness and nterconnectvty. In any of these stuatons, we beleve that the two-phase framework of Chameleon would stll be hghly effectve. Our future research ncludes the verfcaton of Chameleon on dfferent applcaton domans. We also want to study the effectveness of dfferent technques for modelng data as well as cluster smlarty. Acknowledgments Ths work was partally supported by Army Research Offce contract DA/DAAG55-98-1-0441 and the Army Hgh-Performance Computng Research Center under the auspces of the Department of the Army, Army Research Laboratory cooperatve agreement number DAAH04-95-2-0003/contract number DAAH04-95-C-0008, the content of whch does not necessarly reflect the government s poston or polcy, and should nfer no offcal endorsement. Ths work was also supported by an IBM Partnershp Award. Access to computng facltes was provded by the Army Hgh-Performance Computng Research Center and the Mnnesota Supercomputer Insttute. We also thank Martn Ester, Hans-Peter Kregel, Jorg Sander, and Xaowe Xu for provdng the DBScan program. References 1. M.S. Chen, J. Han, and P.S. Yu, Data Mnng: An Overvew from a Database Perspectve, IEEE Trans. Knowledge and Data Eng., Dec. 1996, pp. 866-883. 2. A.K. Jan and R.C. Dubes, Algorthms for Clusterng Data, Prentce Hall, Upper Saddle Rver, N.J., 1988. 3. D. Boley et al., Document Categorzaton and Query Generaton on the World Wde Web Usng WebACE, to be publshed n AI Revew, 1999. 4. E.H. Han et al., Hypergraph-Based Clusterng n Hgh- Dmensonal Data Sets: A Summary of Results, Bull. Tech. 74 Computer

Commttee on Data Eng., Vol. 21, No. 1, 1998, pp. 15-22. Also a longer verson avalable as Tech. Report TR-97-019, Dept. of Computer Scence, Unv. of Mnnesota, 1997. 5. V. Gant et al., Clusterng Large Datasets n Arbtrary Metrc Spaces, Proc. 15th Int l Conf. Data Eng., IEEE CS Press, Los Alamtos, Calf., 1999, pp. 502-511. 6. R. Ng and J. Han, Effcent and Effectve Clusterng Method for Spatal Data Mnng, Proc. 20th Int l Conf. Very Large Data Bases, Morgan Kaufmann, San Francsco, 1994, pp. 144-155. 7. M. Ester et al., A Densty-Based Algorthm for Dscoverng Clusters n Large Spatal Databases wth Nose, Proc. 2nd Int l Conf. Knowledge Dscovery and Data Mnng, AAAI Press, Menlo Park, Calf., 1996, pp. 226-231. 8. S. Guha, R. Rastog, and K. Shm, ROCK: A Robust Clusterng Algorthm for Categorcal Attrbutes, Proc. 15th Int l Conf. Data Eng., IEEE CS Press, Los Alamtos, Calf., 1999, pp. 512-521. 9. S. Guha, R. Rastog, and K. Shm, CURE: An Effcent Clusterng Algorthm for Large Databases, Proc. ACM SIGMOD Int l Conf. Management of Data, ACM Press, New York, 1998, pp. 73-84. 10. G. Karyps and V. Kumar, hmetis 1.5: A Hypergraph Parttonng Package, Tech. Report, Dept. of Computer Scence, Unv. of Mnnesota, 1998; http://wnter.cs. umn.edu/~karyps/mets. 11. K.C. Gowda and G. Krshna, Agglomeratve Clusterng Usng the Concept of Mutual Nearest Neghborhood, Pattern Recognton, Feb. 1978, pp. 105-112. George Karyps s an assstant professor of computer scence at the Unversty of Mnnesota. Hs research nterests nclude data mnng, bo-nformatcs, parallel computng, graph parttonng, and scentfc computng. Karyps has a PhD n computer scence from the Unversty of Mnnesota. He s a member of the IEEE Computer Socety and the ACM. Eu-Hong (Sam) Han s a PhD canddate at the Unversty of Mnnesota. Hs research nterests nclude data mnng and nformaton retreval. Han has an MS n computer scence from the Unversty of Texas, Austn. He s a member of the ACM. Vpn Kumar s a professor of computer scence at the Unversty of Mnnesota. Hs research nterests nclude data mnng, parallel computng, and scentfc computng. Kumar has a PhD n computer scence from the Unversty of Maryland, College Park. He s a member of the IEEE, ACM, and SIAM. Contact the authors at the Unv. of Mnnesota, Dept. of Computer Scence and Eng., 4-192 EECS Bldg., 200 Unon St. SE, Mnneapols, MN 55455; {karyps, han,kumar}@cs.umn.edu. Ggabt Ethernet What s the standard technology? We wrote the book on t. Jon our member network. Fnd Out How @ http://computer.org/standard/ndex.htm Dd You Know? Rght now, over 200 Workng Groups are draftng IEEE standards. IEEE Computer Socety