Analyzing Popular Clustering Algorithms from Different Viewpoints
|
|
- Dennis Watts
- 5 years ago
- Views:
Transcription
1 /2002/13(08) Journal of Software Vol.13, No.8 Analyzng Popular Clusterng Algorthms from Dfferent Vewponts QIAN We-nng, ZHOU Ao-yng (Department of Computer Scence, Fudan Unversty, Shangha , Chna) (Laboratory for Intellgent Informaton Processng, Fudan Unversty, Shangha , Chna) E-mal: Receved September 3, 2001; accepted February 25, 2002 Abstract: Clusterng s wdely studed n data mnng communty. It s used to partton data set nto clusters so that ntra-cluster data are smlar and nter-cluster data are dssmlar. Dfferent clusterng methods use dfferent smlarty defnton and technques. Several popular clusterng algorthms are analyzed from three dfferent vewponts: (1) clusterng crtera, (2) cluster representaton, and (3) algorthm framework. Furthermore, some new bult algorthms, whch mx or generalze some other algorthms, are ntroduced. Snce the analyss s from several vewponts, t can cover and dstngush most of the exstng algorthms. It s the bass of the research of self-tunng algorthm and clusterng benchmark. Key words: data mnng; clusterng; algorthm Clusterng s an mportant data-mnng technque used to fnd data segmentaton and pattern nformaton. Clusterng technque s wdely used n applcatons of fnancal data classfcaton, spatal data processng, satellte photo analyss, and medcal fgure auto-detecton etc.. The problem of clusterng s to partton the data set nto segments (called clusters) so that ntra-cluster data are smlar and nter-cluster data are dssmlar. It can be formalzed as follows: Defnton 1. Gven a data set V{v 1,v 2,,v n }, n whch v s (=1,2,,n) are called data ponts. The process of parttonng V nto {C 1,C 2,,C k }, C V ( =1,2,,k), and k =1 C = V, based on the smlarty between data ponts are called clusterng, C s ( =1,2,,k) are called clusters. The defnton does not defne the smlarty between data ponts. In fact, dfferent methods use dfferent crtera. Clusterng s also known as unsupervsed learnng process, snce there s no pror knowledge about the data set. Therefore, clusterng analyss usually acts as the preprocessng of other KDD operatons. The qualty of the clusterng result s mportant for the whole KDD process. As other data mnng operatons, hgh performance and scalablty are other two requests besde the accuracy. Thus, a good clusterng algorthm should satsfy the followng Supported by the Natonal Grand Fundamental Research 973 Program of Chna under Grant No.G ( 973 ); the Natonal Research Foundaton for the Doctoral Program of Hgher Educaton of Chna under Grant No ( ) QIAN We-nng was born n He s a Ph.D. canddate at the Department of Computer Scence, Fudan Unversty. Hs research nterests are clusterng, data mnng and Web data management. ZHOU Ao-yng was born n He s a professor and doctoral supervsor at the Department of Computer Scence, Fudan Unversty. Hs current research nterests nclude Web data management, data mnng, and obect management over peer-to-peer networks.
2 : 1383 requests: Independent of n-advance knowledge; Only need easy-to-set parameters; Accurate; Fast; Havng good scalablty. Much research work has been done on buldng clusterng algorthms. Each uses novel technques to mprove the ablty of handlng certan characterstc data sets. However, dfferent algorthms use dfferent crtera as mentoned above. Snce there s no benchmark for clusterng methods, t s dffcult to compare these algorthms by usng a common measurement. However, a detaled comparson s necessary. Ths s because that: (1) The advantages and dsadvantages should be analyzed, so that mprovement can be developed on exstng algorthms. (2) The user should be able to choose rght algorthm for a certan data set, so that the optmal result and performance can be obtaned. (3) The detaled comparson s the bass for buldng a clusterng benchmark. In ths paper, we analyze several exstng popular algorthms from some dfferent aspects. It s dfferent wth some other survey work [1~3] n that we compare these algorthms unversally from dfferent vewponts, whle others try to generalze some methods to a certan framework, such as n Refs.[1,2], whch can only cover lmted algorthms, or ust ntroduce clusterng algorthms one by one as tutoral [3], so that no comparson among algorthms s analyzed. Snce dfferent algorthms use dfferent crtera and technques, those surveys can only cover some of the algorthms. Furthermore, some algorthms cannot be dstngushed snce they use a same technque so that they fall nto the same category n a certan framework. The rest of ths paper s organzed as follows: Secton 1 to 3 analyze the clusterng algorthms from three dfferent vewponts, namely, clusterng crtera, algorthm framework and cluster representaton. Secton 4 ntroduces some methods, whch are mxture or generalzaton of other algorthms. Secton 5 ntroduces research focus on auto-detecton of clusters. Fnally, Secton 6 s for concluson remarks. It should be note that from each vewpont, although we try to classfy as many algorthms as we can, someone s stll mssng. And some algorthms may fall nto the same category. However, whle we observng these algorthms from all these vewponts, dfferent algorthms can be dstngushed. Ths s the motvaton of our work. 1 Crtera The bass of clusterng analyss s the defnton of smlarty. Usually, the defnton of smlarty contans two parts: (1) The smlarty between data ponts; (2) The smlarty between sets of data ponts. Not all clusterng methods need both of them. Some algorthms only use one. The clusterng crtera can be classfed nto three categores: dstance-based, densty-based, and lnkage-based. Dstance-based and densty-based clusterng s usually appled to data n Eucldean space, whle lnkage-based clusterng can be appled to data n arbtrary metrc space. 1.1 Dstance-Based clusterng The basc dea of dstance-based clusterng s that a cluster s the data ponts close to each other. The dstance between two data ponts s easy to defne n Eucldean space. The wdely used dstance defntons nclude Eucldean dstance, and Manhattan dstance. However, there are several choces for smlarty defnton between two sets of data ponts, as follows: or or Smlarty ( C, C ) = dstance( rep, rep ) Smlarty rep 1 avg ( C, C ) = dstance( v, v ) n n v C, v C (1) (2)
3 1384 Journal of Software 2002,13(8) Smlarty max ( C, C ) = max{ dstance( v, v ) v C, v C } (3) or Smlarty mn ( C, C ) = mn{ dstance( v, v ) v C, v C } (4) In (1), rep and rep are representatves of C and C, respectvely. The representatve of a data set s usually the mean, such as n k-means [4]. Sngle representatve methods usually employ Defnton (1). It s obvous that the complexty of (2), (3), and (4) are all O( C * C ), whch are neffcent for large data sets. Although they are more global defntons, they are usually not drectly appled on smlarty defnton for sub-clusters or clusters. The only excepton s BIRCH [5], n whch CF-vector and CF-tree are employed to accelerate the computaton. Some trade-off approaches are taken, as t wll be dscussed n Secton 2.1, n whch the detaled analyss of sngle representatve methods s also gven. The advantage of dstance-based clusterng s that dstance s easy for computng and understandng. And dstance-based clusterng algorthms usually need parameters of K, whch s the number of fnal clusters user wants, or the mnmum dstance to dstngush two clusters. However, the dsadvantage of them s also dstnct that they are nose-senstve. Although some technques are ntroduced n some of them, they result n other serous problems. CURE [6] uses representatveshrnkng technques to reduce the mpact of noses. Fg. 1 Hollow-Shaped cluster dentfed by CURE However, t nvtes the problem that t fals to dentfy the clusters n hollow shapes, as the result n our experment shown n Fg.1. Ths shortcomng counteracts the advantage of mult-representatves that the algorthm can dentfy arbtrary-shaped clusters. BIRCH, whch s the frst clusterng algorthm consderng noses, ntroduces a new parameter T, whch s substantally a parameter related to densty. Furthermore, t s hard for user to understand ths parameter unless the page storage ablty of CF-tree s known(page_sze/entry_sze/t s an approxmaton of densty n that page). In addton, t may cause loss of small clusters and long-shaped clusters. Snce lack of space, the detaled dscusson s omtted here. 1.2 Densty-Based clusterng Other than dstance-based clusterng methods, densty-based clusterng stands for that clusters are dense areas. Therefore, the smlarty defnton of data ponts s based on whether they belong to connected dense regons. The data ponts belongng to the connected dense regon belong to the same cluster. Based on the dfferent computaton of densty, densty-based clusterng can be further classfed nto Nearest-Neghbor (called NN n the rest of ths paper) methods and cell-based methods. The dfference between them s that the former defne densty based on data set, and the latter defne t based on data space. No matter whch knd a densty-based clusterng algorthm belongs to, t always needs a parameter of mnmum-densty threshold, whch s the key to defne dense regon NN methods NN methods only treat ponts, whch have more than k neghbors n hyper-sphere whose radus s ε, as data ponts n clusters. Snce the neghbors of each pont should be counted, the ndex structures whch support regon query, such as R * -tree, or X-tree, are always employed. Because of the curse of dmensonalty [7], these methods don t have good scalablty for dmensonalty. Furthermore, NN methods wll result n frequent I/O when the data
4 : 1385 sets are very large. However, for most mult-dmensonal data sets, these methods are effcent. In short, the shortcomng of ths knd of methods s the shortcomng of the ndex structures they based-on. Tradtonal NN methods, such as DBSCAN and ts descendants [8~10], need parameters of densty threshold and ε. Recently, OPTICS [11], whose basc dea s the same as DBSCAN, focuses on automatcally dentfcaton of cluster structures. Snce the novel technques n OPTICS do not belong to the topc of ths sub-secton, we wll dscuss them n Secton Cell-Based methods Cell-based methods count densty nformaton based on the unts. STING [12], WaveCluster [13], DBCLASD [14], CLIQUE [15], and OptGrd [16] all fall nto ths category. Cell-based methods have the shortcomng that cells are only pproxmaton of dense areas. Some methods ntroduce technques to solve ths problem, as wll be ntroduced n Secton 2.3. Densty-based clusterng methods all meet problem when data sets contan clusters or sub-clusters whose granularty s smaller than the granularty of unts for computng densty. A well-known example s the dumbbell-shaped clusters, as shown n our expermental result, Fgure 2. However, for densty-based clusterng methods, t s easy to remove noses, f the parameters are properly set. That s to say, t s robust to noses. Fg.2 Dumbbell-Shaped clusters dentfed by densty-based algorthm (DBSCAN) 1.3 Lnkage-Based clusterng Other than dstance-based or densty-based clusterng, lnkage-based clusterng can be appled to arbtrary metrc spaces. Furthermore, snce n hgh-dmensonal space, the dstance nformaton and densty nformaton s not suffcent for clusterng, lnkage-based clusterng s often employed. Algorthms belongng to ths knd nclude ROCK [17], CHAMELEON [18], ARHP [19,20], STIRR [21], CACTUS [22], etc. Lnkage-based methods are based on graph or hyper-graph model. They usually map the data set nto a graph/hyper-graph, then cluster the data ponts based on the edge/hyper-edge nformaton, so that the hghly connected data ponts are assgned to the same cluster. The dfference between graph model and hyper-graph model s that the former reflects the smlarty of par of nodes, whle the latter usually reflects the co-occurrence nformaton. ROCK and CHAMELEON use graph model, whle ARHP, PDDP, STIRR, and CACTUS use hyper-graph model. Although the developers of CACTUS ddn t state that t s a hyper-graph-model-based algorthm, t belongs to that knd. The qualty of lnkage-based clusterng result depends on the defnton of lnk or hyper-edge. Snce t s mpossble to handle a complete graph, the graph/hyper-graph model always elmnates the edges/hyper-edges whose weght s low, so that the graph/hyper-graph s sparse. However, to gan the effcency, t may reduce the accuracy. The algorthms fall n ths category use dfferent frameworks. ROCK and CHAMELEON are herarchcal clusterng methods, whle ARHP s dvsve method, and STIRR uses dynamcal system model. Furthermore, snce the co-occurrence problem s smlar to assocaton rule mnng problem, ARHP and CACTUS both borrow Apror
5 1386 Journal of Software 2002,13(8) algorthm [23] to fnd the clusters. Another algorthm employ Apror-lke algorthm s CLIQUE. However, the monotoncty lemma s used to fnd hgh-dmensonal clusters based on clusters fnd n subspaces. CLIQUE s not lnkage-based clusterng methods, whch s the dfference between t wth other algorthms dscussed n ths subsecton. The detaled dscusson of algorthm framework wll be gven n Secton 3. And snce CHAMELEON uses both lnk and dstance nformaton, t wll be dscussed standalone n Secton Cluster Representaton The purpose of clusterng s to dentfy the data clusters, whch are the summary of the smlar data. Each algorthm should represent the clusters and sub-clusters n some forms. Although labelng each data pont wth a cluster dentty s a straghtforward dea, most methods don t employ ths approach. Ths may be because that: (1) The summary, whch should be easly understandable, s more than (data-pont, cluster-d) pars; (2) It s tme- and space-expensve to label all the data ponts n the process of clusterng; (3) Some methods employ accurate compact cluster representatves, whch make the tme-consumng process of labelng unnecessary. We classfy the cluster representaton technques nto four knds, as dscussed n the followng: 2.1 Representatve ponts Most dstance-based clusterng methods use some ponts to represent clusters. These ponts are called representatve ponts. The representatves may be data ponts, or some other ponts that do not exst n database, such as means of some sets of data ponts. The data representaton technques fallng nto ths category can be further classfed nto three classes: Sngle representatve The smplest approach s to use one pont as the representatve of each cluster. Each data pont s assgned to the cluster whose representatve s the closest one. The representatve pont may be the mean of the cluster, lke k-means [4] methods do, or the data pont n the database, whch s the closest pont to the center, lke k-medods methods do. Other algorthms fall nto ths knd nclude BIRCH [5], CLARA [24], and CLARANS [25]. The dfferent affect of k-means and k-medods methods on clusterng result s ntroduced n detal n Ref.[25]. Snce t s not related to the motvaton of ths paper, we don t survey t here. The shortcomng of sngle representatve approach s obvous: (1) only sphere clusters can be dentfed; and (2) large clusters wth small cluster besde wll be splt, whle some data ponts n the large cluster wll be assgned to the small cluster. These two condtons are shown n Fg.3 (The rght part of ths Fgure s borrowed from Ref.[6], Fg.1(b)). Therefore, ths approach wll fal when processng data sets wth arbtrary shaped clusters or clusters wth great dfference All data ponts Usng all the data ponts n a cluster to represent t s another straghtforward approach. However, t s tme-expensve snce: (1) the data sets are always large so that the label nformaton cannot ft n memory, whch leads to frequent dsk access, and (2) whle computng nformaton ntra- and nter- clusters, t wll access all data ponts. Furthermore, the label nformaton s hard to understand. Therefore, no popular algorthms take ths approach Mult-Representatves Mult-representatves approach s ntroduced n CURE, whch s the trade-off between sngle-pont and all-ponts methods. The frst representatve s the data pont, whch s the farthest to the mean of the cluster. And next, the data pont, whose dstance to the nearest exstng representatve s the largest, s chosen each tme, untl the number of representatves s large enough. In Ref.[6], the experments show that for most data sets, 10
6 : 1387 Fg.3 Non-Sphercal clusters and clusters wth dfferent scales dentfed by sngle representatve methods representatves wll lead to satsfed result. In the long verson of Ref.[26], the authors who developed CURE also mentoned that for complex data sets, more representatves are needed. However, before clusterng, the complexty of the clusters s unknown. Furthermore, the relatonshp between complexty of clusters and number of representatves s not clear. Ths forces the user to choose a large number of representatves. Snce the tme complexty of CURE s O(n 2 log n), n whch n s the number of data ponts n the begnnng, the exstence of large number of representatves n the ntal sub-clusters wll affect the effcency (there exsts sub-clusters because that a smple parttonng technque s used n CURE [6]. The tme-complexty accordng to number of representatves s O(c*log c), f the number of ntal sub-clusters s a fxed number), as shown n our expermental result, Fg.4. Furthermore, along wth the technque they handlng outlers (the shrnkng of representatves), t fals to dentfy clusters of hollow shape, as t has already been dscussed n Secton 1.1 and shown n Fg.1. However, t outperforms sngle-pont and all-ponts approaches when both effectveness and effcency are consdered. Tme (s) Number of representatves n a cluster Fg. 4 Performance of CURE vs. number of representatves n a cluster 2.2 Dense area Some densty-based clusterng algorthms use dense area to denote clusters and sub-clusters. DBSCAN [8], ts descendants [9,10], and OPTICS [11] belong to ths category. Dense area representaton method s smlar to all-data-ponts methods except that only core ponts are used. Core ponts are those data ponts whose neghbors wthn a certan regon are more than the threshold. Therefore, only core ponts are used to expand a sub-cluster, and t wll stop when no further expanson can be appled on core ponts. Dense area can fgure arbtrary-shaped clusters besdes the dumbbell-shaped clusters. However, the cost for computng core ponts s expensve, so that specal ndex structures are needed. In algorthms of DBSCAN seres and OPTICS, R * -tree s used to support regon query. Snce these methods need to scan the whole database, and
7 1388 Journal of Software 2002,13(8) each pont may cause a regon query, these methods always result n frequent I/O when appled to large databases, as shown n experments gven n Secton Cells Some grd-based methods use cells to summary the clusters, such as STING [12], WaveCluster [13], CLIQUE [15], DBCLASD [14], and OptGrd [16] etc.. Other than dense areas, whch are the condensaton of dense data ponts, cells are parttons of the data space. Therefore, a cell s the approxmaton of the data ponts fallng nto t. Ths makes the algorthms takng ths approach naccurate n some condton. In Ref.[12], the authors argue that under a suffcent condton, STING can ensure the result s accurate. However, ths concluson s made n the condton that the characterstc of queres s known a pror. WaveCluster facltates the mult-resoluton property of wavelet to dentfy clusters n dfferent resolutons, whch ensure that the hghest resoluton clusters are accurate. The advantage of usng cells to represent clusters s straghtforward. Frstly, the number of cells s much smaller than the sze of the database. Therefore, the data for processng s lmted, whch leads to hgh scalablty of those approaches. Secondly, the cost of computng propertes of cells s low compared to fndng dense area, whch needs complex data structure support. Ths s because that cells are data ndependent, whle dense area depends on data dstrbuton. At last, as dense areas, cells can reflect the data dstrbuton nformaton of a local area, although t s approxmate. Snce the number of neghborng relatonshp s explosve when the dmensonalty s ncreasng, the algorthms facltatng neghborng nformaton of cells s usually neffcent for hgh-dmensonal data. The only excepton s CLIQUE. Dfferent from other cell representaton methods, CLIQUE fnds dense unts (cells) from low-dmensonal subspaces to hgh-dmensonal subspaces. Therefore, t has hgh scalablty to dmensonalty. Although OptGrd s a cell-based clusterng method, t does not use neghborng nformaton, so that t s effcent for hgh-dmensonal data sets. 2.4 Probablty Some methods use probablty to denote the degree of a data ponts belongng to a cluster. EM [27,28], and AutoClass [29] belong to ths category. The problem of classfyng a data pont to more than one cluster s also known as fuzzy clusterng or soft clusterng. In most cases, the performance of soft clusterng s unsatsfactory. Reference [2] provdes a detaled survey of fuzzy clusterng. Snce the lack of space, we are not verbose here. 3 Algorthm Framework In the above two sectons, we dscussed the clusterng crtera and cluster representaton, whch are the two most mportant factors for clusterng effectveness. In ths secton, the algorthm framework wll be dscussed. The algorthm framework determnes the tme complexty of the algorthms, and the needed parameters. Furthermore, algorthm framework also affects the technques of preprocessng. These are the focuses n the followng three subsectons. 3.1 Optmzaton methods Optmzaton methods usually try to optmze a certan measure. Tradtonal optmzaton methods are also known as parttonng methods. The most famous ones nclude k-means (ncludng ts varance k-modes [30], k-prototypes [30] ) [4], and k-medods (ncludng PAM [24], CLARA [24], CLARANS [25], etc.). Some new bult algorthms also fall nto ths category, ncludng STIRR [21]. K-means methods try to mnmze a dssmlar crteron (typcally the squared-error crteron). K-means
8 : 1389 algorthms usually are lnear to the sze of the data set. However, they are usually senstve to outlers, and often termnate at a local optmum. Therefore, the qualty of the result s not satsfable. Furthermore, they are usually desgned as memory-resdent algorthms, whch lmts the scalablty. Other than k-means, k-medods methods use data ponts to represent a cluster. Snce noses or outlers less nfluence the medods, they are more robust than k-means. However, the cost of k-medods algorthms s also expensve. PAM, CLARA, and CLARANS are three most famous k-medods algorthms. PAM s the frst k-medods method. CLARA and CLARANS both use samplng technque, n whch CLARA use fxed samples, whle CLARANS don t. Furthermore, CLARANS explots randomzed search. Therefore, CLARANS s more scalable than PAM and CLARA. Other than k-means or k-medods, some new bult optmzaton algorthms don t use representatves, such as STIRR. STIRR s desgned to handle categorcal data, so that means or medods s dffcult to defne. It maps the data set nto a hyper-graph and then employs dynamcal system technques to fnd basns, whch are fx-ponts of the system. Therefore, t can be vewed as the process of fndng an optmum of the system confguraton. 3.2 Agglomerate methods Agglomerate algorthms treat data ponts or data set parttons as sub-clusters n the begnnng. Then they merge the sub-clusters teratvely untl the fnal clusters are gotten. BIRCH [5], CURE [6], ISAAC [31], ROCK [17], STING [12], CHAMELEON [18], all fall nto ths category. The agglomerate methods have the shortcomng that the tme complexty s at least O(n2). Therefore, several technques are employed to accelerate the processng. Snce the number of the merge operatons depends on the number of ntal obects, some preprocessng technques are used to reduce the obect to be processed. Samplng and parttonng are two wdely used preprocessng technques. The developers of CURE proved that a small sample could guarantee the qualty of clusterng, whle CURE, STING, CHAMELEON all use parttonng before mergng the sub-clusters. Another technque used to accelerate the processng s ndexng. Nearly all agglomerate algorthms explot specal ndex structure. BIRCH uses CF-tree, CURE uses k-d-tree and heap, ROCK uses two-level heap, STING uses quad-tree-lke ndex, and CHAMELEON uses k-d-tree and heap-based prorty queue. Agglomerate methods usually need a parameter known as stop condton, whch s used to determne when the merge operatons should stop. Ths parameter may be k, the number of fnal clusters, or a threshold, whch denotes the mnmum value of the mergng measurement. 3.3 Dvsve methods Dvsve methods belong to herarchcal methods as agglomerate methods do. Dvsve methods begn wth a large cluster, whch contans all the data ponts, and then partton the cluster based on the dssmlarty recursvely, untl some stop condton s reached. ARHP [19,20], PDDP [20], and OptGrd [16] fall nto ths category. ARHP uses hyper-graph model. The whole data set s mapped to a hyper-graph by usng assocaton rule dscovery technques frst. Then, the sub-graphs satsfy that the ftness s larger than a threshold s parttoned out. At last, the vertces are assgned to the clusters they are hghly connected to. Other than ARHP, whch uses ftness to partton the clusters, PDDP and OptGrd use a hyper-plane to splt a cluster n each teraton. As agglomerate methods, dvsve methods also need the parameter of stop condton. It can be ether the number of fnal clusters: k, or a threshold for parttonng, such as ftness-threshold. The advantage of dvsve methods s that, for graph/hyper-graph model, there s some mature research work, such as HMETIS [32], can be employed. In fact, even CHAMELEON [18], an agglomerate method, has a dvsve step as the pre-processng to get the ntal sub-clusters. Snce t s the preprocessng, the parameter s easy to set.
9 1390 Journal of Software 2002,13(8) 4 Mxed or Generalzed Clusterng Approaches As analyzed above, algorthms usng sngle crtera may fall down on handlng some knd of data sets. Some recent research focuses on combnng or generalzng dfferent crtera. In ths secton, three algorthms of ths knd wll be ntroduced and analyzed. 4.1 CHAMELEON: dstance + connectvty method CHAMELEON [18] s an algorthm combnng several exstng clusterng technques. From the clusterng crtera vewpont, t combnes dstance measurement (relatve closeness) wth lnkage measurement (relatve nter-connectvty). Furthermore, t generalzes the classc dstance measurement n that t uses relatve crtera, whch s frst ntroduced n lnkage-based clusterng [19]. From the algorthm framework vewpont, t uses dvsve method as parttonng step to generate the ntal sub-clusters. And the man phase of the algorthm employs agglomerate framework. From the cluster representaton vewpont, t s an all-pont method. However, the ponts here may be the ntal sub-clusters. The advantages and shortcomngs of CHAMELEON can be derved easly from the multple vewponts analyss. It s strong at dentfcaton of arbtrary shaped clusters and hghly ntra-connectve clusters, snce relatve dstance and relatve connectvty are used. However, t needs two parameters as the threshold of relatve dstance and relatve connectvty respectvely. Furthermore, the dvsve parttonng needs another parameter. Ths s the shortcomng of combnng so many technques together. Furthermore, the framework determnes that ndex structure (e.g. k-d-tree) supports regon query and a heap must be used. Although the tme complexty s analyzed theoretcally, the scalng up technque or experment s not provded n the paper. 4.2 Hybrd: dstance + densty method Hybrd algorthm s a clusterng method combnng dstance and densty crtera [33]. From the vewpont of crtera, t uses dstance and densty nformaton. From the cluster representaton vewpont, t uses mult-representatve technque. Although cell s employed to enable the scalng up processng, t s not used to present the clusters, so that the cluster representaton could be more accurate. From the framework vewpont, t s an agglomerate algorthm. As dscussed before, the advantages and shortcomngs s straghtforward after the analyss. It can dentfy arbtrary-shaped clusters, and be nsenstve to noses or outlers, snce both dstance and densty nformaton are taken use of. However, ths ntroduced three parameters: one s for dstance computng whle other two are for densty computng. Furthermore, the framework determnes the use of k-d-tree and heap structure. Dfferent from CHAMELEON, t s desgned to handlng very large databases. The cell-based ndexng not only reduces the data to be processed, but also acceleratng the labelng process. As shown n our experments, Fg.5, t outperforms two popular clusterng algorthms DBSCAN and CURE, snce that R * -tree takes hgh overhead when processng large data sets, whle CURE fals when data sets scales out of the man memory. Detaled descrpton of the experments can be found n Ref.[33]. 4.3 DENCLUE: generalzed densty method DENCLUE [34] s a densty-based clusterng method, whch tres to generalze several other clusterng algorthms. It can be vewed as a knd of survey on densty-based clusterng algorthms, snce t can cover almost all densty-based algorthms by usng dfferent nfluence functon and densty functon. The developers of DENCLUE also state that t can generalze herarchcal algorthms and parttonng algorthms (named as tradtonal optmzaton algorthms n ths paper). However, t can only denote the framework of those algorthms. It cannot cover those algorthms usng representatves, even dfferent functons or parameters are set.
10 : Tme (s) Data sze Hybrd DBSCAN CURE Fg.5 Scalng-up experments of CURE, DBSCAN, and Hybrd algorthm Snce DENCLUE s n fact a densty-based method. It needs to determne the parameters to calculate densty, and be robust to noses. Furthermore, the cell-based technque determnes that a tree-based ndex should be taken use of, so that t can handle very large data sets. It also employs a flterng technque to reduce the complexty of handlng hgh-dmensonal data. However, another parameter should be ntroduced. 5 Automatc and Vsualzaton Approaches Snce clusterng s a process of unsupervsed learnng, settng approprate parameters s a problem for lots of algorthms. The above analyss show that for most clusterng algorthms, some parameters are needed. Although they may be straghtforward n some cases, they are dffcult to set n many envronments. Furthermore, current cluster representaton technques can be easly understood only when the data s n low-dmensonal space. Therefore, some algorthms are bult for automatc clusterng. Meanwhle, some other efforts has been made to vsualze the process of clusterng, so that the user can set the parameters easly and the result can be more understandable. OPTICS [11] s an algorthm, whch s desgned to dscover cluster structure. It s essentally a densty-based clusterng algorthm, as DBSCAN s. The dfference between OPTICS and other densty-based methods s that t uses reachablty-plots to vsualze the process of clusterng. Furthermore, t ntroduces an automatc technque to detect the steep ponts, so that clusters can be dscovered. By usng dfferent parameters, t can dscover clusters n dfferent densty-level. Therefore, cluster structure s an organzaton of clusters n dfferent densty. In Ref.[35], the authors ntroduced an algorthm to buld mult-granularty cluster-tree. They argued that an accurate mult-granularty cluster-tree should be vertcal dstngushed, horzontal dstngushed, and complete, whch ensure that each node n the cluster-tree denotes a cluster n a certan granularty, whle any cluster n any granularty has a correspondng node n the cluster-tree. The constructon of mult-granularty cluster-tree employs dstance-based clusterng n agglomerate framework, whch s the man dfference between mult-granularty cluster-tree wth cluster structure n Ref.[11]. Therefore, clusters n dfferent densty wll be treated as clusters n dfferent level, and clusters n dfferent scale may be treated as clusters n the same level, by OPTICS; whle mult-granularty cluster-tree wll treat them n the contrary, as shown n Fg.6. The dfference exsts because that the motvaton of buldng mult-granularty cluster-tree s to provde a cluster management faclty to ease the understandng of clusterng result, whle OPTICS s desgned for automatcally or nteractve clusterng.
11 1392 Journal of Software 2002,13(8) Fg.6 Some researchers n computer graphcs also developed some algorthms to vsualze the clusterng process, such as H-BLOB [36]. However, the basc dea s smlar: (1) vsualze the clusterng processng, so that the constructon of clusters can be seen by the user; (2) clusters may exst n dfferent levels, whle dfferent parameters are used, whatever whch crtera s used. 6 Conclusons In ths paper, we try to analyze the exstng popular clusterng algorthms both theoretcally and expermentally from three dfferent vewponts: clusterng crtera, cluster representaton, and algorthm framework, so that most algorthms can be covered, and dstngushed. Ths work can be the bass of: (1) Clusterng algorthm advantage/dsadvantage analyss; (2) Clusterng algorthm selecton for data mnng users; (3) Clusterng algorthm auto-selecton for dfferent data sets; (4) Self-tunng clusterng algorthm development; (5) Clusterng benchmark constructon. The analyss shows that most current algorthms have ts shortcomngs whle beng effectve or effcent for some specal characterstc data sets. Furthermore, three algorthms, whch generalze or mx some other algorthms, are ntroduced. And they are analyzed from the three vewponts ntroduced n ths paper. At last, some automatc/vsualzaton algorthms for clusterng are ntroduced. They are the attempts of researchers to push the unsupervsed learnng process to a more understandable and automatc stage. Acknowledgement We would lke to thank Dr. Wen Jn n Smon Fraser Unversty for hs suggeston on the outlne and draft of ths paper. We also would lke to thank Dr. Joerge Sander for provdng the source code of DBSCAN, and Ms. Hale Qan for helpng us to mplement the algorthms of CURE and Hybrd. References: [1] Fasulo, D. An analyss of recent work on clusterng algorthms. Techncal Report, Department of Computer Scence and Engneerng, Unversty of Washngton, [2] Barald, A., Blonda, P. A survey of fuzzy clusterng algorthms for pattern recognton. IEEE Transactons on Systems, Man and Cybernetcs, Part B (Cybernetcs), 1999,29:786~801. [3] Kem, D.A., Hnneburg, A. Clusterng technques for large data sets from the past to the future. Tutoral Notes for ACM SIGKDD 1999 Internatonal Conference on Knowledge Dscovery and Data Mnng. San Dego, CA, ACM, ~181. [4] McQueen, J. Some methods for classfcaton and Analyss of Multvarate Observatons. In: LeCam, L., Neyman, J., eds. Proceedngs of the 5th Berkeley Symposum on Mathematcal Statstcs and Probablty ~297. [5] Zhang, T., Ramakrshnan, R., Lvny, M. BIRCH: an effcent data clusterng method for very large databases. In: Jagadsh, H.V., Mumck, I.S., eds. Proceedngs of the 1996 ACM SIGMOD Internatonal Conference on Management of Data. Quebec: ACM Press, ~114. [6] Guha, S., Rastog, R., Shm, K. CURE: an effcent clusterng algorthm for large databases. In: Haas, L.M., Twary, A., eds. Proceedngs of the 1998 ACM SIGMOD Internatonal Conference on Management of Data. Seattle: ACM Press, ~84.
12 : 1393 [7] Beyer, K.S., Goldsten, J., Ramakrshnan, R., et al. When s nearest neghbor meanngful? In: Beer, C., Buneman, P., eds. Proceedngs of the 7th Internatonal Conference on Data Theory, ICDT 99. LNCS1540, Jerusalem, Israel: Sprnger, ~235. [8] Ester, M., Kregel, H.-P., Sander, J., et al. A densty-based algorthm for dscoverng clusters n large spatal databases wth noses. In: Smouds, E., Han, J., Fayyad, U.M., eds. Proceedngs of the 2nd Internatonal Conference on Knowledge Dscovery and Data Mnng (KDD 96). AAAI Press, ~231. [9] Ester, M., Kregel, H.-P., Sander, J., et al. Incremental clusterng for mnng n a data warehousng envronment. In: Gupta, A., Shmuel, O., Wdom, J., eds. Proceedngs of the 24th Internatonal Conference on Very Large Data Bases. New York: Morgan Kaufmann, ~333. [10] Sander, J., Ester, M., Kregel, H.-P., et al. Densty-Based clusterng n spatal databases: the algorthm GDBSCAN and ts applcatons. Data Mnng and Knowledge Dscovery, 1998,2(2):169~194. [11] Ankerst, M., Breung, M.M., Kregel, H.-P., et al. OPTICS: orderng ponts to dentfy the clusterng structure. In: Dels, A., Faloutsos, C., Ghandeharzadeh, S., eds. Proceedngs of the 1999 ACM SIGMOD Internatonal Conference on Management of Data. Phladelpha: ACM Press, ~60. [12] Wang, W., Yang, J, Muntz, R. STING: a statstcal nformaton grd approach to spatal data mnng. In: Jarke, M., Carey, M.J., Dttrch, K.R., et al., eds. Proceedngs of the 23rd Internatonal Conference on Very Large Data Bases. Athens: Morgan Kaufmann, ~195. [13] Shekholeslam, G., Chatteree, S., Zhang, A. WaveCluster: a mult-resoluton clusterng approach for very large spatal databases. In: Gupta, A., Shmuel, O., Wdom, J., eds. Proceedngs of the 24th Internatonal Conference on Very Large Data Bases. New York: Morgan Kaufmann, ~438. [14] Xu, X., Ester, M., Kregel, H.-P., et al. A dstrbuton-based clusterng algorthm for mnng n large spatal databases. In: Proceedngs of the 14th Internatonal Conference on Data Engneerng. Orlando: IEEE Computer Socety Press, ~331. [15] Agrawal, R., Gehrke, J., Gunopulos, D., et al. Automatc subspace clusterng of hgh dmensonal data for data mnng applcatons. In: Haas, L.M., Twary, A., eds. Proceedngs of the 1998 ACM SIGMOD Internatonal Conference on Management of Data. Seattle: ACM Press, ~105. [16] Hnnebrug, A., Kem, D.A. Optmal grd-clusterng: towards breakng the curse of dmensonalty n hgh-dmensonal clusterng. In: Atknson, M.P., Orlowska, M.E., Valdurez, P., et al., eds. Proceedngs of the 25th Internatonal Conference on Very Large Data Bases. Ednburgh: Morgan Kaufmann, ~517. [17] Guha, S., Rastog, R., Shm, K. ROCK: a robust clusterng algorthm for categorcal attrbutes. In: Proceedngs of the 15th Internatonal Conference on Data Engneerng. Sydney: IEEE Computer Socety Press, ~521. [18] Karyps, G., Han, E.H., Kumar, V. CHAMELEON: a herarchcal clusterng algorthm usng dynamc modelng. IEEE Computer, 1999,32(8):68~75. [19] Han, E.H., Karyps, G., Kumar, V., et al. Hypergraph based clusterng n hgh-dmensonal data sets: a summary of results. Data Engneerng Bulletn, 1998,21(1):15~22. [20] Boley, D., Gn, M., Gross, R., et al. Parttonng-Based clusterng for web document categorzaton. Decson Support System Journal, 1999,27(3):329~341. [21] Gbson, D., Klenberg, J.M., Raghavan, P. Clusterng categorcal data: an approach based on dynamcal systems. In: Gupta, A., Shmuel, O., Wdom, J., eds. Proceedngs of the 24th Internatonal Conference on Very Large Data Bases. New York: Morgan Kaufmann, ~322. [22] Gant, V., Gehrke, J., Ramakrshnan, R. CACTUS, clusterng categorcal data usng summares. In: Proceedngs of the 5th Internatonal Conference on Knowledge Dscovery and Data Mnng. San Dego: ACM Press, ~83. [23] Agrawal, R., Srkant, R. Fast algorthms for mnng assocaton rules. In: Bocca, J.B., Jarke, M., Zanolo, C., eds. Proceedngs of the 20th Internatonal Conference on Very Large Data Bases (VLDB 94). Santago: Morgan Kaufmann, ~499. [24] Kaufman, L., Rousseeuw, P.J. Fndng Groups n Data: an Introducton to Cluster Analyss. John Wley & Sons, [25] Ng, R.T., Han, J. Effcent and effectve clusterng methods for spatal data mnng. In: Bocca, J.B., Jarke, M., Zanolo, C., eds. Proceedngs of the 20th Internatonal Conference on Very Large Data Bases (VLDB 94). Santago: Morgan Kaufmann, ~155. [26] Guha, S., Rastog, R., Shm, K. CURE: an effcent clusterng algorthm for large databases. Informaton System Journal, 1998, 26(1):35~58. [27] Dempster, A.P., Lard, N.M., Rubn, D.B. Maxmum lkelhood from ncomplete data va the EM algorthm. Journal of the Royal Statstcal Socety(Seres B), 1977,29(1):1~38.
13 1394 Journal of Software 2002,13(8) [28] Laurtzen, S.L. The EM algorthm for graphcal assocaton models wth mssng data. Computatonal Statstcs and Data Analyss, 1995,19:191~201. [29] Cheeseman, P., Stutz, J. Bayesan classfcaton (AutoClass): theory and results. In: Fayyad, U.M., Patetsky-Shapro, G., Smyth, P., et al., eds. Advances n Knowledge Dscovery and Data Mnng. AAAI/MIT Press, ~180. [30] Huang, Z. Extensons to the K-means algorthm for clusterng large data sets wth categorcal values. Data Mnng and Knowledge Dscovery, 1998,2:283~304. [31] Talavera, L., Bear, J. Effcent constructon of comprehensble herarchcal clusterng. In: Zytkow, J.M., Quafalou, M., eds. Prncples of Data Mnng and Knowledge Dscovery, Proceedngs of the 2nd European Symposum, PKDD 98. LNCS1510, Nantes: Sprnger-Verlag, ~101. [32] Karyps, G., Aggarwal, R., Kumar, V., et al. Multlevel hypergraph parttonng: applcaton n VLSI doman. In: Proceedngs of the 34th Conference on Desgn Automaton. Anahem, CA: ACM Press, ~529. [33] Zhou, A., Qan, W., Qan, H., et al. A hybrd approach to clusterng n very large databases. In: Cheung, D., Wllams, G.J., L, Q., eds. Proceedngs of the 5th Pacfc-Asa Conference on Knowledge Dscovery and Data Mnng. LNCS2035, Hong Kong: Sprnger-Verlag, ~524. [34] Hnneburg, A., Kem, D.A. An effcent approach to clusterng n large multmeda databases wth nose. In: Agrawal, R., Stolorz, P.E., Patetsky-Shapro, G., eds. Proceedngs of the 4th Internatonal Conference on Knowledge Dscovery and Data Mnng (KDD 98). New York: AAAI Press, ~65. [35] Zhou, A., Qan, W., Qan, H., et al. SACT: automatc cluster-tree constructon for very large spatal databases. Techncal Report, Computer Scence Department, Fudan Unversty, [36] Sprenger, T.C., Brunella, R., Gross, M.H. H-BLOB: a herarchcal vsual clusterng method usng mplct surfaces. Techncal Report No.341, Computer Scence Department, ETH Zürch, ftp://ftp.nf.ethz.ch/pub/publcatons/tech-reports/3xx/341.pdf., (, ) (, ) :.,,.. 3 : (1) ; (2) ; (3).,. 3,.. : ; : TP311 : A
A Deflected Grid-based Algorithm for Clustering Analysis
A Deflected Grd-based Algorthm for Clusterng Analyss NANCY P. LIN, CHUNG-I CHANG, HAO-EN CHUEH, HUNG-JEN CHEN, WEI-HUA HAO Department of Computer Scence and Informaton Engneerng Tamkang Unversty 5 Yng-chuan
More informationHierarchical clustering for gene expression data analysis
Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally
More informationCluster Analysis of Electrical Behavior
Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School
More informationOutline. Type of Machine Learning. Examples of Application. Unsupervised Learning
Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton
More informationSCALABLE AND VISUALIZATION-ORIENTED CLUSTERING FOR EXPLORATORY SPATIAL ANALYSIS
SCALABLE AND VISUALIZATION-ORIENTED CLUSTERING FOR EXPLORATORY SPATIAL ANALYSIS J.H.Guan, F.B.Zhu, F.L.Ban a School of Computer, Spatal Informaton & Dgtal Engneerng Center, Wuhan Unversty, Wuhan, 430079,
More informationSubspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;
Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features
More informationUnsupervised Learning
Pattern Recognton Lecture 8 Outlne Introducton Unsupervsed Learnng Parametrc VS Non-Parametrc Approach Mxture of Denstes Maxmum-Lkelhood Estmates Clusterng Prof. Danel Yeung School of Computer Scence and
More informationMachine Learning: Algorithms and Applications
14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of
More informationCS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15
CS434a/541a: Pattern Recognton Prof. Olga Veksler Lecture 15 Today New Topc: Unsupervsed Learnng Supervsed vs. unsupervsed learnng Unsupervsed learnng Net Tme: parametrc unsupervsed learnng Today: nonparametrc
More informationContent Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers
IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth
More informationParallelism for Nested Loops with Non-uniform and Flow Dependences
Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr
More informationA Binarization Algorithm specialized on Document Images and Photos
A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a
More informationTsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance
Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for
More informationAn Optimal Algorithm for Prufer Codes *
J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,
More informationLearning the Kernel Parameters in Kernel Minimum Distance Classifier
Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department
More informationClustering. A. Bellaachia Page: 1
Clusterng. Obectves.. Clusterng.... Defntons... General Applcatons.3. What s a good clusterng?. 3.4. Requrements 3 3. Data Structures 4 4. Smlarty Measures. 4 4.. Standardze data.. 5 4.. Bnary varables..
More informationA Fast Content-Based Multimedia Retrieval Technique Using Compressed Data
A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,
More informationStudy of Data Stream Clustering Based on Bio-inspired Model
, pp.412-418 http://dx.do.org/10.14257/astl.2014.53.86 Study of Data Stream lusterng Based on Bo-nspred Model Yngme L, Mn L, Jngbo Shao, Gaoyang Wang ollege of omputer Scence and Informaton Engneerng,
More informationUnsupervised Learning and Clustering
Unsupervsed Learnng and Clusterng Why consder unlabeled samples?. Collectng and labelng large set of samples s costly Gettng recorded speech s free, labelng s tme consumng 2. Classfer could be desgned
More informationUnsupervised Learning and Clustering
Unsupervsed Learnng and Clusterng Supervsed vs. Unsupervsed Learnng Up to now we consdered supervsed learnng scenaro, where we are gven 1. samples 1,, n 2. class labels for all samples 1,, n Ths s also
More informationA Similarity Measure Method for Symbolization Time Series
Research Journal of Appled Scences, Engneerng and Technology 5(5): 1726-1730, 2013 ISSN: 2040-7459; e-issn: 2040-7467 Maxwell Scentfc Organzaton, 2013 Submtted: July 27, 2012 Accepted: September 03, 2012
More informationProblem Definitions and Evaluation Criteria for Computational Expensive Optimization
Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty
More informationThe Research of Support Vector Machine in Agricultural Data Classification
The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou
More informationFace Recognition Method Based on Within-class Clustering SVM
Face Recognton Method Based on Wthn-class Clusterng SVM Yan Wu, Xao Yao and Yng Xa Department of Computer Scence and Engneerng Tong Unversty Shangha, Chna Abstract - A face recognton method based on Wthn-class
More informationClustering is a discovery process in data mining.
Cover Feature Chameleon: Herarchcal Clusterng Usng Dynamc Modelng Many advanced algorthms have dffculty dealng wth hghly varable clusters that do not follow a preconceved model. By basng ts selectons on
More informationDetermining the Optimal Bandwidth Based on Multi-criterion Fusion
Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn
More informationMachine Learning. Topic 6: Clustering
Machne Learnng Topc 6: lusterng lusterng Groupng data nto (hopefully useful) sets. Thngs on the left Thngs on the rght Applcatons of lusterng Hypothess Generaton lusters mght suggest natural groups. Hypothess
More informationClassifier Selection Based on Data Complexity Measures *
Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.
More informationSupport Vector Machines
/9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.
More informationCS 534: Computer Vision Model Fitting
CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust
More informationAn Improved Image Segmentation Algorithm Based on the Otsu Method
3th ACIS Internatonal Conference on Software Engneerng, Artfcal Intellgence, Networkng arallel/dstrbuted Computng An Improved Image Segmentaton Algorthm Based on the Otsu Method Mengxng Huang, enjao Yu,
More informationA New Approach For the Ranking of Fuzzy Sets With Different Heights
New pproach For the ankng of Fuzzy Sets Wth Dfferent Heghts Pushpnder Sngh School of Mathematcs Computer pplcatons Thapar Unversty, Patala-7 00 Inda pushpndersnl@gmalcom STCT ankng of fuzzy sets plays
More informationSTING : A Statistical Information Grid Approach to Spatial Data Mining
STING : A Statstcal Informaton Grd Approach to Spatal Data Mnng We Wang, Jong Yang, and Rchard Muntz Department of Computer Scence Unversty of Calforna, Los Angeles {wewang, jyang, muntz}@cs.ucla.edu February
More informationOutlier Detection Methodologies Overview
Outler Detecton Methodologes Overvew Mohd. Noor Md. Sap Department of Computer and Informaton Systems Faculty of Computer Scence and Informaton Systems Unverst Teknolog Malaysa 81310 Skuda, Johor Bahru,
More informationAn Entropy-Based Approach to Integrated Information Needs Assessment
Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology
More informationThe Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique
//00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy
More informationClustering Algorithm of Similarity Segmentation based on Point Sorting
Internatonal onference on Logstcs Engneerng, Management and omputer Scence (LEMS 2015) lusterng Algorthm of Smlarty Segmentaton based on Pont Sortng Hanbng L, Yan Wang*, Lan Huang, Mngda L, Yng Sun, Hanyuan
More informationOutline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1
4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:
More informationBIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING
An Improved K-means Algorthm based on Cloud Platform for Data Mnng Bn Xa *, Yan Lu 2. School of nformaton and management scence, Henan Agrcultural Unversty, Zhengzhou, Henan 450002, P.R. Chna 2. College
More informationSHAPE RECOGNITION METHOD BASED ON THE k-nearest NEIGHBOR RULE
SHAPE RECOGNITION METHOD BASED ON THE k-nearest NEIGHBOR RULE Dorna Purcaru Faculty of Automaton, Computers and Electroncs Unersty of Craoa 13 Al. I. Cuza Street, Craoa RO-1100 ROMANIA E-mal: dpurcaru@electroncs.uc.ro
More informationCourse Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms
Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques
More informationConcurrent Apriori Data Mining Algorithms
Concurrent Apror Data Mnng Algorthms Vassl Halatchev Department of Electrcal Engneerng and Computer Scence York Unversty, Toronto October 8, 2015 Outlne Why t s mportant Introducton to Assocaton Rule Mnng
More informationAvailable online at Available online at Advanced in Control Engineering and Information Science
Avalable onlne at wwwscencedrectcom Avalable onlne at wwwscencedrectcom Proceda Proceda Engneerng Engneerng 00 (2011) 15000 000 (2011) 1642 1646 Proceda Engneerng wwwelsevercom/locate/proceda Advanced
More informationA PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION
1 THE PUBLISHING HOUSE PROCEEDINGS OF THE ROMANIAN ACADEMY, Seres A, OF THE ROMANIAN ACADEMY Volume 4, Number 2/2003, pp.000-000 A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION Tudor BARBU Insttute
More informationSmoothing Spline ANOVA for variable screening
Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory
More informationLearning-Based Top-N Selection Query Evaluation over Relational Databases
Learnng-Based Top-N Selecton Query Evaluaton over Relatonal Databases Lang Zhu *, Wey Meng ** * School of Mathematcs and Computer Scence, Hebe Unversty, Baodng, Hebe 071002, Chna, zhu@mal.hbu.edu.cn **
More informationAn Image Fusion Approach Based on Segmentation Region
Rong Wang, L-Qun Gao, Shu Yang, Yu-Hua Cha, and Yan-Chun Lu An Image Fuson Approach Based On Segmentaton Regon An Image Fuson Approach Based on Segmentaton Regon Rong Wang, L-Qun Gao, Shu Yang 3, Yu-Hua
More informationFeature Reduction and Selection
Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components
More informationBidirectional Hierarchical Clustering for Web Mining
Bdrectonal Herarchcal Clusterng for Web Mnng ZHONGMEI YAO & BEN CHOI Computer Scence, College of Engneerng and Scence Lousana Tech Unversty, Ruston, LA 71272, USA zya001@latech.edu, pro@bencho.org Abstract
More informationFINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK
FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK L-qng Qu, Yong-quan Lang 2, Jng-Chen 3, 2 College of Informaton Scence and Technology, Shandong Unversty of Scence and Technology,
More informationA new segmentation algorithm for medical volume image based on K-means clustering
Avalable onlne www.jocpr.com Journal of Chemcal and harmaceutcal Research, 2013, 5(12):113-117 Research Artcle ISSN : 0975-7384 CODEN(USA) : JCRC5 A new segmentaton algorthm for medcal volume mage based
More informationModule Management Tool in Software Development Organizations
Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,
More informationClustering algorithms and validity measures
Clusterng algorthms and valdty measures M. Hald, Y. Batstas, M. Vazrganns Department of Informatcs Athens Unversty of Economcs & Busness Emal: {mhal, yanns, mvazrg}@aueb.gr Abstract Clusterng ams at dscoverng
More informationK-means and Hierarchical Clustering
Note to other teachers and users of these sldes. Andrew would be delghted f you found ths source materal useful n gvng your own lectures. Feel free to use these sldes verbatm, or to modfy them to ft your
More informationLECTURE : MANIFOLD LEARNING
LECTURE : MANIFOLD LEARNING Rta Osadchy Some sldes are due to L.Saul, V. C. Raykar, N. Verma Topcs PCA MDS IsoMap LLE EgenMaps Done! Dmensonalty Reducton Data representaton Inputs are real-valued vectors
More informationQuery Clustering Using a Hybrid Query Similarity Measure
Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan
More informationQuerying by sketch geographical databases. Yu Han 1, a *
4th Internatonal Conference on Sensors, Measurement and Intellgent Materals (ICSMIM 2015) Queryng by sketch geographcal databases Yu Han 1, a * 1 Department of Basc Courses, Shenyang Insttute of Artllery,
More informationTerm Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task
Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto
More informationRelated-Mode Attacks on CTR Encryption Mode
Internatonal Journal of Network Securty, Vol.4, No.3, PP.282 287, May 2007 282 Related-Mode Attacks on CTR Encrypton Mode Dayn Wang, Dongda Ln, and Wenlng Wu (Correspondng author: Dayn Wang) Key Laboratory
More informationOutline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:
Self-Organzng Maps (SOM) Turgay İBRİKÇİ, PhD. Outlne Introducton Structures of SOM SOM Archtecture Neghborhoods SOM Algorthm Examples Summary 1 2 Unsupervsed Hebban Learnng US Hebban Learnng, Cntd 3 A
More informationA fast algorithm for color image segmentation
Unersty of Wollongong Research Onlne Faculty of Informatcs - Papers (Arche) Faculty of Engneerng and Informaton Scences 006 A fast algorthm for color mage segmentaton L. Dong Unersty of Wollongong, lju@uow.edu.au
More informationHybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 16, No 2 Sofa 2016 Prnt ISSN: 1311-9702; Onlne ISSN: 1314-4081 DOI: 10.1515/cat-2016-0017 Hybrdzaton of Expectaton-Maxmzaton
More informationA Fast Visual Tracking Algorithm Based on Circle Pixels Matching
A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng
More informationA NEW LINEAR APPROXIMATE CLUSTERING ALGORITHM BASED UPON SAMPLING WITH PROBABILITY DISTRIBUTING
A NEW LINEAR APPROXIMATE CLUSTERING ALGORITHM BASED UPON SAMPLING WITH PROBABILITY DISTRIBUTING CHANG-AN YUAN,, CHANG-JIE TANG, CHUAN LI, JIAN-JUN HU, JING PENG College of Computer, Schuan unversty, Chengdu,
More informationDetermining Fuzzy Sets for Quantitative Attributes in Data Mining Problems
Determnng Fuzzy Sets for Quanttatve Attrbutes n Data Mnng Problems ATTILA GYENESEI Turku Centre for Computer Scence (TUCS) Unversty of Turku, Department of Computer Scence Lemmnkäsenkatu 4A, FIN-5 Turku
More informationMULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION
MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION Paulo Quntlano 1 & Antono Santa-Rosa 1 Federal Polce Department, Brasla, Brazl. E-mals: quntlano.pqs@dpf.gov.br and
More informationSurvey of Cluster Analysis and its Various Aspects
Harmnder Kaur et al, Internatonal Journal of Computer Scence and Moble Computng, Vol.4 Issue.0, October- 05, pg. 353-363 Avalable Onlne at www.csmc.com Internatonal Journal of Computer Scence and Moble
More informationDetection of an Object by using Principal Component Analysis
Detecton of an Object by usng Prncpal Component Analyss 1. G. Nagaven, 2. Dr. T. Sreenvasulu Reddy 1. M.Tech, Department of EEE, SVUCE, Trupath, Inda. 2. Assoc. Professor, Department of ECE, SVUCE, Trupath,
More information12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification
Introducton to Artfcal Intellgence V22.0472-001 Fall 2009 Lecture 24: Nearest-Neghbors & Support Vector Machnes Rob Fergus Dept of Computer Scence, Courant Insttute, NYU Sldes from Danel Yeung, John DeNero
More informationConstructing Minimum Connected Dominating Set: Algorithmic approach
Constructng Mnmum Connected Domnatng Set: Algorthmc approach G.N. Puroht and Usha Sharma Centre for Mathematcal Scences, Banasthal Unversty, Rajasthan 304022 usha.sharma94@yahoo.com Abstract: Connected
More informationUsing Fuzzy Logic to Enhance the Large Size Remote Sensing Images
Internatonal Journal of Informaton and Electroncs Engneerng Vol. 5 No. 6 November 015 Usng Fuzzy Logc to Enhance the Large Sze Remote Sensng Images Trung Nguyen Tu Huy Ngo Hoang and Thoa Vu Van Abstract
More informationSum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints
Australan Journal of Basc and Appled Scences, 2(4): 1204-1208, 2008 ISSN 1991-8178 Sum of Lnear and Fractonal Multobjectve Programmng Problem under Fuzzy Rules Constrants 1 2 Sanjay Jan and Kalash Lachhwan
More informationObject-Based Techniques for Image Retrieval
54 Zhang, Gao, & Luo Chapter VII Object-Based Technques for Image Retreval Y. J. Zhang, Tsnghua Unversty, Chna Y. Y. Gao, Tsnghua Unversty, Chna Y. Luo, Tsnghua Unversty, Chna ABSTRACT To overcome the
More informationS1 Note. Basis functions.
S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type
More informationBioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.
[Type text] [Type text] [Type text] ISSN : 0974-74 Volume 0 Issue BoTechnology 04 An Indan Journal FULL PAPER BTAIJ 0() 04 [684-689] Revew on Chna s sports ndustry fnancng market based on market -orented
More informationCollaboratively Regularized Nearest Points for Set Based Recognition
Academc Center for Computng and Meda Studes, Kyoto Unversty Collaboratvely Regularzed Nearest Ponts for Set Based Recognton Yang Wu, Mchhko Mnoh, Masayuk Mukunok Kyoto Unversty 9/1/013 BMVC 013 @ Brstol,
More informationBiostatistics 615/815
The E-M Algorthm Bostatstcs 615/815 Lecture 17 Last Lecture: The Smplex Method General method for optmzaton Makes few assumptons about functon Crawls towards mnmum Some recommendatons Multple startng ponts
More informationA Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures
A Novel Adaptve Descrptor Algorthm for Ternary Pattern Textures Fahuan Hu 1,2, Guopng Lu 1 *, Zengwen Dong 1 1.School of Mechancal & Electrcal Engneerng, Nanchang Unversty, Nanchang, 330031, Chna; 2. School
More informationTECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.
TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS Muradalyev AZ Azerbajan Scentfc-Research and Desgn-Prospectng Insttute of Energetc AZ1012, Ave HZardab-94 E-mal:aydn_murad@yahoocom Importance of
More informationImprovement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration
Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,
More informationNUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS
ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana
More informationThe Codesign Challenge
ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.
More informationAn Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices
Internatonal Mathematcal Forum, Vol 7, 2012, no 52, 2549-2554 An Applcaton of the Dulmage-Mendelsohn Decomposton to Sparse Null Space Bases of Full Row Rank Matrces Mostafa Khorramzadeh Department of Mathematcal
More information1. Introduction. Abstract
Image Retreval Usng a Herarchy of Clusters Danela Stan & Ishwar K. Seth Intellgent Informaton Engneerng Laboratory, Department of Computer Scence & Engneerng, Oaland Unversty, Rochester, Mchgan 48309-4478
More informationA Topology-aware Random Walk
A Topology-aware Random Walk Inkwan Yu, Rchard Newman Dept. of CISE, Unversty of Florda, Ganesvlle, Florda, USA Abstract When a graph can be decomposed nto clusters of well connected subgraphs, t s possble
More informationSteps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices
Steps for Computng the Dssmlarty, Entropy, Herfndahl-Hrschman and Accessblty (Gravty wth Competton) Indces I. Dssmlarty Index Measurement: The followng formula can be used to measure the evenness between
More informationTN348: Openlab Module - Colocalization
TN348: Openlab Module - Colocalzaton Topc The Colocalzaton module provdes the faclty to vsualze and quantfy colocalzaton between pars of mages. The Colocalzaton wndow contans a prevew of the two mages
More informationResearch on Categorization of Animation Effect Based on Data Mining
MATEC Web of Conferences 22, 0102 0 ( 2015) DOI: 10.1051/ matecconf/ 2015220102 0 C Owned by the authors, publshed by EDP Scences, 2015 Research on Categorzaton of Anmaton Effect Based on Data Mnng Na
More informationFast Computation of Shortest Path for Visiting Segments in the Plane
Send Orders for Reprnts to reprnts@benthamscence.ae 4 The Open Cybernetcs & Systemcs Journal, 04, 8, 4-9 Open Access Fast Computaton of Shortest Path for Vstng Segments n the Plane Ljuan Wang,, Bo Jang
More informationA Comparative Study for Outlier Detection Techniques in Data Mining
A Comparatve Study for Outler Detecton Technques n Data Mnng Zurana Abu Bakar, Rosmayat Mohemad, Akbar Ahmad Department of Computer Scence Faculty of Scence and Technology Unversty College of Scence and
More informationMaximum Variance Combined with Adaptive Genetic Algorithm for Infrared Image Segmentation
Internatonal Conference on Logstcs Engneerng, Management and Computer Scence (LEMCS 5) Maxmum Varance Combned wth Adaptve Genetc Algorthm for Infrared Image Segmentaton Huxuan Fu College of Automaton Harbn
More informationIncremental Learning with Support Vector Machines and Fuzzy Set Theory
The 25th Workshop on Combnatoral Mathematcs and Computaton Theory Incremental Learnng wth Support Vector Machnes and Fuzzy Set Theory Yu-Mng Chuang 1 and Cha-Hwa Ln 2* 1 Department of Computer Scence and
More informationNetwork Intrusion Detection Based on PSO-SVM
TELKOMNIKA Indonesan Journal of Electrcal Engneerng Vol.1, No., February 014, pp. 150 ~ 1508 DOI: http://dx.do.org/10.11591/telkomnka.v1.386 150 Network Intruson Detecton Based on PSO-SVM Changsheng Xang*
More informationSkew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach
Angle Estmaton and Correcton of Hand Wrtten, Textual and Large areas of Non-Textual Document Images: A Novel Approach D.R.Ramesh Babu Pyush M Kumat Mahesh D Dhannawat PES Insttute of Technology Research
More informationSpatial Data Dynamic Balancing Distribution Method Based on the Minimum Spatial Proximity for Parallel Spatial Database
JOURNAL OF SOFTWARE, VOL. 6, NO. 7, JULY 211 1337 Spatal Data Dynamc Balancng Dstrbuton Method Based on the Mnmum Spatal Proxmty for Parallel Spatal Database Yan Zhou College of Automaton Unversty of Electrc
More informationOptimal Workload-based Weighted Wavelet Synopses
Optmal Workload-based Weghted Wavelet Synopses Yoss Matas School of Computer Scence Tel Avv Unversty Tel Avv 69978, Israel matas@tau.ac.l Danel Urel School of Computer Scence Tel Avv Unversty Tel Avv 69978,
More informationEXTENDED BIC CRITERION FOR MODEL SELECTION
IDIAP RESEARCH REPORT EXTEDED BIC CRITERIO FOR ODEL SELECTIO Itshak Lapdot Andrew orrs IDIAP-RR-0-4 Dalle olle Insttute for Perceptual Artfcal Intellgence P.O.Box 59 artgny Valas Swtzerland phone +4 7
More informationCHAPTER 2 DECOMPOSITION OF GRAPHS
CHAPTER DECOMPOSITION OF GRAPHS. INTRODUCTION A graph H s called a Supersubdvson of a graph G f H s obtaned from G by replacng every edge uv of G by a bpartte graph,m (m may vary for each edge by dentfyng
More informationClustering Algorithm Combining CPSO with K-Means Chunqin Gu 1, a, Qian Tao 2, b
Internatonal Conference on Advances n Mechancal Engneerng and Industral Informatcs (AMEII 05) Clusterng Algorthm Combnng CPSO wth K-Means Chunqn Gu, a, Qan Tao, b Department of Informaton Scence, Zhongka
More informationUB at GeoCLEF Department of Geography Abstract
UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department
More information