Clustering algorithms and validity measures

Size: px
Start display at page:

Download "Clustering algorithms and validity measures"

Transcription

1 Clusterng algorthms and valdty measures M. Hald, Y. Batstas, M. Vazrganns Department of Informatcs Athens Unversty of Economcs & Busness Emal: {mhal, yanns, Abstract Clusterng ams at dscoverng groups and dentfyng nterestng dstrbutons and patterns n data sets. Researchers have extensvely studed clusterng snce t arses n many applcaton domans n engneerng and socal scences. In the last years the avalablty of huge transactonal and expermental data sets and the arsng requrements for data mnng created needs for clusterng algorthms that scale and can be appled n dverse domans. Ths paper surveys clusterng methods and approaches avalable n lterature n a comparatve way. It also presents the basc concepts, prncples and assumptons upon whch the clusterng algorthms are based. Another mportant ssue s the valdty of the clusterng schemes resultng from applyng algorthms. Ths s also related to the nherent features of the data set under concern. We revew and compare clusterng valdty measures avalable n the lterature. Furthermore, we llustrate the ssues that are underaddressed by the recent algorthms and we address new research drectons.. Introducton Clusterng s one of the most useful tass n data mnng process for dscoverng groups and dentfyng nterestng dstrbutons and patterns n the underlyng data. Clusterng problem s about parttonng a gven data set nto groups () such that the data ponts n a cluster are more smlar to each other than ponts n dfferent [0]. For example, consder a retal database records contanng tems purchased by customers. A clusterng procedure could group the customers n such a way that customers wth smlar buyng patterns are n the same cluster. Thus, the man concern n the clusterng process s to reveal the organzaton of patterns nto sensble groups, whch allow us to dscover smlartes and dfferences, as well as to derve useful conclusons about them. Ths dea s applcable n many felds, such as lfe scences, medcal scences and engneerng. Clusterng may be found under dfferent names n dfferent contexts, such as unsupervsed learnng (n pattern recognton), numercal taxonomy (n bology, ecology), typology (n socal scences) and partton (n graph theory) [8]. In the clusterng process, there are no predefned classes and no examples that would show what nd of desrable relatons should be vald among the data that s why t s perceved as an unsupervsed process []. On the other hand, classfcaton s a procedure of assgnng a data tem to a predefned set of categores [8]. Clusterng produces ntal categores n whch values of a data set are classfed durng the classfcaton process. The clusterng process may result n dfferent parttonng of a data set, dependng on the specfc crteron used for clusterng. Thus, there s a need of preprocessng before we assume a clusterng tas n a data set. The basc steps to develop clusterng process can be summarzed as follows [8]: Feature selecton. The goal s to select properly the features on whch clusterng s to be performed so as to encode as much nformaton as possble concernng the tas of our nterest. Thus, preprocessng of data may be necessary pror to ther utlzaton n clusterng tas. Clusterng algorthm. Ths step refers to the choce of an algorthm that results n the defnton of a good clusterng scheme for a data set. A proxmty measure and a clusterng crteron manly characterze a clusterng algorthm as well as ts effcency to defne a clusterng scheme that fts the data set. ) Proxmty measure s a measure that quantfes how smlar two data ponts (.e. feature vectors) are. In most of the cases we have to ensure that all selected features contrbute equally to the computaton of the proxmty measure and there are no features that domnate others. ) Clusterng crteron. In ths step, we have to defne the clusterng crteron, whch can be expressed va a cost functon or some other type of rules. We should stress that we have to tae nto account the type of that are expected to occur n the data set. Thus, we may defne a good clusterng crteron, leadng to a parttonng that fts well the data set. Valdaton of the results. The correctness of clusterng algorthm results s verfed usng approprate crtera and technques. Snce clusterng algorthms defne that are not nown a pror, rrespectve of the clusterng methods, the fnal partton of data requres some nd of evaluaton n most applcatons [4]. Interpretaton of the results. In many cases, the experts n the applcaton area have to ntegrate the

2 clusterng results wth other expermental evdence and analyss n order to draw the rght concluson. Clusterng Applcatons. Cluster analyss s a maor tool n a number of applcatons n many felds of busness and scence. Hereby, we summarze the basc drectons n whch clusterng s used [8]: Data reducton. Cluster analyss can contrbute n compresson of the nformaton ncluded n data. In several cases, the amount of avalable data s very large and ts processng becomes very demandng. Clusterng can be used to partton data set nto a number of nterestng. Then, nstead of processng the data set as an entty, we adopt the representatves of the defned n our process. Thus, data compresson s acheved. Hypothess generaton. Cluster analyss s used here n order to nfer some hypotheses concernng the data. For nstance we may fnd n a retal database that there are two sgnfcant groups of customers based on ther age and the tme of purchases. Then, we may nfer some hypotheses for the data, that t, young people go shoppng n the evenng, old people go shoppng n the mornng. Hypothess testng. In ths case, the cluster analyss s used for the verfcaton of the valdty of a specfc hypothess. For example, we consder the followng hypothess: Young people go shoppng n the evenng. One way to verfy whether ths s true s to apply cluster analyss to a representatve set of stores. Suppose that each store s represented by ts customer s detals (age, ob etc) and the tme of transactons. If, after applyng cluster analyss, a cluster that corresponds to young people buy n the evenng s formed, then the hypothess s supported by cluster analyss. Predcton based on groups. Cluster analyss s appled to the data set and the resultng are characterzed by the features of the patterns that belong to these. Then, unnown patterns can be classfed nto specfed based on ther smlarty to the features. Useful nowledge related to our data can be extracted. Assume, for example, that the cluster analyss s appled to a data set concernng patents nfected by the same dsease. The result s a number of of patents, accordng to ther reacton to specfc drugs. Then for a new patent, we dentfy the cluster n whch he/she can be classfed and based on ths decson hs/her medcaton can be made. More specfcally, some typcal applcatons of the clusterng are n the followng felds []: Busness. In busness, clusterng may help mareters dscover sgnfcant groups n ther customers database and characterze them based on purchasng patterns. Bology. In bology, t can be used to defne taxonomes, categorze genes wth smlar functonalty and gan nsghts nto structures nherent n populatons. Spatal data analyss. Due to the huge amounts of spatal data that may be obtaned from satellte mages, medcal equpment, Geographcal Informaton Systems (GIS), mage database exploraton etc., t s expensve and dffcult for the users to examne spatal data n detal. Clusterng may help to automate the process of analysng and understandng spatal data. It s used to dentfy and extract nterestng characterstcs and patterns that may exst n large spatal databases. Web mnng. In ths case, clusterng s used to dscover sgnfcant groups of documents on the Web huge collecton of sem-structured documents. Ths classfcaton of Web documents asssts n nformaton dscovery. In general terms, clusterng may serve as a preprocessng step for other algorthms, such as classfcaton, whch would then operate on the detected. Clusterng Algorthms Categores. A multtude of clusterng methods are proposed n the lterature. Clusterng algorthms can be classfed accordng to: The type of data nput to the algorthm. The clusterng crteron defnng the smlarty between data ponts The theory and fundamental concepts on whch clusterng analyss technques are based (e.g. fuzzy theory, statstcs) Thus accordng to the method adopted to defne, the algorthms can be broadly classfed nto the followng types [6]: Parttonal clusterng attempts to drectly decompose the data set nto a set of dsont. More specfcally, they attempt to determne an nteger number of parttons that optmse a certan crteron functon. The crteron functon may emphasze the local or global structure of the data and ts optmsaton s an teratve procedure. Herarchcal clusterng proceeds successvely by ether mergng smaller nto larger ones, or by splttng larger. The result of the algorthm s a tree of, called dendrogram, whch shows how the are related. By cuttng the dendrogram at a desred level, a clusterng of the data tems nto dsont groups s obtaned. Densty-based clusterng. The ey dea of ths type of clusterng s to group neghbourng obects of a data set nto based on densty condtons. Grd-based clusterng. Ths type of algorthms s manly proposed for spatal data mnng. Ther man characterstc s that they quantse the space nto a fnte

3 number of cells and then they do all operatons on the quantsed space. For each of above categores there s a wealth of subtypes and dfferent algorthms for fndng the. Thus, accordng to the type of varables allowed n the data set can be categorzed nto [, 5, 4]: Statstcal, whch are based on statstcal analyss concepts. They use smlarty measures to partton obects and they are lmted to numerc data. Conceptual, whch are used to cluster categorcal data. They cluster obects accordng to the concepts they carry. Another classfcaton crteron s the way clusterng handles uncertanty n terms of cluster overlappng. Fuzzy clusterng, whch uses fuzzy technques to cluster data and they consder that an obect can be classfed to more than one. Ths type of algorthms leads to clusterng schemes that are compatble wth everyday lfe experence as they handle the uncertanty of real data. The most mportant fuzzy clusterng algorthm s Fuzzy C-Means []. Crsp clusterng, consders non-overlappng parttons meanng that a data pont ether belongs to a class or not. Most of the clusterng algorthms result n crsp, and thus can be categorzed n crsp clusterng. Kohonen net clusterng, whch s based on the concepts of neural networs. The Kohonen networ has nput and output nodes. The nput layer (nput nodes) has a node for each attrbute of the record, each one connected to every output node (output layer). Each connecton s assocated wth a weght, whch determnes the poston of the correspondng output node. Thus, accordng to an algorthm, whch changes the weghts properly, output nodes move to form. In general terms, the clusterng algorthms are based on a crteron for assessng the qualty of a gven parttonng. More specfcally, they tae as nput some parameters (e.g. number of, densty of ) and attempt to defne the best parttonng of a data set for the gven parameters. Thus, they defne a parttonng of a data set based on certan assumptons and not necessarly the best one that fts the data set. Snce clusterng algorthms dscover, whch are not nown a pror, the fnal parttons of a data set requres some sort of evaluaton n most applcatons [4]. For nstance questons le how many are there n the data set?, does the resultng clusterng scheme fts our data set?, s there a better parttonng for our data set? call for clusterng results valdaton and are the subects of methods dscussed n the lterature. They am at the quanttatve evaluaton of the results of the clusterng algorthms and are nown under the general term cluster valdty methods. The remander of the paper s organzed as follows. In the next secton we present the man categores of clusterng algorthms that are avalable n lterature. Then, n Secton 3 we dscuss the man characterstcs of these algorthms n a comparatve way. In Secton 4 we present the man concepts of clusterng valdty ndces and the technques proposed n lterature for evaluatng the clusterng results. Moreover, an expermental study based on some of these valdty ndces s presented, usng synthetc and real data sets. We conclude n Secton 5 by summarzng and provdng the trends n clusterng.. Clusterng Algorthms In recent years, a number of clusterng algorthms has been proposed and s avalable n the lterature. Some representatve algorthms of the above categores follow.. Parttonal Algorthms In ths category, K-Means s a commonly used algorthm [8]. The am of K-Means clusterng s the optmsaton of an obectve functon that s descrbed by the equaton c E d( x, m ) () x C In the above equaton, m s the center of cluster C, whle d(x, m ) s the Eucldean dstance between a pont x and m. Thus, the crteron functon E attempts to mnmze the dstance of every pont from the center of the cluster to whch the pont belongs. More specfcally, the algorthm begns by ntalsng a set of c cluster centers. Then, t assgns each obect of the dataset to the cluster whose center s the nearest, and re-computes the centers. The process contnues untl the centers of the stop changng. Another algorthm of ths category s PAM(Parttonng Around Medods). The obectve of PAM s to determne a representatve obect (medod) for each cluster, that s, to fnd the most centrally located obects wthn the. The algorthm begns by selectng an obect as medod for each of c. Then, each of the non-selected obects s grouped wth the medod to whch t s the most smlar. PAM swaps medods wth other non-selected obects untl all obects qualfy as medod. It s clear that PAM s an expensve algorthm as regards fndng the medods, as t compares an obect wth entre dataset []. CLARA (Clusterng Large Applcatons), s an mplementaton of PAM n a subset of the dataset. It draws multple samples of the dataset, apples PAM on samples, and then outputs the best clusterng out of these samples []. CLARAS (Clusterng Large Applcatons based on Randomzed Search), combnes the samplng technques wth PAM. The clusterng process can be presented as searchng a graph where every node s a potental

4 soluton, that s, a set of medods. The clusterng obtaned after replacng a medod s called the neghbour of the current clusterng. CLARAS selects a node and compares t to a user-defned number of ther neghbours searchng for a local mnmum. If a better neghbour s found (.e., havng lower-square error), CLARAS moves to the neghbour s node and the process start agan; otherwse the current clusterng s a local optmum. If the local optmum s found, CLARAS starts wth a new randomly selected node n search for a new local optmum. Fnally K-prototypes, K-mode [5] are based on K- Means algorthm, but they am at clusterng categorcal data..herarchcal Algorthms Herarchcal clusterng algorthms accordng to the method that produce can further be dvded nto [8]: Agglomeratve algorthms. They produce a sequence of clusterng schemes of decreasng number of at east step. The clusterng scheme produced at each step results from the prevous one by mergng the two closest nto one. Dvsve algorthms. These algorthms produce a sequence of clusterng schemes ncreasng the number of at each step. Contrary to the agglomeratve algorthms the clusterng produced at each step results from the prevous one by splttng a cluster nto two. In sequel, we descrbe some representatve herarchcal clusterng algorthms. BIRCH [3] uses a herarchcal data structure called CF-tree for parttonng the ncomng data ponts n an ncremental and dynamc way. CF-tree s a heghtbalanced tree, whch stores the clusterng features and t s based on two parameters: branchng factor B and threshold T, whch referred to the dameter of a cluster (the dameter (or radus) of each cluster must be less than T). BIRCH can typcally fnd a good clusterng wth a sngle scan of the data and mprove the qualty further wth a few addtonal scans. It s also the frst clusterng algorthm to handle nose effectvely [3]. However, t does not always correspond to a natural cluster, snce each node n CF-tree can hold a lmted number of entres due to ts sze. Moreover, t s order-senstve as t may generate dfferent for dfferent orders of the same nput data. CURE [0] represents each cluster by a certan number of ponts that are generated by selectng wellscattered ponts and then shrnng them toward the cluster centrod by a specfed fracton. It uses a combnaton of random samplng and partton clusterng to handle large databases. ROCK [], s a robust clusterng algorthm for Boolean and categorcal data. It ntroduces two new concepts, that s a pont's neghbours and lns, and t s based on them n order to measure the smlarty/proxmty between a par of data ponts.3densty-based algorthms Densty based algorthms typcally regard as dense regons of obects n the data space that are separated by regons of low densty. A wdely nown algorthm of ths category s DBSCA[6]. The ey dea n DBSCA s that for each pont n a cluster, the neghbourhood of a gven radus has to contan at least a mnmum number of ponts. DBSCA can handle nose (outlers) and dscover of arbtrary shape. Moreover, DBSCA s used as the bass for an ncremental clusterng algorthm proposed n [7]. Due to ts densty-based nature, the nserton or deleton of an obect affects the current clusterng only n the neghbourhood of ths obect and thus effcent algorthms based on DBSCA can be gven for ncremental nsertons and deletons to an exstng clusterng [7]. In [4] another densty-based clusterng algorthm, DECLUE, s proposed. Ths algorthm ntroduces a new approach to cluster large multmeda databases The basc dea of ths approach s to model the overall pont densty analytcally as the sum of nfluence functons of the data ponts. The nfluence functon can be seen as a functon, whch descrbes the mpact of a data pont wthn ts neghbourhood. Then can be dentfed by determnng densty attractors. Densty attractors are local maxmum of the overall densty functon. In addton, of arbtrary shape can be easly descrbed by a smple equaton based on overall densty functon. The man advantages of DECLUE are that t has good clusterng propertes n data sets wth large amounts of nose and t allows a compact mathematcally descrpton of arbtrary shaped n hgh-dmensonal data sets. However, DECLUE clusterng s based on two parameters and as n most other approaches the qualty of the resultng clusterng depends on the choce of them. These parameters are [4]: ) parameter σ whch determnes the nfluence of a data pont n ts neghbourhood and ) ξ descrbes whether a denstyattractor s sgnfcant, allowng a reducton of the number of densty-attractors and helpng to mprove the performance..4 Grd-based algorthms Recently a number of clusterng algorthms have been presented for spatal data, nown as grd-based algorthms. These algorthms quantse the space nto a fnte number of cells and then do all operatons on the quantsed space. STIG (Statstcal Informaton Grd-based method) s representatve of ths category. It dvdes the spatal area nto rectangular cells usng a herarchcal structure. STIG [30] goes through the data set and computes the

5 statstcal parameters (such as mean, varance, mnmum, maxmum and type of dstrbuton) of each numercal feature of the obects wthn cells. Then t generates a herarchcal structure of the grd cells so as to represent the clusterng nformaton at dfferent levels. Based on ths structure STIG enables the usage of clusterng nformaton to search for queres or the effcent assgnment of a new obect to the. WaveCluster [5] s the latest grd-based algorthm proposed n lterature. It s based on sgnal processng technques (wavelet transformaton) to convert the spatal data nto frequency doman. More specfcally, t frst summarzes the data by mposng a multdmensonal grd structure onto the data space []. Each grd cell summarzes the nformaton of a group of ponts that map nto the cell. Then t uses a wavelet transformaton to transform the orgnal feature space. In wavelet transform, convoluton wth an approprate functon results n a transformed space where the natural n the data become dstngushable. Thus, we can dentfy the by fndng the dense regons n the transformed doman. A-pror nowledge about the exact number of s not requred n WaveCluster..5 Fuzzy Clusterng The algorthms descrbed above result n crsp, meanng that a data pont ether belongs to a cluster or not. The are non-overlappng and ths nd of parttonng s further called crsp clusterng. The ssue of uncertanty support n clusterng tas leads to the ntroducton of algorthms that use fuzzy logc concepts n ther procedure. A common fuzzy clusterng algorthm s the Fuzzy C-Means (FCM), an extenson of classcal C- Means algorthm for fuzzy applcatons []. FCM attempts to fnd the most characterstc pont n each cluster, whch can be consdered as the center of the cluster and, then, the grade of membershp for each obect n the. Another approach proposed n lterature to solve the problems of crsp clusterng s based on probablstc models. The bass of ths type of clusterng algorthms s the EM algorthm, whch provdes a qute general approach to learnng n presence of unobservable varables [0]. A common algorthm s the probablstc varant of K-Means, whch s based on the mxture of Gaussan dstrbutons. Ths approach of K-Means uses probablty densty rather than dstance to assocate records wth []. More specfcally, t regards the centers of as means of Gaussan dstrbutons. Then, t estmates the probablty that a data pont s generated by th Gaussan (.e., belongs to th cluster). Ths approach s based on Gaussan model to extract and assgns the data ponts to assumng that they are generated by normal dstrbuton. Also, ths approach s mplemented only n the case of algorthms, whch are based on EM (Expectaton Maxmzaton) algorthm. 3. Comparson of Clusterng Algorthms Clusterng s broadly recognzed as a useful tool n many applcatons. Researchers of many dscplnes have addressed the clusterng problem. However, t s a dffcult problem, whch combnes concepts of dverse scentfc felds (such as databases, machne learnng, pattern recognton, statstcs). Thus, the dfferences n assumptons and context among dfferent research communtes caused a number of clusterng methodologes and algorthms to be defned. Ths secton offers an overvew of the man characterstcs of the clusterng algorthms presented n a comparatve way. We consder the algorthms categorzed n four groups based on ther clusterng method: parttonal, herarchcal, densty-based and grd-based algorthms. Tables, and 3 summarze the man concepts and the characterstcs of the most representatve algorthms of these clusterng categores. More specfcally our study s based on the followng features of the algorthms: ) the type of the data that an algorthm supports (numercal, categorcal), ) the shape of, ) ablty to handle nose and outlers, v) the clusterng crteron and, v) complexty. Moreover, we present the nput parameters of the algorthms whle we study the nfluence of these parameters to the clusterng results. Fnally we descrbe the type of algorthms results,.e., the nformaton that an algorthm gves so as to represent the dscovered n a data set. As Table depcts, parttonal algorthms are applcable manly to numercal data sets. However, there are some varants of K-Means such as K-mode, whch handle categorcal data. K-Mode s based on K-means method to dscover whle t adopts new concepts n order to handle categorcal data. Thus, the cluster centers are replaced wth modes, a new dssmlarty measure used to deal wth categorcal obects. Another characterstc of parttonal algorthms s that they are unable to handle nose and outlers and they are not sutable to dscover wth non-convex. Moreover, they are based on certan assumpton to partton a data set. Thus, they need to specfy the number of n advance except for CLARAS, whch needs as nput the maxmum number of neghbours of a node as well as the number of local mnma that wll be found n order to defne a parttonng of a dataset. The result of clusterng process s the set of representatve ponts of the dscovered. These ponts may be the centers or the medods (most centrally located obect wthn a cluster) of the dependng on the algorthm. As regards the clusterng crtera, the obectve of algorthms s to mnmze the dstance of the obects wthn a cluster from the representatve pont of ths cluster. Thus, the crteron of K-Means ams at the mnmzaton of the dstance of obects belongng to a cluster from the cluster center, whle PAM from ts medod. CLARA and

6 CLARAS, as mentoned above, are based on the clusterng crteron of PAM. However, they consder samples of the data set on whch clusterng s appled and as a consequence they may deal wth larger data sets than PAM. More specfcally, CLARA draws multple samples of the data set and t apples PAM on each sample. Then t gves the best clusterng as the output. The problem of ths approach s that ts effcency depends on the sample sze. Also, the clusterng results are produced based only on samples of a data set. Thus, t s clear that f a sample s based, a good clusterng based on samples wll not necessarly represent a good clusterng of the whole data set. On the other hand, CLARAS s a mxture of PAM and CLARA. A ey dfference between CLARAS and PAM s that the former searches a subset of dataset n order to defne []. The subsets are drawn wth some randomness n each step of the search, n contrast to CLARA that has a fxed sample at every stage. Ths has the beneft of not confnng a search to a localzed area. In general terms, CLARAS s more effcent and scalable than both CLARA and PAM. The algorthms descrbed above are crsp clusterng algorthms, that s, they consder that a data pont (obect) may belong to one and only one cluster. However, the boundares of a cluster can hardly be defned n a crsp way f we consder real-lfe cases. FCM s a representatve algorthm of fuzzy clusterng whch s based on K-means concepts n order to partton a data set nto. However, t ntroduces the concept of uncertanty and t assgns the obects to the wth an attached degree of belef. Thus, an obect may belong to more than one cluster wth dfferent degree of belef. A summarzed vew of the characterstcs of herarchcal clusterng methods s presented n Table. The algorthms of ths category create a herarchcal decomposton of the database represented as dendrogram. They are more effcent n handlng nose and outlers than parttonal algorthms. However, they brea down due to ther non-lnear tme complexty (typcally, complexty O(n ), where n s the number of ponts n the dataset) and huge I/O cost when the number of nput data ponts s large. BIRCH tacles ths problem usng a herarchcal data structure called CF-tree for multphase clusterng. In BIRCH, a sngle scan of the dataset yelds a good clusterng and one or more addtonal scans can be used to mprove the qualty further. However, t handles only numercal data and t s order-senstve (.e., t may generate dfferent for dfferent orders of the same nput data.) Also, BIRCH does not perform well when the do not have unform sze and shape snce t uses only the centrod of a cluster when redstrbutng the data ponts n the fnal phase. On the other hand, CURE employs a combnaton of random samplng and parttonng to handle large databases. It dentfes havng non-sphercal and wde varances n sze by representng each cluster by multple ponts. The representatve ponts of a cluster are generated by selectng well-scattered ponts from the cluster and shrnng them toward the centre of the cluster by a specfed fracton. However, CURE s senstve to some parameters such as the number of representatve ponts, the shrn factor used for handlng outlers, number of parttons. Thus, the qualty of clusterng results depends on the selecton of these parameters. ROCK s a representatve herarchcal clusterng algorthm for categorcal data. It ntroduces a novel concept called ln n order to measure the smlarty/proxmty between a par of data ponts. Thus, the ROCK clusterng method extends to non-metrc smlarty measures that are relevant to categorcal data sets. It also exhbts good scalablty propertes n comparson wth the tradtonal algorthms employng technques of random samplng. Moreover, t seems to handle successfully data sets wth sgnfcant dfferences n the szes of. The thrd category of our study s the densty-based clusterng algorthms (Table 3). They sutably handle arbtrary shaped collectons of ponts (e.g. ellpsodal, spral, cylndrcal) as well as of dfferent szes. Moreover, they can effcently separate nose (outlers). Two wdely nown algorthms of ths category, as mentoned above, are: DBSCA and DECLUE. DBSCA requres the user to specfy the radus of the neghbourhood of a pont, Eps, and the mnmum number of ponts n the neghbourhood, MnPts. Then, t s obvous that DBSCA s very senstve to the parameters Eps and MnPts, whch are dffcult to determne. Smlarly, DECLUE requres careful selecton of ts nput parameters value (.e., σ and ξ), snce such parameters may nfluence the qualty of clusterng results. However, the maor advantage of DECLUE n comparson wth other clusterng algorthms are []: ) t has a sold mathematcal foundaton and generalzed other clusterng methods, such as parttonal, herarchcal, ) t has good clusterng propertes for data sets wth large amount of nose, ) t allows a compact mathematcal descrpton of arbtrary shaped n hghdmensonal data sets, v) t uses grd cells and only eeps nformaton about the cells that actually contan ponts. It manages these cells n a tree-based access structure and thus t s sgnfcant faster than some nfluental algorthms such as DBSCA. In general terms the complexty of densty based algorthms s O(logn). They do not perform any sort of samplng, and thus they could ncur substantal I/O costs. Fnally, densty-based algorthms may fal to use random samplng to reduce the nput sze, unless sample s sze s large. Ths s because there may be substantal dfference between the densty n the sample s cluster and the n the whole data set

7 Table The man characterstcs of the Parttonal Clusterng Algorthms. Category Parttonal ame Type of data Complexty* Geometry Outlers, nose K-Mean umercal O(n) non-convex K-mode Categorcal O(n) non-convex PAM umercal O((n-) ) non-convex CLARA umercal O((40+) + (n-)) non-convex CLARAS umercal O(n ) non-convex FCM Fuzzy C-Means umercal O(n) on-convex Input parameters o umber of o umber of o umber of o umber of o umber of, maxmum number of neghbors examned o umber of Results Clusterng crteron Center of Modes of Medods of Medods of mnv,v,,v (E) ( x, v mnq,q,,q (E) D(X,Ql) dstance between categorcal obects Xl, and modes Q mn ( TCh ) TCh Σ Ch mn ( TCh ) Medods of mn ( TCh ) Center of cluster, belefs n E d TCh Σ Ch (Ch the cost of replacng center wth h as far as O s concerned) TCh Σ Ch X l, Q mnu,v,v,,v (Jm(U,V)) ( J n E d n m m ( U, V ) U d ( x, v ) ) ) * n s the number of ponts n the dataset and the number of defned l

8 Table. The man characterstcs of the Herarchcal Clusterng Algorthms. Category Herarchcal ame Type of data Complexty* Geometry Outlers Input parameters BIRCH umercal O(n) non-convex CURE umercal O(n logn), O(n) Arbtrary ROCK Categorcal O(n +nmmma+ n logn), O(n,nmmma) where mm s the maxmum number of neghbors for a pont and ma s the average number of neghbors for a pont Arbtrary Yes Radus of, branchng factor Yes umber of, number of representat ves Results Clusterng crteron CF (number of ponts n the cluster, lnear sum of the ponts n the cluster LS, the square sum of data ponts SS ) Assgnment of data values to Yes umber of Assgnment of data values to A pont s assgned to closest node (cluster) accordng to a chosen dstance metrc. Also, the defnton s based on the requrement that the number of ponts n each cluster must satsfy a threshold. The wth the closest par of representatves (well scattered ponts) are merged at each step. max (El) E n + p q p V ln -v center of cluster I - ln (pq, pr) the number of common neghbors between p and pr. n ( p q, θ p r ) * n s the number of ponts n the dataset under consderaton l, r f ( )

9 Category Densty-based Table 3 The man characterstcs of the Densty-based Clusterng Algorthms. ame Type of data Complexty* Geometry Outlers, nose DBSCA umercal O(nlogn) Arbtrary DECLUE umercal O(logn) Arbtrary Input parameters Yes Cluster radus, mnmum number of obects Yes Cluster radus σ, Mnmum number of obects ξ Results Clusterng crteron Assgnment of data values to Assgnment of data values to Merge ponts that are densty reachable nto one cluster. D * σ f Gauss ( x ) e * x near ( x ) x * densty attractor for a pont x f FGauss > ξ then x attached to the cluster belongng to x*. d ( x * x ) Table 4. The man characterstcs of the Grd-based Clusterng Algorthms Category Grd-Based ame Type of data Complexty* Geometry Outlers Input parameters Output Clusterng crteron Wave- Cluster Spatal data O(n) Arbtrary Yes Wavelets, the number of grd cells for each dmenson, the number of applcatons of wavelet transform. Clustered obects Decompose feature space applyng wavelet transformaton Average sub-band Detal sub-bands boundares STIG Spatal data O(K) K s the number of grd cells at the lowest level Arbtrary Yes umber of obects n a cell Clustered obects Dvde the spatal area nto rectangle cells and employ a herarchcal structure. Each cell at a hgh level s parttoned nto a number of smaller cells n the next lower level * n s the number of ponts n the dataset under consderaton,

10 Fgure. (a) A data set that conssts of 3 three, (b) The results from the applcaton of K-means when we as four The last category of our study (see Table 4) refers to grdbased algorthms. The basc concept of these algorthms s that they defne a grd for the data space and then do all the operatons on the quantsed space. In general terms these approaches are very effcent for large databases and are capable of fndng arbtrary shape and handlng outlers. STIG s one of the well-nown grdbased algorthms. It dvdes the spatal area nto rectangular cells whle t stores the statstcal parameters of the numercal features of the obects wthn cells. The grd structure facltates parallel processng and ncremental updatng. Snce STIG goes through the database once to compute the statstcal parameters of the cells, t s generally an effcent method for generatng. Its tme complexty s O(n). However, STIG uses a multresoluton approach to perform cluster analyss and thus the qualty of ts clusterng results depends on the granularty of the lowest level of grd. Moreover, STIG does not consder the spatal relatonshp between the chldren and ther neghbourng cells to construct the parent cell. The result s that all cluster boundares are ether horzontal or vertcal and thus the qualty of s questonable [5]. On the other hand, WaveCluster effcently acheves to detect arbtrary shape at dfferent scales explotng wellnown sgnal processng technques. It does not requre the specfcaton of nput parameters (e.g. the number of or a neghbourhood radus), though a-pror estmaton of the expected number of helps n selectng the correct resoluton of. In expermental studes, WaveCluster was found to outperform BIRCH, CLARAS and DBSCA n terms of effcency and clusterng qualty. Also, the study shows that t s not effcent n hgh dmensonal space []. 4. Clusterng results valdty assessment 4. Problem Specfcaton The obectve of the clusterng methods s to dscover sgnfcant groups present n a data set. In general, they should search for whose members are close to each other (n other words have a hgh degree of smlarty) and well separated. A problem we face n clusterng s to decde the optmal number of that ftsadataset. In most algorthms expermental evaluatons D-data sets are used n order that the reader s able to vsually verfy the valdty of the results (.e., how well the clusterng algorthm dscovered the of the data set). It s clear that vsualzaton of the data set s a crucal verfcaton of the clusterng results. In the case of large multdmensonal data sets (e.g. more than three dmensons) effectve vsualzaton of the data set would be dffcult. Moreover the percepton of usng avalable vsualzaton tools s a dffcult tas for humans that are not accustomed to hgher dmensonal spaces. The varous clusterng algorthms behave n a dfferent way dependng on: ) the features of the data set (geometry and densty dstrbuton of ), ) the nput parameters values For nstance, assume the data set n Fgure a. It s obvous that we can dscover three n the gven data set. However, f we consder a clusterng algorthm (e.g. K-Means) wth certan parameter values (n the case of K-means the number of ) so as to partton the data set n four, the result of clusterng process would be the clusterng scheme presented n Fgure b. In our example the clusterng algorthm (K-Means) found the best four n whch our data set could be parttoned. However, ths s not the optmal parttonng for the consdered data set. We defne, here, the term optmal clusterng scheme as the outcome of runnng a clusterng algorthm (.e., a parttonng) that best fts the nherent parttons of the data set. It s obvous from Fgure b that the depcted scheme s not the best for our data set.e., the clusterng scheme presented n Fgure b does not ft well the data set. The optmal clusterng for our data set wll be a scheme wth three. As a consequence, f the clusterng algorthm parameters are assgned an mproper value, the clusterng method may result n a parttonng scheme that s not optmal for the specfc data set leadng to wrong decsons. The problems of decdng the number of better fttng a data set as well as the evaluaton of the clusterng results has been subect of several research efforts [3, 9, 4, 7, 8, 3].

11 Fgure. Confdence nterval for (a) two-taled ndex, (b) rght-taled ndex, (c) left-taled ndex, where q 0 p s the ρ proporton of q under hypothess Η 0. [8] In the sequel, we dscuss the fundamental concepts of clusterng valdty and we present the most mportant crtera n the context of clusterng valdty assessment. 4. Valdty Indces In ths secton, we dscuss methods sutable for quanttatve evaluaton of the clusterng results, nown as cluster valdty methods. However, we have to menton that these methods gve an ndcaton of the qualty of the resultng parttonng and thus they can only be consdered as a tool at the dsposal of the experts n order to evaluate the clusterng results. In the sequel, we descrbe the fundamental crtera for each of the above descrbed cluster valdty approaches as well as ther representatve ndces. How Monde Carlo s used n Cluster Valdty The goal of usng Monde Carlo technques s the computaton of the probablty densty functon of the defned statstc ndces. Frst, we generate a large amount of synthetc data sets. For each one of these synthetc data sets, called X, we compute the value of the defned ndex, denoted q. Then based on the respectve values of q for each of the data sets X, we create a scatter-plot. Ths scatter-plot s an approxmaton of the probablty densty functon of the ndex. In Fgure we see the three possble cases of probablty densty functon s shape of an ndex q. There are three dfferent possble dependng on the crtcal nterval D ρ, correspondng to sgnfcant level ρ (statstc constant). As we can see the probablty densty functon of a statstc ndex q, under Ho, has a sngle maxmum and the Dρ regon s ether a half lne, or a unon of two half lnes. Assumng that ths shape s rght-taled (Fgure b) and that we have generated the scatter-plot usng r values of the ndex q, called q, n order to accept or reect the ull Hypothess Ho we examne the followng condtons [8]: We reect (accept) Ho If q s value for our data set, s greater (smaller) than (-ρ) r of q values, of the respectve synthetc data sets X. Assumng that the shape s left-taled (Fgure c), we reect (accept) Ho f q s value for our data set, s smaller (greater) than ρ rofq values. Assumng that the shape s two-taled (Fgure a) we accept Ho f q s greater than (ρ/) r number of q values and smaller than (- ρ/) r ofq values. Based on the external crtera we can wor n two dfferent ways. Frstly, we can evaluate the resultng clusterng structure C, by comparng t to an ndependent partton of the data P bult accordng to our ntuton about the clusterng structure of the data set. Secondly, we can compare the proxmty matrx P to the partton P. a) Comparson of C wth partton P (not for herarchy of clusterng) Consder C {C C m } s a clusterng structure of a data set X and P {P P s } s a defned partton of the data, where m s. We refer to a par of ponts (x v, x u )fromthe data set usng the followng terms: SS: f both ponts belong to the same cluster of the clusterng structure C and to the same group of partton P. SD: f both ponts belong to the same cluster of C and to dfferent groups of P. DS: f both ponts belong to dfferent of C and to the same group of P. DD: f both ponts belong to dfferent of C and to dfferent groups of P. Assumng now that a, b, c and d are the number of SS, SD, DS and DD pars respectvely, then a + b + c + d M whch s the maxmum number of all pars n the data set (meanng, M(-)/ where s the total number of ponts n the data set). ow we can defne the followng ndces to measure the degree of smlarty between C and P:

12 Rand Statstc: R(a+d)/M, Jaccard Coeffcent: Ja/(a+b+c), The above two ndces tae values between 0 and, and are maxmzed when m s. Another ndex s the: Foles and Mallows ndex: FM a / mm a a a + b a + c () where m a/(a+b),m a / (a + c). For the prevous three ndces t has been proven that hgh values of ndces ndcate great smlarty between C and P. The hgher the values of these ndces are the more smlar C and P are. Other ndces are: Huberts Γ statstc: - Γ (/M) X(, ) Y(, ) (3) + Hgh values of ths ndex ndcate a strong smlarty between X and Y. ormalzed Γ statstc: _ - Γ [(/M)(X(, ) -µ X )(Y(, ) -µ Y )] σχσυ (4) + where µ x, µ y, σ x, σ y are the respectve means and varances of X, Y matrces. Ths ndex taes values between and. All these statstcs have rght-taled probablty densty functons, under the random hypothess. In order to use these ndces n statstcal tests we must now ther respectve probablty densty functon under the ull Hypothess Ho, whch s the hypothess of random structure of our data set. Ths means that usng statstcal tests, f we accept the ull Hypothess then our data are randomly dstrbuted. However, the computaton of the probablty densty functon of these ndces s dffcult. A soluton to ths problem s to use Monde Carlo technques. Theproceduresasfollows: For to r o Generate adatasetx wth vectors (ponts) n the area of X, whch means that the generated vectors have the same dmenson wth those of the data set X. o Assgn each vector y, of X to the group that x X belongs, accordng to the partton P. o Run the same clusterng algorthm used to produce structure C, for each X, and let C the resultng clusterng structure. o Compute q(c ) value of the defned ndex q for P and C. End For Create scatter-plot of the r valdty ndex values, q(c ) (that computed nto the for loop). After havng plotted the approxmaton of the probablty densty functon of the defned statstc ndex, we compare ts value, let q, to the q(c )values,letq.the ndces R, J, FM, Γ defned prevously are used as the q ndex mentoned n the above procedure. Example: Assume a gven data set, X, contanng 00 three-dmensonal vectors (ponts). The ponts of X form four of 5 ponts each. Each cluster s generated by a normal dstrbuton. The covarance matrces of these dstrbutons are all equal to 0.I, where I s the 3x3 dentty matrx. The mean vectors for the four dstrbutons are [0., 0., 0.] T, [0.5, 0., 0.8] T, [0.5, 0.8, 0.] T,and[0.8,0.8,0.8] T. We ndependently group data set X n four groups accordng to the partton P for whch the frst 5 vectors (ponts) belong to the frst group P, the next 5 belong to the second group P, the next 5 belong to the thrd group P 3 and the last 5 vectors belong to the fourth group P 4. We run -means clusterng algorthm for 4 and we assume that C s the resultng clusterng structure. We compute the values of the ndces for the clusterng C and the partton P, andwe get R 0.9, J 0.68, FM 0.8 and Γ Then we follow the steps descrbed above n order to defne the probablty densty functon of these four statstcs. We generate 00 data sets X,,, 00, and each one of them conssts of 00 random vectors (n 3 dmensons) usng the unform dstrbuton. Accordng to the partton P defned earler for each X we assgn the frst 5 of ts vectors to P and the second, thrd and forth groups of 5 vectors to P, P 3 and P 4 respectvely. Then we run - means -tmes, one tme for each X,soastodefnethe respectve clusterng structures of datasets, denoted C. For each of them we compute the values of the ndces R, J,FM, Γ,,,00. We set the sgnfcance level ρ 0.05 and we compare these values to the R, J, FM and Γ values correspondng to X. We accept or reect the null hypothess whether (-ρ)r ( 0.05) values of R, J, FM, Γ are greater or smaller than the correspondng values of R, J, FM, Γ. In our case the R,J, FM, Γ values are all smaller than the correspondng values of R, J, FM, and Γ, whch lead us to the concluson that the null hypothess Ho s reected. Somethng that we were expectng because of the predefned clusterng structure of data set X. b) Comparson of P(proxmty matrx) wth partton P Partton P can be consdered as a mappng g: X { nc}. Assumng matrx Y: Y(, ) {, f g(x ) g(x )and0, otherwse},,, we can compute Γ (or normalzed Γ) statstc usng the proxmty matrx P and the matrx Y. Based on the ndex value, we may have an ndcaton of the two matrces smlarty. To proceed wth the evaluaton procedure we use the Monde Carlo technques as mentoned above. In the Generate step of the procedure we generate the correspondng mappngs g for every generated X data set. So n the Compute step we compute the matrx Y,

13 for each X n order to fnd the Γ correspondng statstc ndex. 4.. Internal Crtera. Usng ths approach of cluster valdty our goal s to evaluate the clusterng result of an algorthm usng only quanttes and features nherent to the dataset. There are two cases n whch we apply nternal crtera of cluster valdty dependng on the clusterng structure: a) herarchy of clusterng schemes, and b) sngle clusterng scheme. a) Valdatng herarchy of clusterng schemes. A matrx called cophenetc matrx, P c, can represent the herarchy dagram that produced by a herarchcal algorthm. We may defne a statstcal ndex to measure the degree of smlarty between P c and P (proxmty matrx) matrces. Ths ndex s called Cophenetc Correlaton Coeffcent and defned as: CPCC CPCC - ( /M ) ( / M) d µ P ( / M ) + - d + c µ µ P - C +, c µ C (5) where M (-)/ and s the number of ponts n a dataset. Also, µ p and µ c are the means of matrces P and P c respectvely, and are gven by the equaton (6): - - P µ + + ( /M) P(, ), C ( /M) Pc(, ) µ (6) Moreover, d,c are the (, ) elements of P and P c matrces respectvely. A value of the ndex close to 0 s an ndcaton of a sgnfcant smlarty between the two matrces. The procedure of the Monde Carlo technques descrbed above s also used n ths case of valdaton. b) Valdatng a sngle clusterng scheme The goal here s to fnd the degree of agreement between a gven clusterng scheme C, consstng of nc, and the proxmty matrx P. The defned ndex for ths approach s Hubert s Γ statstc (or normalzed Γ statstc). An addtonal matrx for the computaton of the ndex s used,thatsy(,){,fx and x belong to dfferent,and0,otherwse},,,,. The applcaton of Monde Carlo technques s also here the way to test the random hypothess n a gven data set. 4.. Relatve Crtera. The bass of the above descrbed valdaton methods s statstcal testng. Thus, the maor drawbac of technques based on nternal or external crtera s ther hgh computatonal demands. A dfferent valdaton approach s dscussed n ths secton. It s based on relatve crtera and does not nvolve statstcal tests. The fundamental dea of ths approach s to choose the best clusterng scheme of a set of defned schemes accordng to a pre-specfed crteron. More specfcally, the problem can be stated as follows: Let P the set of parameters assocated wth a specfc clusterng algorthm (e.g. the number of nc). Among the clusterng schemes C,,..,nc, defned by a specfc algorthm, for dfferent values of the parameters n P, choose the one that best fts the data set. Then, we can consder the followng cases of the problem: I) P does not contan the number of, nc, asa parameter. In ths case, the choce of the optmal parameter values are descrbed as follows: We run the algorthm for a wde range of ts parameters values and we choose the largest range for whch nc remans constant (usually nc << (number of tuples). Then we choose as approprate values of the P parameters the values that correspond to the mddle of ths range. Also, ths procedure dentfes the number of that underle our data set. II) P contans nc as a parameter. The procedure of dentfyng the best clusterng scheme s based on a valdty ndex. Selectng a sutable performance ndex, q, we proceed wth the followng steps: We run the clusterng algorthm for all values of nc between a mnmum n cmn and a maxmum n cmax. The mnmum and maxmum values have been defned a-pror by user. For each of the values of nc, we run the algorthm r tmes, usng dfferent set of values for the other parameters of the algorthm (e.g. dfferent ntal condtons). We plot the best values of the ndex q obtaned by each nc as the functon of nc. Based on ths plot we may dentfy the best clusterng scheme. We have to stress that there are two approaches for defnng the best clusterng dependng on the behavour of q wth respect to nc. Thus, f the valdty ndex does not exhbt an ncreasng or decreasng trend as nc ncreases we see the maxmum (mnmum) of the plot. On the other hand, for ndces that ncrease (or decrease) as the number of ncrease we search for the values of nc at whch a sgnfcant local change n value of the ndex occurs. Ths change appears as a nee n the plot and t s an ndcaton of the number of underlyng the dataset. Moreover, the absence of a nee may be an ndcaton that the data set possesses no clusterng structure. In the sequel, some representatve valdty ndces for crsp and fuzzy clusterng are presented. Crsp clusterng. Ths secton dscusses valdty ndces sutable for crsp clusterng.

14 a) The modfed Hubert Γ statstc. The defnton of the modfed Hubert Γ statstc s gven by the equaton Γ ( / M ) + P(, ) Q(, ) (7) where M(-)/, P s the proxmty matrx of the data set and Q s an X matrx whose (, ) element s equal to the dstance between the representatve ponts (v c,v c ) of the where the obects x and x belong. Smlarly, we can defne the normalzed Hubert Γ statstc (gven by equaton (4)). If the d(v c,v c ) s close to d(x, x ) for,,,..,, P and Q wll be n close agreement and the values of Γ and Γ (normalzed Γ) wll be hgh. Conversely, a hgh value of Γ ( Γ ) ndcates the exstence of compact. Thus, n the plot of normalzed Γ versus nc, we see a sgnfcant nee that corresponds to a sgnfcant ncrease of normalzed Γ. The number of at whch the nee occurs s an ndcaton of the number of that underle the data. We note, that for nc and nc the ndex s not defned. b) Dunn and Dunn-le ndces. A cluster valdty ndex for crsp clusterng proposed n [5], attempts to dentfy compact and well separated. The ndex s defned by equaton (8) for a specfc number of ( ) d c, c D (8) nc mn mn,..., nc +,..., nc ( ) max dam c,..., nc where d(c,c ) s the dssmlarty functon between two c and c defned as d ( c, c ) mn d( x, y), x c, y c (9) and dam(c) s the dameter of a cluster, whch may be consdered as a measure of dsperson of the. The dameter of a cluster C can be defned as follows: dam( C) max d( x, y) ( 0) x, y C It s clear that f the dataset contans compact and well-separated, the dstance between the s expected to be large and the dameter of the s expected to be small. Thus, based on the Dunn s ndex defnton, we may conclude that large values of the ndex ndcate the presence of compact and well-separated. The ndex D nc does not exhbt any trend wth respect to number of. Thus, the maxmum n the plot of D nc versus the number of can be an ndcaton of the number of that fts the data. The mplcatons of the Dunn ndex are: ) the consderable amount of tme requred for ts computaton, ) the senstve to the presence of nose n datasets, snce these are lely to ncrease the values of dam(c) (.e., domnator of equaton (8)) Three ndces, are proposed n [3] that are more robust to the presence of nose. They are wdely nown as Dunn-le ndces snce they are based on Dunn ndex. Moreover, the three ndces use for ther defnton the concepts of the mnmum spannng tree (MST), the relatve neghbourhood graph (RG) and the Gabrel graph respectvely [8]. Consder the ndex based on MST. Let a cluster c and the complete graph G whose vertces correspond to the vectors of c. The weght, w e,ofanedge,e, of ths graph equals the dstance between ts two end ponts, x, y. Let E MST be the set of edges of the MST of the graph G and e MST theedgene MST wth the maxmum weght. Then the dameter of C s defned as the weght of e MST.Then the Dunn-le ndex based on the concept of the MST s gven by equaton D ( ) The number of at whch D MST m taes ts maxmum value ndcates the number of n the underlyng data. Based on smlar arguments we may defne the Dunn-le ndces for GG nad RG graphs. c) The Daves-Bouldn (DB) ndex. A smlarty measure R between the C and C s defned based on a measure of dsperson of a cluster C and a dssmlarty measure between two d.ther ndex s defned to satsfy the followng condtons [4]:. R 0. R R 3. Αν s 0και s 0τότε R 0 4. Αν s >s και d d τότε R >R 5. Αν s s και d <d τότε R >R. These condtons state that R s nonnegatve and symmetrc. A smple choce for R that satsfes the above condtons s [4]: R (s +s )/d. ( ) Then the DB ndex s defned as DB R nc nc n max nc c,..., nc, R R,,..., nc ( c, c ) d mn mn,..., nc +,..., nc max dam,..., MST nc ( 3) It s clear for the above defnton that DB nc s the average smlarty between each cluster c,,,ncand ts most smlar one. It s desrable for the to have

15 the mnmum possble smlarty to each other; therefore we see clusterngs that mnmze DB. The DB nc ndex exhbts no trends wth respect to the number of and thus we see the mnmum value of DB nc n ts plot versus the number of. Some alternatve defntons of the dssmlarty between two as well as the dsperson of a cluster, c,s defned n [4]. Moreover, n [3] three varants of the DB nc ndex are proposed. They are based on MST, RG and GG concepts smlarly to the cases of the Dunn-le ndces. Other valdty ndces for crsp clusterng have been proposed n [3] and []. The mplementaton of most of these ndces s very computatonally expensve, especally when the number of and number of obects n the data set grows very large [3]. In [9], an evaluaton study of thrty valdty ndces proposed n lterature s presented. It s based on small data sets (about 50 ponts each) wth well-separated. The results of ths study [9] place Cals and Harabasz(974), Je()/Je() (984), C-ndex (976), Gamma and Beale among the sx best ndces. However, t s noted that although the results concernng these methods are encouragng they are lely to be data dependent. Thus, the behavour of ndces may change f dfferent data structures were used [9]. Also, some ndces based on a sample of clusterng results. A representatve example s Je()/Je() whch s computed based only on the nformaton provded by the tems nvolved n the last cluster merge. d) RMSSDT, SPR, RS, CD. In ths pont we wll gve the defntons of four valdty ndces, whch have to be used smultaneously to determne the number of exstng n the data set. These four ndces can be appled to each step of a herarchcal clusterng algorthm and they are nown as [6]: Root-mean-square standard devaton (RMSSTD) of the new cluster Sem-partal R-squared (SPR) R-squared (RS) Dstance between two. Gettng nto a more detaled descrpton of them we can say that: RMSSTD of a new clusterng scheme defned n a level of clusterng herarchy s the square root of the pooled sample varance of all the varables (attrbutes used n the clusterng process). Ths ndex measures the homogenety of the formed at each step of the herarchcal algorthm. Snce the obectve of cluster analyss s to form homogeneous groups the RMSSTD of a cluster should be as small as possble. In case that the values of RMSSTD are hgher at ths step than the ones of the prevous step, we have an ndcaton that the new clusterng scheme s not homogenous. In the followng defntons we shall use the symbolsm SS, whch means Sum of Squares and refers to the equaton: n SS ( X X ). Along wth ths we shall use some addtonal symbolsm le: )SS w referrng to the wthn group sum of squares, ) SS b referrng to the between groups sum of squares. ) SS t referrng to the total sum of squares, of the wholedataset. SPR of the new cluster s the dfference between the pooled SS w of the new cluster and the sum of the pooled SS w s values of oned to obtan the new cluster (loss of homogenety), dvded by the pooled SS t for the whole data set. Ths ndex measures the loss of homogenety after mergng the two of a sngle algorthm step. If the ndex value s zero then the new cluster s obtaned by mergng two perfectly homogeneous. If ts value s hgh then the new cluster s obtaned by mergng two heterogeneous. RS of the new cluster s the rato of SS b to SS t.aswe can understand SS b s a measure of dfference between groups. Snce SS t SS b +SS w the greater the SS b the smaller the SS w and vse versa. As a result, the greater the dfferences between groups the more homogenous each group s and vse versa. Thus, RS may be consdered as a measure of the degree of dfference between. Furthermore, t measures the degree of homogenety between groups. The values of RS range between 0 and. In case that the value of RS s zero (0) ndcates that no dfference exsts among groups. On the other hand, when RS equals there s an ndcaton of sgnfcant dfference among groups. The CD ndex measures the dstance between the two that are merged n a gven step. Ths dstance s measured each tme dependng on the selected representatves for the herarchcal clusterng we perform. For nstance, n case of Centrod herarchcal clusterng the representatves of the formed are the centers of each cluster, so CD s the dstance between the centers ofthe.incasethatweusesngle lnage CD measures the mnmum Eucldean dstance between all possble pars of ponts. In case of complete lnage CD s the maxmum Eucldean dstance between all pars of data ponts,andsoon. Usng these four ndces we determne the number of that exst nto our data set, plottng a graph of all these ndces values for a number of dfferent stages of the clusterng algorthm. In ths graph we search for the steepest nee, or n other words, the greater ump of these ndces values from hgher to smaller number of In the case of nonherarchcal clusterng (e.g. K- Means) we may also use some of these ndces n order to evaluate the resultng clusterng. The ndces that are more

16 meanngful to use n ths case are RMSSTD and RS. The dea, here, s to run the algorthm a number of tmes for dfferent number of each tme. Then we plot the respectve graphs of the valdty ndces for these clusterngs and as the prevous example shows, we search for the sgnfcant nee n these graphs. The number of at whch the nee s observed ndcates the optmal clusterng for our data set. In ths case the valdty ndces descrbed before tae the followng form: RMSSTD RS n K d ( x K nc K d x n K nc K d K d ( x ( n ) n ( x x x ) K nc K d n ) ) ( x x ) ( 4) ( 5) where nc s the number of, d the number of varables(data dmenson), n s the number of data values of dmenson whle n corresponds to the number of data values of dmenson that belong to cluster. Also x s the mean of data values of dmenson. e) The SD valdty ndex. A most recent clusterng valdty approach s proposed n [3]. The SD valdty ndex s defned based on the concepts of the average scatterng for and total separaton between. In the sequel, we gve the fundamental defnton for ths ndex. Average scatterng for. The average scatterng for s defned as nc Scat( nc) σ v σ ( X ) ( 6) nc Total separaton between. The defnton of total scatterng (separaton) between s gven by equaton (7) nc c Dmax Ds( nc) ( 7) v v z Dmn z where D max max( v -v ), {,,3,,nc} s the maxmum dstance between cluster centers. The D mn mn( v -v ), {,,,nc} s the mnmum dstance between cluster centers. ow, we can defne a valdty ndex based on equatons (6) and (7), as follows SD(nc) a Scat(nc) + Ds(nc) ( 8) where α s a weghtng factor equal to Ds(c max ) where c max s the maxmum number of nput. The frst term (.e., Scat(nc) s defned by equaton (6) ndcates the average compactness of (.e., ntra-cluster dstance). A small value for ths term ndcates compact and as the scatterng wthn ncreases (.e., they become less compact) the value of Scat(nc) also ncreases. The second term Ds(nc) ndcates the total separaton between the nc (.e., an ndcaton of nter-cluster dstance). Contrary to the frst term the second one, Ds(nc), s nfluenced by the geometry of the centres and ncrease wth the number of. It s obvous for prevous dscusson that the two terms of SD are of the dfferent range, thus a weghtng factor s needed n order to ncorporate both terms n a balanced way. The number of, c, that mnmzes the above ndex can be consdered as an optmal value for the number of present n the data set. Also, the nfluence of the maxmum number of c max, related to the weghtng factor, n the selecton of the optmal clusterng scheme s dscussed n [3]. It s proved that SD proposes an optmal number of almost rrespectvely of c max. However, the ndex cannot handle properly arbtrary shaped. The same apples to all the aforementoned ndces. Fuzzy Clusterng. In ths secton, we present valdty ndces sutable for fuzzy clusterng. The obectve s to see clusterng schemes where most of the vectors of the dataset exhbt hgh degree of membershp n one cluster. We note, here, that a fuzzy clusterng s defned by a matrx U[u ], where u denotes the degree of membershp of the vector x n the cluster. Also, a set of the cluster representatves has been defned. Smlarly to the crsp clusterng case we defne valdty ndex, q, and we search for the mnmum or maxmum n the plot of q versus m. Also, n case that q exhbts a trend wth respect to the number of, we see a sgnfcant nee of decrease (or ncrease) n the plot of q. In the sequel two categores of fuzzy valdty ndces are dscussed. The frst category uses only the membershps values, u, of a fuzzy partton of data. On the other hand the latter one nvolves both the U matrx and the dataset tself. a) Valdty Indces nvolvng only the membershp values. Bezde proposed n [] the partton coeffcent, whch s defned as PC nc u ( 9) The PC ndex values range n [/nc, ], where nc s the number of. The closer to unty the ndex the crsper the clusterng s. In case that all membershp values to a fuzzy partton are equal, that s, u /nc, the PC obtans ts lower value. Thus, the closer the value of PC s to /nc, the fuzzer the clusterng s. Furthermore, a value close to /nc ndcates that there s no clusterng tendency n the consdered dataset or the clusterng algorthm faled to reveal t.

17 The partton entropy coeffcent s another ndex of ths category. It s defned as PE nc ( ) u log a u ( 0) where a s the base of the logarthm. The ndex s computed for values of nc greater than and ts values ranges n [0, log a nc]. The closer the value of PE to 0, the harder the clusterng s. As n the prevous case, the values of ndex close to the upper bound (.e., log a nc), ndcate absence of any clusterng structure n the dataset or nablty of the algorthm to extract t. The drawbacs of these ndces are: ) ther monotonous dependency on the number of. Thus, we see sgnfcant nees of ncrease (for PC) or decrease (for PE) n plot of the ndces versus the number of., ) ther senstvty to the fuzzfer, m. More specfcally, as m the ndces gve the same values for all values of nc. On the other hand when m, both PC and PE exhbt sgnfcant nee at nc. ) the lac of drect connecton to the geometry of the data [3], snce they do not use the data tself. b) Indces nvolvng the membershp values and the dataset. The Xe-Ben ndex [3], XB, also called the compactness and separaton valdty functon, s a representatve ndex of ths category. Consder a fuzzy partton of the data set X{x ;,.., n} wth v (,, nc} the centers of each cluster and u the membershp of data pont belongng to cluster. The fuzzy devaton of x form cluster, d, s defned as the dstance between x and the center of cluster weghted by the fuzzy membershp of data pont belongng to cluster. d u x -v ( ) Also, for a cluster, the summaton of the squares of fuzzy devaton of the data pont n X, denoted σ, s called varaton of cluster. The summaton of the varatons of all, σ, s called total varaton ofthedataset. The quantty π(σ /n ), s called compactness of cluster. Snce n s the number of pont n cluster belongng to cluster, π, s the average varaton n cluster. Also, the separaton of the fuzzy parttons s defned as the mnmum dstance between cluster centres, that s dmn mn v -v Then the XB ndex s defned as XBπ/ dmn where s the number of ponts n the data set. It s clear that small values of XB are expected for compact and well-separated. We note, however, that XB s monotoncally decreasng when the number of nc gets very large and close to n. One way to elmnate ths decreasng tendency of the ndex s to determne a startng pont, c max, of the monotonc behavour and to search for the mnmum value of XB n the range [, c max ]. Moreover, the values of the ndex XB depend on the fuzzfer values, so as f m then XB. Another ndex of ths category s the Fuuyama- Sugeno ndex, whch s defned as ( ) where v s the mean vector of X and A s an lxl postve defnte, symmetrc matrx. When AI, the above dstance become the squared Eucldean dstance. It s clear that for compact and well-separated we expect small values for FS m. The frst term n the parenthess measures the compactness of the and the second one measures the dstances of the representatves. Other fuzzy valdty ndces are proposed n [9], whch are based on the concepts of hypervolume and densty. Let Σ the fuzzy covarance matrx of the -th cluster defned as FS m u nc m u m ( x v )( x v ) m u T ( 3) Then the total fuzzy hyper volume s gven by the equaton FH ( x v v v ) nc V ( 4) Small values of FH ndcate the exstence of compact. The average partton densty s also an ndex of ths category. It can be defned as nc S PA ( 5) nc V where x s the set of data ponts that are wthn a prespecfed regon around v. Then S x u X s called the sum of the central members of the cluster. A dfferent measure s the partton densty ndex that s defned as PDS/FH ( 6) nc where S S A few other ndces are proposed and dscussed n [7, 4]. 4.3 Other approaches for cluster valdty Another approach for fndng the best number of cluster of a data set proposed n [7]. It ntroduces a practcal clusterng algorthm based on Monte Carlo crossvaldaton. More specfcally, the algorthm conssts of M A A

18 (a) DataSet (b) DataSet (c)dataset3 (d) DataSet4 cross-valdaton runs over M chosen tran/test parttons of a data set, D. For each partton u, the EM algorthm s used to defne c to the tranng data, whle c s varedfromtoc max. Then, the log-lelhood L c u (D) s calculated for each model wth c. It s defned usng the probablty densty functon of the data as L( D) log f ( x / Φ ( 7) where f s the probablty densty functon for the data and Φ denotes parameters that have been estmated from data. Ths s repeated M tmes and the M cross-valdated estmates are averaged for each nc. Based on these estmates we may defne the posteror probabltes for each value of the number of nc, p(nc/d). Ifone of p(nc/d) s near, there s strong evdence that the partcular number of s the best for our data set. The evaluaton approach proposed n [7] s based on densty functons consdered for the data set. Thus, t s based on concepts related to probablstc models n order to estmate the number of, better fttng a data set, and t does not use concepts drectly related to the data, (.e., nter-cluster and ntra- dstances). 4.4An expermental study of valdty ndces In ths secton we present a comparatve expermental evaluaton of the mportant valdty measures, amng at llustratng ther advantages and dsadvantages. We consder the nown relatve valdty ndces proposed n the lterature, such as RS-RMSSTD [6], DB [8] and the recent one SD [3]. The defntons of these valdty ) Fgure 3 Datasets (e) Real_Data ndces can be found n Secton 4.3. RMSSTD and RS have to be taen nto account smultaneously n order to fnd the correct number of. The optmal values of the number of are those for whch a sgnfcant local change n values of RS and RMSSTD occurs. As regards DB, an ndcaton of the optmal clusterng scheme s the pont at whch t taes ts mnmum value. For our study, we used four synthetc two-dmensonal data sets further referred to as DataSet, DataSet, DataSet3 and DataSet4 (see Fgure 3a-d) and a real data set Real_Data (Fgure 3e), representng a part of Gree road networ [9]. Table 8 summarzes the results of the valdty ndces (RS, RMSSDT, DB, SD), for dfferent clusterng schemes of the above-mentoned data sets as resultng from a clusterng algorthm. For our study, we use the results of the algorthms K-means and CURE wth ther nput value (number of ), rangng between and 8. Indces RS, RMSSTD propose the parttonng of DataSet nto three whle DB selects sx as the best parttonng. On the other hand, SD selects four as the best parttonng for DataSet, whch s the correct number of fttng the data set. Moreover, the ndex DB selects the correct number of (.e., seven) as the optmal parttonng for DataSet3 whle RS, RMSSTD and SD select the clusterng scheme of fve and sx respectvely. Also, all ndces propose three as the best parttonng for Real_Data. In the case of DataSet, DB and SD select three as the optmal scheme, whle RS-RMSSDT selects two (.e., the correct number of fttng the data set). Table 5: Optmal number of proposed by valdty ndces RS-RMSSTD, DB, SD DataSet DataSet DataSet3 DataSet4 Real_Data Optmal number of RS, RMSSTD DB SD

Hierarchical clustering for gene expression data analysis

Hierarchical clustering for gene expression data analysis Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervsed Learnng and Clusterng Why consder unlabeled samples?. Collectng and labelng large set of samples s costly Gettng recorded speech s free, labelng s tme consumng 2. Classfer could be desgned

More information

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust

More information

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

Unsupervised Learning

Unsupervised Learning Pattern Recognton Lecture 8 Outlne Introducton Unsupervsed Learnng Parametrc VS Non-Parametrc Approach Mxture of Denstes Maxmum-Lkelhood Estmates Clusterng Prof. Danel Yeung School of Computer Scence and

More information

Determining Fuzzy Sets for Quantitative Attributes in Data Mining Problems

Determining Fuzzy Sets for Quantitative Attributes in Data Mining Problems Determnng Fuzzy Sets for Quanttatve Attrbutes n Data Mnng Problems ATTILA GYENESEI Turku Centre for Computer Scence (TUCS) Unversty of Turku, Department of Computer Scence Lemmnkäsenkatu 4A, FIN-5 Turku

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Classifier Selection Based on Data Complexity Measures *

Classifier Selection Based on Data Complexity Measures * Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.

More information

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15 CS434a/541a: Pattern Recognton Prof. Olga Veksler Lecture 15 Today New Topc: Unsupervsed Learnng Supervsed vs. unsupervsed learnng Unsupervsed learnng Net Tme: parametrc unsupervsed learnng Today: nonparametrc

More information

Machine Learning. Topic 6: Clustering

Machine Learning. Topic 6: Clustering Machne Learnng Topc 6: lusterng lusterng Groupng data nto (hopefully useful) sets. Thngs on the left Thngs on the rght Applcatons of lusterng Hypothess Generaton lusters mght suggest natural groups. Hypothess

More information

A Deflected Grid-based Algorithm for Clustering Analysis

A Deflected Grid-based Algorithm for Clustering Analysis A Deflected Grd-based Algorthm for Clusterng Analyss NANCY P. LIN, CHUNG-I CHANG, HAO-EN CHUEH, HUNG-JEN CHEN, WEI-HUA HAO Department of Computer Scence and Informaton Engneerng Tamkang Unversty 5 Yng-chuan

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

Clustering. A. Bellaachia Page: 1

Clustering. A. Bellaachia Page: 1 Clusterng. Obectves.. Clusterng.... Defntons... General Applcatons.3. What s a good clusterng?. 3.4. Requrements 3 3. Data Structures 4 4. Smlarty Measures. 4 4.. Standardze data.. 5 4.. Bnary varables..

More information

A New Approach For the Ranking of Fuzzy Sets With Different Heights

A New Approach For the Ranking of Fuzzy Sets With Different Heights New pproach For the ankng of Fuzzy Sets Wth Dfferent Heghts Pushpnder Sngh School of Mathematcs Computer pplcatons Thapar Unversty, Patala-7 00 Inda pushpndersnl@gmalcom STCT ankng of fuzzy sets plays

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices Steps for Computng the Dssmlarty, Entropy, Herfndahl-Hrschman and Accessblty (Gravty wth Competton) Indces I. Dssmlarty Index Measurement: The followng formula can be used to measure the evenness between

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

SCALABLE AND VISUALIZATION-ORIENTED CLUSTERING FOR EXPLORATORY SPATIAL ANALYSIS

SCALABLE AND VISUALIZATION-ORIENTED CLUSTERING FOR EXPLORATORY SPATIAL ANALYSIS SCALABLE AND VISUALIZATION-ORIENTED CLUSTERING FOR EXPLORATORY SPATIAL ANALYSIS J.H.Guan, F.B.Zhu, F.L.Ban a School of Computer, Spatal Informaton & Dgtal Engneerng Center, Wuhan Unversty, Wuhan, 430079,

More information

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

The Research of Support Vector Machine in Agricultural Data Classification

The Research of Support Vector Machine in Agricultural Data Classification The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervsed Learnng and Clusterng Supervsed vs. Unsupervsed Learnng Up to now we consdered supervsed learnng scenaro, where we are gven 1. samples 1,, n 2. class labels for all samples 1,, n Ths s also

More information

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 An Iteratve Soluton Approach to Process Plant Layout usng Mxed

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana

More information

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques

More information

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Determining the Optimal Bandwidth Based on Multi-criterion Fusion Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn

More information

y and the total sum of

y and the total sum of Lnear regresson Testng for non-lnearty In analytcal chemstry, lnear regresson s commonly used n the constructon of calbraton functons requred for analytcal technques such as gas chromatography, atomc absorpton

More information

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT 3. - 5. 5., Brno, Czech Republc, EU APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT Abstract Josef TOŠENOVSKÝ ) Lenka MONSPORTOVÁ ) Flp TOŠENOVSKÝ

More information

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data Malaysan Journal of Mathematcal Scences 11(S) Aprl : 35 46 (2017) Specal Issue: The 2nd Internatonal Conference and Workshop on Mathematcal Analyss (ICWOMA 2016) MALAYSIAN JOURNAL OF MATHEMATICAL SCIENCES

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

Programming in Fortran 90 : 2017/2018

Programming in Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Exercse 1 : Evaluaton of functon dependng on nput Wrte a program who evaluate the functon f (x,y) for any two user specfed values

More information

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z. TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS Muradalyev AZ Azerbajan Scentfc-Research and Desgn-Prospectng Insttute of Energetc AZ1012, Ave HZardab-94 E-mal:aydn_murad@yahoocom Importance of

More information

Hierarchical agglomerative. Cluster Analysis. Christine Siedle Clustering 1

Hierarchical agglomerative. Cluster Analysis. Christine Siedle Clustering 1 Herarchcal agglomeratve Cluster Analyss Chrstne Sedle 19-3-2004 Clusterng 1 Classfcaton Basc (unconscous & conscous) human strategy to reduce complexty Always based Cluster analyss to fnd or confrm types

More information

An Image Fusion Approach Based on Segmentation Region

An Image Fusion Approach Based on Segmentation Region Rong Wang, L-Qun Gao, Shu Yang, Yu-Hua Cha, and Yan-Chun Lu An Image Fuson Approach Based On Segmentaton Regon An Image Fuson Approach Based on Segmentaton Regon Rong Wang, L-Qun Gao, Shu Yang 3, Yu-Hua

More information

1. Introduction. Abstract

1. Introduction. Abstract Image Retreval Usng a Herarchy of Clusters Danela Stan & Ishwar K. Seth Intellgent Informaton Engneerng Laboratory, Department of Computer Scence & Engneerng, Oaland Unversty, Rochester, Mchgan 48309-4478

More information

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Learning-Based Top-N Selection Query Evaluation over Relational Databases Learnng-Based Top-N Selecton Query Evaluaton over Relatonal Databases Lang Zhu *, Wey Meng ** * School of Mathematcs and Computer Scence, Hebe Unversty, Baodng, Hebe 071002, Chna, zhu@mal.hbu.edu.cn **

More information

Clustering is a discovery process in data mining.

Clustering is a discovery process in data mining. Cover Feature Chameleon: Herarchcal Clusterng Usng Dynamc Modelng Many advanced algorthms have dffculty dealng wth hghly varable clusters that do not follow a preconceved model. By basng ts selectons on

More information

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION Paulo Quntlano 1 & Antono Santa-Rosa 1 Federal Polce Department, Brasla, Brazl. E-mals: quntlano.pqs@dpf.gov.br and

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

Wishing you all a Total Quality New Year!

Wishing you all a Total Quality New Year! Total Qualty Management and Sx Sgma Post Graduate Program 214-15 Sesson 4 Vnay Kumar Kalakband Assstant Professor Operatons & Systems Area 1 Wshng you all a Total Qualty New Year! Hope you acheve Sx sgma

More information

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Proceedngs of the Wnter Smulaton Conference M E Kuhl, N M Steger, F B Armstrong, and J A Jones, eds A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Mark W Brantley Chun-Hung

More information

Meta-heuristics for Multidimensional Knapsack Problems

Meta-heuristics for Multidimensional Knapsack Problems 2012 4th Internatonal Conference on Computer Research and Development IPCSIT vol.39 (2012) (2012) IACSIT Press, Sngapore Meta-heurstcs for Multdmensonal Knapsack Problems Zhbao Man + Computer Scence Department,

More information

A Clustering Algorithm for Key Frame Extraction Based on Density Peak

A Clustering Algorithm for Key Frame Extraction Based on Density Peak Journal of Computer and Communcatons, 2018, 6, 118-128 http://www.scrp.org/ournal/cc ISSN Onlne: 2327-5227 ISSN Prnt: 2327-5219 A Clusterng Algorthm for Key Frame Extracton Based on Densty Peak Hong Zhao

More information

Classifying Acoustic Transient Signals Using Artificial Intelligence

Classifying Acoustic Transient Signals Using Artificial Intelligence Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)

More information

K-means and Hierarchical Clustering

K-means and Hierarchical Clustering Note to other teachers and users of these sldes. Andrew would be delghted f you found ths source materal useful n gvng your own lectures. Feel free to use these sldes verbatm, or to modfy them to ft your

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

S1 Note. Basis functions.

S1 Note. Basis functions. S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type

More information

Smoothing Spline ANOVA for variable screening

Smoothing Spline ANOVA for variable screening Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned

More information

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence

More information

Edge Detection in Noisy Images Using the Support Vector Machines

Edge Detection in Noisy Images Using the Support Vector Machines Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona

More information

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng

More information

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe CSCI 104 Sortng Algorthms Mark Redekopp Davd Kempe Algorthm Effcency SORTING 2 Sortng If we have an unordered lst, sequental search becomes our only choce If we wll perform a lot of searches t may be benefcal

More information

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like: Self-Organzng Maps (SOM) Turgay İBRİKÇİ, PhD. Outlne Introducton Structures of SOM SOM Archtecture Neghborhoods SOM Algorthm Examples Summary 1 2 Unsupervsed Hebban Learnng US Hebban Learnng, Cntd 3 A

More information

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide Lobachevsky State Unversty of Nzhn Novgorod Polyhedron Quck Start Gude Nzhn Novgorod 2016 Contents Specfcaton of Polyhedron software... 3 Theoretcal background... 4 1. Interface of Polyhedron... 6 1.1.

More information

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:

More information

A Statistical Model Selection Strategy Applied to Neural Networks

A Statistical Model Selection Strategy Applied to Neural Networks A Statstcal Model Selecton Strategy Appled to Neural Networks Joaquín Pzarro Elsa Guerrero Pedro L. Galndo joaqun.pzarro@uca.es elsa.guerrero@uca.es pedro.galndo@uca.es Dpto Lenguajes y Sstemas Informátcos

More information

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur FEATURE EXTRACTION Dr. K.Vjayarekha Assocate Dean School of Electrcal and Electroncs Engneerng SASTRA Unversty, Thanjavur613 41 Jont Intatve of IITs and IISc Funded by MHRD Page 1 of 8 Table of Contents

More information

Graph-based Clustering

Graph-based Clustering Graphbased Clusterng Transform the data nto a graph representaton ertces are the data ponts to be clustered Edges are eghted based on smlarty beteen data ponts Graph parttonng Þ Each connected component

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

Biostatistics 615/815

Biostatistics 615/815 The E-M Algorthm Bostatstcs 615/815 Lecture 17 Last Lecture: The Smplex Method General method for optmzaton Makes few assumptons about functon Crawls towards mnmum Some recommendatons Multple startng ponts

More information

Available online at Available online at Advanced in Control Engineering and Information Science

Available online at   Available online at   Advanced in Control Engineering and Information Science Avalable onlne at wwwscencedrectcom Avalable onlne at wwwscencedrectcom Proceda Proceda Engneerng Engneerng 00 (2011) 15000 000 (2011) 1642 1646 Proceda Engneerng wwwelsevercom/locate/proceda Advanced

More information

Machine Learning 9. week

Machine Learning 9. week Machne Learnng 9. week Mappng Concept Radal Bass Functons (RBF) RBF Networks 1 Mappng It s probably the best scenaro for the classfcaton of two dataset s to separate them lnearly. As you see n the below

More information

A Robust Method for Estimating the Fundamental Matrix

A Robust Method for Estimating the Fundamental Matrix Proc. VIIth Dgtal Image Computng: Technques and Applcatons, Sun C., Talbot H., Ourseln S. and Adraansen T. (Eds.), 0- Dec. 003, Sydney A Robust Method for Estmatng the Fundamental Matrx C.L. Feng and Y.S.

More information

Lecture 4: Principal components

Lecture 4: Principal components /3/6 Lecture 4: Prncpal components 3..6 Multvarate lnear regresson MLR s optmal for the estmaton data...but poor for handlng collnear data Covarance matrx s not nvertble (large condton number) Robustness

More information

X- Chart Using ANOM Approach

X- Chart Using ANOM Approach ISSN 1684-8403 Journal of Statstcs Volume 17, 010, pp. 3-3 Abstract X- Chart Usng ANOM Approach Gullapall Chakravarth 1 and Chaluvad Venkateswara Rao Control lmts for ndvdual measurements (X) chart are

More information

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010 Smulaton: Solvng Dynamc Models ABE 5646 Week Chapter 2, Sprng 200 Week Descrpton Readng Materal Mar 5- Mar 9 Evaluatng [Crop] Models Comparng a model wth data - Graphcal, errors - Measures of agreement

More information

Clustering Algorithm of Similarity Segmentation based on Point Sorting

Clustering Algorithm of Similarity Segmentation based on Point Sorting Internatonal onference on Logstcs Engneerng, Management and omputer Scence (LEMS 2015) lusterng Algorthm of Smlarty Segmentaton based on Pont Sortng Hanbng L, Yan Wang*, Lan Huang, Mngda L, Yng Sun, Hanyuan

More information

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK L-qng Qu, Yong-quan Lang 2, Jng-Chen 3, 2 College of Informaton Scence and Technology, Shandong Unversty of Scence and Technology,

More information

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval Fuzzy -Means Intalzed by Fxed Threshold lusterng for Improvng Image Retreval NAWARA HANSIRI, SIRIPORN SUPRATID,HOM KIMPAN 3 Faculty of Informaton Technology Rangst Unversty Muang-Ake, Paholyotn Road, Patumtan,

More information

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007 Syntheszer 1.0 A Varyng Coeffcent Meta Meta-Analytc nalytc Tool Employng Mcrosoft Excel 007.38.17.5 User s Gude Z. Krzan 009 Table of Contents 1. Introducton and Acknowledgments 3. Operatonal Functons

More information

Concurrent Apriori Data Mining Algorithms

Concurrent Apriori Data Mining Algorithms Concurrent Apror Data Mnng Algorthms Vassl Halatchev Department of Electrcal Engneerng and Computer Scence York Unversty, Toronto October 8, 2015 Outlne Why t s mportant Introducton to Assocaton Rule Mnng

More information

SVM-based Learning for Multiple Model Estimation

SVM-based Learning for Multiple Model Estimation SVM-based Learnng for Multple Model Estmaton Vladmr Cherkassky and Yunqan Ma Department of Electrcal and Computer Engneerng Unversty of Mnnesota Mnneapols, MN 55455 {cherkass,myq}@ece.umn.edu Abstract:

More information

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1) Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A

More information

An Internal Clustering Validation Index for Boolean Data

An Internal Clustering Validation Index for Boolean Data BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 16, No 6 Specal ssue wth selecton of extended papers from 6th Internatonal Conference on Logstc, Informatcs and Servce Scence

More information

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming CS 4/560 Desgn and Analyss of Algorthms Kent State Unversty Dept. of Math & Computer Scence LECT-6 Dynamc Programmng 2 Dynamc Programmng Dynamc Programmng, lke the dvde-and-conquer method, solves problems

More information

A Simple and Efficient Goal Programming Model for Computing of Fuzzy Linear Regression Parameters with Considering Outliers

A Simple and Efficient Goal Programming Model for Computing of Fuzzy Linear Regression Parameters with Considering Outliers 62626262621 Journal of Uncertan Systems Vol.5, No.1, pp.62-71, 211 Onlne at: www.us.org.u A Smple and Effcent Goal Programmng Model for Computng of Fuzzy Lnear Regresson Parameters wth Consderng Outlers

More information

EXTENDED BIC CRITERION FOR MODEL SELECTION

EXTENDED BIC CRITERION FOR MODEL SELECTION IDIAP RESEARCH REPORT EXTEDED BIC CRITERIO FOR ODEL SELECTIO Itshak Lapdot Andrew orrs IDIAP-RR-0-4 Dalle olle Insttute for Perceptual Artfcal Intellgence P.O.Box 59 artgny Valas Swtzerland phone +4 7

More information

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc. [Type text] [Type text] [Type text] ISSN : 0974-74 Volume 0 Issue BoTechnology 04 An Indan Journal FULL PAPER BTAIJ 0() 04 [684-689] Revew on Chna s sports ndustry fnancng market based on market -orented

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

From Comparing Clusterings to Combining Clusterings

From Comparing Clusterings to Combining Clusterings Proceedngs of the Twenty-Thrd AAAI Conference on Artfcal Intellgence (008 From Comparng Clusterngs to Combnng Clusterngs Zhwu Lu and Yuxn Peng and Janguo Xao Insttute of Computer Scence and Technology,

More information

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics Introducton G10 NAG Fortran Lbrary Chapter Introducton G10 Smoothng n Statstcs Contents 1 Scope of the Chapter... 2 2 Background to the Problems... 2 2.1 Smoothng Methods... 2 2.2 Smoothng Splnes and Regresson

More information

Three supervised learning methods on pen digits character recognition dataset

Three supervised learning methods on pen digits character recognition dataset Three supervsed learnng methods on pen dgts character recognton dataset Chrs Flezach Department of Computer Scence and Engneerng Unversty of Calforna, San Dego San Dego, CA 92093 cflezac@cs.ucsd.edu Satoru

More information

Multi-stable Perception. Necker Cube

Multi-stable Perception. Necker Cube Mult-stable Percepton Necker Cube Spnnng dancer lluson, Nobuuk Kaahara Fttng and Algnment Computer Vson Szelsk 6.1 James Has Acknowledgment: Man sldes from Derek Hoem, Lana Lazebnk, and Grauman&Lebe 2008

More information

Hybrid Non-Blind Color Image Watermarking

Hybrid Non-Blind Color Image Watermarking Hybrd Non-Blnd Color Image Watermarkng Ms C.N.Sujatha 1, Dr. P. Satyanarayana 2 1 Assocate Professor, Dept. of ECE, SNIST, Yamnampet, Ghatkesar Hyderabad-501301, Telangana 2 Professor, Dept. of ECE, AITS,

More information

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification Introducton to Artfcal Intellgence V22.0472-001 Fall 2009 Lecture 24: Nearest-Neghbors & Support Vector Machnes Rob Fergus Dept of Computer Scence, Courant Insttute, NYU Sldes from Danel Yeung, John DeNero

More information

Optimal Workload-based Weighted Wavelet Synopses

Optimal Workload-based Weighted Wavelet Synopses Optmal Workload-based Weghted Wavelet Synopses Yoss Matas School of Computer Scence Tel Avv Unversty Tel Avv 69978, Israel matas@tau.ac.l Danel Urel School of Computer Scence Tel Avv Unversty Tel Avv 69978,

More information

A Semi-parametric Regression Model to Estimate Variability of NO 2

A Semi-parametric Regression Model to Estimate Variability of NO 2 Envronment and Polluton; Vol. 2, No. 1; 2013 ISSN 1927-0909 E-ISSN 1927-0917 Publshed by Canadan Center of Scence and Educaton A Sem-parametrc Regresson Model to Estmate Varablty of NO 2 Meczysław Szyszkowcz

More information

Fitting: Deformable contours April 26 th, 2018

Fitting: Deformable contours April 26 th, 2018 4/6/08 Fttng: Deformable contours Aprl 6 th, 08 Yong Jae Lee UC Davs Recap so far: Groupng and Fttng Goal: move from array of pxel values (or flter outputs) to a collecton of regons, objects, and shapes.

More information