A Hierarchical Clustering and Validity Index for Mixed Data

Size: px
Start display at page:

Download "A Hierarchical Clustering and Validity Index for Mixed Data"

Transcription

1 Graduate Theses and Dssertatons Graduate College 2012 A Herarchcal Clusterng and Valdty Index for Mxed Data Ru Yang Iowa State Unversty Follow ths and addtonal works at: Part of the Industral Engneerng Commons Recommended Ctaton Yang, Ru, "A Herarchcal Clusterng and Valdty Index for Mxed Data" (2012). Graduate Theses and Dssertatons Ths Dssertaton s brought to you for free and open access by the Graduate College at Iowa State Unversty Dgtal Repostory. It has been accepted for ncluson n Graduate Theses and Dssertatons by an authorzed admnstrator of Iowa State Unversty Dgtal Repostory. For more nformaton, please contact dgrep@astate.edu.

2 A herarchcal clusterng and valdty ndex for mxed data by Ru Yang A dssertaton submtted to the graduate faculty n partal fulfllment of the requrements for the degree of DOCTOR OF PHILOSOPHY Maor: Industral Engneerng Program of Study Commttee: Sgurdur Olafsson, Maor Professor Danne Cook Heke Hofmann John Jackman Jo Mn Iowa State Unversty Ames, Iowa 2012 Copyrght Ru Yang, All rghts reserved.

3 ACKNOWLEDGEMENTS I would lke to thank everyone who support and encourage me whle I completed my degree. Lookng back, I could not have asked for a better person to be my maor advsor than Dr. Sgurdur Olafsson. He allowed me the space to thnk creatvely, but he was always there when I needed help or gudance. I would also lke to thank my commttee members, Dr. Jo Mn, Dr. John Jackman, Dr. Danne Cook, and Dr. Heke Hofmann, for ther concern, advce and encouragement for mprovng the qualty of the thess presented n every possble way. Most mportantly, I thank my husband, Jan Fan, for provdng unwaverng support for my pursut of ths dream, and my best frend, Hu Ban, who has always been my bggest emotonal support. Fnally, I would lke to express my greatest apprecaton to my famly and frends.

4 TABLE OF CONTENTS ACKNOWLEDGEMENTS... TABLE OF CONTENTS... LIST OF TABLES... v LIST OF FIGURES... v ABSTRACT... v CHAPTER 1 INTRODUCTION Motvaton Obectve Overvew Summary... 5 CHAPTER 2 REVIEW OF LITERATURE Cluster Analyss Numerc Clusterng Categorcal Clusterng Mxed Clusterng Cluster Valdaton External Indces Internal Indces Summary and Dscusson CHAPTER 3 HIERARCHICAL CLUSTERING FOR MIXED DATA Motvaton Background Notaton k-prototype Dstance Optmal Weght Dstance Goodall Dstance Co-occurrence Dstance Proposed Herarchcal Clusterng Method... 24

5 v Agglomeratve Herarchcal Algorthm Evaluaton Methodology Experment Synthetc Datasets Real-world Datasets Propertes of Proposed Clusterng Method Irs Dataset Vote Dataset Heart Dsease Dataset Australan Credt Dataset DNA-nomnal Dataset Summary CHAPTER 4 BK INDEX FOR MIXED DATA Motvaton Background Calnsk-Harabasz Index Dunn Index Slhouette Index Proposed Entropy-based Valdty Notaton BK Index Proposed Algorthm Experment Synthetc Datasets Real-world Datasets Preprocessed Real Datasets Summary CHAPTER 5 CONCLUSION BIBILIOGRAPHY... 62

6 v LIST OF TABLES Table 1: The base dataset ds1 three well-separated clusters Table 2: ds2 co-occurrence wth 20% nose Table 3: ds5 stronger co-occurrence wth 20% nose Table 4: ds26 clusters only dependent on real attrbutes Table 5: ds27 clusters only dependent on categorcal attrbutes Table 6: ds29 clusters only dependent on real attrbutes and Cat Table 7: Accuracy of synthetc datasets wth co-occurrence Table 8: Accuracy of synthetc datasets when addng non-gaussan nose Table 9: Accuracy of datasets wth co-occurrence & nomnal non-gaussan nose Table 10: Accuracy of datasets wth co-occurrence & real non-gaussan nose Table 11: Accuracy of synthetc datasets when relaxng some attrbutes Table 12: Sx real datasets from UCI Table 13: Accuracy of real datasets wth four dstances Table 14: Comparatve study on Heart Dsease dataset Table 15: Results of real datasets wth the co-occurrence dstance Table 16: Accuracy of preprocessed Irs dataset compared to orgnal Table 17: Accuracy of preprocessed Vote dataset compared to orgnal Table 18: Accuracy of preprocessed Heart Dsease dataset compared to orgnal Table 19: Accuracy of preprocessed Australan Credt dataset Table 20: Accuracy of preprocessed DNA-nomnal dataset compared to orgnal Table 21: Estmated numbers of clusters by four valdty ndces Table 22: Estmated numbers of clusters by four valdty ndces (contnued) Table 23: Estmated numbers of clusters by four valdty ndces for real datasets Table 24: Estmated numbers of clusters by four valdty ndces for Irs Table 25: Estmated numbers of clusters by four valdty ndces for Vote Table 26: Estmated numbers by four valdty ndces for Heart Dsease Table 27: Estmated numbers by four valdty ndces for Australan Credt Table 28: Estmated numbers of clusters by four valdty ndces for DNA

7 v LIST OF FIGURES Fgure 1: The proposed clusterng framework Fgure 2: The scatter plots of Irs. (Left) SL vs. SW. (Rght) PL vs. PW Fgure 3: Plots of four ndces on base dataset ds Fgure 4: Plots of four ndces on very nosy dataset ds Fgure 5: B(k) for sx real-world datasets Fgure 6: Plots of four ndces on Heart Dsease Fgure 7: Plots of four ndces on Irs Fgure 8: Plots of four ndces on Irs-Dsc Fgure 9: Plots of four ndces on Vote Fgure 10: Plots of four ndces on Australan Credt Fgure 11: Plots of four ndces on DNA Fgure 12: Plots of four ndces on Irs Fgure 13: Plots of four ndces on Vote Fgure 14: Plots of four ndces on Heart Fgure 15: Plots of four ndces on DNA

8 v ABSTRACT Ths study develops novel approaches to partton mxed data nto natural groups, that s, clusterng datasets contanng both numerc and nomnal attrbutes. Such data arses n many dverse applcatons. Our approach addresses two mportant ssues regardng clusterng mxed datasets. One s how to fnd the optmal number of clusters whch s mportant because ths s unknown n many applcatons. The other s how to group the obects naturally accordng to a sutable smlarty measurement. These problems are especally dffcult for the mxed datasets snce they nvolve determnng how to unfy the two dfferent representaton schemes for numerc and nomnal data. To address the ssue of constructng clusters, that s, to naturally group obects, we compare the performance of four dstances capable of dealng wth the mxed datasets when ncorporatng nto a classcal agglomeratve herarchcal clusterng approach. Based on these results, we conclude that the so-called co-occurrence dstance to measure the dssmlarty performs well as ths dstance s found to obtan good clusterng results wth reasonable computaton, thus balancng effectveness and effcency. The second mportant contrbuton of ths research s to defne an entropy-based valdty ndex to valdate the sequence of parttons generated by the herarchcal clusterng wth the co-occurrence dstance. A cluster valdty ndex called the BK ndex s modfed for mxed data and used n conuncton wth the proposed clusterng algorthm. Ths ndex s compared to three well-known ndces, namely, the Calnsk-Harabasz ndex (CH), the Dunn ndex (DU), and the Slhouette ndex (SI). The results show that the modfed BK ndex outperforms the three other ndces for ts ablty to dentfy the true number of clusters. Fnally, the study also dentfes the lmtaton of the herarchcal clusterng wth a cooccurrence dstance, and provdes some remedes to mprove not only the clusterng accuracy but especally the ablty to correctly dentfy best number of classes of the mxed datasets.

9 1 CHAPTER 1 INTRODUCTION 1.1 Motvaton Clusterng s one of the fundamental technques n data mnng. The prmary obectve of clusterng s to partton a set of obects nto homogeneous groups (Jan and Dubes, 1988). An effectve clusterng algorthm needs a sutable measure of smlarty or dssmlarty, so a partton structure would be dentfed n the form of natural groups, where obects that are smlar tend to fall nto the same group and obects that relatvely dstnct tend to separate nto dfferent groups. Clusterng has been extensvely appled n dverse felds, ncludng healthcare systems (Mateo et al., 2008), customer relatons management (Jng et al., 2007), manufacturng systems (Sukk et al., 2006), botechnology (Km et al., 2009), fnance (Lao et al., 2008), and geographcal nformaton systems (Touray et al., 2010). Many algorthms that form clusters n numerc domans have been proposed. The maorty explot nherent geometry or densty. Ths ncludes classcal k-means (Kaufman and Rousseew, 1990; Jng et al., 2007) and agglomeratve herarchcal clusterng (Day and Edelsbrunner, 1984; Yasunor et al., 2007). More recently several studes have tackled the problem of clusterng and extractng from categorcal data,.e., batch self-organzng maps (Chen and Marques, 2005), matrx parttonng method (Jau et al., 2006), k-dstrbutons (Ca et al., 2007), and fuzzy c-means (Brouwer and Groenwold, 2010). However, whle the maorty of the useful data s descrbed by a combnaton of mxed features (L and Bswas, 2002), tradtonal clusterng algorthms are desgned prmarly for one data type. The lterature on clusterng mxed data s stll relatvely sparse (Hsu et al., 2007; Ahmad and Dey, 2007; Lee and Pedrycz, 2009) and more work s needed n ths area. The man obstacle to clusterng mxed data s determnng how to unfy the dstance representaton schemes for numerc and categorcal data. Numerc clusterng adopts dstance metrcs whle symbolc clusterng uses a countng scheme to calculate condtonal probablty estmates as a means for defnng the relaton between groups. The pragmatc methods that convert one type of attrbutes to the other and then apply tradtonal sngle-type clusterng algorthms may lead to sgnfcant loss of nformaton. If categorcal data wth a large doman

10 2 s converted to numerc data by bnary encodng, more space and tme are ntroduced. Moreover, f quanttatve and bnary attrbutes are ncluded n the same ndex, these procedures wll generally gve the latter excessve weght (Goodall, 1966). Apart from the need for a sutable dstance measure for mxed data, another crtcal ssue s how to evaluate clusterng structures obectvely and quanttatvely, that s, wthout usng the doman knowledge and expert experence. The need to estmate the number of clusters n contnuous data has led to the development of a large number of what s usually called valdty crtera (Halkd et al., 2002; Km and Ramakrshna, 2005), but there are few crtera for evaluatng parttons produced from categorcal clusterng (Celeux and Govaert, 1991; Chen and Lu, 2009). To the best of our knowledge there s no lterature that satsfactorly addresses the cluster valdaton problems related to data wth both dscrete and real features. The lmtatons of exstng clusterng methodologes and crteron functons n dealng wth mxed data motvate us to develop clusterng algorthms that can better handle both numerc and categorcal attrbutes. 1.2 Obectve As stated above, ths dssertaton addresses the queston of how to partton mxed data nto natural groups effcently and effectvely. Later, the proposed approach wll be appled to dentfy the optmal classfcaton scheme among those parttons. The extenson of clusterng to a more general settng requres sgnfcant changes n algorthm technques n several fundamental respects. To tackle the obectves stated above, the followng three research tasks wll therefore be addressed: (1) We wll develop an agglomeratve herarchcal clusterng method for clusterng mxed datasets and nvestgate the performance of varous dstance measurements that represent both data types. As mentoned above, the tradtonal way of convertng data nto a sngle type has many dsadvantages. Wthn the context of an agglomeratve herarchcal clusterng method, we wll nvestgate quanttatve measures of smlarty among obects that could keep not only the structure of categorcal attrbutes but also relatve dstance of numerc values. Specfcally, the measurements wll be the co-occurrence dstance (Ahmad and Dey, 2007),

11 3 the k-prototype dstance (Huang, 1998), the optmal weght dstance (Modha and Spangler, 2003), and the Goodall dstance (Goodall, 1966). In the lterature, the frst three dstances have been appled n k-means famles, and the last s used n an agglomeratve herarchcal clusterng called the SBAC method (L and Bswas, 2002). Our goal s to choose the dstance measure that not only produces a low error rate n parttonng but s also sutable to search for the proper number of clusters. (2) We wll nvestgate how to determne the optmal number of classes n mxed datasets. Evaluaton of clusterng structures can provde crucal nsghts about whether the clusterng partton derved s meanngful or not. For numerc clusterng, the number of clusters can be valdated through geometry shape or densty dstrbuton, whle cluster entropy and categorcal utlty are frequently used for categorcal clusterng. We wll nvestgate how to extend the extant valdty ndces and make them capable of handlng both data types. Specfcally, we wll nvestgate two approaches snce they can ntegrate current valdaton methods smoothly. Frst, f a quanttatve dstance would represent numerc and categorcal dssmlartes n a compatble way, then ths geometrc-lke dstance may be exploted n tradtonal numerc valdaton methods lke the Calnsk-Harabasz ndex, n whch the optmal number of clusters would be determned by mnmzng the ntra-cluster dstance whle maxmzng the nter-cluster dstance. Second, t s also possble to calculate the entropes of cluster structures over both components. A low entropy s desrable snce t ndcates an ordered structure. We clam that the ncrements of expected entropes of optmal clusters structures over a seres of successve cluster numbers would ndcate the optmal number of clusters. (3) We wll explore the property of the proposed algorthm. There s no cure-all algorthm for clusterng problems. It s mportant to understand whch datasets would be much more applcable to be analyzed by the new algorthm. Some testng datasets wth varous characterstcs wll be generated to nvestgate specfc propertes. Certanly, these propertes could gude some data preprocessng operatons on real-world datasets such as a feature selecton.

12 4 1.3 Overvew The outcome of the research s a framework that combnes a herarchcal clusterng ntegrated wth a co-occurrence dstance and an extenson of the BK ndex to search for the best number of the classes. Fgure 1 shows ths procedure. As llustrated, t contans sx man steps to recover the underlyng structure of mxed datasets under the assumpton that the number of classes s unknown. Fgure 1: The proposed clusterng framework. The framework works as follows: (1) calculate the sgnfcance of each attrbute, n whch the numerc part wll be used as weghts n the co-occurrence dstance; (2) conduct a feature selecton accordng to the order of attrbutes sgnfcance and the propertes of the proposed algorthm. Thus, we can generate a reduced dataset that would be much more applcable to ths algorthm and get better result. Ths step s optonal, exploratory, and

13 5 teratve; (3) calculate the co-occurrence dstance for the dataset of nterest or the reduced mxed dataset; (4) construct a dendrogram by a herarchcal algorthm; (5) determne the optmal number of the classes by usng BK ndex; (6) cut the tree accordng to ths proper number and report the results. 1.4 Summary All of the research tasks are ether completely or partally new to the lterature. The extended clusterng applcable to arbtrary collectons of datasets and the valdty ndex for mxed datasets are partcular novel and make sgnfcant contrbutons. The contrbutons n ths study can be thus summarzed as follows. (1) Compare four dstances capable of handlng wth the mxed dataset when used wth herarchcal clusterng. (2) Identfy lmtatons of the herarchcal clusterng wth a co-occurrence dstance and propose solutons. (3) Defne a valdty ndex to search the optmal number of clusters. The remander of the dssertaton s organzed as follows. In Chapter 2, we survey the related lterature. In Chapter 3, we compare the four dstances when used n an agglomeratve clusterng algorthm, and then choose the co-occurrence dstance. The lmtaton of the proposed algorthm s dentfed. The correspondng solutons are provded. In Chapter 4, a valdty ndex s ntegrated wth the proposed algorthm to estmate the number of clusters for mxed data wth numerc and categorcal features. We conclude and suggest our future studes n Chapter 5.

14 6 CHAPTER 2 REVIEW OF LITERATURE We wll brefly revew the lterature on cluster analyss and cluster valdaton. The frst secton provdes a basc understandng of clusterng methods, specfcally on categorcal clusterng and mxed clusterng. Then we ntroduce several well-known valdty ndces to determne the optmal number of clusters. 2.1 Cluster Analyss Cluster analyss was frst proposed n numerc domans, where a dstance s clearly defned. Then t extended to categorcal data. However, much of the data n the real world contans a mxture of categorcal and contnuous features. As a result, the demand of cluster analyss on the mxed data s ncreasng Numerc Clusterng Clusterng s to partton the data nto groups where obects that are smlar tend to fall nto the same groups and obects that are relatvely dstnct tend to separate nto dfferent groups. Tradtonal clusterng methodologes handle datasets wth numerc attrbutes. The proxmty measure can be defned by geometrcal dstance. A set of data wth n obects (o 1,, o n ) are dvded nto k dsont clusters (C 1,, C k ), called partton P(k). n s the number of obects n the dataset and n the number of obects n the th cluster. D(C, C ) s the dstance between the th cluster and the th cluster. d(o, o ) s the dstance between the th obect and the th obect. The centrod of the th cluster s defned as z 1 = o. Z = { z1,, z k } s a set of k n o C center locatons. D( Z, o ) s the shortest dstance between obect and ts nearest center. Clusterng constructs a flat (non-herarchcal) or herarchcal parttonng of the obects. Herarchcal algorthms use the dstance matrx as nput and create a sequence of nested parttons, ether from sngleton clusters to a cluster ncludng all ndvduals or vce versa. Some detals on agglomeratve methods are provded here snce the dvsve clusterng s not commonly used n practce. To begn, the n obects form n sngleton clusters. The clusters wth the mnmal dstance are merged. The dstances between the new generated

15 7 cluster and others wll be updated accordng to some lnkage method. The searchng for two clusters wth the mnmal dstance and the mergng process contnue untl all obects n the same cluster. The commonly used lnkage methods are lsted as follows, along wth the defntons of nter-cluster dstances and update rules. (1) The sngle lnkage defnes the cluster dstance as the smallest dstance of a par of obects n two dfferent clusters. It s known as the nearest neghborhood method, whch tends to cause channg effect. D( C, C ) = mn d( o, o ) (2.1) o C, o C D( C,( C, C )) = mn( D( C, C ),( C, C )) (2.2) k k k (2) The complete lnkage pcks the furthest obects n two dfferent clusters as the cluster dstance. D( C, C ) = max d( o, o ) (2.3) o C, o C D( C, ( C, C )) = max( D( C, C ), ( C, C )) (2.4) k k k (3) The average lnkage uses the average dstance between all pars of obects n cluster C and cluster C. 1 D( C, C ) = d( o, o ) (2.5) nn o C, o C n n D( Ck,( C, C )) = D( Ck, C ) + D( Ck, C ) n + n n + n (2.6) (4) The centrod lnkage takes the dstance between the centrods of two clusters as the cluster dstance. D( C, C ) = d( z, z ) (2.7) n n n n D( C,( C, C )) = D( C, C ) + D( C, C ) D( C, C ) k k k 2 n + n n + n ( n + n ) (2.8) (5) The Ward s lnkage s called the mnmum varance method snce t uses the ncrement of the wthn-class sum of squared errors when onng clusters C and C as the cluster dstance between cluster C and cluster C.

16 8 n n D C C d z z 2 (, ) = (, ) n + n n + n n + n k k nk D( Ck,( C, C )) = D( Ck, C ) + D( Ck, C ) D( C, C ) n + n + n n + n + n n + n + n k k k (2.9) (2.10) Non-herarchcal partton clusterng employs an teratve approach to group data nto a pre-specfed number by mnmzng a sum of weghted wthn-cluster dstances between every obect and ts cluster center. mnmze n f ( Z) = w D( Z, o ) (2.11) = 1 The weght, w > 0, modfes the dstances. If treat the weghted dstance w D( Z, o ) as a cost, we formulate t as a standard dscrete optmzaton problem. Let x be a decson varable x 1 f obect s assgned to the th cluster = 0 otherwse To ensure that every obect s assgned to exactly one cluster, t has k = 1 x = 1 The nterpretaton of the notaton s as : ndex of obects; : ndex of clusters; d : dstance between obect and the center of cluster. mnmze n k f ( X ) = w d x (2.12) = 1 = 1 subect to k = 1 x = 1 x {0,1},

17 9 By selectng a vector of cluster centers from the set of feasble alternatves defned by the constrants, the model acheves the mnmum total cost, namely, the mnmum total weghted wthn-group dstance over all groups. Unfortunately, ths problem s NP-hard even for k = 2 (Drneas et.al, 2004). It s mpossble to fnd exact solutons n polynomal tme unless P = NP. However, there are some effcent approxmate approaches, such as k-means algorthms Categorcal Clusterng For categorcal data whch has no order relatonshps, conceptual clusterng algorthms based on herarchcal clusterng were proposed. These algorthms use condtonal probablty estmates to defne relatons between groups. Intra-class smlarty s the probablty Pr(a = v C k ) and nter-class dssmlarty s the probablty Pr(C k a = v ), where a = v s an attrbute-value par representng the th attrbute takes ts th possble value. Category Utlty (CU) s a heurstc evaluaton measure (Fsher, 1987) to gude constructon of the tree n systems COBWEB (Huang and Ng, 1999) or ts dervatves, e.g., COBWEB/3 (McKusck and Thomson, 1990), and ITERATE (Bswas et al., 1998). CU attempts to maxmze both the probablty that two obects n the same cluster have attrbute values n common and the probablty that obects from dfferent clusters have dfferent values. ROCK (Guha et al., 1999) s a clusterng algorthm that works for both boolean and categorcal attrbutes. Ths algorthm employs the concept of lnks to measure the smlarty between a par of data ponts. The number of lnks between a par of ponts s the number of common neghbors shared by the ponts. Clusters are merged through herarchcal clusterng whch checks the lnks whle mergng clusters. The man obectve of the algorthm s to group together obects that have more lnks. CACTUS (Gant et al., 1999) s a herarchcal algorthm to group categorcal data by lookng at the support of two attrbute values. Support s the frequency of two values appearng n obects together. The hgher the support s, the more smlar the two attrbute values are. The two attrbute values are strongly connected f ther support exceeds the expected value wth the assumpton of attrbute-ndependence. Ths concept s extended to a set of attrbutes that par wse strongly connected. Fndng the cooccurrence of a set of attrbute values s ntensve n computatonal complexty.

18 Mxed Clusterng Clusterng algorthms are desgned for ether categorcal data or numerc data. However, n the real world, a maorty of datasets are descrbed by a combnaton of contnuous and categorcal features. A general method s to transform one data type to another. In most cases, nomnal attrbutes are encoded by smple matchng or bnary mappng, and then clusterng s performed on the new-computed numerc proxmty. Bnary encodng transforms each categorcal attrbute to a set of bnary attrbutes, and then encodes a categorcal value to ths set of bnary values. Smple matchng generates dstance measurement n such a way that yelds a dfference of zero when comparng two dentcal categorcal values, and a dfference of one whle comparng two dstnct values. However, the codng methods have the dsadvantages of (1) losng nformaton dervable from the orderng of dfferent values, (2) losng the structure of categorcal value wth dfferent levels of smlarty, (3) requrng more space and tme when the doman of the categorcal attrbute s large, (4) gnorng the context of a par of values, e.g., the cooccurrence wth other attrbutes, and (5) gvng dfferent weght to the attrbutes accordng to the number of dfferent values they may take. Moreover, f quanttatve and bnary attrbutes are ncluded n the same ndex, these procedures wll generally gve the latter excessve weght (Goodall, 1966). An alternatve approach s to dscretze numerc values and then apply symbolc clusterng algorthms. The dscretzaton process often loses the mportant nformaton especally the relatve dfference of two values for numerc features. In addton, t causes boundary problem when two close values near the dscretzaton boundary may be assgned to two dfferent ranges. Another dffcult problem s to estmate the optmal ntervals durng dscretzaton. Huang (1998) extended k-modes to mxed datasets and developed k-prototype algorthm. The dstances of two types of features are separately calculated. The numercal dstances are measured by Eucldean dstances, whle the categorcal dstances are measured by smple matchng. The centers of categorcal attrbutes are defned as the modes n the cluster. Ahmad and Dey (2007) proposed a fuzzy prototype k-means algorthm. Smlar to k- prototype, the cost functon s made up of two components. The dfference s that the

19 11 categorcal dstances are measured by the co-occurrence of two attrbutes and the categorcal cluster centers are the lsts of values n every attrbute wth ther frequences n the cluster. Modha and Spangler (2003) used k-means to cluster mxed datasets, but they carefully chose the weghts for dfferent features by mnmzng the rato of the between-cluster scatter matrx and the wth-cluster scatter matrx of the dstorted dstance. ECOWEB (Rech and Fenves, 1991) defnes Category Utlty measurement n numerc attrbutes by approxmaton of the probablty n some user-descrbed nterval, whch has greatly mpact on the performance. AUTOCLASS (Cheesman and Stutz, 1995) assumes a classcal fnte mxture dstrbuton model on the data and uses a Bayesan method to maxmze the posteror probablty of the clusterng partton model gven the data. The number of classes n the data s pre-specfc. The computatonal complexty s extremely expensve. SBAC (L and Bswas, 2002) s a herarchcal clusterng of mxed data based on Goodall smlarty measurement wth the assumpton of attrbute-ndependence. The dstance explots the property that a par of the obects s closer than other pars f t has an uncommon feature. Ths algorthm s computatonally prohbtve and demands huge memory. Hsu and hs colleagues (2007) exploted the semantcs property n the doman of categorcal attrbutes and represented each attrbute wth a tree structure whose leaves are the possble values of ths attrbute and the lnks assocate wth some user-specfed weghts. Ths herarchcal dstance scheme s ntegrated wth agglomeratve herarchcal clusterng and compared to bnary codng and smple matchng. Some fuzzy clusterng algorthms are proposed recently to attack the dataset wth mxed features. Unlke hard clusterng where each obect belongs to only one cluster, fuzzy clusterng algorthms assgn each obect to all of the clusters wth a certan degree of membershp. Yang et al. (2004) nvestgated symbolc dssmlarty that s orgnally proposed by Gowda and Dday (1991) and modfed the three components parts of dssmlarty measure. Ths fuzzy clusterng has the strength to handle the categorcal data and fuzzy data. GFCM (Lee and Pedrycz, 2009) used a fuzzy center nstead of a sngleton prototype for the categorcal components, whch took a lst of partal values of a categorcal attrbute wth ther frequences n the cluster. The sze of the values n the prototype s an nput parameter. Besdes searchng for optmal membershp matrx and prototype matrx, the

20 12 algorthm has to choose a set of values from a categorcal attrbute doman and present them n the prototype. The sze of the labels and the fuzzfcaton coeffcents affect the performance. 2.2 Cluster Valdaton Clusterng algorthms expose the nherent parttons n the underlyng data, whle cluster valdaton methods are able to evaluate the result clusters quanttatvely and obectvely, e.g., whether the cluster structure s meanngful or ust an artfact of the clusterng algorthm. There are two man categores of testng crtera, known as external ndces and nternal ndces. External ndces are dstngushed from nternal ndces by the present of pror nformaton of known categores External Indces Gven a pror known cluster structure (P) of the data, external ndces evaluate a clusterng structure resultng from cluster algorthms (P ) based on countng the pars of ponts on whch two parttons agreement and dsagreement. A par of ponts can fall nto one of the four cases as below: a : number of pont pars n the same cluster n both P and P b : number of pont pars n the same cluster n P but not n P c : number of pont pars n the same cluster n P but not n P d : number of pont pars n dfferent clusters under both P and P Wallace (1983) proposed the two asymmetrc crtera W 1, W 2 as W 1 (P, P ) = a a + b W 2 (P, P ) = a a + c and (2.13) (2.14) representng the probablty that a par of ponts whch are n the same cluster n P (respectvely, P ) are also n the same cluster under the other clusterng. Fowlkes and Mallows (1983) took the geometrc mean of the asymmetrc Wallace ndces and ntroduced a symmetrc crteron

21 13 a a FM (P, P ) =. a + c a + b The Fowlkes-Mallows ndex assumes the two parttons are ndependent. (2.15) The Rand ndex emphaszes the probablty that a par of ponts n the same group or n dfferent groups n both parttons whle Jaccard s coeffcent does not take an account nto conont absence and only measures the porton of pars n the same cluster. The Rand ndex (Rand, 1971) a + d R (P, P ) = a + b + c + d The Jaccard ndex (Jan and Dubes, 1988) J (P, P ) = Internal Indces a a + b + c (2.16) (2.17) Internal ndces are valdaton measures whch evaluate clusterng results usng only nformaton ntrnsc to the underlyng data. Wthout true cluster labels, estmatng the number of clusters, k, n a gven dataset s a central task n cluster valdaton. Overestmaton of k complcates the true clusterng structure, and makes t dffcult to nterpret and analyze the results; on the other hand, underestmaton causes the loss of nformaton and msleads the fnal decson. In the followng secton, we wll brefly revew several well-known ndces. One of the oldest and most cted ndces s proposed by Dunn (Dunn, 1974) to dentfy the clusters that are compact and well separated by maxmzng the nter-cluster dstance whle mnmzng the ntra-cluster dstance. The Dunn ndex for k clusters s defned as D( C, C ) k* = arg max DU ( k) = mn mn, k 2 = 1,, k = + 1,, k max dam( Cm) m= 1,, k where D(C, C ) s the dstance between two clusters C and C as the mnmum dstance (2.18) between a par of obects n the two dfferent clusters separately and the dameter of cluster C m, dam(c m ), as the maxmum dstance between two obects n the cluster. The optmal

22 14 number of clusters s calculated at the largest value of the Dunn ndex. The Dunn ndex s senstve to nose. By redefntons of the cluster dameter and the cluster dstance, a famly of cluster valdaton ndces s proposed (Bezdek and Pal, 1998). Based on the rato of the between-cluster scatter matrx (S B ) and the wthn-cluster scatter matrx (S W ), the Calnsk-Harabasz ndex (Calnsk and Harabasz, 1974) s the best among the top 30 ndces ranked by Mllgan and Cooper (1985). The optmal number of clusters s determned by maxmzng CH(k). Tr( S ) / ( k 1) B k* = arg max CH ( k ) =. (2.19) k 2 Tr( SW ) / ( n k) Smlar to the Calnsk-Harabasz ndex, the Daves-Bouldn ndex (Daves and Bouldn, 1979) obtans clusters wth the mnmum ntra-cluster dstance as well as the maxmum dstance between cluster centrods. The mnmum value of the ndex ndcates a sutable partton for the dataset. k 1 dam( C ) + dam( C ) k* = arg mn DB( k) = max (2.20) k 2 k = 1,, k, = 1 d( z, z ) where the dameter of a cluster s defned as dam C 1 = (2.21) 2 ( ) d( o, z ). n o C The Slhouette ndex (Kaufman and Rousseeuw, 1990) computes for each obect a wdth dependng on ts membershp n any cluster. For the th obect, let a be the average dstance to other obects n ts cluster and b the mnmum of the average dssmlartes between obect and other obects n other clusters. The slhouette wdth s defned as (b - a )/max{a, b }. Slhouette ndex s the average Slhouette wdth of all the data ponts. The partton wth the hghest SI(k) s taken to be optmal. n 1 ( b a ) k* = arg max SI ( k ) = (2.22) k 2 n = 1 max( b, a ) The Geometrc ndex (Lam and Yan, 2005) s recently proposed to accommodate data wth clusters of dfferent denstes and overlap clusters. The optmal number of clusters s found by mnmzng the GE(k) ndex. Let d be the dmensonalty of the data and λ pq the

23 15 egenvalue of the covarance matrx from the data. D(C, C ) s the nter-cluster dstance between cluster and cluster. The GE ndex s constructed as 2 d 2 λ = 1 k* = arg mn GE( k) max =. k 2 1 k mn D( C, C ) (2.23) 1 k, Unlke the crtera mentoned above, whch employ the geometrc-lke dstance, CU and entropy-based methods use the countng scheme to evaluate the performance of a categorcal clusterng algorthm. CU of a partton wth k clusters s defned n Eq A cluster soluton wth hgh CU s desred snce t mproves the lkelhood of smlar patterns fallng nto the same cluster. k* = arg max CU ( k) = n Pr( a = v C ) Pr( a = v ) k l 2 2 l k 2 l= 1 n (2.24) Entropy-based method computes the expected entropy of a partton wth respect to a class attrbute a. The smaller the expected entropy, the better qualty of the partton wth respect to a. It s expected that the expected entropy decreases monotoncally as the number of clusters ncreases, but from some pont onwards the decrease flattens remarkably. Rather than searchng for the locaton of an elbow on the plot of the expected entropy versus the number of clusters, Chen and Lu (2009) calculated the second order dfference of ncremental expected entropy of the partton structure, whch s called the BK ndex. The largest value ndcates an elbow pont whch s the potental number of clusters. n l k* = arg mn E a ( k) = Pr( a ) log Pr( ) = v Cl a = v Cl k 2 l n (2.25) The sgnfcance test on external varables s other commonly used method. It compares the parttons usng varables not used n the generaton of those clusters. Although the valdty ndex for mxed features s relatvely sparse, there are a few to evaluate fuzzy clusterng algorthms based on the fuzzy partton matrces and/or dssmlarty among obects and prototypes. For example, Lee (2009) proposed an ndex called CPI(k) as

24 16 ululsm( o, o ) usutsm( o, o ) k k 1 1 k* = arg max CPI ( k) =, k 2 k l= 1 ulul k( k 1) s= 1; t= 1; s t usut (2.26) where u s the membershp of obect o n cluster, 0 u 1. sm(o, o ) s any smlarty measure between obect o and o. 2.3 Summary and Dscusson Real-lfe systems are overwhelmed wth large mxed datasets that nclude numerc and nomnal data. However, the maorty of the clusterng algorthms are desgned for one data type. Ths study wll propose a novel approach to partton mxed dataset, evaluate the resultng cluster solutons, and determne the optmal number of clusters.

25 17 CHAPTER 3 HIERARCHICAL CLUSTERING FOR MIXED DATA Many parttonng algorthms requre the number of classes, k, as a user-specfed parameter. However, k s not always avalable n many applcatons. Herarchcal clusterng does not need ths pror nformaton. Ths method creates a sequence of nested parttons. In order to form a set of clusters, a cuttng pont s determned by usng some expert experences to nterpret each branch n the dendrogram, or by applyng some valdty ndces to estmate where the best levels are. Our algorthm can be dvded nto two man procedures. Frst, a cluster tree s constructed by a herarchcal algorthm n a bottom up manner; and then a search procedure s followed to obtan the optmal number of classes. In ths chapter, we present the frst procedure that generates a cluster tree by herarchcal clusterng on the dstances from a co-occurrence measure. We compare four dstances measurements capable of handlng mxed data when used wth agglomeratve herarchcal clusterng, and provde a soluton n whch the co-occurrence dstance would outperform other dstances. The performance s tested on some standard real-lfe as well as synthetc datasets. 3.1 Motvaton A dstance measure that has been prevously found to perform well for the fuzzy prototype k-means algorthm (Ahmad and Dey, 2007) wll be adopted to defne the proxmty between pars of obects. Wthout the assumpton about data dstrbuton, t consders the strong co-occurrence probablty of two attrbute values n a certan class. Intutvely, some attrbute values are assocated wth dfferent classes. For example, the color of a banana s yellow whle the color of a strawberry s red. If a basket has only strawberres and bananas and, by a chance, we pck a yellow frut, then t must be a banana. Lkewse, f we know the type of the frut n ths basket, then the color can be decded. Yellow and red have strong correlatons wth bananas and strawberres, respectvely. The color of the frut can dstngush each type of the frut. In ths context, we can assume the dstance between yellow and red s large wth respect to type of frut. However, f we harvest yellow lemons and red tomatoes, and put them nto ths basket, then t s dffcult to tell the

26 18 types of the fruts from ther colors. In ths stuaton, n terms of type of frut, we can say a small dstance between yellow and red. Therefore, a dstance would be defned based on the co-occurrence of colors wth respect to frut types. A strong occurrence relatonshp between the two levels of color and type of frut results n a large dstance and vce versa. Based on the power of an attrbute to separate data nto homogenous segments, Ahmad and Dey (2007) defned the dstance between categorcal attrbute values and calculated the weghts for numerc attrbutes by explotng ths co-occurrence relaton. The overall dstance s a sum of categorcal and the weghted numerc dstances and appled n the k-means algorthm to cluster mxed datasets. The comparatve study showed good performance. Ahmad and Dey s fuzzy prototype k-means method does not work n applcatons wthout a known number of groups. Unlke Ahmad and Dey s method, therefore, our study employs ths dstance n a herarchcal algorthm to derve a tree structure, whch can generate a seres of parttons wth successve cluster numbers. These parttons wll be evaluated by the valdty ndex proposed n the followng chapter. In ths secton, we compare the herarchcal algorthm wth four dstances capable of handlng mxed data types n terms of clusterng accuracy, and explot the propertes of the datasets would take the advantage of the co-occurrence dstance when used wth agglomeratve herarchcal clusterng. 3.2 Background Tradtonal approaches of clusterng datasets wth mxed data types adopt dstance representatons by convertng one type of attrbutes to another. One way s to transfer categorcal data nto numerc data by smple matchng or bnary codng. On the other hand, the contnuous attrbutes are dscretzed nto categorcal data. As mentoned n the precedng chapters, these two ways are not effectve n dealng wth the partcular mxed datasets. Some researchers have made efforts to balance numerc and nomnal dstances. Huang (1998) ntroduced a weght factor for categorcal dstance n hs k-prototype algorthm. Modha and Spangler (2003) went further and found the optmal weght that mnmzes the wthn-cluster weghted dstance whle maxmzng the between-cluster weghted dstance. The SBAC method (L and Bswas, 2002) adopted the Goodall dstance based on the concept that two

27 19 speces are closer f they have rarer characterstcs n common. From the vew of probablty, the Goodall dstance untes categorcal and numerc dstances wthn a common framework. Ahmad and Dey (2007) defnes a dstance by explotng a co-occurrence relaton of values n dfferent attrbutes. The k-prototype, the optmal weght, the Goodall, and the co-occurrence dstances are dscussed n greater detals below Notaton DS(U, A) represents a set of obects n terms of ther attrbutes, where U s a nonempty fnte set of obects and A s a nonempty fnte set of attrbutes. For example, a strawberry and a banana are two obects n U, whle color, type, and weght of the frut are the attrbutes n A. Let n be the number of obects and m the number of attrbutes. There exsts a functon between the set of obects and each attrbute such that a : U V a p A (p = 1, 2,, m),.e., a p (x ) V ap ( = 1, 2,, n), x U, where Va p p a p for any s called the doman of attrbute a p. a p (x ) s the value of obect x on attrbute a p. For example, the color of a banana s denoted as a color (banana) = yellow. Generally speakng, the set of attrbutes can be dvded nto two subsets A r and A c accordng to data type, where A r s the set of numerc attrbutes and A c the set of categorcal attrbutes. Thus, A = Ar Ac. If A s the set ncludng color, type, and weght of the frut, then A c s the set wth color and type of the frut whle A r contans the weght of the frut. mr and mc are the numbers of numerc and categorcal attrbutes, respectvely. m = mr + mc. Gven a p A, x, y U, f a p (x) = a p (y), then x and y are sad to have no dfference w.r.t. a p. The dstance of x and y on a p s zero, denoted p by D ( x, y ) = 0 when a p (x) = a p (y). For example, f one frut n the basket s a banana and another s a lemon, we know ther colors are yellow. It can be denoted as a color (banana) = color a color (lemon), or D ( banana, lemon ) = 0. The total dstance between two obects x, y s defned as d(x, y). Let R = {(x, y) U U: a (x) = a (y)}. Thus, the relaton R parttons U nto dsont subsets accordng to values on attrbute a. These subsets are called equvalence classes of a. The equvalence class ncludng x s S (x), S (x) = { y U : (x, y) R }. For example, f x represents a banana, then S color (banana) contans all the fruts n the basket that have the

28 20 same color as the banana. Thus, S color (banana) s the set of all bananas and lemons n the basket. W (B ) = { S (x) : a (x) B, B V a p }, the set of obects whose values on a are among B. If the obects havng the same value w.r.t. a should all be ncluded n one segment, ether W (B ) or U/W (B ). U/W (B ) s called the smply complement of W (B ), usually denoted by ~W (B ). If W color ({yellow}) represents the set of fruts wth a yellow appearance, then ~W color ({yellow}) s the set of fruts contanng colors except yellow k-prototype Dstance In order to cluster mxed datasets, Huang (1998) used a user-specfed weght γ to balance the dstance over numerc and nomnal attrbutes and appled ths measure n hs k- prototype algorthm. The numerc dstance s the squared-eucldean dstance; and the categorcal dstance s smple matchng. All numerc attrbutes are normalzed to the range of [0, 1]. A small γ value ndcates that the clusterng s domnated by numerc attrbutes whle a large γ value mples that categorcal attrbutes domnate the clusterng. Huang suggested the weght should be n the range of [0.5, 1.4]. The total dstance between a par of obects x and y s formulated as, a ( x) a ( y) d( x, y) = + γ D ( x, y), A max( V ) mn( ) r a V A a c 2 (3.1) and the computatonal complexty s O(mn 2 ) Optmal Weght Dstance Modha and Spangler (2003) used optmzaton technques to further balance numerc and nomnal dstances. Instead of a user-specfed weght, ther method searches for an optmal weght to mnmze the rato of the wthn- and between-cluster weghted dstances when the number of clusters s gven. The numerc features are standardzed based on mean and standard devance, and then the dstance s found by takng the squared-eucldean dstance. Each categorcal value s represented by a bnary vector usng 1-of-v encodng (v s the number of attrbute values), and the dstance s found by takng cosne dstance. The optmal weght dstance combnes the weghted dstances of the two data types.

29 21 ( a ( x) a ( y)) d ( x, y) = (1 w) + w D ( x, y) var( V ) Ar a 2 (3.2) Ac In order to get the optmal weght, the number of clusters should be known n advance. Snce the obectve functon of ths mnmzaton problem s nonlnear, t s hard to pursue an optmal soluton. Modha and Spangler calculated the obectve value by takng a large number n the nterval [0, 1] n order to search the best one. The computatonal complexty s O(mn 2 ) when choosng teratons to search the weght Goodall Dstance Goodall (1966) proposed a smlarty ndex based on the agreement that a par of obects havng an uncommon value of an attrbute s closer than other pars only possessng a common value among them. For example, a salmon and a bass have scales but a dolphn and a salmon have vertebra. Snce there are more anmals havng vertebra than those havng scales, a salmon s closer to a bass than to a dolphn. The author made an assumpton about ndependent attrbutes. L and Bswas (2002) adopted ths dstance n the SBAC method. The dstance for non-dentcal nomnal values s one, as D (x, y) = 1, a (x) a (y), a A c. For example, n the case of the frut basket, the dstance between yellow and red s formulated color color as D ( banana, strawberry ) = 1. However, D ( banana, lemon ) s much more complex. For a par of dentcal nomnal values, the dstance s the sum of the possbltes of pckng an dentcal value par whose value s equally or more smlar to the par n queston, that s, havng lower or equal frequency. The formulaton s as follows when s = a (x) = a (y), a A c. D ( x, y) = freq( r)( freq( r) 1) (3.3) n( n 1) r V a, freq( r ) freq( s) The dstance between dentcal numerc values s calculated as ther nomnal component usng equaton (3.3). To calculate dstance of two dfferent numerc values, dvde the doman nto successve segments by the unque values of a numerc feature and count the frequency n every nterval frst. Sum the possbltes n a smaller range or equal-wdth range (l, m) but wth less or equal frequency. Gven s= a (x), t= a (y), a (x) a (y), a A r, the dstance on a numerc attrbute A s defned as follows.

30 22 freq( r)( freq( r) 1) 2 freq( l)( freq( m) 1) D ( x, y) = + n( n 1) n( n 1) (3.4) r V a m l < t s { m l = t s, freq( m l ) freq ( t s )} When havng calculated the dstances for each attrbute, we use χ 2 transformaton to get correspondng ch-square values. The sum of these values s dstrbuted as χ 2 wth the degree of freedom two tmes of the number of attrbutes. The probablty of ths sum s the Goodall dstance of a par of obects. It takes O(n+v logv ) to compute the nomnal dstance for a categorcal attrbute wth v levels. For m c attrbutes, t needs m c such calls. Therefore, the runnng tme s O(nm c +mq c logq c ), where q c = max{ V a, a A c }. Sortng the ntervals of a numerc attrbute wth v unque values requres O(v 2 logv ). Gven l, the number of the ntervals n th equalrange ntervals, the same-range ntervals are sorted by ther frequences n 2 v O( l log l ), = 1 2 whch s upper bounded by O( v l log l) and l s the maxmum number over all l. Accordngly, the computatonal complexty of numerc dstances s O(nm r +m r q 2 r logq r +m r q 2 r llogl); and the total runnng tme s O(nm+mq c logq c +mq 2 r logq r +mq 2 r llogl), where q r s the number of maxmum number of unque values n the numerc attrbutes. Even n an ordnary dataset, the number of q r s huge Co-occurrence Dstance Ahmad and Dey (2007) explot the property that f there s a stronger connecton of values on a (e.g., s and t) wth dfferent values on a, then s and t are more powerful to separate a dataset nto pure segments w.r.t. a. The extreme case s when s and t assocate wth dfferent values n a. Gven W (B ), let P (W (B ) s) be the condtonal probablty w.r.t. W (B ) when the value of obect x on attrbute a s s, and P (~W (B ) s) the condtonal probablty of set U/W (B ) when the value of obect x on attrbute a s s, where s formulatons are V a, a A c. These two

31 23 ( W B s) P ( ) the number of y, y W ( B ) a ( y) = s = the number of x, x U a ( x) = s (3.5) the number of y, y U / W ( B ) a ( y) = s P ( W ( B ) s) =. the number of x, x U a ( x) = s (3.6) The defnton of the dstance of two levels n categorcal attrbute a wth respect to attrbute a s where s, t { } D ( s, t) = max P ( W ( B ) s) + P ( W ( B ) t) 1.0, (3.7) V a B, s t, a, a A c. Ahmad and Dey (2007) showed an optmal soluton would be obtaned n polynomal algorthm n terms of V a. The dstance of two levels n categorcal attrbute a s the average of D (s, t) over all categorcal attrbutes but a. Gven s = a (x), t = a (y), s t, a A c, 1 D ( x, y) = D ( s, t), and (3.8a) mc 1, a Ac 1 Dst ( s, t) D ( x, y) = D ( s, t). (3.8b) mc 1, a Ac The dstance between two values on numercal attrbute a A r s a ( x) a ( y) D ( x, y) =, max( V ) mn( V ) a a (3.9) whch s equvalent to normalzng numercal attrbute a frst and then takng the absolute value of the dfference. The dstance between every par of obects w.r.t. attrbute set A s defned as a ( x) a ( y) d( x, y) = w + D ( x, y), A max( V ) mn( ) r a V A a c 2 (3.10) where w s the weght of real attrbute a. The weght w s ntroduced to modfy numercal dstances based on separatng power to dvde the dataset nto pure segments. Frst, a numercal attrbute a A r wll be dscretzed nto v ntervals, where v s a predefned nteger. w s calculated as Eq to reveal the sgnfcance of attrbute a to separate the dataset. In

32 24 w = ( s, t ) Dst ( s, t) v ( v 1), (3.11) s and t are the new assgned categorcal values for dscretzed numercal attrbute a and v s the number of categorcal values, v = V a. The runnng tme of calculatng the co-occurrence dstance s O(n 2 m+nm 2 +q 3 m 2 ), where q = max{ V a, a A c }. 3.3 Proposed Herarchcal Clusterng Method In ths secton, we wll ntegrate an agglomeratve clusterng algorthm wth the four dstances measures capable of handlng mxed datasets Agglomeratve Herarchcal Algorthm Agglomeratve herarchcal algorthms start from sngleton clusters and merge those clusters wth mnmal dstances untl all obects are ncluded n one cluster. The dstance between ndvdual obects s as mportant n creatng clusters as the cluster dstance, but the cluster dstance has greater weght on creatng the fnal partton. Although there are a large number of dstance defntons between a cluster and a newly formed cluster, we choose Ward s method to mnmze the ncrease of the wthn-class sum of the squared errors snce we wsh the formed clusters would be compact, not chan-lke or wth one obect. The wthn-class sum of the squared errors s the sum of squared-eucldean dstance between each obect to ts nearest cluster center and s formulated as 2 (3.12) k k E = o z where o s an obect n the k th cluster and z k the centrod of ths cluster. If two clusters C and C are merged, the ncrement of E wll be calculated wth the followng equaton: n n E = z z n + n 2. Thus, the dstance between two clusters C and C s defned as (3.13)

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Hierarchical clustering for gene expression data analysis

Hierarchical clustering for gene expression data analysis Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally

More information

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervsed Learnng and Clusterng Why consder unlabeled samples?. Collectng and labelng large set of samples s costly Gettng recorded speech s free, labelng s tme consumng 2. Classfer could be desgned

More information

Unsupervised Learning

Unsupervised Learning Pattern Recognton Lecture 8 Outlne Introducton Unsupervsed Learnng Parametrc VS Non-Parametrc Approach Mxture of Denstes Maxmum-Lkelhood Estmates Clusterng Prof. Danel Yeung School of Computer Scence and

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

S1 Note. Basis functions.

S1 Note. Basis functions. S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

K-means and Hierarchical Clustering

K-means and Hierarchical Clustering Note to other teachers and users of these sldes. Andrew would be delghted f you found ths source materal useful n gvng your own lectures. Feel free to use these sldes verbatm, or to modfy them to ft your

More information

Clustering. A. Bellaachia Page: 1

Clustering. A. Bellaachia Page: 1 Clusterng. Obectves.. Clusterng.... Defntons... General Applcatons.3. What s a good clusterng?. 3.4. Requrements 3 3. Data Structures 4 4. Smlarty Measures. 4 4.. Standardze data.. 5 4.. Bnary varables..

More information

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components

More information

Hierarchical agglomerative. Cluster Analysis. Christine Siedle Clustering 1

Hierarchical agglomerative. Cluster Analysis. Christine Siedle Clustering 1 Herarchcal agglomeratve Cluster Analyss Chrstne Sedle 19-3-2004 Clusterng 1 Classfcaton Basc (unconscous & conscous) human strategy to reduce complexty Always based Cluster analyss to fnd or confrm types

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15 CS434a/541a: Pattern Recognton Prof. Olga Veksler Lecture 15 Today New Topc: Unsupervsed Learnng Supervsed vs. unsupervsed learnng Unsupervsed learnng Net Tme: parametrc unsupervsed learnng Today: nonparametrc

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervsed Learnng and Clusterng Supervsed vs. Unsupervsed Learnng Up to now we consdered supervsed learnng scenaro, where we are gven 1. samples 1,, n 2. class labels for all samples 1,, n Ths s also

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

Hermite Splines in Lie Groups as Products of Geodesics

Hermite Splines in Lie Groups as Products of Geodesics Hermte Splnes n Le Groups as Products of Geodescs Ethan Eade Updated May 28, 2017 1 Introducton 1.1 Goal Ths document defnes a curve n the Le group G parametrzed by tme and by structural parameters n the

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur FEATURE EXTRACTION Dr. K.Vjayarekha Assocate Dean School of Electrcal and Electroncs Engneerng SASTRA Unversty, Thanjavur613 41 Jont Intatve of IITs and IISc Funded by MHRD Page 1 of 8 Table of Contents

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

Smoothing Spline ANOVA for variable screening

Smoothing Spline ANOVA for variable screening Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory

More information

Graph-based Clustering

Graph-based Clustering Graphbased Clusterng Transform the data nto a graph representaton ertces are the data ponts to be clustered Edges are eghted based on smlarty beteen data ponts Graph parttonng Þ Each connected component

More information

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques

More information

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices Steps for Computng the Dssmlarty, Entropy, Herfndahl-Hrschman and Accessblty (Gravty wth Competton) Indces I. Dssmlarty Index Measurement: The followng formula can be used to measure the evenness between

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust

More information

A New Approach For the Ranking of Fuzzy Sets With Different Heights

A New Approach For the Ranking of Fuzzy Sets With Different Heights New pproach For the ankng of Fuzzy Sets Wth Dfferent Heghts Pushpnder Sngh School of Mathematcs Computer pplcatons Thapar Unversty, Patala-7 00 Inda pushpndersnl@gmalcom STCT ankng of fuzzy sets plays

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

Wishing you all a Total Quality New Year!

Wishing you all a Total Quality New Year! Total Qualty Management and Sx Sgma Post Graduate Program 214-15 Sesson 4 Vnay Kumar Kalakband Assstant Professor Operatons & Systems Area 1 Wshng you all a Total Qualty New Year! Hope you acheve Sx sgma

More information

Private Information Retrieval (PIR)

Private Information Retrieval (PIR) 2 Levente Buttyán Problem formulaton Alce wants to obtan nformaton from a database, but she does not want the database to learn whch nformaton she wanted e.g., Alce s an nvestor queryng a stock-market

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Proceedngs of the Wnter Smulaton Conference M E Kuhl, N M Steger, F B Armstrong, and J A Jones, eds A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Mark W Brantley Chun-Hung

More information

Clustering algorithms and validity measures

Clustering algorithms and validity measures Clusterng algorthms and valdty measures M. Hald, Y. Batstas, M. Vazrganns Department of Informatcs Athens Unversty of Economcs & Busness Emal: {mhal, yanns, mvazrg}@aueb.gr Abstract Clusterng ams at dscoverng

More information

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana

More information

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics Introducton G10 NAG Fortran Lbrary Chapter Introducton G10 Smoothng n Statstcs Contents 1 Scope of the Chapter... 2 2 Background to the Problems... 2 2.1 Smoothng Methods... 2 2.2 Smoothng Splnes and Regresson

More information

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 An Iteratve Soluton Approach to Process Plant Layout usng Mxed

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Review of approximation techniques

Review of approximation techniques CHAPTER 2 Revew of appromaton technques 2. Introducton Optmzaton problems n engneerng desgn are characterzed by the followng assocated features: the objectve functon and constrants are mplct functons evaluated

More information

An Internal Clustering Validation Index for Boolean Data

An Internal Clustering Validation Index for Boolean Data BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 16, No 6 Specal ssue wth selecton of extended papers from 6th Internatonal Conference on Logstc, Informatcs and Servce Scence

More information

LECTURE : MANIFOLD LEARNING

LECTURE : MANIFOLD LEARNING LECTURE : MANIFOLD LEARNING Rta Osadchy Some sldes are due to L.Saul, V. C. Raykar, N. Verma Topcs PCA MDS IsoMap LLE EgenMaps Done! Dmensonalty Reducton Data representaton Inputs are real-valued vectors

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

Programming in Fortran 90 : 2017/2018

Programming in Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Exercse 1 : Evaluaton of functon dependng on nput Wrte a program who evaluate the functon f (x,y) for any two user specfed values

More information

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe CSCI 104 Sortng Algorthms Mark Redekopp Davd Kempe Algorthm Effcency SORTING 2 Sortng If we have an unordered lst, sequental search becomes our only choce If we wll perform a lot of searches t may be benefcal

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

Classifier Selection Based on Data Complexity Measures *

Classifier Selection Based on Data Complexity Measures * Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.

More information

y and the total sum of

y and the total sum of Lnear regresson Testng for non-lnearty In analytcal chemstry, lnear regresson s commonly used n the constructon of calbraton functons requred for analytcal technques such as gas chromatography, atomc absorpton

More information

Determining Fuzzy Sets for Quantitative Attributes in Data Mining Problems

Determining Fuzzy Sets for Quantitative Attributes in Data Mining Problems Determnng Fuzzy Sets for Quanttatve Attrbutes n Data Mnng Problems ATTILA GYENESEI Turku Centre for Computer Scence (TUCS) Unversty of Turku, Department of Computer Scence Lemmnkäsenkatu 4A, FIN-5 Turku

More information

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters Proper Choce of Data Used for the Estmaton of Datum Transformaton Parameters Hakan S. KUTOGLU, Turkey Key words: Coordnate systems; transformaton; estmaton, relablty. SUMMARY Advances n technologes and

More information

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for

More information

CSE 326: Data Structures Quicksort Comparison Sorting Bound

CSE 326: Data Structures Quicksort Comparison Sorting Bound CSE 326: Data Structures Qucksort Comparson Sortng Bound Steve Setz Wnter 2009 Qucksort Qucksort uses a dvde and conquer strategy, but does not requre the O(N) extra space that MergeSort does. Here s the

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

5 The Primal-Dual Method

5 The Primal-Dual Method 5 The Prmal-Dual Method Orgnally desgned as a method for solvng lnear programs, where t reduces weghted optmzaton problems to smpler combnatoral ones, the prmal-dual method (PDM) has receved much attenton

More information

Meta-heuristics for Multidimensional Knapsack Problems

Meta-heuristics for Multidimensional Knapsack Problems 2012 4th Internatonal Conference on Computer Research and Development IPCSIT vol.39 (2012) (2012) IACSIT Press, Sngapore Meta-heurstcs for Multdmensonal Knapsack Problems Zhbao Man + Computer Scence Department,

More information

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc. [Type text] [Type text] [Type text] ISSN : 0974-74 Volume 0 Issue BoTechnology 04 An Indan Journal FULL PAPER BTAIJ 0() 04 [684-689] Revew on Chna s sports ndustry fnancng market based on market -orented

More information

X- Chart Using ANOM Approach

X- Chart Using ANOM Approach ISSN 1684-8403 Journal of Statstcs Volume 17, 010, pp. 3-3 Abstract X- Chart Usng ANOM Approach Gullapall Chakravarth 1 and Chaluvad Venkateswara Rao Control lmts for ndvdual measurements (X) chart are

More information

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated.

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated. Some Advanced SP Tools 1. umulatve Sum ontrol (usum) hart For the data shown n Table 9-1, the x chart can be generated. However, the shft taken place at sample #21 s not apparent. 92 For ths set samples,

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

A Deflected Grid-based Algorithm for Clustering Analysis

A Deflected Grid-based Algorithm for Clustering Analysis A Deflected Grd-based Algorthm for Clusterng Analyss NANCY P. LIN, CHUNG-I CHANG, HAO-EN CHUEH, HUNG-JEN CHEN, WEI-HUA HAO Department of Computer Scence and Informaton Engneerng Tamkang Unversty 5 Yng-chuan

More information

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more

More information

Problem Set 3 Solutions

Problem Set 3 Solutions Introducton to Algorthms October 4, 2002 Massachusetts Insttute of Technology 6046J/18410J Professors Erk Demane and Shaf Goldwasser Handout 14 Problem Set 3 Solutons (Exercses were not to be turned n,

More information

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide Lobachevsky State Unversty of Nzhn Novgorod Polyhedron Quck Start Gude Nzhn Novgorod 2016 Contents Specfcaton of Polyhedron software... 3 Theoretcal background... 4 1. Interface of Polyhedron... 6 1.1.

More information

Classifying Acoustic Transient Signals Using Artificial Intelligence

Classifying Acoustic Transient Signals Using Artificial Intelligence Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned

More information

A Robust Method for Estimating the Fundamental Matrix

A Robust Method for Estimating the Fundamental Matrix Proc. VIIth Dgtal Image Computng: Technques and Applcatons, Sun C., Talbot H., Ourseln S. and Adraansen T. (Eds.), 0- Dec. 003, Sydney A Robust Method for Estmatng the Fundamental Matrx C.L. Feng and Y.S.

More information

From Comparing Clusterings to Combining Clusterings

From Comparing Clusterings to Combining Clusterings Proceedngs of the Twenty-Thrd AAAI Conference on Artfcal Intellgence (008 From Comparng Clusterngs to Combnng Clusterngs Zhwu Lu and Yuxn Peng and Janguo Xao Insttute of Computer Scence and Technology,

More information

Life Tables (Times) Summary. Sample StatFolio: lifetable times.sgp

Life Tables (Times) Summary. Sample StatFolio: lifetable times.sgp Lfe Tables (Tmes) Summary... 1 Data Input... 2 Analyss Summary... 3 Survval Functon... 5 Log Survval Functon... 6 Cumulatve Hazard Functon... 7 Percentles... 7 Group Comparsons... 8 Summary The Lfe Tables

More information

A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment

A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment A Webpage Smlarty Measure for Web Sessons Clusterng Usng Sequence Algnment Mozhgan Azmpour-Kv School of Engneerng and Scence Sharf Unversty of Technology, Internatonal Campus Ksh Island, Iran mogan_az@ksh.sharf.edu

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Query Clustering Using a Hybrid Query Similarity Measure

Query Clustering Using a Hybrid Query Similarity Measure Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan

More information

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions Sortng Revew Introducton to Algorthms Qucksort CSE 680 Prof. Roger Crawfs Inserton Sort T(n) = Θ(n 2 ) In-place Merge Sort T(n) = Θ(n lg(n)) Not n-place Selecton Sort (from homework) T(n) = Θ(n 2 ) In-place

More information

Clustering validation

Clustering validation MOHAMMAD REZAEI Clusterng valdaton Publcatons of the Unversty of Eastern Fnland Dssertatons n Forestry and Natural Scences No 5 Academc Dssertaton To be presented by permsson of the Faculty of Scence and

More information

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1) Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A

More information

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification Introducton to Artfcal Intellgence V22.0472-001 Fall 2009 Lecture 24: Nearest-Neghbors & Support Vector Machnes Rob Fergus Dept of Computer Scence, Courant Insttute, NYU Sldes from Danel Yeung, John DeNero

More information

Reducing Frame Rate for Object Tracking

Reducing Frame Rate for Object Tracking Reducng Frame Rate for Object Trackng Pavel Korshunov 1 and We Tsang Oo 2 1 Natonal Unversty of Sngapore, Sngapore 11977, pavelkor@comp.nus.edu.sg 2 Natonal Unversty of Sngapore, Sngapore 11977, oowt@comp.nus.edu.sg

More information

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval Fuzzy -Means Intalzed by Fxed Threshold lusterng for Improvng Image Retreval NAWARA HANSIRI, SIRIPORN SUPRATID,HOM KIMPAN 3 Faculty of Informaton Technology Rangst Unversty Muang-Ake, Paholyotn Road, Patumtan,

More information

Intra-Parametric Analysis of a Fuzzy MOLP

Intra-Parametric Analysis of a Fuzzy MOLP Intra-Parametrc Analyss of a Fuzzy MOLP a MIAO-LING WANG a Department of Industral Engneerng and Management a Mnghsn Insttute of Technology and Hsnchu Tawan, ROC b HSIAO-FAN WANG b Insttute of Industral

More information

Insertion Sort. Divide and Conquer Sorting. Divide and Conquer. Mergesort. Mergesort Example. Auxiliary Array

Insertion Sort. Divide and Conquer Sorting. Divide and Conquer. Mergesort. Mergesort Example. Auxiliary Array Inserton Sort Dvde and Conquer Sortng CSE 6 Data Structures Lecture 18 What f frst k elements of array are already sorted? 4, 7, 1, 5, 1, 16 We can shft the tal of the sorted elements lst down and then

More information

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming CS 4/560 Desgn and Analyss of Algorthms Kent State Unversty Dept. of Math & Computer Scence LECT-6 Dynamc Programmng 2 Dynamc Programmng Dynamc Programmng, lke the dvde-and-conquer method, solves problems

More information

A fault tree analysis strategy using binary decision diagrams

A fault tree analysis strategy using binary decision diagrams Loughborough Unversty Insttutonal Repostory A fault tree analyss strategy usng bnary decson dagrams Ths tem was submtted to Loughborough Unversty's Insttutonal Repostory by the/an author. Addtonal Informaton:

More information

Three supervised learning methods on pen digits character recognition dataset

Three supervised learning methods on pen digits character recognition dataset Three supervsed learnng methods on pen dgts character recognton dataset Chrs Flezach Department of Computer Scence and Engneerng Unversty of Calforna, San Dego San Dego, CA 92093 cflezac@cs.ucsd.edu Satoru

More information

Active Contours/Snakes

Active Contours/Snakes Actve Contours/Snakes Erkut Erdem Acknowledgement: The sldes are adapted from the sldes prepared by K. Grauman of Unversty of Texas at Austn Fttng: Edges vs. boundares Edges useful sgnal to ndcate occludng

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like: Self-Organzng Maps (SOM) Turgay İBRİKÇİ, PhD. Outlne Introducton Structures of SOM SOM Archtecture Neghborhoods SOM Algorthm Examples Summary 1 2 Unsupervsed Hebban Learnng US Hebban Learnng, Cntd 3 A

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT 3. - 5. 5., Brno, Czech Republc, EU APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT Abstract Josef TOŠENOVSKÝ ) Lenka MONSPORTOVÁ ) Flp TOŠENOVSKÝ

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009. Farrukh Jabeen Algorthms 51 Assgnment #2 Due Date: June 15, 29. Assgnment # 2 Chapter 3 Dscrete Fourer Transforms Implement the FFT for the DFT. Descrbed n sectons 3.1 and 3.2. Delverables: 1. Concse descrpton

More information

Optimal Workload-based Weighted Wavelet Synopses

Optimal Workload-based Weighted Wavelet Synopses Optmal Workload-based Weghted Wavelet Synopses Yoss Matas School of Computer Scence Tel Avv Unversty Tel Avv 69978, Israel matas@tau.ac.l Danel Urel School of Computer Scence Tel Avv Unversty Tel Avv 69978,

More information

cos(a, b) = at b a b. To get a distance measure, subtract the cosine similarity from one. dist(a, b) =1 cos(a, b)

cos(a, b) = at b a b. To get a distance measure, subtract the cosine similarity from one. dist(a, b) =1 cos(a, b) 8 Clusterng 8.1 Some Clusterng Examples Clusterng comes up n many contexts. For example, one mght want to cluster journal artcles nto clusters of artcles on related topcs. In dong ths, one frst represents

More information

TOWARDS FUZZY-HARD CLUSTERING MAPPING PROCESSES. MINYAR SASSI National Engineering School of Tunis BP. 37, Le Belvédère, 1002 Tunis, Tunisia

TOWARDS FUZZY-HARD CLUSTERING MAPPING PROCESSES. MINYAR SASSI National Engineering School of Tunis BP. 37, Le Belvédère, 1002 Tunis, Tunisia TOWARDS FUZZY-HARD CLUSTERING MAPPING PROCESSES MINYAR SASSI Natonal Engneerng School of Tuns BP. 37, Le Belvédère, 00 Tuns, Tunsa Although the valdaton step can appear crucal n the case of clusterng adoptng

More information

(1) The control processes are too complex to analyze by conventional quantitative techniques.

(1) The control processes are too complex to analyze by conventional quantitative techniques. Chapter 0 Fuzzy Control and Fuzzy Expert Systems The fuzzy logc controller (FLC) s ntroduced n ths chapter. After ntroducng the archtecture of the FLC, we study ts components step by step and suggest a

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law) Machne Learnng Support Vector Machnes (contans materal adapted from talks by Constantn F. Alfers & Ioanns Tsamardnos, and Martn Law) Bryan Pardo, Machne Learnng: EECS 349 Fall 2014 Support Vector Machnes

More information

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem An Effcent Genetc Algorthm wth Fuzzy c-means Clusterng for Travelng Salesman Problem Jong-Won Yoon and Sung-Bae Cho Dept. of Computer Scence Yonse Unversty Seoul, Korea jwyoon@sclab.yonse.ac.r, sbcho@cs.yonse.ac.r

More information

TN348: Openlab Module - Colocalization

TN348: Openlab Module - Colocalization TN348: Openlab Module - Colocalzaton Topc The Colocalzaton module provdes the faclty to vsualze and quantfy colocalzaton between pars of mages. The Colocalzaton wndow contans a prevew of the two mages

More information

CSE 326: Data Structures Quicksort Comparison Sorting Bound

CSE 326: Data Structures Quicksort Comparison Sorting Bound CSE 326: Data Structures Qucksort Comparson Sortng Bound Bran Curless Sprng 2008 Announcements (5/14/08) Homework due at begnnng of class on Frday. Secton tomorrow: Graded homeworks returned More dscusson

More information

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007 Syntheszer 1.0 A Varyng Coeffcent Meta Meta-Analytc nalytc Tool Employng Mcrosoft Excel 007.38.17.5 User s Gude Z. Krzan 009 Table of Contents 1. Introducton and Acknowledgments 3. Operatonal Functons

More information