Clustering validation

Size: px
Start display at page:

Download "Clustering validation"

Transcription

1 MOHAMMAD REZAEI Clusterng valdaton Publcatons of the Unversty of Eastern Fnland Dssertatons n Forestry and Natural Scences No 5 Academc Dssertaton To be presented by permsson of the Faculty of Scence and Forestry for publc examnaton n Louhela audtorum n Scence Park Buldng at the Unversty of Eastern Fnland, Joensuu, on June 0, 06, at o clock noon. School of Computng

2 Grano Oy Joensuu, 06 Edtor: Dr. Pertt Pasanen Dstrbuton: Unversty of Eastern Fnland Lbrary / Sales of publcatons P.O.Box 07, FI-800 Joensuu, Fnland tel ISBN: (prnted ISSNL: ISSN: ISBN: (pdf ISSNL: ISSN:

3 Author: Mohammad Rezae Unversty of Eastern Fnland School of Computng P.O.Box 800 JOENSUU FINLAND emal: Supervsor: Professor Pas Fränt, PhD. Unversty of Eastern Fnland School of Computng P.O.Box 800 JOENSUU FINLAND emal: Revewers: Professor Ana Lusa N. Fred, PhD Insttuto Superor Técnco Torre Norte, Insttuto de Telecomuncações Av. Rovsco Pas, , Lsbon PORTUGAL emal: Professor James Baley, PhD Unversty of Melbourne Department of Computng and Informaton Systems Vctora 300 AUSTRALIA emal: Opponent: Professor Ioan Tabus, PhD Tampere Unversty of Technology Department of Sgnal Processng P.O.Box Tampere FINLAND emal:

4 ABSTRACT: Cluster analyss or clusterng s one of the most fundamental and essental data mnng tasks wth broad applcatons. It ams at fndng a structure n a set of unlabeled data, producng clusters so that obects n one cluster are smlar n some way and dfferent from obects n other clusters. Basc elements of clusterng nclude proxmty measure between obects, cost functon, algorthm, and cluster valdaton. There s a close relatonshp between these elements. Although there has been extensve research on clusterng methods and ther applcatons, less attenton has been pad to the relatonshps between the basc elements. Ths thess frst provdes an overvew of the basc elements of cluster analyss. It then focuses on cluster valdty as four publcatons are devoted to ths element. Chapter sketches the clusterng procedure and provdes defntons of basc components. Chapter revews popular proxmty measures for dfferent types of data. A novel smlarty measure for comparng two groups of words s ntroduced whch s used n the clusterng of tems characterzed by a set of keywords. Chapter 3 presents basc clusterng algorthms and Chapter 4 analyzes cost functons. A clusterng algorthm s expected to optmze a gven cost functon. However, n many cases the cost functon s unknown and hdden wth the algorthm, makng the evaluaton of clusterng results and analyss of the algorthms dffcult. Numerous clusterng algorthms have been developed for dfferent applcaton felds. Dfferent algorthms, or even one algorthm wth dfferent parameters, can gve dfferent results for the same data set. The best clusterng can be selected based on the cost functon f the number of clusters s fxed and the cost functon has been defned, otherwse cluster valdty ndces, nternal and external, are used. Chapter 5 revews several popular nternal ndces. We study the problem of determnng the number of clusters n a data set usng these ndces, and we propose a new nternal ndex for fndng the number of clusters n herarchcal clusterng of words. External

5 valdty ndces are studed n Chapter 6 and two new external ndces, centrod ndex and par sets ndex, are ntroduced. We present a novel expermental setup based on generated parttons to evaluate external ndces. We also study whether external ndces are applcable to the problem of determnng the number of clusters. The concluson s made that external ndces can be used for the problem, but only n theory and n controlled envronments where the type of data s well known and no surprses appear. In practce, ths s rarely the case. AMS classfcaton: 6H30, 9C0 Unversal Decmal Classfcaton: , , Lbrary of Congress Subect Headngs: Data mnng; cluster analyss; algorthms Ylenen suomalanen asasanasto: tedonlouhnta; klusteranalyys; valdont; algortmt

6 Preface Ths PhD dssertaton contans the results of research completed at the School of Computng of the Unversty of Eastern Fnland durng the years Many ndvduals have helped me both drectly and ndrectly n my research and wrtng ths thess. I would lke to express my sncere grattude to my supervsor, Professor Pas Fränt, for gvng me the chance to study n the PhD program and for hs support wth research throughout the years. I would never have fnshed ths dssertaton wthout hs help and gudance. I would also lke to thank my colleagues who helped me durng my PhD study specally Dr. Qnpe Zhao. I am thankful to Professor Ana Lusa N. Fred and Professor James Baley, the revewers of the thess, for ther feedback and comments. I extend my heartfelt grattude to my father and mother, my frst teachers. Thank you so much for your help and support. I would lke to express my deepest love and grattude to my wfe and my sons. Ths research has been supported by MOPSI and MOPIS proects, SCITECO and LUMET grants from Unversty of Eastern Fnland, and the Noka FOUNDATION. Joensuu, May 9, 06 Mohammad Rezae

7 LIST OF ORIGINAL PUBLICATIONS P P P3 P. Fränt, M. Rezae, Q. Zhao, Centrod ndex: cluster level smlarty measure, Pattern Recognton, 47(9, pp , 04. M. Rezae, P. Fränt, Set matchng measures for external cluster valdty, IEEE Transactons on Knowledge and Data Engneerng, 06, (accepted. M. Rezae, P. Fränt, Can number of clusters be solved by external valdty ndex?, 06, (submtted. P4 Q. Zhao, M. Rezae, P. Fränt, Keyword clusterng for automatc categorzaton, Internatonal Conference on Pattern Recognton (ICPR, pp , 0. P5 M. Rezae, P. Fränt, Matchng smlarty for keywordbased clusterng, Jont IAPR Internatonal Workshop, SSPR & SPR 04, Joensuu, (S+SSPR, pp. 93-0, 04. Throughout the thess, these papers wll be referred to by [P]- [P5]. These papers are ncluded at the end of ths thess by the permsson of ther copyrght holders.

8 AUTHOR S CONTRIBUTION The dea of the paper [P] orgnates from Prof. Pas Fränt. The author contrbuted by refnng the defnton of the centrod ndex and extendng t to the correspondng pont-level ndex. The prncpal deas of the other papers orgnate from the author. Implementatons for the papers [P], [P3], and [P5] were performed completely by the author. The author mplemented the pont-level ndex n [P]. Implementaton of the dea n [P4] was done by the author, except the lbrares and smlarty measures usng WordNet. The author performed all experments for [P]-[P5] and part of the experments for [P]. [P] was wrtten by Prof. Pas Fränt and [P4] by Dr. Qnpe Zhao. The author helped to refne the text and provded materals for some sectons of the papers. The author has wrtten the papers [P], [P3], and [P5].

9 Lst of symbols N X x P K c n x D number of data obects data obect as vector th data obects th cluster of clusterng soluton P number of clusters centrod of the th cluster number of obects n the th cluster average of all data obects dmenson of data

10 Contents Introducton... Proxmty measures ElemEntary Data types Numercal dstances Non-numercal dstances Semantc smlarty between words Semantc smlarty between groups of words... 3 Clusterng algorthms K-means Random swap Agglomeratve clusterng DBSCAN Cost functons Total Squared Error (TSE All parwse dstances (APD Spannng tree (ST K-nearest neghbor connectvty Lnkage crtera Internal valdty ndces Internal ndces Sum of squares wthn clusters (SSW Sum of squares between clusters (SSB Calnsk-Harabasz ndex (CH Slhouette coeffcent (SC Dunn famly of ndces Solvng number of clusters... 8

11 6 External valdty ndces Desred propertes Par-countng ndces Informaton-theoretc ndces Set matchng ndces Expermental setup for evaluaton Solvng the number of clusters Summary of contrbutons Conclusons References Appendx: Orgnal publcatons

12

13 Introducton Clusterng s the dvson of data obects nto groups or clusters such that obects n the same group are more smlar than obects n dfferent groups. Clusterng plays an mportant role n data mnng applcatons such as scentfc data exploraton, nformaton retreval and text mnng, spatal database applcatons, Web analyss, customer relatonshp management (CRM, marketng, medcal dagnostcs, computatonal bology, and vsualzaton []. Data Clusterng method Proxmty measure [P5] Clusterng crteron Clusterng algorthm Clusters Clusterng tendency? How many clusters? [P3][P4] Cluster valdaton Internal ndex [P4] External ndex [P][P] Results nterpretaton Knowledge Fgure.: Basc components of cluster analyss Fgure. shows the components of cluster analyss. Data s represented n terms of features that form d-dmensonal feature vectors. Feature extracton and selecton from orgnal enttes must be performed so that the features provde as much dstncton as possble between dfferent enttes concernng the task of nterest. Ths s performed by an expert n the feld. For example, the extracton of features from a speech sgnal to Dssertaton n Forestry and Natural Scences No 5

14 dstngush between dfferent people s performed by an expert n the speech processng feld []. Moreover, extracted features may need preprocessng, such as dmensonalty reducton and normalzaton of the features, so that all features have the same scale and contrbute equally. Next, the assumpton s made that the features have been already extracted and the requred preprocessng has been performed. The basc components of cluster analyss are the followng:. Proxmty measure. Clusterng crteron 3. Clusterng algorthm 4. Cluster valdaton 5. Results nterpretaton Smlarty or dssmlarty (dstance measure between two data obects s a basc requrement for clusterng, and t s chosen based on the problem at hand. For example, suppose that the problem concerns a tme analyss of travellng n a cty. Usng Eucldean dstance between two places s not accurate because one cannot typcally travel through buldngs. We study several proxmty measures n Chapter ncludng a new smlarty between two groups of words. Clusterng crteron determnes the type of clusters that are expected. The crteron s expressed as a cost (or obectve functon, or some other rules. For example, for the same data set, one crteron leads to hypersphercal clusters, whereas another leads to elongated clusters []. The cost functon s hdden n many exstng clusterng approaches, however, the functon can be determned through further analyss. We study several cost functons n Chapter 4. Clusterng algorthm s the procedure that groups data n order to optmze the clusterng crteron. Numerous clusterng algorthms have been developed for dfferent felds. Good algorthms fnd a clusterng close to the optmum effcently. In Chapter 3, we revew basc clusterng algorthms. Dfferent clusterng algorthms, and even one algorthm wth dfferent parameters and ntal assumptons, can produce dfferent clusterngs for the same data set. For a fxed number of

15 Introducton clusters, dfferent results can be evaluated based on the clusterng crteron f avalable. In a general case, cluster valdaton technques are used to evaluate the results of a clusterng algorthm [3], and decde whch clusterng best fts the data. Cluster valdaton s performed usng cluster valdty ndces whch are dvded nto two groups: nternal ndex and external ndex [P]. Internal ndces measure the qualty of a clusterng soluton usng only the underlyng data [4], [5]. External ndces compare two clusterng solutons of the same dataset. They mght compare a clusterng wth ground truth to evaluate a clusterng algorthm. Both nternal and external ndces are used for determnng the number of clusters. We study cluster valdty ndces n Chapters 5 and 6. The goal of clusterng s to provde meanngful nsghts to the data n order to develop a better understandng of the data. Therefore, n many cases, the expert n the applcaton feld s encouraged to nterpret the resultng parttons and ntegrate the results wth other expermental evdence and analyss n order to draw the rght conclusons. Dssertaton n Forestry and Natural Scences No 5 3

16 Mohammad Rezae: Clusterng Valdaton 4 Dssertaton n Forestry and Natural Scences No 5

17 Proxmty measures A data obect represents an entty and s descrbed by attrbutes or features wth a certan type, such as a number or a word. Attrbutes are often represented by a multdmensonal vector [6]. The type of attrbutes s one of the factors that determnes how to measure the smlarty between two obects. Other factors are related to the problem at hand. For example, the smlarty of two words for some applcatons s measured by consderng the letters n the words. However, for other applcatons, ths does not provde good results, and the semantc smlarty between two words s requred. A dssmlarty or smlarty measure can be effectve wthout beng a metrc [7], but sometmes metrc requrements are desrable. A dssmlarty metrc must satsfy the followng condtons [7]: Non-negatvty: D(x, x 0 Symmetry: D(x, x = D(x, x Reflexvty: D(x, x = 0 f and only f x=x. Trangular nequalty: D(x, x+ D(x, xk D(x, xk A smlarty metrc satsfes the followng: Lmted range: S(x, x S0 Symmetry: S(x, x = S(x, x Reflexvty: S(x, x = S0 f and only f x=x. Trangular nequalty: S(x, x S(x, xk S(x, xk (S(x, x+s(x, xk. ELEMENTARY DATA TYPES Numerc: Numerc data are classfed n two groups: nterval and rato. The nterval between each consecutve pont of measurement s equal to every other for nterval data, such as Dssertaton n Forestry and Natural Scences No 5 5

18 Mohammad Rezae: Clusterng Valdaton tme and temperature. They do not have a meanngful zero pont. For example, am s not the absence of tme. The dfference between 0:5 and 0:30 has exactly the same value as the dfference between 8:00 and 8:5. In rato data, such as the number of people n lne, a value of zero ndcates an absence of whatever s measured. Another classfcaton for numerc data ncludes dscrete data and contnuous data. Categorcal: Every obect belongs to one of a lmted number of possble categores, states, or names. Categorcal data are classfed nto two groups: nomnal and ordnal. Categores n nomnal data such as marrage status (marred, wdow, sngle are not ordered. Bnary data can be consdered as nomnal data wth only two states: 0 and. On the other hand, categores n ordnal data, such as degree of pan (severe, moderate, mld, none are ordered.. NUMERICAL DISTANCES Eucldean dstance Eucldean dstance s the most common metrc that s used for numercal vector obects. For two d dmensonal obects x and x, Eucldean dstance s calculated as follows: d l d x x l / l (. Centrod-based clusterng algorthms, such as K-means, that use Eucldean dstance tend to provde hypersphercal clusters [6]. Eucldean dstance s a specal case (p= of a more general metrc called Mnkowsk dstance: d d l / p p l l x x (. Another popular and specal case of Mnkowsk dstance s Manhattan or cty-block dstance where p=, see Fgure.: 6 Dssertaton n Forestry and Natural Scences No 5

19 Proxmty Measures d d l l x x l (.3 A clusterng algorthm that uses Manhattan dstance tends to buld hyper-rectangular clusters [6]. 0 X = (,8 D example x = (,8 x = (6,3 5 5 Eucldean dstance d(, X = (6,3 Manhattan dstance d(, Fgure.: Eucldean and Manhattan dstances ( Mahalonobs dstance All the obects n a cluster affect on Mahalonobs dstance between two obects by applyng wthn group covarance matrx S. Clusterng algorthms that use ths dstance tend to buld hyper-ellpsodal clusters. T d ( x x S ( x x (.4 The wthn group covarance matrx for uncorrelated features becomes an dentty matrx and, therefore, Mahalonobs dstance smplfes to Eucldean dstance [6]. Dssertaton n Forestry and Natural Scences No 5 7

20 Mohammad Rezae: Clusterng Valdaton.3 NON-NUMERICAL DISTANCES Cosne smlarty Cosne smlarty s the most popular metrc used n document clusterng and s based on the angle between the vectors of two obects. s X X X X (.5 The more smlar two obects are, the more parallel they are n the feature space, and the greater the cosne value. The Cosne value does not provde nformaton on the magntude of the dfference. Hammng dstance Hammng dstance s used for comparng categorcal data and strngs of equal length. It counts the number of dfferent elements n two obects [8]: d l l d dl( x, x, l 0, l l dl( x, x, x x l l l l x x (.6 Followng are some examples: Cables, Tablet d= 0000, 000 d=3 (male, blond, blue, A, (female, blond, brown, A d= Gower smlarty s a varant of Hammng dstance, whch s normalzed by the number of attrbutes and has been extended for mxed categorcal and numercal data [9]. The smple form of Gower smlarty for categorcal data can be wrtten as follows: S d l l l Sl ( x, x, d, l l Sl ( x, x 0, x x l l l x x l (.7 8 Dssertaton n Forestry and Natural Scences No 5

21 Proxmty Measures Edt dstance Levenshten or edt dstance measures the dssmlarty of two strngs (e.g., words by countng the mnmum number of nsertons, deletons, and substtutons requred to transform one strng to the other. Several varants exst. For example, longest common subsequence (LCS allows only nsertons and deletons [0]. We descrbe the edt dstance by an example: the dssmlarty between ktten and sttng. Transformng ktten nto sttng can be performed n three steps as follows: Substtute s wth k: stten Substtute e wth : sttn Insert g at the end: sttng Therefore, the edt dstance between the two words s 3..4 SEMANTIC SIMILARITY BETWEEN WORDS Semantc smlarty between two words s measured accordng to ther meanng rather than ther syntactcal representaton. Measures for the semantc smlarty of words can be categorzed as corpus-based, search engne-based, knowledge-based and hybrd. Corpus-based measures such as pont-wse mutual nformaton (PMI [] and latent semantc analyss (LSA [] defne the smlarty based on large corpora and term cooccurrence. The number of occurrences and co-occurrences of two words n a large number of documents s used to approxmate ther smlarty. A hgh smlarty s acheved when the number of co-occurrences s only slghtly lower than the number of occurrences of each word. Search engne-based measures such as Google dstance are based on web counts and snppets from the results of a search engne [] [3] [4]. Flckr dstance frst searches for two target words separately through mage tags and then uses mage content to calculate the dstance between two words [5]. Dssertaton n Forestry and Natural Scences No 5 9

22 Mohammad Rezae: Clusterng Valdaton Knowledge-based measures use lexcal databases such as WordNet [6] or CYC [6]. These databases can be consdered computatonal formats of large amounts of human knowledge. The knowledge extracton process s tme consumng and the database depends on human udgment. Moreover, t does not scale easly to new words, felds, and languages [7] [8]. WordNet s a taxonomy that requres a procedure to derve a smlarty score between words. Despte ts lmtatons, t has been successvely used for clusterng [P4]. Fgure. llustrates a small part of the WordNet herarchy where mammal s the least subsummer of wolf and huntng dog. Depth of a word s the number of lnks between t and the root word n WordNet. As an example, the Wu and Palmer measure [9] s defned as follows: S( w, w depth( LCS( w, w (.8 depth( w depth( w where LCS s the least common subsummer of the words w and w. anmal fsh mammal reptle amphban 3 wolf horse dog cat S wup mare stallon huntng dog 4 dachshund terrer Fgure.: Part of WordNet taxonomy Jang-Contrath [6] s a hybrd of corpus-based and knowledge-based methods n that t extracts the nformaton content of two words and ther least subsumer n a corpus. Methods based on Wkpeda or smlar webstes are also hybrd n the sense that they use organzed corpora wth lnks between documents [0]. 0 Dssertaton n Forestry and Natural Scences No 5

23 Proxmty Measures.5 SEMANTIC SIMILARITY BETWEEN GROUPS OF WORDS The semantc clusterng of obects such as documents, web stes, and moves based on ther keywords requres a smlarty measure between two sets of keywords. Exstng measures nclude mnmum, maxmum, and average smlarty. Consder the bpartte graph n Fgure.3 where the smlarty between every two words s wrtten on ther correspondng lnk. Mnmum and maxmum measures are based on the lnks wth mnmum (0.0 and maxmum (0.84 values. The average measure consders all the lnks and calculates the average value (0.57. These measures have fundamental lmtatons n provdng a reasonable smlarty value between two sets of words [P5]. For example, the mnmum and average measures gve a lower value than.00 for two sets wth the same words. Maxmum measure gves.00 for two dfferent sets whch have only one common word. Hyve Sampon lomamökt restaurant cafetera cafe max mn Average = 0.57 sauna holday cottage Fgure.3: Mnmum and maxmum smlartes between two locaton-based servces s derved by consderng two keywords wth mnmum and maxmum smlartes In [P5], we present a new measure based on matchng the words of two groups assumng that a smlarty measure between two ndvdual words s avalable. The proposed matchng smlarty measure s based on a greedy parng algorthm whch frst fnds the two most smlar words across Dssertaton n Forestry and Natural Scences No 5

24 Mohammad Rezae: Clusterng Valdaton the sets, and then teratvely matches next smlar words. Fnally, the remanng non-pared keywords (of the obect wth more keywords are ust matched wth the most smlar words n the other obect. Fgure.4 llustrates the matchng process between two sample obects. Veslepps Tahko Spa restaurant gym skng dance spa S = 0.79 restaurant gym spa Fgure.4: Matchng between the words of two obects. Consder two obects wth N and N keywords so that N>N. We defne normalzed smlarty between the two obects as follows: S N S( w, w N p( (.9 where S(w,w measures the smlarty between two words, and p( provdes the matched word for w n the other obect. The proposed measure elmnates the dsadvantages of mnmum, maxmum, and average smlarty measures. Dssertaton n Forestry and Natural Scences No 5

25 3 Clusterng algorthms 3. K-MEANS K-means s a parttonal clusterng algorthm that ams at mnmzng the total squared error (TSE. To cluster N data obects nto K clusters, K centrods are ntally selected n some way, for example, through randomly chosen data obects. Two steps of the algorthm are then teratvely performed: assgnment and update, for a fxed number of teratons or untl convergence. In the frst step, obects are assgned to ther nearest centrod. In the second step, new centrods are calculated by averagng the obects n each cluster []. Tme complexty s O(IKN, where I s the number of teratons []. K-means suffers from several drawbacks [6]. The man drawback s that the result s hghly dependent on the ntal selecton of centrods. Dfferent centrods lead to dfferent local optmums that may be very far away from the global one. Consequently, many varants of K-means have been proposed to tackle the obstacles. For example, several technques such as K- means++ [3] have been proposed for the better selecton of ntal centrods. Iteratve methods such as genetc algorthm [4] and random swap [5] mprove results by modfyng the centrods. 3. RANDOM SWAP The randomzed local search or random swap algorthm [5] selects one of the centrods n a gven clusterng randomly and moves t to another locaton. K-means s then appled to fne tune the clusterng result. The process s repeated for a gven number of teratons chosen as an nput parameter. In each teraton, the new resultng clusterng s accepted f t mproves TSE, and s Dssertaton n Forestry and Natural Scences No 5 3

26 Mohammad Rezae: Clusterng Valdaton then used for the next teraton. Wth large number of teratons, typcally 5,000, the method usually provdes good results. Ths tral-and-error approach s smple to mplement and very effectve n practce. 3.3 AGGLOMERATIVE CLUSTERING Agglomeratve clusterng s a bottom-up approach n whch each obect s ntally consdered as ts own cluster. Two clusters are then teratvely merged based on a crteron [6]. Several crtera have been proposed for selectng the next two clusters to be merged such as sngle-lnkage, average-lnkage, complete-lnkage, centrod-lnkage, and Ward s method [7]. Classcal agglomeratve clusterng usng any of these crtera s not approprate for large-scale data sets due to the quadratc computatonal complextes n both executon tme and storng space. The tme complexty of the basc agglomeratve clusterng s O(N 3. The fast algorthm ntroduced n [8] employs a nearest neghbor table that only uses O(N memory and reduces the tme complexty to O(N, where <<N. Even ths algorthm can stll be too slow for real-tme applcatons. In [6], an algorthm based on k-nearest neghbor graph s proposed to mprove the speed close to O(NlogN wth a slght decrease n accuracy. However, graph creaton s the bottleneck of the algorthm and should be solved. Otherwse, ths step domnates the tme complexty. Agglomeratve clusterng s senstve to nose and outlers. It does not consder an obect after t s assgned to a cluster, and therefore, prevous msclassfcatons cannot be corrected afterwards [6]. 3.4 DBSCAN Densty Based Spatal Clusterng of Applcatons wth Nose (DBSCAN s a densty-based clusterng algorthm whch ams at fndng arbtrary shaped clusters and elmnate nose. It 4 Dssertaton n Forestry and Natural Scences No 5

27 Clusterng Algorthms creates clusters from the ponts whose neghborhood wthn a gven radus (eps contans a mnmum number (mnpt of other ponts [9]. Usng every such a pont, the algorthm grows a cluster by onng other ponts that are close to the cluster. The results are ndependent of the order of processng the obects. Three types of ponts are defned, see Fgure 3.. Core ponts contan at least mnpt (5 n ths example ponts n ther eps neghborhood. Border ponts do not contan enough ponts n ther neghborhood but they fall n the neghborhood of some core ponts. Other ponts are consdered nose or outlers. A pont x s drectly densty reachable from x f x s a core pont and x s n ts eps neghborhood. A pont x s defned densty reachable from a core pont x f a chan of ponts from x to x exst so that each pont s drectly densty reachable from the prevous pont. The concept of densty connectvty s also defned to descrbe the relatons between the border ponts that belong to the same cluster but are not densty reachable from each other. Two ponts are densty connected f they are densty reachable from a common core pont. A cluster s bult from a core pont and ts neghborng obects n eps dstance, and t grows usng the concepts of densty-reachable and denstyconnected. Two condtons should be held:. If x s n cluster C, and x s densty reachable from x, then x also belongs to cluster C. If x and x belongs to cluster C, x and x are densty connected The results are hghly dependent on the nput parameters eps and mnpt. Fndng approprate parameters for a data set s not trval, and the problem becomes more complcated when dfferent parts of data requre dfferent parameters []. Several methods such as Orderng Ponts To Identfy the Clusterng Structure (OPTICS [30] have been proposed to address ths problem. Tme complexty of the orgnal DBSCAN s O(N but efforts [3] [3] have been made to reduce t close to O(N. Dssertaton n Forestry and Natural Scences No 5 5

28 Mohammad Rezae: Clusterng Valdaton Nose eps Outler Border Cluster Cluster Core Fgure 3.: Three types of ponts are defned n the DBSCAN algorthm; two clusters are dentfed n ths example, where eps= and mnpt=5. 6 Dssertaton n Forestry and Natural Scences No 5

29 4 Cost functons An obectve functon or cost functon measures the error n a clusterng. The optmal clusterng s acheved by mnmzng the cost functon. However, not all clusterng algorthms are based on mnmzng a cost functon. Some nclude the cost functon hdden wthn the algorthm. Ths makes the evaluaton of clusterng results and analyss of the algorthms dffcult. For example, DBSCAN produces a clusterng heurstcally wth two gven nput parameters. Dfferent parameter values result n dfferent clusterngs. No obectve functon has been reported to decde whch clusterng s the best. There s however a cost functon but t may be hdden. Ths chapter addresses several cost functons that are used n exstng clusterng methods. 4. TOTAL SQUARED ERROR (TSE Total squared error (TSE s the obectve functon for most centrod-based clusterng algorthms such as k-means, whch s the sum of varances n ndvdual clusters. Gven data nputs x, =..N, centrods c, =..k, and labels of data l, =..N, l=..k, TSE s defned as [6]: TSE N x c l (4. Mean squared error (MSE equals normalzed TSE by the total number of obects. There s no dfference between mnmzng MSE and TSE. MSE N N x c l (4. For a fxed number of clusters k, the best clusterng s the one that provdes mnmum TSE. However, when the number of Dssertaton n Forestry and Natural Scences No 5 7

30 Mohammad Rezae: Clusterng Valdaton clusters vares, the clusterng that best fts the data cannot be concluded merely based on TSE because ncreasng k wll always provde a smaller TSE. Ths would lead all ponts nto ther own clusters. The TSE n equaton (4. can be used only for the data that the centrod of a cluster can be calculated by averagng the obects n the cluster. 4. ALL PAIRWISE DISTANCES (APD Ths cost functon consders all parwse dstances (APD between the obects n a cluster. The centrod s not needed. Therefore, APD can be used for any type of data f the dstance between every two obects s avalable. The crteron s defned as: APD x, x Cl x x (4.3 It can be shown for Eucldean dstance that [33]: APD APD APD n TSE n TSE... APD... n TSE k k k (4.4 where APD, n, and TSE are the sum of all parwse dstances, the number of obects, and the total squared error n cluster, respectvely. It s shown n [34] that applyng all parwse dstances as the clusterng crteron leads to more balanced clusters than TSE. TSE can be calculated for non-numerc data wthout havng centrods as follows. The sum of all parwse dstances s calculated for each cluster, and the result s dvded by the number of obects n the cluster gvng the total squared error TSE. Summng up the total squared errors of all clusters results n TSE. 8 Dssertaton n Forestry and Natural Scences No 5

31 Cost Functons 4.3 SPANNING TREE (ST The cost functon s the sum of the costs of spannng trees (ST of the ndvdual clusters. The optmal soluton for the cost functon s acheved from the mnmum spannng tree (MST of the data obects. Gven the MST n Fgure 4. (left, we can get three clusters by cuttng the two largest lnks. Ths cost functon s sutable for detectng well separated arbtrary shaped clusters. However, t fals n real lfe data sets wth nose, see Fgure 4. (rght. Nose Fgure 4.: Spannng trees of clusters are used to derve the cost functon. 4.4 K-NEAREST NEIGHBOR CONNECTIVITY Ths cost functon measures connectedness by countng the number of k nearest neghbors of each obect that are placed n dfferent cluster than the obect [35]. It s calculated as:, f x Pl K CONN x ( x x ( x (4.5 xpl x nn( x 0, otherwse where x s the th nearest neghbor of x, and Pl represents the cluster that x belongs to.the number of neghbors k s an nput parameter. The cost functon should be mnmzed. The optmal case s when all k nearest neghbors of an obect locate n the same cluster of the obect. The mpact of the frst neghbor on the cost functon s the hghest, and t decreases for the next Dssertaton n Forestry and Natural Scences No 5 9

32 Mohammad Rezae: Clusterng Valdaton neghbors by the factor /, =..k. The 5 nearest neghbors of one obect s depcted n Fgure 4., from whch the fourth and ffth neghbors are from the other cluster. The error s calculated as /4+/5=0.45. Summng up the errors for all the ponts gves the value of cost functon. Fgure 4.: Fve nearest neghbors are consdered to calculate the cost functon. For the selected pont, two neghbors are located n the other cluster. 4.5 LINKAGE CRITERIA In agglomeratve clusterng, a global cost functon has not been defned n the lterature. Instead, a merge cost s defned whch ams at optmzng the clusterng locally. Several crtera such as sngle-lnk and complete-lnk are used for mergng two clusters, see Fgure 4.3. We reveal the global cost functon through analyzng the local ones. Sngle-lnk crteron s the dstance between the two most smlar obects n two clusters. The goal of sngle-lnk s to fnd clusters wth the hghest connectvty. Two obects n a cluster can be far away but connected through other ponts n the cluster. The cost functon s the sum of the costs of spannng trees of ndvdual clusters. Sngle-lnk can be related to Kruskal s algorthm whch s known to be optmal for MST. It can be shown that k clusters correspond to the MST forest of k trees. Complete-lnk crteron s the dstance between the two most dssmlar obects n two clusters. Complete-lnk ams at fndng homogenous clusters so that the maxmum dstance between the obects n each cluster s mnmzed. Once two new clusters are merged, the resultng dstance s the maxmum dstance over all 0 Dssertaton n Forestry and Natural Scences No 5

33 Cost Functons clusters whch ndcates the worst cluster. Gven a clusterng, the largest parwse dstance n each cluster s determned. The overall cost functon s the maxmum of the largest dstances from all clusters. We call the cost functon MAX-MAX. Agglomeratve clusterng usng the complete-lnk crteron does not guarantee the optmal soluton for the MAX-MAX cost, see Fgure 4.4. Average-lnk crteron selects the two clusters that the average dstance between all pars of obects n them s mnmum. The correspondng cost functon s therefore all parwse dstances. Centrod-lnk crteron s the dstance between the centrods of two clusters. It can be used only for data n whch the centrods of clusters can be derved. Ward s crteron selects the clusters to be merged that result n a mnmum ncrease n TSE [36]. The ncrease of TSE resulted from mergng two clusters and s calculated as: nn TSE n n c c (4.6 where c and c are the centrods, and n and n are the number of obects n the two clusters. Sngle-lnk Complete-lnk Average-lnk Fgure 4.3: Dstance between two clusters Dssertaton n Forestry and Natural Scences No 5

34 Mohammad Rezae: Clusterng Valdaton Complete lnk Random swap Fgure 4.4: Complete lnk agglomeratve clusterng (left results n a hgher value of the cost functon MAX-MAX comparng to the random swap algorthm (rght. The numbers show the order of merges. Dssertaton n Forestry and Natural Scences No 5

35 5 Internal valdty ndces Clusterng s defned as an optmzaton problem n whch the qualty s evaluated drectly from the optmzaton crteron. Straghtforward crteron works wth a fxed number of clusters k. Internal valdty ndces extend ths to varable k. 5. INTERNAL INDICES Internal ndces use a clusterng and the underlyng data set to assess the qualty of the clusterng [37]. They are desgned based on the goal of clusterng, placng smlar obects n the same cluster and dssmlar obects n dfferent clusters. Accordngly, two concepts are defned: ntra-cluster smlarty and ntercluster smlarty. Intra-cluster smlarty (e.g. compactness, connectedness, and homogenety measures the smlarty of the obects wthn a cluster, and nter-cluster smlarty or separaton measures how dstant ndvdual clusters (or ther obects are. Compactness s sutable for the clusterng algorthms that tend to provde sphercal clusters. Examples nclude centrodbased clusterng algorthms such as K-means, and average-lnk agglomeratve clusterng. Connectedness s sutable for denstybased algorthms such as DBSCAN [37]. Several varants of compactness and connectedness exst. The average of parwse ntra-cluster dstances and the average of centrod-based smlartes are representatves of compactness. A popular measure of connectedness s k-nearest neghbor connectvty whch counts volatons of nearest neghbor relatonshps [37]. A good clusterng of a data set s expected to provde well separated clusters [38]. Separaton s defned n dfferent ways. Three common methods are the dstance between the closest obects, the most dstant obects, and the centers of two clusters [39]. Dssertaton n Forestry and Natural Scences No 5 3

36 Mohammad Rezae: Clusterng Valdaton Several nternal ndces have been proposed that combne compactness and separaton [3] [37] [39] [40] [4] [4]. Popular ndces are lsted n Table 5.. Most of the ndces have been nvented for determnng the number of clusters that fts the data. Table 5.: Selecton of popular nternal valdty ndces SSW [43] SSB [43] Calnsk-Harabasz [44] Ball&Hall [45] Xu-ndex [46] Dunn s ndex [47] Daves&Bouldn [48] SC [49] N K x c n l c x SSB /( K SSW /( N K SSW / K D log ( M SSW /( DN mn mn d( c, c M max dam( c k where M d ( c, c mn x x' k xc, x' c dam( c max x x' K K.. M, k max x, x' c k R where MSE MSE R and c c MSE p n n x p c N b( x p a( x p N max( a( x, b( x where p log K and 4 Dssertaton n Forestry and Natural Scences No 5

37 Internal Valdty Indces Dssertaton n Forestry and Natural Scences No 5 5 q p C x x n q q p p x x n x a,, ( q p C x C x q p N q p q p N q p x x x a x x x b, mn ( mn ( BIC [43] M n D K N L log( ( Xe-Ben [50] } mn{ s t s t N K k c c N c x u WB [5] SSB K * SSW 5. SUM OF SQUARES WITHIN CLUSTERS (SSW Sum of squares wthn clusters (SSW [43] or wthn cluster varance s equal to the TSE, see Fgure 5.. The ndex can only be used for numercal data because t requres centrods of clusters. SSW measures the compactness of clusters, and s sutable for centrod-based clusterng, where hypersphercal clusters are desred. The value of SSW always decreases as the number of clusters ncreases. Fgure 5.: Illustraton of the sum of squares wthn clusters

38 Mohammad Rezae: Clusterng Valdaton 5.3 SUM OF SQUARES BETWEEN CLUSTERS (SSB The sum of squares between clusters (SSB [43] measures the degree of separaton between clusters by calculatng between cluster varance. The separaton between clusters s determned accordng to the dstances of centrods to the mean vector of all obects, see Fgure 5.. The factor n n the formula presented n Table 5. ndcates that a cluster wth a bgger sze has more mpact on the ndex. Ths crteron requres the centrods or prototypes of clusters and all data. Increasng the number of clusters usually results n a larger SSB value. x Fgure 5.: Illustraton of the sum of squares between clusters. 5.4 CALINSKI-HARABASZ INDEX (CH The Calnsk-Harabasz (CH [44] ndex uses the rato of separaton and compactness to provde the best possble separaton and compactness smultaneously. A maxmum of the ndex value ndcates the best clusterng wth a hgh separaton and low error n compactness. A hgher number of clusters for a data set provdes hgher SSB and lower SSW. However, the decrease n SSW s more than that of SSB. Therefore, the penalty factor (K- prevents the concluson of a hgher number of clusters than the correct one. The term N-K s consdered to support cases n whch the number of clusters s comparable to 6 Dssertaton n Forestry and Natural Scences No 5

39 Internal Valdty Indces the total number of obects. However, usually N s much hgher than K, and the term can be shortened to N. Ths ndex, smlar to SSB and SSW, s lmted to numercal data wth hypersphercal clusters. 5.5 SILHOUETTE COEFFICIENT (SC Slhouette coeffcent (SC [49] measures how well each obect s placed n ts cluster, and separated from the obects n other clusters. The average dssmlarty of each obect x wth all obects n the same cluster s calculated as a(x, whch ndcates how well x s assgned to ts cluster. Lowest average dssmlarty of x to other clusters s calculated as b(x. N b( x a( x SC= N max( a( x, b( x p (5. The dssmlarty between two obects s suffcent for calculatng the ndex. Therefore, SC can be used for any type of data, and any clusterng structure. 5.6 DUNN FAMILY OF INDICES Dunn ndex [47] s defned as follows: DI K mn mn d( c, c K max dam( c k K k (5. where d(c,c s the dssmlarty between two clusters and dam(ck=max d(x, x s the dameter of cluster ck, where x, x ck. The numerator of the equaton s a measure of separaton, the dstance between the two closest clusters. The dameter of a cluster shows the dsperson (opposte to compactness of the cluster. The cluster wth the maxmum dameter s consdered. A larger value of the ndex ndcates a better clusterng of a data set wth more compact and well separated clusters. Dssertaton n Forestry and Natural Scences No 5 7

40 Mohammad Rezae: Clusterng Valdaton Dunn ndex s senstve to nose, and has a hgh tme complexty [5]. Three related ndces have been ntroduced n [5] based on Dunn ndex to allevate these lmtatons. They are called Dunn-lke ndces. 5.7 SOLVING NUMBER OF CLUSTERS To determne the number of clusters, clusterng s appled to the data set for a range of k[kmn, Kmax], and the valdty ndex values are calculated. The best number of clusters k* s selected accordng to the extremum of the valdty ndex. Fgure 5.3 shows data set S wth 5 clusters and the normalzed values of SSW and SSB. Random swap clusterng algorthm [5] s appled when the number of clusters s vared n the range [, 5]. Fgure 5.3: Data set S (left, and the measured values of SSW and SSB (rght The error n compactness measured by SSW decreases, and the separaton measured by SSB ncreases, as the number of clusters ncreases. However, the decreasng and ncreasng rates sgnfcantly reduce after k=5, a knee pont that ndcates the correct number of clusters. Although several methods for detectng the knee pont have been summarzed n [43] but none of them work n all cases. It would be easer to use a valdty 8 Dssertaton n Forestry and Natural Scences No 5

41 Internal Valdty Indces ndex that provdes a clear mnmum or maxmum value at the correct number of clusters. For example, CH [44] provdes a maxmum by consderng both SSW and SSB, and also a penalty factor on the number of clusters k, see Fgure 5.4. Fgure 5.4: Determnng the number of clusters for the data set S usng CH ndex Most of the exstng nternal ndces requre the prototypes of the clusters but these are not always easy to calculate, such as n a clusterng of words based on ther semantc smlarty. In [P4], we ntroduce a new nternal ndex to be used for determnng the number of clusters n a herarchcal clusterng of words. To fnd out whch level of the herarchy provdes the best categorzaton of the data, an nternal ndex needs to evaluate the compactness wthn clusters and separaton between clusters at each level. We defne the proposed ndex as the rato of compactness and separaton: C( k SC( k (5.3 S( k C( k max{max JC( w, w, w w c } I k t S( k t k st, mn, JC( w, w k( k /, w c, w c t t s / N (5.4 (5.5 Dssertaton n Forestry and Natural Scences No 5 9

42 Mohammad Rezae: Clusterng Valdaton where w s the th keyword, ct s the cluster t at the level of herarchy where the number of clusters s k, JC s the Jang & Conrath functon that measures the dstance of two words, I s the number of clusters wth only one word, and N s the total number of words. Compactness measures the maxmum parwse dstance n each cluster, and takes the maxmum value among all clusters. Compactness for clusters wth a sngle obect cannot be consdered zero because the clusterng n whch each obect s n ts own cluster would then result n the best compactness. To avod ths, we add the factor I/N to the compactness equaton. In the begnnng of clusterng, when each obect belongs to ts own cluster, the compactness equals because I=N. Separaton measures the mnmum dstance between the words of every two clusters and sums up the values. Normalzaton by k(k- provdes a value n the same scale as compactness. A good clusterng provdes a small dstance value for compactness and a large dstance value for separaton. Therefore, the level of the herarchy wth k clusters that results n the mnmum SC s selected as the best level. 30 Dssertaton n Forestry and Natural Scences No 5

43 6 External valdty ndces External valdty ndces measure how well the results of a clusterng match the ground truth (f avalable or another clusterng [53] [P]. They are the crtera for testng and evaluatng clusterng results and for the analyss of clusterng tendency n a data set. Some authors defne an external ndex for comparng a clusterng wth ground truth [4] [37] and defne relatve ndex for comparng two clusterngs of a data set [3] [5]. However, many others classfy both as external ndex. External ndces have been used n ensemble clusterng [40] [54] [55] [56], genetc algorthms [57], and evaluatng the stablty of k-means [55]. In ths secton, we frst ntroduce several propertes for a valdty ndex based on whch ts performance can be evaluated. We then provde a revew of the external ndces n three categores: par-countng, nformaton theoretc, and set-matchng, see Table 6., [P]. Fnally, we descrbe our new setup of experments for evaluatng the external ndces. Gven two parttons P={P, P,,PK} of K clusters and G={G, G,,GK } of K clusters, an external valdty ndex measures the smlarty between P and G. Most external ndces are derved usng the values n the contngency table of P and G, see Table 6.. The table s a matrx where n s the number of obects that are both n clusters P and G: n= PG, n and m are the sze of clusters P and G respectvely. Table 6.: External valdty ndces Rand ndex [58] Adusted Rand ndex [59] Par-countng measures a d RI N ( N / RI E( RI ARI E( RI Informaton theoretc measures Dssertaton n Forestry and Natural Scences No 5 3

44 Mohammad Rezae: Clusterng Valdaton Mutual nformaton [60] Normalzed mutual nformaton [60] MI K K' n Nn log N n m MI ( P, G NMI ( H ( P H ( G / MI ( P, G NMI H ( P H ( G Normalzed Varaton of H ( P H ( G MI ( P, G NVI Informaton [6] H ( P H ( G F measure [6] Crteron H [63] Set-matchng measures K n FM n max N n m H max N K n K' Normalzed N max n Van Dongen [64] NVD N K Purty [5] Purty max n, ( N Centrod ndex [P] Centrod smlarty ndex [P] Centrod rato [65] Par sets ndex [P] K K CI ( P, G orphan( G K ' max K CI ( P, G max( CI ( P, G, CI ( G, P K n K n CSI N, : ndces of matched clusters CR K unstable par 0 stable par S E( S S E( S, max( K, K' E( S max( K, K' 0 S E( S K K' mn( K, K' n S max( n, m / K, : ndces of pared clusters n 3 Dssertaton n Forestry and Natural Scences No 5

45 External Valdty Indces Table 6.: Contngency table for two parttons P and G G G G G K P n n n n K n P n n n n K n P n n n n K n P K n K n K n K n KK n K m m m m K N 6. DESIRED PROPERTIES An external valdty ndex needs to satsfy several propertes to be consstent and comparable for dfferent data sets and clusterng structures. Normalzaton transforms the ndex wthn a fxed range, for example [0, ], whch makes comparson easer for data sets of a dfferent sze and structure. Normalzaton s the most commonly agreed property n the clusterng communty [66], and s usually performed as: n I d I d mn( I d ( P, G (6. max( I mn( I d d where mn(id and max(id are the mnmum and maxmum values of Id. Index values are expected to be constant when dfferent random clusterngs are compared wth a ground truth [59]. A random partton s created by selectng a random number of clusters of random sze. The smlarty between the random partton and the ground truth orgnates merely by chance. Take an example of Rand ndex: the value of the ndex for two random parttons s not a constant, and s n a narrower range of [0.5, ] nstead of [0, ]. By correcton for chance or adustment, the expected value of an ndex E(I s transformed to zero (smlarty or one (dssmlarty [59] [67]. Adustment and normalzaton can be performed ontly as follows: Dssertaton n Forestry and Natural Scences No 5 33

46 Mohammad Rezae: Clusterng Valdaton Dssmlarty : Smlarty : ad I d I ad s Id mn( Id ( P, G E( I mn( I Is E( Is ( P, G max( I E( I d s d s (6. where the mnmum (smlarty or maxmum (dssmlarty s replaced by the expected value E(I. Metrc property has also been consdered. Although a smlarty/dssmlarty measure can be effectve wthout beng a metrc [7], t s sometmes preferred. Consderng dssmlarty ndex I and clusters P, P and P3, metrc propertes requre [] [68]:. Non-negatvty: Id(P,P 0. Reflexvty: Id(P,P=0 f and only f P=P 3. Symmetry: Id(P,P=Id(P,P 4. Trangular nequalty: Id(P,P+Id(P,P3 Id(P,P3 A smlarty metrc satsfes the followng []:. Lmted Range: Is(P,P I0<. Reflexvty: Is(P,P= I0 f and only f P=P 3. Symmetry: Is(P,P=Is(P,P 4. Trangular nequalty: Is(P,P Is(P,P3 Is(P,P3 ( Is(P,P+Is(P,P3 The trangular nequalty for a smlarty ndex Is s derved here accordng to the correspondng nequalty for a dssmlarty ndex whch s defned as c/is (c>0. However, other forms of the nequalty are possble by defnng other dssmlartes such as max(is-is. It s trval to show that f c/is (or max(is-is s a dssmlarty metrc, Is s a smlarty metrc as well []. Hence, metrc propertes for a smlarty ndex can be checked for ts correspondng dssmlarty [P]. Cluster sze mbalance sgnfes that a data set can nclude clusters wth large dfference n ther szes. Some researchers argue that clusters wth larger szes have more mportance than smaller clusters but we assume that each cluster has the same mportance ndependent of ts sze. Invarance n the sze of clusters s therefore another desred property of an ndex. The sze of a data set should not affect the ndex ether [P]. 34 Dssertaton n Forestry and Natural Scences No 5

47 External Valdty Indces An ndex should be ndependent of the number of clusters. Some ndces such as Rand ndex (RI [58] gve hgher smlarty when more clusters [68]. An ndex should also be applcable for comparng two clusterngs wth dfferent number of clusters. Monotoncty s another requred property. Ths property states that the smlarty of two clusterngs monotoncally decreases as ther dfference ncreases [P]. Once these desred propertes are met, then ndex values for dfferent data sets are on the same scale and comparable. For nstance, f an ndex gves 90% and 70% smlartes, 90% should represent hgher smlarty. However, ths s true only f the ndex s ndependent of the data set and ts clusterng structure [P]. 6. PAIR-COUNTING INDICES Par-countng measures count the pars of ponts on whch two clusterngs agree or dsagree. For nstance, f two obects n one cluster n the frst partton are also placed n the same cluster n the second partton, then ths s consdered an agreement. Most exstng external valdty ndces are classfed n ths group [P]. Four values are defned: a represents the number of pars that are n the same cluster both n P and G; b represents the number of pars that are n the same cluster n P but n dfferent clusters n G; c represents the number of pars that are n dfferent clusters n P but n the same cluster n G; d represents the number of pars that are n dfferent clusters both n P and G. Values a and d count agreements whle values b and c count dsagreements. Examples of each case are llustrated n Fgure 6.. The values of a, b, c, and d can be calculated from the contngency table [59] as follows: a b ( K K' K' n ( n K K' m n Dssertaton n Forestry and Natural Scences No 5 35

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Hierarchical clustering for gene expression data analysis

Hierarchical clustering for gene expression data analysis Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15 CS434a/541a: Pattern Recognton Prof. Olga Veksler Lecture 15 Today New Topc: Unsupervsed Learnng Supervsed vs. unsupervsed learnng Unsupervsed learnng Net Tme: parametrc unsupervsed learnng Today: nonparametrc

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervsed Learnng and Clusterng Why consder unlabeled samples?. Collectng and labelng large set of samples s costly Gettng recorded speech s free, labelng s tme consumng 2. Classfer could be desgned

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervsed Learnng and Clusterng Supervsed vs. Unsupervsed Learnng Up to now we consdered supervsed learnng scenaro, where we are gven 1. samples 1,, n 2. class labels for all samples 1,, n Ths s also

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Unsupervised Learning

Unsupervised Learning Pattern Recognton Lecture 8 Outlne Introducton Unsupervsed Learnng Parametrc VS Non-Parametrc Approach Mxture of Denstes Maxmum-Lkelhood Estmates Clusterng Prof. Danel Yeung School of Computer Scence and

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Classifier Selection Based on Data Complexity Measures *

Classifier Selection Based on Data Complexity Measures * Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.

More information

An Internal Clustering Validation Index for Boolean Data

An Internal Clustering Validation Index for Boolean Data BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 16, No 6 Specal ssue wth selecton of extended papers from 6th Internatonal Conference on Logstc, Informatcs and Servce Scence

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

Clustering. A. Bellaachia Page: 1

Clustering. A. Bellaachia Page: 1 Clusterng. Obectves.. Clusterng.... Defntons... General Applcatons.3. What s a good clusterng?. 3.4. Requrements 3 3. Data Structures 4 4. Smlarty Measures. 4 4.. Standardze data.. 5 4.. Bnary varables..

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

On the Efficiency of Swap-Based Clustering

On the Efficiency of Swap-Based Clustering On the Effcency of Swap-Based Clusterng Pas Fränt and Oll Vrmaok Department of Computer Scence, Unversty of Joensuu, Fnland {frant, ovrma}@cs.oensuu.f Abstract. Random swap-based clusterng s very smple

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

Graph-based Clustering

Graph-based Clustering Graphbased Clusterng Transform the data nto a graph representaton ertces are the data ponts to be clustered Edges are eghted based on smlarty beteen data ponts Graph parttonng Þ Each connected component

More information

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana

More information

Smoothing Spline ANOVA for variable screening

Smoothing Spline ANOVA for variable screening Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory

More information

Hierarchical agglomerative. Cluster Analysis. Christine Siedle Clustering 1

Hierarchical agglomerative. Cluster Analysis. Christine Siedle Clustering 1 Herarchcal agglomeratve Cluster Analyss Chrstne Sedle 19-3-2004 Clusterng 1 Classfcaton Basc (unconscous & conscous) human strategy to reduce complexty Always based Cluster analyss to fnd or confrm types

More information

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Determining the Optimal Bandwidth Based on Multi-criterion Fusion Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn

More information

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components

More information

S1 Note. Basis functions.

S1 Note. Basis functions. S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type

More information

Shape Representation Robust to the Sketching Order Using Distance Map and Direction Histogram

Shape Representation Robust to the Sketching Order Using Distance Map and Direction Histogram Shape Representaton Robust to the Sketchng Order Usng Dstance Map and Drecton Hstogram Department of Computer Scence Yonse Unversty Kwon Yun CONTENTS Revew Topc Proposed Method System Overvew Sketch Normalzaton

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

Clustering algorithms and validity measures

Clustering algorithms and validity measures Clusterng algorthms and valdty measures M. Hald, Y. Batstas, M. Vazrganns Department of Informatcs Athens Unversty of Economcs & Busness Emal: {mhal, yanns, mvazrg}@aueb.gr Abstract Clusterng ams at dscoverng

More information

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like: Self-Organzng Maps (SOM) Turgay İBRİKÇİ, PhD. Outlne Introducton Structures of SOM SOM Archtecture Neghborhoods SOM Algorthm Examples Summary 1 2 Unsupervsed Hebban Learnng US Hebban Learnng, Cntd 3 A

More information

Lecture 4: Principal components

Lecture 4: Principal components /3/6 Lecture 4: Prncpal components 3..6 Multvarate lnear regresson MLR s optmal for the estmaton data...but poor for handlng collnear data Covarance matrx s not nvertble (large condton number) Robustness

More information

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng

More information

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices Steps for Computng the Dssmlarty, Entropy, Herfndahl-Hrschman and Accessblty (Gravty wth Competton) Indces I. Dssmlarty Index Measurement: The followng formula can be used to measure the evenness between

More information

A Clustering Algorithm for Chinese Adjectives and Nouns 1

A Clustering Algorithm for Chinese Adjectives and Nouns 1 Clusterng lgorthm for Chnese dectves and ouns Yang Wen, Chunfa Yuan, Changnng Huang 2 State Key aboratory of Intellgent Technology and System Deptartment of Computer Scence & Technology, Tsnghua Unversty,

More information

A New Approach For the Ranking of Fuzzy Sets With Different Heights

A New Approach For the Ranking of Fuzzy Sets With Different Heights New pproach For the ankng of Fuzzy Sets Wth Dfferent Heghts Pushpnder Sngh School of Mathematcs Computer pplcatons Thapar Unversty, Patala-7 00 Inda pushpndersnl@gmalcom STCT ankng of fuzzy sets plays

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur FEATURE EXTRACTION Dr. K.Vjayarekha Assocate Dean School of Electrcal and Electroncs Engneerng SASTRA Unversty, Thanjavur613 41 Jont Intatve of IITs and IISc Funded by MHRD Page 1 of 8 Table of Contents

More information

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION Paulo Quntlano 1 & Antono Santa-Rosa 1 Federal Polce Department, Brasla, Brazl. E-mals: quntlano.pqs@dpf.gov.br and

More information

Data & Knowledge Engineering

Data & Knowledge Engineering Data & Knowledge Engneerng 9 (4) 77 89 Contents lsts avalable at ScenceDrect Data & Knowledge Engneerng journal homepage: www.elsever.com/locate/datak Edtoral WB-ndex: A sum-of-squares based ndex for cluster

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

K-means and Hierarchical Clustering

K-means and Hierarchical Clustering Note to other teachers and users of these sldes. Andrew would be delghted f you found ths source materal useful n gvng your own lectures. Feel free to use these sldes verbatm, or to modfy them to ft your

More information

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters Proper Choce of Data Used for the Estmaton of Datum Transformaton Parameters Hakan S. KUTOGLU, Turkey Key words: Coordnate systems; transformaton; estmaton, relablty. SUMMARY Advances n technologes and

More information

Keyword-based Document Clustering

Keyword-based Document Clustering Keyword-based ocument lusterng Seung-Shk Kang School of omputer Scence Kookmn Unversty & AIrc hungnung-dong Songbuk-gu Seoul 36-72 Korea sskang@kookmn.ac.kr Abstract ocument clusterng s an aggregaton of

More information

LECTURE : MANIFOLD LEARNING

LECTURE : MANIFOLD LEARNING LECTURE : MANIFOLD LEARNING Rta Osadchy Some sldes are due to L.Saul, V. C. Raykar, N. Verma Topcs PCA MDS IsoMap LLE EgenMaps Done! Dmensonalty Reducton Data representaton Inputs are real-valued vectors

More information

Machine Learning. Topic 6: Clustering

Machine Learning. Topic 6: Clustering Machne Learnng Topc 6: lusterng lusterng Groupng data nto (hopefully useful) sets. Thngs on the left Thngs on the rght Applcatons of lusterng Hypothess Generaton lusters mght suggest natural groups. Hypothess

More information

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques

More information

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Learning-Based Top-N Selection Query Evaluation over Relational Databases Learnng-Based Top-N Selecton Query Evaluaton over Relatonal Databases Lang Zhu *, Wey Meng ** * School of Mathematcs and Computer Scence, Hebe Unversty, Baodng, Hebe 071002, Chna, zhu@mal.hbu.edu.cn **

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc. [Type text] [Type text] [Type text] ISSN : 0974-74 Volume 0 Issue BoTechnology 04 An Indan Journal FULL PAPER BTAIJ 0() 04 [684-689] Revew on Chna s sports ndustry fnancng market based on market -orented

More information

Available online at Available online at Advanced in Control Engineering and Information Science

Available online at   Available online at   Advanced in Control Engineering and Information Science Avalable onlne at wwwscencedrectcom Avalable onlne at wwwscencedrectcom Proceda Proceda Engneerng Engneerng 00 (2011) 15000 000 (2011) 1642 1646 Proceda Engneerng wwwelsevercom/locate/proceda Advanced

More information

Generating Fuzzy Term Sets for Software Project Attributes using and Real Coded Genetic Algorithms

Generating Fuzzy Term Sets for Software Project Attributes using and Real Coded Genetic Algorithms Generatng Fuzzy Ter Sets for Software Proect Attrbutes usng Fuzzy C-Means C and Real Coded Genetc Algorths Al Idr, Ph.D., ENSIAS, Rabat Alan Abran, Ph.D., ETS, Montreal Azeddne Zah, FST, Fes Internatonal

More information

Object-Based Techniques for Image Retrieval

Object-Based Techniques for Image Retrieval 54 Zhang, Gao, & Luo Chapter VII Object-Based Technques for Image Retreval Y. J. Zhang, Tsnghua Unversty, Chna Y. Y. Gao, Tsnghua Unversty, Chna Y. Luo, Tsnghua Unversty, Chna ABSTRACT To overcome the

More information

Query Clustering Using a Hybrid Query Similarity Measure

Query Clustering Using a Hybrid Query Similarity Measure Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan

More information

Detection of an Object by using Principal Component Analysis

Detection of an Object by using Principal Component Analysis Detecton of an Object by usng Prncpal Component Analyss 1. G. Nagaven, 2. Dr. T. Sreenvasulu Reddy 1. M.Tech, Department of EEE, SVUCE, Trupath, Inda. 2. Assoc. Professor, Department of ECE, SVUCE, Trupath,

More information

Survey of Cluster Analysis and its Various Aspects

Survey of Cluster Analysis and its Various Aspects Harmnder Kaur et al, Internatonal Journal of Computer Scence and Moble Computng, Vol.4 Issue.0, October- 05, pg. 353-363 Avalable Onlne at www.csmc.com Internatonal Journal of Computer Scence and Moble

More information

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:

More information

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach Angle Estmaton and Correcton of Hand Wrtten, Textual and Large areas of Non-Textual Document Images: A Novel Approach D.R.Ramesh Babu Pyush M Kumat Mahesh D Dhannawat PES Insttute of Technology Research

More information

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence

More information

Wishing you all a Total Quality New Year!

Wishing you all a Total Quality New Year! Total Qualty Management and Sx Sgma Post Graduate Program 214-15 Sesson 4 Vnay Kumar Kalakband Assstant Professor Operatons & Systems Area 1 Wshng you all a Total Qualty New Year! Hope you acheve Sx sgma

More information

Hermite Splines in Lie Groups as Products of Geodesics

Hermite Splines in Lie Groups as Products of Geodesics Hermte Splnes n Le Groups as Products of Geodescs Ethan Eade Updated May 28, 2017 1 Introducton 1.1 Goal Ths document defnes a curve n the Le group G parametrzed by tme and by structural parameters n the

More information

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe CSCI 104 Sortng Algorthms Mark Redekopp Davd Kempe Algorthm Effcency SORTING 2 Sortng If we have an unordered lst, sequental search becomes our only choce If we wll perform a lot of searches t may be benefcal

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

From Comparing Clusterings to Combining Clusterings

From Comparing Clusterings to Combining Clusterings Proceedngs of the Twenty-Thrd AAAI Conference on Artfcal Intellgence (008 From Comparng Clusterngs to Combnng Clusterngs Zhwu Lu and Yuxn Peng and Janguo Xao Insttute of Computer Scence and Technology,

More information

Efficient Segmentation and Classification of Remote Sensing Image Using Local Self Similarity

Efficient Segmentation and Classification of Remote Sensing Image Using Local Self Similarity ISSN(Onlne): 2320-9801 ISSN (Prnt): 2320-9798 Internatonal Journal of Innovatve Research n Computer and Communcaton Engneerng (An ISO 3297: 2007 Certfed Organzaton) Vol.2, Specal Issue 1, March 2014 Proceedngs

More information

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices Internatonal Mathematcal Forum, Vol 7, 2012, no 52, 2549-2554 An Applcaton of the Dulmage-Mendelsohn Decomposton to Sparse Null Space Bases of Full Row Rank Matrces Mostafa Khorramzadeh Department of Mathematcal

More information

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION 1 THE PUBLISHING HOUSE PROCEEDINGS OF THE ROMANIAN ACADEMY, Seres A, OF THE ROMANIAN ACADEMY Volume 4, Number 2/2003, pp.000-000 A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION Tudor BARBU Insttute

More information

A Deflected Grid-based Algorithm for Clustering Analysis

A Deflected Grid-based Algorithm for Clustering Analysis A Deflected Grd-based Algorthm for Clusterng Analyss NANCY P. LIN, CHUNG-I CHANG, HAO-EN CHUEH, HUNG-JEN CHEN, WEI-HUA HAO Department of Computer Scence and Informaton Engneerng Tamkang Unversty 5 Yng-chuan

More information

A Hierarchical Clustering and Validity Index for Mixed Data

A Hierarchical Clustering and Validity Index for Mixed Data Graduate Theses and Dssertatons Graduate College 2012 A Herarchcal Clusterng and Valdty Index for Mxed Data Ru Yang Iowa State Unversty Follow ths and addtonal works at: http://lb.dr.astate.edu/etd Part

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Collaboratively Regularized Nearest Points for Set Based Recognition

Collaboratively Regularized Nearest Points for Set Based Recognition Academc Center for Computng and Meda Studes, Kyoto Unversty Collaboratvely Regularzed Nearest Ponts for Set Based Recognton Yang Wu, Mchhko Mnoh, Masayuk Mukunok Kyoto Unversty 9/1/013 BMVC 013 @ Brstol,

More information

SCALABLE AND VISUALIZATION-ORIENTED CLUSTERING FOR EXPLORATORY SPATIAL ANALYSIS

SCALABLE AND VISUALIZATION-ORIENTED CLUSTERING FOR EXPLORATORY SPATIAL ANALYSIS SCALABLE AND VISUALIZATION-ORIENTED CLUSTERING FOR EXPLORATORY SPATIAL ANALYSIS J.H.Guan, F.B.Zhu, F.L.Ban a School of Computer, Spatal Informaton & Dgtal Engneerng Center, Wuhan Unversty, Wuhan, 430079,

More information

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes SPH3UW Unt 7.3 Sphercal Concave Mrrors Page 1 of 1 Notes Physcs Tool box Concave Mrror If the reflectng surface takes place on the nner surface of the sphercal shape so that the centre of the mrror bulges

More information

Clustering is a discovery process in data mining.

Clustering is a discovery process in data mining. Cover Feature Chameleon: Herarchcal Clusterng Usng Dynamc Modelng Many advanced algorthms have dffculty dealng wth hghly varable clusters that do not follow a preconceved model. By basng ts selectons on

More information

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng

More information

Clustering Algorithm of Similarity Segmentation based on Point Sorting

Clustering Algorithm of Similarity Segmentation based on Point Sorting Internatonal onference on Logstcs Engneerng, Management and omputer Scence (LEMS 2015) lusterng Algorthm of Smlarty Segmentaton based on Pont Sortng Hanbng L, Yan Wang*, Lan Huang, Mngda L, Yng Sun, Hanyuan

More information

LinkSelector: A Web Mining Approach to. Hyperlink Selection for Web Portals

LinkSelector: A Web Mining Approach to. Hyperlink Selection for Web Portals nkselector: A Web Mnng Approach to Hyperlnk Selecton for Web Portals Xao Fang and Olva R. u Sheng Department of Management Informaton Systems Unversty of Arzona, AZ 8572 {xfang,sheng}@bpa.arzona.edu Submtted

More information

Active Contours/Snakes

Active Contours/Snakes Actve Contours/Snakes Erkut Erdem Acknowledgement: The sldes are adapted from the sldes prepared by K. Grauman of Unversty of Texas at Austn Fttng: Edges vs. boundares Edges useful sgnal to ndcate occludng

More information

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification Introducton to Artfcal Intellgence V22.0472-001 Fall 2009 Lecture 24: Nearest-Neghbors & Support Vector Machnes Rob Fergus Dept of Computer Scence, Courant Insttute, NYU Sldes from Danel Yeung, John DeNero

More information

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION Overvew 2 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION Introducton Mult- Smulator MASIM Theoretcal Work and Smulaton Results Concluson Jay Wagenpfel, Adran Trachte Motvaton and Tasks Basc Setup

More information

USING GRAPHING SKILLS

USING GRAPHING SKILLS Name: BOLOGY: Date: _ Class: USNG GRAPHNG SKLLS NTRODUCTON: Recorded data can be plotted on a graph. A graph s a pctoral representaton of nformaton recorded n a data table. t s used to show a relatonshp

More information

Web Mining: Clustering Web Documents A Preliminary Review

Web Mining: Clustering Web Documents A Preliminary Review Web Mnng: Clusterng Web Documents A Prelmnary Revew Khaled M. Hammouda Department of Systems Desgn Engneerng Unversty of Waterloo Waterloo, Ontaro, Canada 2L 3G1 hammouda@pam.uwaterloo.ca February 26,

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

Document Representation and Clustering with WordNet Based Similarity Rough Set Model

Document Representation and Clustering with WordNet Based Similarity Rough Set Model IJCSI Internatonal Journal of Computer Scence Issues, Vol. 8, Issue 5, No 3, September 20 ISSN (Onlne): 694-084 www.ijcsi.org Document Representaton and Clusterng wth WordNet Based Smlarty Rough Set Model

More information

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

Face Recognition University at Buffalo CSE666 Lecture Slides Resources: Face Recognton Unversty at Buffalo CSE666 Lecture Sldes Resources: http://www.face-rec.org/algorthms/ Overvew of face recognton algorthms Correlaton - Pxel based correspondence between two face mages Structural

More information

y and the total sum of

y and the total sum of Lnear regresson Testng for non-lnearty In analytcal chemstry, lnear regresson s commonly used n the constructon of calbraton functons requred for analytcal technques such as gas chromatography, atomc absorpton

More information

A Multi-step Strategy for Shape Similarity Search In Kamon Image Database

A Multi-step Strategy for Shape Similarity Search In Kamon Image Database A Mult-step Strategy for Shape Smlarty Search In Kamon Image Database Paul W.H. Kwan, Kazuo Torach 2, Kesuke Kameyama 2, Junbn Gao 3, Nobuyuk Otsu 4 School of Mathematcs, Statstcs and Computer Scence,

More information

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust

More information

Related-Mode Attacks on CTR Encryption Mode

Related-Mode Attacks on CTR Encryption Mode Internatonal Journal of Network Securty, Vol.4, No.3, PP.282 287, May 2007 282 Related-Mode Attacks on CTR Encrypton Mode Dayn Wang, Dongda Ln, and Wenlng Wu (Correspondng author: Dayn Wang) Key Laboratory

More information

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search Sequental search Buldng Java Programs Chapter 13 Searchng and Sortng sequental search: Locates a target value n an array/lst by examnng each element from start to fnsh. How many elements wll t need to

More information

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law) Machne Learnng Support Vector Machnes (contans materal adapted from talks by Constantn F. Alfers & Ioanns Tsamardnos, and Martn Law) Bryan Pardo, Machne Learnng: EECS 349 Fall 2014 Support Vector Machnes

More information

A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment

A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment A Webpage Smlarty Measure for Web Sessons Clusterng Usng Sequence Algnment Mozhgan Azmpour-Kv School of Engneerng and Scence Sharf Unversty of Technology, Internatonal Campus Ksh Island, Iran mogan_az@ksh.sharf.edu

More information

CMPS 10 Introduction to Computer Science Lecture Notes

CMPS 10 Introduction to Computer Science Lecture Notes CPS 0 Introducton to Computer Scence Lecture Notes Chapter : Algorthm Desgn How should we present algorthms? Natural languages lke Englsh, Spansh, or French whch are rch n nterpretaton and meanng are not

More information