Topics. Clustering. Unsupervised vs. Supervised. Vehicle Example. Vehicle Clusters Advanced Algorithmics

.0.009 Topcs Advanced Algorthmcs Clusterng Jaak Vlo 009 Sprng What s clusterng Herarchcal clusterng K means + K medods SOM Fuzzy EM Jaak Vlo MTAT.0.90 Text Algorthms Unsupervsed vs. Supervsed Clusterng Fnd groups nherent to data (clusterng) Fnd a classfer for known classes An old problem Many methods No sngle best sutes all needs method Vehcle Example Vehcle Clusters Vehcle Top speed km/h Colour Ar resstance Weght Kg V 0 red 0.0 00 V 0 black 0. 00 V 60 red 0.9 00 V 0 gray 0. 800 V blue 0. 90 V6 0 whte 0.0 600 V7 00 black 0.0 000 V8 0 red 0.60 00 V9 0 gray 0. 00 Weght [kg] 00 000 00 000 00 000 Lorres Medum market cars Sports cars 00 00 0 00 0 00 Top speed [km/h]

.0.009 Termnology Motvaton: Why Clusterng? feature Weght [kg] 00 000 00 000 00 000 Object or data pont Lorres cluster label Medum market cars Sports cars 00 00 0 00 0 00 Top speed [km/h] feature feature space Problem: Identfy (a small number of) groups of smlar objects n a gven (large) set of object. Goals: Fnd representatves for homogeneous groups Data Compresson Fnd natural clusters and descrbe ther propertes natural Data Types Fnd sutable and useful groupng useful Data Classes Fnd unusual data object Outler Detecton 8 Clusterng t s easy (for humans) Edge Detecton (advantage to smooth contours) Texture clusterng

.0.009 Clusterng cont Dstance measures: whch two profles are smlar to each other? Eucldean, Manhattan etc. Rank correlaton.. Correlaton, angle, etc.. Tme warpng Dstance measures Some standard dstance measures How to formally descrbe whch objects are close to each other, and whch are not More than one way to defne dstances. Dstance s a metrc, f d(x,x) = 0 d(x,y) = d(y,x) 0 d(a,b) d(a,c) + d(c, B) Eucldean dstance Eucldean squared Manhattan (cty-block) Average dstance d( f, g) ( f g).. c.. c d( f, g) ( f g) d( f, g) f g.. c d( f, g) c ( f g).. c Pearson correlaton Chord dstance d( f, g) c c ( ( f f )( g c f f ) g) ( g g) d( f, g) ( c f g c c f g ) If means of each column are 0, then t becomes: d( f, g) c f g c c f g cos d( f, g) ( cos ) Eucldean dstance between two vectors whose length has been normalzed to Legendre & Legendre: Numercal Ecologynd ed. y f g x

.0.009 Rank correlaton c 6 ( ) (, ) rank rank f g d f g c( c ) Rank - smallest has rank, next, etc. Equal values have rank that s average of the ranks f = 7 8 rank=.. Herarchcal clusterng 6,,... All aganst all dstance matrx. Lnkage strategy dentfy closest clusters and merge dstance( ::, : ) =. Performance: O(dn ) Herarchcal clusterng Cluster matrces: Herarchcal clusterng Calculate all parwse dstances and assgn each object nto a sngleton cluster Keep jonng together two closest clusters by usng the: Mnmum dstance => Sngle lnkage Maxmum dstance => Complete lnkage Average dstance => Average lnkage (UPGMA, WPGMA) Cluster sequences: Whle more than cluster select smallest dstance merge the two clusters update the changed dstances after merger Update dstances Merge Ca, Cb nto C Re calculate all dstances D(C, C) D(C, C) = mn{ D(C, Ca), D(C, Cb) } Merge Ca, Cb nto C D(C, C) = mn{ D(C, Ca), D(C, Cb) } Sngle lnk; Mnmal dstance D(C, C) = max{ D(C, Ca), D(C, Cb) } Complete lnk; Maxmum dstance D(C, C) = average{ D(C, Ca), D(C, Cb) } n a /( n a +n b ) * D(C, C a ) + n b /( n a +n b ) * D(C, C b ) UPGMA UnweghtedPar Group Method Average

.0.009 Persstent Systems Pvt. Ltd. http://www.persstent.co.n Runnng tme for herarchcal clusterng Dstances 00 attrb Tme mn mnute Clusterng 0,00, 000 dm Dstances 0 attrb. O( n ) dstances n tmes merge select smallest dstance update all dstances to new cluster Data sze 0K K 0K

.0.009 Herarchcal clusterng output Desgn any heatmap colorng scheme Cut GENOMES: Yeast Zoom Heat map color schema desgn Lmts of standard clusterng Herarchcal clusterng s (very) good for vsualzaton (frst mpresson) and browsng Speed for modern data sets remans relatvely slow (mnutes or even hours) ArrayExpress database needs some faster analytcal tools Hard to predct number of clusters (=>Unsupervsed) 600 genes, 80 exp. Montor sze 600x00 pxels Laptop: 800x600 600 genes, 80 exp. Montor sze 600x00 Laptop: 800x600 COLLAPSE 7 subtrees Developed and mplemented n Expresson Profler n October 000 by 6

.0.009 VsHC; 009 Fast Approxmate Herarchcal Clusterng usng Smlarty Heurstcs Herarchcal clusterng s appled n gene expresson data analyss, number of genes can be 0000+ Herarchcal clusterng: Each subtree s a cluster. Herarchy s bult by teratvely jonng two most smlar clusters nto a larger one. Fast Herarchcal Clusterng Avod calculatng all O(n ) dstances: Estmate dstances Input data Input data vsualzed Use pvots Fnd close objects Cluster wth partal nformaton Meels Kull, Jaak Vlo. Fast Approxmate Herarchcal Clusterng usng Smlarty Heurstcs. BoData Mnng, :9 008. [HappeClust webste] [URL, do:0.86/76-08--9][pubmed] Dstances from one pvot Dstances from two pvots Eucldean dstances Average lnkage herarchcal clusterng Dstances from two pvots 7

.0.009 Dstances from two pvots Epslon Grd Order (EGO)... -grd -grd Here we use Chebyshev dstance (maxmum of dfferences) By trangle nequalty we get: Eucldean dstance n orgnal plot cannot b smaller than Chebyshev dstance here ) Dataponts sorted accordng to EGO ord ) Each pont s compared wth the later ponts untl one hypercube away Major Clusterng Approaches Epslon Grd Order (EGO) -grd ) Dataponts sorted accordng to EGO ord ) Each pont s compared wth the later ponts untl one hypercube away e.g. Is compared wth the ponts n the marked hypercubes Parttonng algorthms/representatve based/prototype based Clusterng Algorthm: Construct varous parttons and then evaluate them by some crteron or ftness functon Kmeans Herarchcal algorthms: Create a herarchcal decomposton of the set of data (or objects) usng some crteron Densty based: based on connectvty and densty functons DBSCAN, DENCLUE, Grd based: based on a multple level granularty structure Model based: A model s hypotheszed for each of the clusters and the dea s to fnd the best ft of that model to each other EM 6 Representatve Based Clusterng Ams at fndng a set of objects among all objects (called representatves) n the data set that best represent the objects n the data set. Each representatve corresponds to a cluster. The remanng objects n the data set are then clustered around these representatves by assgnng objects to the cluster of the closest representatve. Remarks:. The popular k medod algorthm, also called PAM, s a representatvebased clusterng algorthm; K means also shares the characterstcs of representatve based clusterng, except that the representatves used by k means not necessarly have to belong to the data set.. If the representatve do not need to belong to the dataset we call the algorthms prototype based clusterng. K means s a prototype based clusterng algorthm K means, K medods, Partton the data ponts nto K groups Each group s centered around t s mean or medod Mean s an abstract pont Medod: most central object 7 8

.0.009 K means. Guess K centre K means. Assgn obj to clusters. Move C to gravty centres Representatve Based Clusterng (Contnued) Representatve Based Supervsed Clusterng (contnued) Attrbute Attrbute Attrbute Attrbute Objectve of RSC: Fnd a subset O R of O such that the clusterng X obtaned by usng the objects n O R as representatves mnmzes q(x); q s an objectve/ftness functon. The K Means Clusterng Method Gven k, the k means algorthm s mplemented n steps:. Partton objects nto k nonempty subsets. Compute seed ponts as the centrods of the clusters of the current partton. The centrod s the center (mean pont) of the cluster.. Assgn each object to the cluster wth the nearest seed pont.. Go back to Step, stop when no more new assgnment. The K Means Clusterng Method Example 0 0 9 9 8 8 7 7 6 6 0 0 0 6 7 8 9 0 0 6 7 8 9 0 0 0 9 9 8 8 7 7 6 6 0 0 6 7 8 9 0 0 0 6 7 8 9 0 9

.0.009 Comments on K Means Strength Relatvely effcent: O(t*k*n*d), where n s # objects, k s # clusters, and t s # teratons, d s the # dmensons. Usually, d, k, t << n; n ths case, K Mean s runtme s O(n). Storage only O(n) n contrast to other representatve based algorthms, only computes dstances between centrods and objects n the dataset, and not between objects n the dataset; therefore, the dstance matrx does not need to be stored. Easy to use; well studed; we know what to expect Fnds local optmum of the SSE ftness functon. The global optmum may be found usng technques such as: determnstc annealng and genetc algorthms Implctly uses a ftness functon (fnds a local mnmum for SSE see later) does not waste tme computng ftness values Weakness Applcable only when mean s defned what about categorcal data? Need to specfy k, the number of clusters, n advance Senstve to outlers Not sutable to dscover clusters wth non convex shapes Senstve to ntalzaton; bad ntalzaton mght lead to bad results. 6 Complcaton: Empty Clusters K= X X X XX X X X X XX X We assume that the k-means ntalzaton assgns the green, blue, and brown ponts to a sngle cluster; after centrods are computed and objects are reassgned, t can easly be seen that that the brown cluster becomes empty. Convex Shape Cluster Convex Shape: f we take two ponts belongng to a cluster then all the ponts on a drect lne connectng these two ponts must also n the cluster. Shape of K means/k medods clusters are convex polygons Convex Shape. Shapes of clusters of a representatve based clusterng algorthm can be computed as a Vorono dagram for the set of cluster representatves. Vorono cells are always convex, but there are convex shapes that a dfferent from those of Vorono cells. Vorono Dagram for a Representatve based Clusterng Each cell contans one representatves, and every locaton wthn the cell s closer to that sample than to any other sample. Cluster Representatve (e.g. medod/centrod) A Vorono dagram dvdes the space nto such cells. Vorono cells defne cluster boundary! 7 8 K-means clusterng New centers - center of gravty for a cluster K-means clusterng output URLMAP: Cluster - objects closest to a center * Start clusterng by choosng K centers randomly most dstant centers more... * Iterate clusterng step untl no cluster changes * Determnstc, mght get stuck n local mnmum 0

.0.009 K means Fnds local optmum vary many tmes wth random start make an educated guess to start wth eg e.g. sample the data, perform herarchcal clusterng, select K centers. K medods Choose the cluster center to be one of the exstng objects. Why? If more complex dt data or dstance measure the Real center could not be found easly What s the mean of categorcal data? yellow, red, pnk? Instead of tryng to nvent use one of the exstng objects, whatever the dstance measure Self Organsng Maps (SOM) MxN matrx of neurons, each representng a cluster Object X s put to W, to whch t s most smlar. W and ts near surroundng s changed to resemble X more Tran, tran, tran Problem - there s no clear objectve functon to map D-dmesnonal data to dme W Motvaton: The Problem Statement The problem s how to fnd out semantcs relatonshp among lots of nformaton wthout manual labor How do I know, where to put my new data n, f I know nothng about nformaton s topology? When I have a topc, how can I get all the nformaton about t, f I don t know the place to search them? JASS 0 Informaton Vsualzaton wth SOMs sebs 6 Motvaton: The Idea Motvaton: The Idea Computer know automatcally nformaton classfcaton and put them together Input Pattern Text objects must be automatcally produced wth semantcs relatonshps Semantcs Map Input Pattern Input Pattern Topc Topc Topc JASS 0 Informaton Vsualzaton wth SOMs sebs 6 JASS 0 Informaton Vsualzaton wth SOMs sebs 66

.0.009 Self-Organzng Maps : Orgns Self-Organzng Maps Self-Organzng Maps Ideas frst ntroduced by C. von der Malsburg (97), developed and refned by T. Kohonen (98) Neural network algorthm usng unsupervsed compettve learnng Prmarly l used for organzaton and vsualzaton of complex data Bologcal bass: bran maps Teuvo Kohonen SOM - Archtecture Lattce of neurons ( nodes ) accepts and responds to set of nput sgnals Responses compared; wnnng neuron selected from lattce Selected neuron actvated together wth neghbourhood neurons Adaptve process changes weghts to more closely l resemble nputs j w j w j w j w jn d array of neurons Weghted synapses x x x... x n Set of nput sgnals (connected to all neurons n lattce) JASS 0 Informaton Vsualzaton wth SOMs sebs 67 JASS 0 Informaton Vsualzaton wth SOMs sebs 68 Self-Organzng Maps Intalsaton SOM Result Example Classfyng World Poverty Helsnk Unversty of Technology ()Randomly ntalse the weght vectors wj for all nodes j Poverty map based on 9 ndcators from World Bank statstcs (99) JASS 0 Informaton Vsualzaton wth SOMs sebs 69 JASS 0 Informaton Vsualzaton wth SOMs sebs 70 Input vector Fndng a Wnner () Choose an nput vector x from the tranng set In computer texts are shown as a frequency dstrbuton of one word. A Text Example: Self-organzng maps (SOMs) are a data vsualzaton technque nvented by Professor Teuvo Kohonen whch reduce the dmensons of data through the use of self-organzng neural networks. The problem that data vsualzaton attempts to solve s that humans smply cannot vsualze hgh dmensonal data as s so technque are created to help us understand ths hgh dmensonal data. Self-organzng maps data vsualzaton technque Professor nvented Teuvo Kohonen dmensons... Zebra 0 Regon () Fnd the best-matchng neuron (x), usually the neuron whose weght vector has smallest Eucldean dstance from the nput vector x The wnnng node s that whch s n some sense closest to the nput vector Eucldean dstance s the straght lne dstance between the data ponts, f they were plotted on a (mult-dmensonal) graph Eucldean dstance between two vectors a and b, a = (a,a,,an), b = (b,b, bn), s calculated as: d a, b a b Eucldean dstance JASS 0 Informaton Vsualzaton wth SOMs sebs 7 JASS 0 Informaton Vsualzaton wth SOMs sebs 7

.0.009 Weght Update Example: Self-Organzng Maps SOM Weght Update Equaton wj(t +) = wj(t) + (t) (x)(j,t) [x - wj(t)] L. rate The weghts of every node are updated at each cycle by addng Current learnng rate Degree of neghbourhood wth respect to wnner Dfference between current weghts and nput vector to the current weghts Example of (t) Example of (x)(j,t) Anmal names and ther attrbutes Dove Hen Duck Goose Owl Hawk Eagle Fox Dog Wolf Cat Tger Lon Horse Zebra Cow 0 0 0 0 0 0 0 0 0 Small s Medum 0 0 0 0 0 0 0 0 0 0 0 0 Bg 0 0 0 0 0 0 0 0 0 0 0 legs 0 0 0 0 0 0 0 0 0 legs 0 0 0 0 0 0 0 has Har 0 0 0 0 0 0 0 Hooves 0 0 0 0 0 0 0 0 0 0 0 0 0 Mane 0 0 0 0 0 0 0 0 0 0 0 0 Feathers 0 0 0 0 0 0 0 0 0 Hunt 0 0 0 0 0 0 0 0 lkes Run 0 0 0 0 0 0 0 0 0 0 to Fly 0 0 0 0 0 0 0 0 0 0 0 Swm 0 0 0 0 0 0 0 0 0 0 0 0 0 0 brds A groupng accordng to smlarty has emerged peaceful No. of cycles x-axs shows dstance from wnnng node y-axs shows degree of neghbourhood (max. ) hunters [Teuvo Kohonen 00] Self-Organzng Maps; Sprnger; JASS 0 Informaton Vsualzaton wth SOMs sebs 7 JASS 0 Informaton Vsualzaton wth SOMs sebs 7 Clusterng etc. algorthms Herarchcal clusterng methods + vsualsaton K means, Self Organsng Maps (SOM) SOTA trees (Self Organsng Maps + Tree) Fuzzy, EM (object can belong to several clusters) Graph theory (clques, strongly connected components) Smlarty search: X > Y s.t. d(x,y)< 0. Model based (redscover dstrbutons) Planar embeddngs, Multdmensonal scalng Prncpal Component Analyss Correspondence analyss Independent Component Analyss Smlarty searches r r Smlarty searches Query: cyc (cyc, actvator for cyc, repressor for cyc) => genes + 0 most smlar ones for each = clusters Smlarty searches Expand a tght cluster by other most smlar genes:

.0.009 EM Expectaton Maxmzaton EM A popular teratve refnement algorthm An extenson to k means Assgn each object to a cluster accordng to a weght (prob. dstrbuton) New means/covarances are computed based on weghted measures General ldea Starts wth an ntal estmate of the parameter vector Iteratvely rescores the patterns aganst the mxture densty produced by the parameter vector The rescored patterns are used to update the parameter updates Patterns belongng to the same cluster, f they are placed by ther scores n a partcular component Algorthm converges fast but may not be n global optma The EM (Expectaton Maxmzaton) Algorthm Intally, randomly assgn k cluster centers Iteratvely refne the clusters based on two steps Expectaton step: assgn each data pont X to cluster C wth the followng probablty Maxmzaton step: Estmaton of model parameters Aprl, 009 Other Clusterng Methods PCA (Prncpal Component Analyss) Also called SVD (Sngular Value Decomposton) Reduces dmensonalty of gene expresson space Fnds best vew that helps separate data nto groups Supervsed Methods SVM (Support Vector Machne) Prevous knowledge of whch h genes expected to cluster s used for tranng Bnary classfer uses feature space and kernel functon to defne a optmal hyperplane Also used for classfcaton of samples- expresson fngerprntng for dsease classfcaton Persstent Systems Pvt. Ltd. http://www.persstent.co.n Persstent Systems Pvt. Ltd. http://www.persstent.co.n