APPLIED MACHINE LEARNING - PDF Free Download

Methods for Clusterng K-means, Soft K-means DBSCAN 1

Objectves Learn basc technques for data clusterng K-means and soft K-means, GMM (next lecture) DBSCAN Understand the ssues and major challenges n clusterng Choce of metrc Choce of number of clusters 2

What s clusterng? Clusterng s a type of multvarate statstcal analyss also nown as cluster analyss, unsupervsed classfcaton analyss, or numercal taxonomy. Clusterng s a process of parttonng a set of data (or objects) n a set of meanngful sub-classes, called clusters. Cluster: a collecton of data objects that are smlar to one another and thus can be treated collectvely as one group. 3

Classfcaton versus Clusterng Supervsed Classfcaton = Classfcaton We now the class labels and the number of classes. 1 2 3 1 2 3 Unsupervsed Classfcaton = Clusterng We do not now the class labels and may not now the number of classes.?????? 4

Classfcaton versus Clusterng Unsupervsed Classfcaton = Clusterng Hard problem when no par of objects have exactly the same feature. Need to determne how smlar two or more objects are to one another.????? 5

Whch clusters can you create? Whch two subgroups of pctures are smlar and why? 6

Whch clusters can you create? Whch two subgroups of pctures are smlar and why? 7

What s Good Clusterng? A good clusterng method produces hgh qualty clusters when: The ntra-class (that s, ntra-cluster) smlarty s hgh. The nter-class smlarty s low. The qualty measure of a cluster depends on the smlarty measure used! 8

Exercse: Person1 wth glasses Person1 wthout glasses Person2 wthout glasses Person2 wth glasses Intra-class smlarty s the hghest when: a) you choose to classfy mages wth and wthout glasses b) you choose to classfy mages of person1 aganst person2 9

Exercse: Person1 wth glasses Person1 wthout glasses Person2 wthout glasses Person2 wth glasses Projecton onto frst two prncpal components after PCA Intra-class smlarty s the hghest when: a) you choose to classfy mages wth and wthout glasses b) you choose to classfy mages of person1 aganst person2 10

Exercse: Person1 wth glasses Person1 wthout glasses Person2 wthout glasses Person2 wth glasses e1 e2 Projecton onto e1 aganst e2 The egenvector e1 s composed of a mx between the man characterstcs of the two faces and t s hence explanatory of both. However, snce both faces have lttle n common, the two groups have dfferent coordnates onto e1 but have quas dentcal coordnates for the glasses n each subgroup. Projectng onto e1 hence offers a mean to compute a metrc of smlarty across the two persons. 11

Exercse: Person1 wth glasses Person1 wthout glasses Person2 wthout glasses Person2 wth glasses e1 e2 e3 Projecton onto e1 aganst e3 When projectng onto e1 and e3, we can separate the mage of the person1 wth and wthout glasses, as the egenvector e3 embeds features dstnctve of person1 prmarly. 12

Exercse: Projecton onto frst two prncpal components after PCA Desgn a method to fnd out the groups when you no longer have the class labels? 13

Senstvty to Pror Knowledge Outlers (nose) x 3 Relevant Data x 1 x 2 Prors: Data cluster wthn a crcle There are 2 clusters 14

Senstvty to Pror Knowledge x 3 x 1 x 2 Prors: Data follow a complex dstrbuton There are 3 clusters 15

Clusters Types K-means produces globular clusters Globular Clusters Non-Globular Clusters DBSCAN produces nonglobular clusters 16

What s Good Clusterng? Requrements for good clusterng: Dscovery of clusters wth arbtrary shape Ablty to deal wth nose and outlers Insenstvty to nput records orderng Scalablty Hgh dmensonalty Interpretablty and reusablty 17

How to cluster? x 2 x 1 What choce of model (crcle, ellpse) for the cluster? How many models? 18

K-means Clusterng K-Means clusterng generates a number K of dsjont clusters to mmnze: x 2 J K 1 K,..., x 1 x c 2 x 1 x c th data pont geometrc centrod cluster label or number What choce of model (crcle, ellpse) for the cluster? Crcle How many models? Fxed number: K=2 Where to place them for optmal clusterng? 19

K-means Clusterng x 2 x 1 Intalzaton: ntalze at random the postons of the centers of the clusters In mldemos; centrods are ntalzed on one datapont wth no overlap across centrods. 20

x 2 K-means Clusterng arg mn d x, Responsblty of cluster for pont r 1 f 0 otherwse x x 1 x th data pont geometrc centrod Assgnment Step: Calculate the dstance from each data pont to each centrod. Assgn the responsblty of each data pont to ts closest centrod. If a te happens (.e. two centrods are equdstant to a data pont, one assgns the data pont to the smallest wnnng centrod). 21

x 2 K-means Clusterng arg mn d x, Responsblty of cluster for pont r 1 f 0 otherwse x x 1 rx r Update step (M-Step): Recompute the poston of centrod based on the assgnment of the ponts 22

x 2 K-means Clusterng arg mn d x, Responsblty of cluster for pont r 1 f 0 otherwse x x 1 rx r Assgnment Step: Calculate the dstance from each data pont to each centrod. Assgn the responsblty of each data pont to ts closest centrod. If a te happens (.e. two centrods are equdstant to a data pont, one assgns the data pont to the smallest wnnng centrod). 23

K-means Clusterng x 2 x 1 Update step (M-Step): Recompute the poston of centrod based on the assgnment of the ponts Stoppng Crteron: Go bac to step 2 and repeat the process untl the clusters are stable. 24

K-means Clusterng Intersecton ponts x 2 x 1 K-means creates a hard parttonng of the dataset 25

Effect of the dstance metrc on K-means L1-Norm L2-Norm L3-Norm L8-Norm 26

K-means Clusterng: Algorthm 1. Intalzaton: Pc K arbtrary centrods and set ther geometrc means to random values (n mldemos; centrods are ntalzed on one datapont wth no overlap across centrods). 2. Calculate the dstance from each data pont to each centrod. 3. Assgnment Step: Assgn the responsblty of each data pont to ts closest centrod (E-step). If a te happens (.e. two centrods are equdstant to a data pont, one assgns the data pont to the smallest wnnng centrod). 1 f arg mn, d x r 0 otherwse 4. Update Step: Adjust the centrods to be the means of all data ponts assgned to them (M-step) rx r 5. Go bac to step 2 and repeat the process untl the clusters are stable. 27

K-means Clusterng The algorthm of K-means s a smple verson of Expectaton-Maxmzaton appled to a model composed of sotropc Gauss functons (see next lecture) 28

K-means Clusterng: Propertes There are always K clusters. The clusters do not overlap. (soft K-means relaxes ths assumpton, see next sldes) Each member of a cluster s closer to ts cluster than to any other cluster. The algorthm s guaranteed to converge n a fnte number of teratons But t converges to a local optmum! It s hence very senstve to ntalzaton of the centrods. 29

Soft K-means Clusterng r : responsblty of cluster for pont x d, x e r [0,1], ' x 2 d, x e ' Normalzed over clusters: r 1 x 1 Assgnment Step (E-step): Calculate the dstance from each data pont to each centrod. Assgn the responsblty of each data pont to ts closest centrod. Each data pont to each of the means. x s gven a soft `degree of assgnment' 30

Soft K-means Clusterng x 2 r : responsblty of cluster for pont x d, x e r [0,1], ' d, x e ' Normalzed over clusters: r 1 Update step (M-Step): Recompute the poston of centrod based on the assgnment of the ponts The model parameters,.e. the means, are adjusted to match the weghted sample means of the data ponts that they are responsble for. x 1 r The update algorthm of the soft K-means s dentcal to that of the hard K-means, asde from the fact that the responsbltes to a partcular cluster are now real numbers varyng between 0 and 1. r x 31

Soft K-means Clusterng s the stffness 1 measures the dsparty across clusters r : responsblty of cluster for pont x d, x e r [0,1], ' d, x e ' Normalzed over clusters: r 1 small ~ large large ~ small 32

Soft K-means Clusterng 1 5 10 Soft K-means algorthm wth a small (left), medum (center) and large (rght) 33

Soft K-means Clusterng Iteratons of the Soft K-means algorthm from the random ntalzaton (left) to convergence (rght). Computed wth = 10. 34

(soft) K-means Clusterng: Propertes Advantages: Computatonally faster than other clusterng technques. Produces tghter clusters, especally f the clusters are globular. Guaranteed to converge. Drawbacs: Does not wor well wth non-globular clusters. Senstvty to choce of ntal parttons Dfferent ntal parttons can result n dfferent fnal clusters. Assumes a fxed number K of clusters. It s, therefore, good practce to run the algorthm several tmes usng dfferent K values, to determne the optmal number of clusters. 35

K-means Clusterng: Weanesses Unbalanced clusters: K-means taes nto account only the dstance between the means and data ponts; t has no representaton of the varance of the data wthn each cluster. Elongated clusters: K-means mposes a fxed shape for each cluster (sphere). 37

K-means Clusterng: Weanesses Very senstve to the choce of the number of clusters K and the ntalzaton. Mldemos example 38

K-means: Lmtatons Outlers (nose) x 3 Relevant Data x 1 x 2 K-means would not be able to reject outlers 39

K-means: Lmtatons x 3 x 1 x 2 K-means would not be able to reject outlers K-means assgns all dataponts to a cluster Outlers get assgned to the closest cluster DBSCAN can determne outlers and can generate non-globular clusters 40

Densty Based Spatal Clusterng of Applcatons wth Nose (DBSCAN) e Outlers (nose) x 3 x 1 x 2 1. Pc a datapont at random 2. Compute number of dataponts wthn e 3. If < mdata, set ths datapont as outler 4. Go bac to 1 41

Densty Based Spatal Clusterng of Applcatons wth Nose (DBSCAN) x 3 Outlers (nose) Cluster 1 x 1 x 2 1. Pc a datapont at random 2. Compute number of dataponts wthn e 3. For each datapont found, assgn t to same cluster 4. Go bac to 1 42

Densty Based Spatal Clusterng of Applcatons wth Nose (DBSCAN) x 3 Outlers (nose) Cluster 1 Cluster 2 Cluster 1 x 1 x 2 1. Pc a datapont at random 2. Compute number of dataponts wthn e 3. For each datapont found, assgn t to same cluster 4. Merge two clusters f dstance between clusters < e 43

Densty Based Spatal Clusterng of Applcatons wth Nose (DBSCAN) x 3 Outlers (nose) Cluster 1 Cluster 2 Cluster 1 x 1 x 2 Hyperparameters: e: sze of neghborhood mdata: mnmum number of dataponts 44

46 Comparson: K-means / DBSCAN K-means DBSCAN Hyperparameters K: Nm of clusters e: sze, Mdata: mn. nm of dataponts Computatonal cost O(K*M) O(M*log(M)), M: nm dataponts Type of cluster Globular Non-globular (arbtrary shapes, nonlnear boundares) Robustness to nose Not robust Robust to outlers wthn e K-means s computatonal cheap. However, t s not robust to nose and produces only globular clusters. DBSCAN s computatonally ntensve, but t can detect automatcally nose and produces clusters of arbtrary shape. Both K-means and BDSCAN depend on choosng well the hyperparameters To determne the hyperparameters, use evaluaton methods for clusterng (next)

47 Evaluaton of Clusterng Methods Clusterng methods rely on hyper parameters Number of clusters, elements n the cluster, dstance metrc Need to determne the goodness of these choces Clusterng s unsupervsed classfcaton Do not now the real number of clusters and the data labels Dffcult to evaluate these choces wthout ground truth

4848 ADVANCED MACHINE LEARNING Evaluaton of Clusterng Methods Two types of measures: Internal versus external measures Internal measures rely on measures of smlarty: (low) ntra-cluster dstance versus (hgh) nter-cluster dstances Internal measures are problematc as the metrc of smlarty s often already optmzed by the clusterng algorthm. External measures rely on ground truth (class labels): Gven a (sub)-set of nown class labels compute smlarty of clusters to class labels. In real-world data, t s hard/nfeasble to gather ground truth.

49 Internal Measure: RSS Resdual Sum of Square RSS s an nternal measure (avalable n mldemos). It computes the dstance (n norm-2) of each datapont from ts centrod for all clusters. RSS= K 1 xc x 2

5050 ADVANCED MACHINE LEARNING RSS for K-Means Goal of K-means s to fnd cluster centers μ whch mnmze dstorton. RSS= K 1 xc x 2 Measure of Dstorton By K we RSS, what s the optmal K such that RSS 0? RSS = 0 when K = M. One has as many clusters as dataponts! M: 100 dataponts N: 2 dmensons RSS: 0 K: M clusters However, t can stll be used to determne an optmal K by montorng the slope of the decrease of the measure as K ncreases.

5151 ADVANCED MACHINE LEARNING K-means Clusterng: Examples Procedure: Run K-means ncrease monotoncally number of clusters run K- means wth several ntalzaton and tae best run; use RSS measure to measure mprovement n clusterng determne a plateau Optmal s at the elbow of the curve M: 100 dataponts N: 2 dmensons : 4 clusters

K-means wth RSS: Examples Cluster Analyss of Hedge Funds (fonds speculatfs) [N. Das, 9 th Int. Conf. on Computng Economs and Fnance, 2011] No legal defnton of Hedge funds - conssts of a wde category of nvestment funds wth hgh rs & hgh returns varety of strateges for gudng the nvestment Research Queston: classfy type of Hedge funds based on nformaton provded to the clent Data Dmenson (Features): such as: asset class, sze of the hedge fund, ncentve fee, rslevel, and lqudty of hedge funds Procedure: Run K-means ncrease monotoncally number of clusters run K-means wth several ntalzaton and tae best run; Cutoff Use RSS measure to measure mprovement n clusterng determne a plateau Number of Clusters (K) Optmal results are found wth 7 clusters. 53

5454 ADVANCED MACHINE LEARNING K-means Clusterng: Examples The elbow or plateau method for choosng the optmal from the RSS curve can be unrelable for certan datasets: : 2 Whch one s the optmal? : 11 M: 100 dataponts K: 3 dmensons We don t now! We need an addtonal penalty or crteron!

AIC and BIC determne how good the model fts the dataset n a probablstc sense (maxmum-lelhood measure). The measure s balanced by how many parameters are needed to get a good ft. - Aae Informaton Crteron: AIC= 2 ln L 2B - Bayesan Informaton Crteron: BIC 2ln L B ln M L: maxmum lelhood of the model B: number of free parameters M Other Metrcs to Evaluate Clusterng Methods : number of dataponts As the number of dataponts (observatons) ncrease, BIC assgns more weghts to smpler models than AIC. Low BIC mples ether fewer explanatory varables, better ft, or both. Penalty for an ncrease n computatonal costs due to number of parameters and number of dataponts Choosng AIC versus BIC depends on the applcaton: Is the purpose of the analyss to mae predctons, or to decde whch model best represents realty? AIC may have better predctve ablty than BIC, but BIC fnds a computatonally more effcent soluton. 55

5656 ADVANCED MACHINE LEARNING AIC for K-Means For the partcular case of K-means, we do not have a maxmum lelhood estmate of the model: AIC = 2 ln(l) + 2B L : lelhood of model B: number of free parameters However, we can formulate a metrc based on the RSS that penalzes for model complexty (# K-clusters), conceptually followng AIC: AIC RSS = RSS + B RSS= K 1 xc x 2 Weghtng Factor Number of free parameters B=(K*N) K: # clusters N: # dmensons

5757 ADVANCED MACHINE LEARNING BIC for K-Means For the partcular case of K-means, we do not have a maxmum lelhood estmate of the model: BIC = 2 ln(l) + ln(m)b However, we can formulate a metrc based on the RSS that penalzes for model complexty (# K-clusters, # M-dataponts), conceptually followng BIC: RSS= BIC RSS = RSS + ln(m) B K 1 xc x 2 Weghtng factor penalzes wrt. # dataponts (.e. computatonal complexty) Number of free parameters B=(K*N) K: # clusters N: # dmensons

5858 ADVANCED MACHINE LEARNING K-means Clusterng: Examples Procedure: Run K-means ncrease monotoncally number of clusters run K- means wth several ntalzaton and tae best run; use AIC/BIC curves to fnd the optmal, whch s mn AIC or mn(bic) Both mn(bic) and mn(aic) = 2 M: 100 dataponts N: 3 dmensons : 2 clusters

5959 ADVANCED MACHINE LEARNING M: 100 dataponts N: 2 dmensons K: 14 clusters BIC for K-Means BIC RSS = RSS + ln(m) (K N) : 14

6060 ADVANCED MACHINE LEARNING M: 100 dataponts N: 2 dmensons K: 4 clusters BIC for K-Means BIC RSS = RSS + ln(m) (K N) : 4

6161 ADVANCED MACHINE LEARNING AIC / BIC for DBSCAN Comput centrod of each cluster and apply AIC/BIC of K means DBSCAN large e DBSCAN medum e DBSCAN small e DBSCAN large e DBSCAN medum e DBSCAN small e RSS 43 26 0.5 BIC 42 34 78 AIC 69 51 24

6262 ADVANCED MACHINE LEARNING AIC / BIC for DBSCAN Comput centrod of each cluster and apply AIC/BIC of K means K-means DBSCAN large e DBSCAN medum e DBSCAN small e K-means DBSCAN large e DBSCAN medum e DBSCAN small e RSS 51 95 59 0.6 BIC 65 118 88 331 AIC 55 102 67 93

63 Evaluaton of Clusterng Methods Two types of measures: Internal versus external measures External measures assume that a subset of dataponts have class label sem-supervsed learnng They measure how well these dataponts are clustered. Needs to have an dea of the number of exstng classes and have labeled some dataponts Interestng only n cases when labelng s hghly tme-consumng when the data s very large (e.g. n speech recognton)

Sem-Supervsed Learnng Clusterng F1-Measure: (careful: smlar but not the same F-measure as the F-measure we wll see for classfcaton!) Tradeoff between clusterng correctly all dataponts of the same class n the same cluster and mang sure that each cluster contans ponts of only one class. M C K c, max F c, 1 1 c C M 1 : nm of labeled dataponts : the set of classes : nm of clusters, n : nm of members of class c and of cluster F C K F c, R c, P c, R c P c, P c, 2,, c R c n c n 64

Labeled Unlabeled Class 1 Class 2 2 4 Rc1, 1 1 R c2, 2 1 2 4 M C c, max F c, 1 1 c C M 1 : nm of labeled dataponts : the set of classes K : nm of clusters, n : nm of members of class c and of cluster F C K F c, R c, P c, R c P c, P c, 2,, c R c n c n 2 4 Pc1, 1 R c2, 2 6 6 Recall: proporton of dataponts correctly classfed/clusterzed Precson: proporton of dataponts of the same class n the cluster 65

Labeled Unlabeled Class 1 Class 2 2 4 F C, K F c1, 1 F c2, 2 0.7 6 6 M C c, max F c, 1 1 c C M 1 : nm of labeled dataponts : the set of classes K : nm of clusters, n : nm of members of class c and of cluster F C K F c, R c, P c, R c P c, P c, 2,, c R c n c n Penalze fracton of labeled ponts n each class Pcs for each class the cluster wth the maxmal F1 measure 66

Summary of F1-Measure Clusterng F1-Measure: (careful: smlar but not the same F-measure as the F-measure we wll see for classfcaton!) Tradeoff between clusterng correctly all dataponts of the same class n the same cluster and mang sure that each cluster contans ponts of only one class. M C c, max F c, 1 1 c C M 1 : nm of labeled dataponts : the set of classes K : nm of clusters, n : nm of members of class c and of cluster F C K F c, R c, P c, R c P c, P c, 2,, c R c n c n Penalze fracton of labeled ponts n each class Pcs for each class the cluster wth the maxmal F1 measure Recall: proporton of dataponts correctly classfed/clusterzed Precson: proporton of dataponts of the same class n the cluster 67

Summary of Lecture Introduced two clusterng technques: K-means and DBSCAN Dscussed pros and cons n terms of computatonal tme, power of representaton (globular/non-globular clusters) Introduced metrcs to evaluate clusterng and help to choose the hyperparameters: Internal measures (RSS, AIC, BIC) External measures: F1-measure (also called F-measure for clusterng) Next wee: Practcal on Clusterng: You wll compare performance of K-means and DBSCAN on your datasets and use the nternal and external measure to assess these performance and choose the hyperparameters. 68

Robotc Applcaton of Clusterng Method Varety of hand postures when graspng objects How to generate correct hand posture on robots? El-Khoury, S., Mao, L and Bllard, A. (2013) On the Generaton of a Varety of Grasps. Robotcs and Autonomous Systems Journal. 69

Robotc Applcaton of Clusterng Method 4 DOFs ndustral hand (Barrett Technology) 9 DOFs humanod hand (Cub Robot) Problem: Choose the pont of contact and generate feasble posture for the fngers to touch the object at the correct pont and wth the desred force. Dffculy: Hgh-degrees of freedom (large number of possble ponts of contact, large number of DOFs to control) 70

Formulate the problem as Constrant-Based Optmzaton : Mnmze generated torques at fngertps under constrants: Force closure Knematc feasblty Collson avodance Nonconvex optmzaton yelds several local / feasble solutons From 1890 trals converge to 791 feasble solutons From 1890 trals converge to 612 feasble solutons Too ~12.14s for each soluton Too ~2.65s. for each soluton! Too too long for realstc applcaton 71

Apply K-means on all solutons and group them nto clusters 11 Clusters 20 Clusters 72

A. Shula and A. Bllard, NIPS 2012 73