Distance Based Methods. Lecture 13 August 9, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Size: px
Start display at page:

Download "Distance Based Methods. Lecture 13 August 9, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2"

Transcription

1 Distance Based Methods Lecture 13 August 9, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

2 Today s Lecture Measures of distance (similarity and dissimilarity) Multivariate methods that use distances (dissimilarities) and similarities Hierarchical clustering algorithms Multidimensional scaling K-means clustering

3 Clustering Definitions Clustering objects (or observations) can provide detail regarding the nature and structure of data Clustering is distinct from classification in terminology Classification pertains to a known number of groups, with the objective being to assign new observations to these groups Methods of cluster analysis is a technique where no assumptions are made concerning the number of groups or the group structure

4 Magnitude of the Problem Clustering algorithms make use of measures of similarity (or alternatively, dissimilarity) to define and group variables or observations Clustering presents a host of technical problems For a reasonable sized data set with n objects (either variables or individuals), the number of ways of grouping n objects into k groups is: 1 k! k j 0 1 j k For example, there are over four trillion ways that 25 objects can be clustered into 4 groups which solution is best? k j j n

5 Numerical Problems In theory, one way to find the best solution is to try each possible grouping of all of the objects an optimization process called integer programming It is difficult, if not impossible, to do such a method given the state of today s computers (although computers are catching up with such problems) Rather than using such brute-force type methods, a set of heuristics have been developed to allow for fast clustering of objects in to groups Such methods are called heuristics because they do not guarantee that the solution will be optimal (best), only that the solution will be better than most

6 Clustering Heuristic Inputs The inputs into clustering heuristics are in the form of measures of similarities or dissimilarities The result of the heuristic depends in large part on the measure of similarity/dissimilarity used by the procedure Because of this, we will start this lecture by examining measures of distance (used synonymously with similarity/dissimilarity)

7 Measures of Distance Care must be taken with choosing the metric by which similarity is quantified Important considerations include: The nature of the variables (e.g., discrete, continuous, binary) Scales of measurement (nominal, ordinal, interval, or ratio) The nature of the matter under study

8 Clustering Variables vs. Clustering Observations When variables are to be clustered, oft used measures of similarity include correlation coefficients (or similar measures for noncontinuous variables) When observations are to be clustered, distance metrics are often used

9 Euclidean Distance Euclidean distance is a frequent choice of a distance metric: ) ( ) ( ) ( ) ( ) ( ), ( y x y x y x p x p y y x y x d This distance is often chosen because it represents an understandable metric But sometimes, it may not be best choice

10 Euclidean Distance? Imagine I wanted to know how many miles it was from my old house in Sacramento to Lombard Street in San Francisco Knowing how far it was on a straight line would not do me too much good, particularly with the number of one-way streets that exist in San Francisco

11 Other Distance Metrics Other popular distance metrics include the Minkowski metric: d p 1/ m m ( x, y) i 1 x i y i The key to this metric is the choice of m: If m = 2, this provides the Euclidean distance If m = 1, this provides the city-block distance

12 Preferred Distance Properties It is often desirable to use a distance metric that meets the following properties: d(p,q) = d(q,p) d(p,q) > 0 if P Q d(p,q) = 0 if P = Q d(p,q) d(p,r) + d(r,q) The Euclidean and Minkowski metrics satisfy these properties

13 Triangle Inequality The fourth property is called the triangle inequality, which often gets violated by non-routine measures of distance This inequality can be shown by the following triangle (with lines representing Euclidean distances): P d(p,q) Q d(p,r) R d(r,q) d(p,r) d(r,q) d(p,q)

14 Binary Variables In the case of binary-valued variables (variables that have a 0/1 coding), many other distance metrics may be defined The Euclidean distance provides a count of the number of mismatched observations: Here, d(i,k) = 2 This is sometimes called the Hamming Distance

15 Other Binary Distance Measures There are a number of other ways to define the distance between a set of binary variables Most of these measures reflect the varied importance placed on differing cells in a 2 x 2 table

16 General Distance Measure Properties Use of measures of distance that are monotonic in their ordering of object distances will provide identical results from clustering heuristics Many times this will only be an issue if the distance measure is for binary variables or the distance measure does not satisfy the triangle inequality

17 Hierarchical Clustering Methods Because of the large size of possible clustering solutions, searching through all combinations is not feasible Hierarchical clustering techniques proceed by taking a set of objects and grouping a set at a time Two types of hierarchical clustering methods exist: Agglomerative hierarchical methods Divisive hierarchical methods

18 Agglomerative Clustering Methods Agglomerative clustering methods start first with the individual objects Initially, each object is it s own cluster The most similar objects are then grouped together into a single cluster (with two objects) We will find that what we mean by similar will change depending on the method

19 Agglomerative Clustering Methods The remaining steps involve merging the clusters according the similarity or dissimilarity of the objects within the cluster to those outside of the cluster The method concludes when all objects are part of a single cluster

20 Divisive Clustering Methods Divisive hierarchical methods works in the opposite direction beginning with a single, n-object sized cluster The large cluster is then divided into two subgroups where the objects in opposing groups are relatively distant from each other

21 Divisive Clustering Methods The process continues similarly until there are as many clusters as there are objects While it is briefly discussed in the chapter, we will not discuss divisive clustering methods in detail here in this class

22 To Summarize So you can see that we have this idea of steps At each step two clusters combine to form one (Agglomerative) OR At each step a cluster is divided into two new clusters (Divisive)

23 Methods for Viewing Clusters As you could imagine, when we consider the methods for hierarchical clustering, there are a large number of clusters that are formed sequentially One of the most frequently used tools to view the clusters (and level at which they were formed) is the dendrogram A dendrogram is a graph that describes the differing hierarchical clusters, and the distance at which each is formed

24 Example Data Set #1 To demonstrate several of the hierarchical clustering methods, an example data set is used Data come from a 1991 study by the economic research department of the union bank of Switzerland representing economic conditions of 48 cities around the world Three variables were collected: Average working hours for 12 occupations Price of 112 goods and services excluding rent Index of net hourly earnings in 12 occupations

25 The Cities For example, Here Amsterdam and Brussels Example Of Dendogram were combined to form a group This is an example of an agglomerate cluster where cities start off in there own group and then are combined Where the lines connect is when those two previous groups were joined Notice that we do not specify groups, but if we know how many we want we simply go to the step where there are that many groups

26 Similarity? So, we mentioned that: The most similar objects are then grouped together into a single cluster (with two objects) So the next question is how do we measure similarity between clusters More specifically, how do we redefine it when a cluster contains a combination of old clusters We find that there are several ways to define similar and each way defines a new method of clustering

27 Agglomerative Methods Next we discuss several different way to complete Agglomerative hierarchical clustering: Single Linkage Complete Linkage Average Linkage Centroid Median Ward Method

28 Agglomerative Methods In describing these I will give the definition of how we define similar when clusters are combined. And for the first two I will give detailed examples.

29 Example Distance Matrix The example will be based on the distance matrix below

30 Single Linkage The single linkage method of clustering involves combining clusters by finding the nearest neighbor the cluster closest to any given observation within the current cluster Notice that this is equal to the minimum distance of any observation in cluster one with any observation in cluster 3 Cluster 1 Cluster

31 Single Linkage So the distance between any two clusters is: D( A, B) min d( yi, y j ) Notice any other distance is longer For all y i in A and y j in B So how would we do this using our distance matrix? 2 5

32 Single Linkage Example The first step in the process is to determine the two elements with the smallest distance, and combine them into a single cluster. Here, the two objects that are most similar are objects 3 and 5 we will now combine these into a new cluster, and compute the distance from that cluster to the remaining clusters (objects) via the single linkage rule

33 Single Linkage Example The shaded rows/columns are the portions of the table with These are the distances These the from are distances the an distances of 3 and object of 3 and 5 with 51. with Our 2. rule Our says rule that says the that the to object distance 3 of or our object new 5. distance of our cluster new cluster with with The distance 1 equal 2 is from to equal the the minimum to the (35) minimum of of cluster to theses two remaining theses values 3 two values 7 objects is given below: d (35),1 = min{d 31,d 51 } = min{3,11} = 3 d (35),2 = min{d 32,d 52 } = min{7,10} = 7 d (35),4 = min{d 34,d 54 } = min{9,8} = In equation form our new distances are

34 Single Linkage Example Using the distance values, we now consolidate our table so that (35) is now a single row/column The distance from the (35) cluster to the remaining objects is given below: (35) (35) d (35)1 = min{d 31,d 51 } = min{3,11} = 3 d (35)2 = min{d 32,d 52 } = min{7,10} = 7 d (35)4 = min{d 34,d 54 } = min{9,8} =

35 Single Linkage Example We now repeat the process, by finding the smallest distance between within the set of remaining clusters The smallest distance is between object 1 and cluster (35) Therefore, object 1 joins cluster (35), creating cluster (135) (35) (35) The distance from cluster (135) to the other clusters is then computed: d(135)2 = min{d(35)2,d12} = min{7,9} = 7 d(135)4 = min{d(35)4,d14} = min{8,6} = 6

36 Single Linkage Example Using the distance values, we now consolidate our table so that (135) is now a single row/column The distance from the (135) cluster to the remaining objects is given below: (135) 2 4 (135) d(135)2 = min{d(35)2,d12} = min{7,9} = 7 d(135)4 = min{d(35)4,d14} = min{8,6} = 6

37 Single Linkage Example We now repeat the process, by finding the smallest distance between within the set of remaining clusters The smallest distance is between object 2 and object 4 These two objects will be joined to from cluster (24) The distance from (24) to (135) is then computed (135) 2 4 (135) d (135)(24) = min{d (135)2,d (135)4 } = min{7,6} = 6 The final cluster is formed (12345) with a distance of 6

38 The Dendogram For example, here is where 3 and 5 joined

39 Complete Linkage The complete linkage method of clustering involves combining clusters by finding the farthest neighbor the cluster farthest to any given observation within the current cluster This ensures that all objects in a cluster are within some maximum distance of each other Cluster 1 Cluster

40 Complete Linkage So the distance between any two clusters is: D( A, B) max d( yi, y j ) Notice any other distance is shorter For all y i in A and y j in B So how would we do this using our distance matrix? 2 5

41 Complete Linkage Example To demonstrate complete linkage in action, consider the five-object distance matrix The first step in the process is to determine the two elements with the smallest distance, and combine them into a single cluster Here, the two objects that are most similar are objects 3 and 5 We will now combine these into a new cluster, and compute the distance from that cluster to the remaining clusters (objects) via the complete linkage rule

42 Complete Linkage Example The shaded rows/columns are the portions of the table with the distances from an object to object 3 or object 5 The distance from the (35) cluster to the remaining objects is given below: d (35)1 = max{d 31,d 51 } = max{3,11} = 11 d (35)2 = max{d 32,d 52 } = max{7,10} = 10 d (35)4 = max{d 34,d 54 } = max{9,8} =

43 Complete Linkage Example Using the distance values, we now consolidate our table so that (35) is now a single row/column The distance from the (35) cluster to the remaining objects is given below: d (35)1 = max{d 31,d 51 } = max{3,11} = 11 d (35)2 = max{d 32,d 52 } = max{7,10} = 10 d (35)4 = max{d 34,d 54 } = max{9,8} = 9 (35) (35) Notice our new computed distances with (35)

44 Complete Linkage Example We now repeat the process, by finding the smallest distance between within the set of remaining clusters The smallest distance is between object 2 and object 4. Therefore, they form cluster (24) The distance from cluster (24) to the other clusters is then computed: (35) (35) d (24)(135) = max{d 2(35),d 4(35) } = max{10,9} = 10 d (24)1 = max{d 21,d 41 } = max{9,6} = 9 So now we use our rule to combine 2 and 4

45 Complete Linkage Example Using the distance values, we now consolidate our table so that (24) is now a single row/column (35) (24) 1 (35) The distance from the (24) cluster to the remaining objects is given below: (24) d (24)(135) = max{d 2(35),d 4(35) } = max{10,9} = 10 d (24)1 = max{d 21,d 41 } = max{9,6} = 9 Notice our 10 and 9

46 Complete Linkage Example We now repeat the process, by finding the smallest distance between within the set of remaining clusters The smallest distance is between cluster (24) and object 1 These two objects will be joined to from cluster (124) The distance from (124) to (35) is then computed d (35)(124) = max{d 1(35),d (24)(35) } = max{11,10} = 11 (35) (24) 1 (35) 0 10 (24) The final cluster is formed (12345) with a distance of 11

47 The Dendogram

48 Average Linkage The average linkage method proceeds similarly to the single and complete linkage methods, with the exception that at the end of each agglomeration step, the distance between clusters is now represented by the average distance of all objects within each cluster In reality, the average linkage method will produce very similar results to the complete linkage method

49 So To summarize we have explained three of the methods to combine groups Notice that once things are in the same group they cannot be separated Also, as before with factor analysis the method you use is largely up to you

50 Example Distance Matrix To further demonstrate the agglomerative methods, imagine we had the following distance matrix for five objects:

51 Linkage Methods for Agglomerative Clustering We will discuss three techniques for agglomerative clustering: 1. Single linkage 2. Complete linkage 3. Average linkage Each of these techniques provides an assignment rule that determines when an observation becomes part of a cluster

52 Cluster Analysis in SAS To perform hierarchical clustering in SAS, proc cluster is used Prior to using proc cluster, the data must be converted from the raw observations into a distance matrix SAS provides a built-in macro to create distance matrices that are then used as input into the proc cluster routine To produce dendrograms in SAS, proc tree is used

53 Distance Macro Prior to specifying the distance macro, a pair of %include statements must be placed (preferably at the beginning of the syntax file): %include 'C:\Program Files\SAS Institute\SAS\V8\stat\sample\xmacro.sas'; %include 'C:\Program Files\SAS Institute\SAS\V8\stat\sample\distnew.sas'; The generic syntax for the distance-creating macro is to use %distance() all arguments are optional %distance(data=cities, id=city, options=nomiss, out=distcity, shape=square, method=euclid, var=work--salary); For more information on the distance macro, go to:

54 Distance Macro Example %include 'C:\Program Files\SAS Institute\SAS\V8\stat\sample\xmacro.sas'; %include 'C:\Program Files\SAS Institute\SAS\V8\stat\sample\distnew.sas'; title '1991 City Data'; data cities; input city $ work price salary; cards; Amsterdam Athens ; %distance(data=cities, id=city, options=nomiss, out=distcity, shape=square, method=euclid, var=work--salary);

55 Proc Cluster The proc cluster guide can be found at: ter_toc.htm The generic syntax is: proc cluster data=dataset method=single; id idvariable; run;

56 Same Example in SAS *SAS Example #1; %include 'C:\Program Files\SAS Institute\SAS\V8\stat\sample\xmacro.sas'; %include 'C:\Program Files\SAS Institute\SAS\V8\stat\sample\distnew.sas'; title 'small distance matrix'; data dist(type=distance); input object1-object5 name $; cards; object object object object object5 ; run; The CLUSTER Procedure Single Linkage Cluster Analysis Mean Distance Between Observations = 7 Cluster History Norm T Min i NCL --Clusters Joined--- FREQ Dist e 4 object3 object object1 CL object2 object CL3 CL title 'Single Link'; proc cluster data=dist method=single nonorm; id name; run; proc tree horizontal spaces=2; id name; run;

57 Dendrogram for Single Link Clustering Method

58 Complete Linkage Example The shaded rows/columns are the portions of the table with the distances from an object to object 3 or object 5. The distance from the (35) cluster to the remaining objects is given below: d (35)1 = max{d 31,d 51 } = max{3,11} = 11 d (35)2 = max{d 32,d 52 } = max{7,10} = 10 d (35)4 = max{d 34,d 54 } = max{9,8} =

59 Same Example in SAS title 'Complete Link'; proc cluster data=dist method=complete nonorm; id name; run; proc tree horizontal spaces=2; id name; run; The CLUSTER Procedure Complete Linkage Cluster Analysis Cluster History T Max i NCL --Clusters Joined--- FREQ Dist e 4 object3 object object2 object object1 CL CL2 CL4 5 11

60 Dendrogram for Complete Link Clustering Method

61 Average Linkage The average linkage method proceeds similarly to the single and complete linkage methods, with the exception that at the end of each agglomeration step, the distance between clusters is now represented by average distance of all objects within each cluster In reality, the average linkage method will produce very similar results to the complete linkage method In the interests of time, I will forgo the gory details and get right into the SAS code

62 Average Link Example in SAS title 'Average Link'; proc cluster data=dist method=average nonorm; id name; run; proc tree horizontal spaces=2; id name; run; The CLUSTER Procedure Average Linkage Cluster Analysis Cluster History T RMS i NCL --Clusters Joined--- FREQ Dist e 4 object3 object object2 object object1 CL CL2 CL

63 Dendrogram for Complete Link Clustering Method

64 Clustering Cutoffs In hierarchical clustering, the final decision is to decide just what distance to use to evaluate the final set of clusters Many people choose a spot based on one of two criteria: 1. There is a large gap between agglomeration steps 2. The clusters represent sets of objects that are consistent with preconceived theories To demonstrate, let s pull out the economic data from the cities of the world

65 City Data Example For this data set, we will use the average linkage method of agglomeration. This corresponds to SAS Example #2 in the code %distance(data=cities, id=city, options=nomiss, out=distcity, shape=square, method=euclid, var=work--salary); proc cluster data=distcity method=average; id city; run; proc tree horizontal spaces=2; id city; run;

66 City Data Example The CLUSTER Procedure Average Linkage Cluster Analysis Root-Mean-Square Distance Between Observations = Cluster History Norm T RMS i NCL --Clusters Joined--- FREQ Dist e 45 Caracas Singpore London Paris Milan Vienna Amsterda Brussels Lisbon Rio_de_J Mexico_C Nairobi Copenhag Madrid Geneva Zurich Bogota Kuala_Lu Frankfur Sydney Nicosia Seoul Dublin CL Chicago New_York Johannes CL

67 City Data Dendrogram

68 K-means Clustering Another procedure used to cluster objects is the k-means clustering method K-means initially (and randomly) assigns objects to a total of k clusters The method then goes through each object and reassigns the object to the cluster that has the smallest distance The mean (or centroid) of the cluster is then updated to reflect the new observation This process is repeated until very little change in the clusters occurs

69 K-means Clustering K-means allows for the user to not have to worry about what distance value to use to set a final criterion for a cut-off, as in hierarchical clustering K-means is sensitive to local optima Solutions will depend upon starting values There is a need to re-run this method multiple times to find the best solution

70 K-means Example (SAS Example #3) To do K-means clustering in SAS, the fastclus procedure is used. The guide for the fastclus procedure can be found at: astclus_sect016.htm We will attempt to use proc fastclus to cluster our city data set into three clusters

71 K-means Example *SAS Example #3 - K-means with Cities; proc fastclus data=cities maxc=3 maxiter=1000 out=clus; var work price salary; run; proc sort data=clus; by cluster; run; proc print data=clus; var city cluster; run;

72 1991 City Data 22:13 Saturday, November 27, The FASTCLUS Procedure Replace=FULL Radius=0 Maxclusters=3 Maxiter=1000 Converge=0.02 Initial Seeds Cluster work price salary ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Minimum Distance Between Initial Seeds = Iteration History Relative Change in Cluster Seeds Iteration Criterion ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Convergence criterion is satisfied. Criterion Based on Final Seeds = Cluster Summary Maximum Distance RMS Std from Seed Radius Nearest Distance Between Cluster Frequency Deviation to Observation Exceeded Cluster Cluster Centroids ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

73 Statistics for Variables Variable Total STD Within STD R-Square RSQ/(1-RSQ) ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ work price salary OVER-ALL Pseudo F Statistic = The FASTCLUS Procedure Replace=FULL Radius=0 Maxclusters=3 Maxiter=1000 Converge=0.02 Approximate Expected Over-All R-Squared = Cubic Clustering Criterion = WARNING: The two values above are invalid for correlated variables. Cluster Means Cluster work price salary ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cluster Standard Deviations Cluster work price salary ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

74 1 Amsterda 1 2 Athens 1 3 Brussels 1 4 Copenhag 1 5 Dublin 1 6 Dusseldo 1 7 Frankfur 1 8 Helsinki 1 9 Lagos 1 10 Lisbon 1 11 London 1 12 Luxembou 1 13 Madrid 1 14 Milan 1 15 Montreal 1 16 Nicosia 1 17 Oslo 1 18 Paris 1 19 Rio_de_J 1 20 Seoul 1 21 Stockhol 1 22 Sydney 1 23 Vienna 1 24 Bombay 2 25 Buenos_A 2 26 Caracas 2 27 Chicago 2 28 Geneva 2 29 Houston 2 30 Johannes 2 31 Los_Ange 2 32 Mexico_C 2 33 Nairobi 2 34 New_York 2 35 Panama 2 36 Sao_Paul 2 37 Singpore 2 38 Tel_Aviv 2 39 Tokyo 2 40 Toronto 2 41 Zurich 2 42 Bogota 3 43 Hong_Kon 3 44 Kuala_Lu 3 45 Manila 3 46 Taipei 3 Obs city CLUSTER

75 Multidimensional Scaling Up to this point, we have been working with methods for taking distances and then clustering objects But does using distances remind you of anything? Have you ever been in a really long car ride, and had nothing better to look at than an atlas? Did you ever notice that the back of the atlas provides a nice distance matrix to look at? Those distances come from the proximity of cities on the map

76 Multidimensional Scaling Multidimensional scaling is the process by which a map is drawn from a set of distances Before we get into the specifics of the procedure, let s begin with an example Let s take a state (that is a favorite of mine) that you may or may not be very familiar with Nevada

77 Multidimensional Scaling You may be familiar with several cities in NV Las Vegas Reno Carson City But have you heard of Lovelock? Fernley? Winnemucca?

78 Mapping Nevada From the State of Nevada website, I was able to download a mileage chart for the distance between cities in the Great State of Nevada Using this chart and MDS, we will now draw a map of the Silver State with the cities in their proper locations

79 Nevada from MDS vs. from a Map

80 How MDS Works Multidimensional scaling attempts to recreate the distances in the observed distance matrix by a set of coordinates in a smaller number of dimensions The algorithm works by selecting a set of values for distances, then computing an index of how bad the distances fit, called Stress: Stress( q) i k d i k ( q) ˆ ( q) ik dik d ( q) ik 2 2 1/ 2

81 How MDS Works (Continued) Based on an optimization procedure, new values for the distances are selected, and the new value for Stress is computed The algorithm exits when values of Stress do not change by large amounts Note that this procedure is also subject to local optima multiple runs of the procedure may be necessary

82 Types of MDS Although this is a gross simplification of the methods used for MDS, there are generally two techniques used to scale distances: Metric MDS uses distances (magnitudes) to obtain solution Nonmetric MDS uses ordinal (rank) information of object s distances to obtain solution

83 MDS of Paired Comparison Data In graduate school, I once went through a phase where I was into NASCAR A friend of mine and I would always argue about whether or not NASCAR was a sport We attempted to settle the argument the old-fashioned way, by applying advanced quantitative methods to data sets

84 MDS of Sports Perceptions So, I set out to understand a bit more about how auto racing in general is perceived relative to other sports I created a questionnaire asking for people to judge the similarity of two sports at a time, across 21 different sports I had eight of my closest friends complete the survey (which was quite lengthy because of all of the pairs of sports needing to be compared)

85 MDS in SAS To do MDS in SAS, we will use proc mds The online user guide for proc mds can be found at: _toc.htm This is also SAS Example #5 in the example code file online

86 MDS of Sports Perceptions proc mds data=sports level=ordinal coef=diagonal dimension=2 pfinal out=out outres=res ; run; title1 'Plot of configuration'; %plotit(data=out(where=(_type_='config')), datatype=mds, labelvar=_name_, vtoh=1.75); %plotit(data=out(where=(_type_='config')), datatype=mds, plotvars=dim1 dim2, labelvar=_name_, vtoh=1.75); run;

87 Plot of configuration 00:39 Sunday, November 28, Multidimensional Scaling: Data=WORK.SPORTS.DATA Shape=TRIANGLE Condition=MATRIX Level=ORDINAL Coef=DIAGONAL Dimension=2 Formula=1 Fit=1 Mconverge=0.01 Gconverge=0.01 Maxiter=100 Over=2 Ridge= Badness- Convergence Measures of-fit Change in ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Iteration Type Criterion Criterion Monotone Gradient ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 Initial Monotone Lev-Mar Monotone Lev-Mar Monotone Lev-Mar Monotone Lev-Mar Monotone Lev-Mar Monotone Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar E Multidimensional Scaling: Data=WORK.SPORTS.DATA Shape=TRIANGLE Condition=MATRIX Level=ORDINAL Coef=DIAGONAL Dimension=2 Formula=1 Fit=1 Badness- Convergence Measures of-fit Change in ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Iteration Type Criterion Criterion Monotone Gradient ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 43 Lev-Mar Lev-Mar Lev-Mar Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Convergence criteria are satisfied.

88 Multidimensional Scaling: Data=WORK.SPORTS.DATA Shape=TRIANGLE Condition=MATRIX Level=ORDINAL Coef=DIAGONAL Dimension=2 Formula=1 Fit=1 Configuration Dim1 Dim2 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Auto_Racing Baseball Basketball Biking Bowling Boxing Card_Games Chess Football Golf Hockey Martial_Arts Pool Skiing Soccer Swimming Tennis Track Volleyball Weightlifting Dimension Coefficients _MATRIX_ 1 2 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Multidimensional Scaling: Data=WORK.SPORTS.DATA Shape=TRIANGLE Condition=MATRIX Level=ORDINAL Coef=DIAGONAL Dimension=2 Formula=1 Fit=1 Number of Badness-of- Uncorrected Nonmissing Fit Distance Distance _MATRIX_ Data Weight Criterion Correlation Correlation ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

89 MDS Plot

90 More MDS As alluded to earlier, there are multiple methods for performing MDS There are ways of examining MDS for each individual (Individual Differences Scaling INDSCAL) There are ways of doing MDS on categorical data (Correspondence Analysis)

91 Wrapping Up Today we covered methods for clustering and displaying distances Again, we only scratched the surface of the methods presented today Tomorrow we will talk about more methods for clustering: mixture models Mixture models are model-based clustering methods distributional assumptions are needed

Tree Models of Similarity and Association. Clustering and Classification Lecture 5

Tree Models of Similarity and Association. Clustering and Classification Lecture 5 Tree Models of Similarity and Association Clustering and Lecture 5 Today s Class Tree models. Hierarchical clustering methods. Fun with ultrametrics. 2 Preliminaries Today s lecture is based on the monograph

More information

Introduction to Clustering and Classification. Psych 993 Methods for Clustering and Classification Lecture 1

Introduction to Clustering and Classification. Psych 993 Methods for Clustering and Classification Lecture 1 Introduction to Clustering and Classification Psych 993 Methods for Clustering and Classification Lecture 1 Today s Lecture Introduction to methods for clustering and classification Discussion of measures

More information

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi CLUSTER ANALYSIS V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi-110 012 In multivariate situation, the primary interest of the experimenter is to examine and understand the relationship amongst the

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Unsupervised Learning

Unsupervised Learning Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo 6.873/HST.951 Medical Decision

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Analyse multivariable - mars-avril 2008 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

DATA CLASSIFICATORY TECHNIQUES

DATA CLASSIFICATORY TECHNIQUES DATA CLASSIFICATORY TECHNIQUES AMRENDER KUMAR AND V.K.BHATIA Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 akjha@iasri.res.in 1. Introduction Rudimentary, exploratory

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Multivariate analysis - February 2006 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

The EMCLUS Procedure. The EMCLUS Procedure

The EMCLUS Procedure. The EMCLUS Procedure The EMCLUS Procedure Overview Procedure Syntax PROC EMCLUS Statement VAR Statement INITCLUS Statement Output from PROC EMCLUS EXAMPLES-SECTION Example 1: Syntax for PROC FASTCLUS Example 2: Use of the

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning A review of clustering and other exploratory data analysis methods HST.951J: Medical Decision Support Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Forestry Applied Multivariate Statistics. Cluster Analysis

Forestry Applied Multivariate Statistics. Cluster Analysis 1 Forestry 531 -- Applied Multivariate Statistics Cluster Analysis Purpose: To group similar entities together based on their attributes. Entities can be variables or observations. [illustration in Class]

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

Week 7 Picturing Network. Vahe and Bethany

Week 7 Picturing Network. Vahe and Bethany Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Finding Clusters 1 / 60

Finding Clusters 1 / 60 Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Applied Clustering Techniques. Jing Dong

Applied Clustering Techniques. Jing Dong Applied Clustering Techniques Jing Dong Nov 31, 2016 What is cluster analysis? What is Cluster Analysis? Cluster: o Similar to one another within the same cluster o Dissimilar to the objects in other clusters

More information

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

4. Ad-hoc I: Hierarchical clustering

4. Ad-hoc I: Hierarchical clustering 4. Ad-hoc I: Hierarchical clustering Hierarchical versus Flat Flat methods generate a single partition into k clusters. The number k of clusters has to be determined by the user ahead of time. Hierarchical

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

Chapter 6: Cluster Analysis

Chapter 6: Cluster Analysis Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Clustering algorithms 6CCS3WSN-7CCSMWAL

Clustering algorithms 6CCS3WSN-7CCSMWAL Clustering algorithms 6CCS3WSN-7CCSMWAL Contents Introduction: Types of clustering Hierarchical clustering Spatial clustering (k means etc) Community detection (next week) What are we trying to cluster

More information

CHAPTER THREE THE DISTANCE FUNCTION APPROACH

CHAPTER THREE THE DISTANCE FUNCTION APPROACH 50 CHAPTER THREE THE DISTANCE FUNCTION APPROACH 51 3.1 INTRODUCTION Poverty is a multi-dimensional phenomenon with several dimensions. Many dimensions are divided into several attributes. An example of

More information

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

Hierarchical Clustering / Dendrograms

Hierarchical Clustering / Dendrograms Chapter 445 Hierarchical Clustering / Dendrograms Introduction The agglomerative hierarchical clustering algorithms available in this program module build a cluster hierarchy that is commonly displayed

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

Predict Outcomes and Reveal Relationships in Categorical Data

Predict Outcomes and Reveal Relationships in Categorical Data PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,

More information

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records 11/2/2017 MIST.6060 Business Intelligence and Data Mining 1 An Example Clustering X 2 X 1 Objective of Clustering The objective of clustering is to group the data into clusters such that the records within

More information

Multivariate Analysis

Multivariate Analysis Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Chapter 1. Using the Cluster Analysis. Background Information

Chapter 1. Using the Cluster Analysis. Background Information Chapter 1 Using the Cluster Analysis Background Information Cluster analysis is the name of a multivariate technique used to identify similar characteristics in a group of observations. In cluster analysis,

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Hierarchical Clustering

Hierarchical Clustering What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering

More information

Hierarchical Clustering Lecture 9

Hierarchical Clustering Lecture 9 Hierarchical Clustering Lecture 9 Marina Santini Acknowledgements Slides borrowed and adapted from: Data Mining by I. H. Witten, E. Frank and M. A. Hall 1 Lecture 9: Required Reading Witten et al. (2011:

More information

Getting to Know Your Data

Getting to Know Your Data Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the characterization of data dispersion, and discuss

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Hierarchical clustering

Hierarchical clustering Hierarchical clustering Rebecca C. Steorts, Duke University STA 325, Chapter 10 ISL 1 / 63 Agenda K-means versus Hierarchical clustering Agglomerative vs divisive clustering Dendogram (tree) Hierarchical

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m.

2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m. Cluster Analysis The main aim of cluster analysis is to find a group structure for all the cases in a sample of data such that all those which are in a particular group (cluster) are relatively similar

More information

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010 STATS306B Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Outline K-means, K-medoids, EM algorithm choosing number of clusters: Gap test hierarchical clustering spectral

More information

Market basket analysis

Market basket analysis Market basket analysis Find joint values of the variables X = (X 1,..., X p ) that appear most frequently in the data base. It is most often applied to binary-valued data X j. In this context the observations

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Hierarchical Clustering

Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 0 0 0 00

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

SAS/STAT 13.2 User s Guide. The MDS Procedure

SAS/STAT 13.2 User s Guide. The MDS Procedure SAS/STAT 13.2 User s Guide The MDS Procedure This document is an individual chapter from SAS/STAT 13.2 User s Guide. The correct bibliographic citation for the complete manual is as follows: SAS Institute

More information

Lecture 4 Hierarchical clustering

Lecture 4 Hierarchical clustering CSE : Unsupervised learning Spring 00 Lecture Hierarchical clustering. Multiple levels of granularity So far we ve talked about the k-center, k-means, and k-medoid problems, all of which involve pre-specifying

More information

Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC

Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC Clustering Idea Given a set of data can we find a natural grouping? Essential R commands: D =rnorm(12,0,1)

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

Stat 321: Transposable Data Clustering

Stat 321: Transposable Data Clustering Stat 321: Transposable Data Clustering Art B. Owen Stanford Statistics Art B. Owen (Stanford Statistics) Clustering 1 / 27 Clustering Given n objects with d attributes, place them (the objects) into groups.

More information

Image Analysis - Lecture 5

Image Analysis - Lecture 5 Texture Segmentation Clustering Review Image Analysis - Lecture 5 Texture and Segmentation Magnus Oskarsson Lecture 5 Texture Segmentation Clustering Review Contents Texture Textons Filter Banks Gabor

More information

CS573 Data Privacy and Security. Li Xiong

CS573 Data Privacy and Security. Li Xiong CS573 Data Privacy and Security Anonymizationmethods Li Xiong Today Clustering based anonymization(cont) Permutation based anonymization Other privacy principles Microaggregation/Clustering Two steps:

More information

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000.

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000. Homework # 4 1. Attribute Types Classify the following attributes as binary, discrete, or continuous. Further classify the attributes as qualitative (nominal or ordinal) or quantitative (interval or ratio).

More information

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008 Cluster Analysis Jia Li Department of Statistics Penn State University Summer School in Statistics for Astronomers IV June 9-1, 8 1 Clustering A basic tool in data mining/pattern recognition: Divide a

More information

SAS/STAT 12.3 User s Guide. The MDS Procedure (Chapter)

SAS/STAT 12.3 User s Guide. The MDS Procedure (Chapter) SAS/STAT 12.3 User s Guide The MDS Procedure (Chapter) This document is an individual chapter from SAS/STAT 12.3 User s Guide. The correct bibliographic citation for the complete manual is as follows:

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Chapter DM:II. II. Cluster Analysis

Chapter DM:II. II. Cluster Analysis Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1

More information

What is Unsupervised Learning?

What is Unsupervised Learning? Clustering What is Unsupervised Learning? Unlike in supervised learning, in unsupervised learning, there are no labels We simply a search for patterns in the data Examples Clustering Density Estimation

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING S7267 MAHINE LEARNING HIERARHIAL LUSTERING Ref: hengkai Li, Department of omputer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) Mingon Kang, Ph.D. omputer Science,

More information

Clustering Part 3. Hierarchical Clustering

Clustering Part 3. Hierarchical Clustering Clustering Part Dr Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Hierarchical Clustering Two main types: Agglomerative Start with the points

More information

Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975.

Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975. Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975. SPSS Statistics were designed INTRODUCTION TO SPSS Objective About the

More information

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Unsupervised Learning Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Table of Contents 1)Clustering: Introduction and Basic Concepts 2)An Overview of Popular Clustering Methods 3)Other Unsupervised

More information

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis 7 Supervised learning vs unsupervised learning Unsupervised Learning Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute These patterns are then

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

Summer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis

Summer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis Summer School in Statistics for Astronomers & Physicists June 15-17, 2005 Session on Computational Algorithms for Astrostatistics Cluster Analysis Max Buot Department of Statistics Carnegie-Mellon University

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information

Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering

More information

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

What is Cluster Analysis?

What is Cluster Analysis? Cluster Analysis What is Cluster Analysis? Finding groups of objects (data points) such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the

More information

University of Florida CISE department Gator Engineering. Clustering Part 5

University of Florida CISE department Gator Engineering. Clustering Part 5 Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean

More information

Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi

Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi Lecture - 8 Consistency and Redundancy in Project networks In today s lecture

More information

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 06 07 Department of CS - DM - UHD Road map Cluster Analysis: Basic

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

K-Means Clustering 3/3/17

K-Means Clustering 3/3/17 K-Means Clustering 3/3/17 Unsupervised Learning We have a collection of unlabeled data points. We want to find underlying structure in the data. Examples: Identify groups of similar data points. Clustering

More information

Data Mining Algorithms

Data Mining Algorithms for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester

More information