Distance Based Methods. Lecture 13 August 9, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2
|
|
- Aileen Boone
- 5 years ago
- Views:
Transcription
1 Distance Based Methods Lecture 13 August 9, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2
2 Today s Lecture Measures of distance (similarity and dissimilarity) Multivariate methods that use distances (dissimilarities) and similarities Hierarchical clustering algorithms Multidimensional scaling K-means clustering
3 Clustering Definitions Clustering objects (or observations) can provide detail regarding the nature and structure of data Clustering is distinct from classification in terminology Classification pertains to a known number of groups, with the objective being to assign new observations to these groups Methods of cluster analysis is a technique where no assumptions are made concerning the number of groups or the group structure
4 Magnitude of the Problem Clustering algorithms make use of measures of similarity (or alternatively, dissimilarity) to define and group variables or observations Clustering presents a host of technical problems For a reasonable sized data set with n objects (either variables or individuals), the number of ways of grouping n objects into k groups is: 1 k! k j 0 1 j k For example, there are over four trillion ways that 25 objects can be clustered into 4 groups which solution is best? k j j n
5 Numerical Problems In theory, one way to find the best solution is to try each possible grouping of all of the objects an optimization process called integer programming It is difficult, if not impossible, to do such a method given the state of today s computers (although computers are catching up with such problems) Rather than using such brute-force type methods, a set of heuristics have been developed to allow for fast clustering of objects in to groups Such methods are called heuristics because they do not guarantee that the solution will be optimal (best), only that the solution will be better than most
6 Clustering Heuristic Inputs The inputs into clustering heuristics are in the form of measures of similarities or dissimilarities The result of the heuristic depends in large part on the measure of similarity/dissimilarity used by the procedure Because of this, we will start this lecture by examining measures of distance (used synonymously with similarity/dissimilarity)
7 Measures of Distance Care must be taken with choosing the metric by which similarity is quantified Important considerations include: The nature of the variables (e.g., discrete, continuous, binary) Scales of measurement (nominal, ordinal, interval, or ratio) The nature of the matter under study
8 Clustering Variables vs. Clustering Observations When variables are to be clustered, oft used measures of similarity include correlation coefficients (or similar measures for noncontinuous variables) When observations are to be clustered, distance metrics are often used
9 Euclidean Distance Euclidean distance is a frequent choice of a distance metric: ) ( ) ( ) ( ) ( ) ( ), ( y x y x y x p x p y y x y x d This distance is often chosen because it represents an understandable metric But sometimes, it may not be best choice
10 Euclidean Distance? Imagine I wanted to know how many miles it was from my old house in Sacramento to Lombard Street in San Francisco Knowing how far it was on a straight line would not do me too much good, particularly with the number of one-way streets that exist in San Francisco
11 Other Distance Metrics Other popular distance metrics include the Minkowski metric: d p 1/ m m ( x, y) i 1 x i y i The key to this metric is the choice of m: If m = 2, this provides the Euclidean distance If m = 1, this provides the city-block distance
12 Preferred Distance Properties It is often desirable to use a distance metric that meets the following properties: d(p,q) = d(q,p) d(p,q) > 0 if P Q d(p,q) = 0 if P = Q d(p,q) d(p,r) + d(r,q) The Euclidean and Minkowski metrics satisfy these properties
13 Triangle Inequality The fourth property is called the triangle inequality, which often gets violated by non-routine measures of distance This inequality can be shown by the following triangle (with lines representing Euclidean distances): P d(p,q) Q d(p,r) R d(r,q) d(p,r) d(r,q) d(p,q)
14 Binary Variables In the case of binary-valued variables (variables that have a 0/1 coding), many other distance metrics may be defined The Euclidean distance provides a count of the number of mismatched observations: Here, d(i,k) = 2 This is sometimes called the Hamming Distance
15 Other Binary Distance Measures There are a number of other ways to define the distance between a set of binary variables Most of these measures reflect the varied importance placed on differing cells in a 2 x 2 table
16 General Distance Measure Properties Use of measures of distance that are monotonic in their ordering of object distances will provide identical results from clustering heuristics Many times this will only be an issue if the distance measure is for binary variables or the distance measure does not satisfy the triangle inequality
17 Hierarchical Clustering Methods Because of the large size of possible clustering solutions, searching through all combinations is not feasible Hierarchical clustering techniques proceed by taking a set of objects and grouping a set at a time Two types of hierarchical clustering methods exist: Agglomerative hierarchical methods Divisive hierarchical methods
18 Agglomerative Clustering Methods Agglomerative clustering methods start first with the individual objects Initially, each object is it s own cluster The most similar objects are then grouped together into a single cluster (with two objects) We will find that what we mean by similar will change depending on the method
19 Agglomerative Clustering Methods The remaining steps involve merging the clusters according the similarity or dissimilarity of the objects within the cluster to those outside of the cluster The method concludes when all objects are part of a single cluster
20 Divisive Clustering Methods Divisive hierarchical methods works in the opposite direction beginning with a single, n-object sized cluster The large cluster is then divided into two subgroups where the objects in opposing groups are relatively distant from each other
21 Divisive Clustering Methods The process continues similarly until there are as many clusters as there are objects While it is briefly discussed in the chapter, we will not discuss divisive clustering methods in detail here in this class
22 To Summarize So you can see that we have this idea of steps At each step two clusters combine to form one (Agglomerative) OR At each step a cluster is divided into two new clusters (Divisive)
23 Methods for Viewing Clusters As you could imagine, when we consider the methods for hierarchical clustering, there are a large number of clusters that are formed sequentially One of the most frequently used tools to view the clusters (and level at which they were formed) is the dendrogram A dendrogram is a graph that describes the differing hierarchical clusters, and the distance at which each is formed
24 Example Data Set #1 To demonstrate several of the hierarchical clustering methods, an example data set is used Data come from a 1991 study by the economic research department of the union bank of Switzerland representing economic conditions of 48 cities around the world Three variables were collected: Average working hours for 12 occupations Price of 112 goods and services excluding rent Index of net hourly earnings in 12 occupations
25 The Cities For example, Here Amsterdam and Brussels Example Of Dendogram were combined to form a group This is an example of an agglomerate cluster where cities start off in there own group and then are combined Where the lines connect is when those two previous groups were joined Notice that we do not specify groups, but if we know how many we want we simply go to the step where there are that many groups
26 Similarity? So, we mentioned that: The most similar objects are then grouped together into a single cluster (with two objects) So the next question is how do we measure similarity between clusters More specifically, how do we redefine it when a cluster contains a combination of old clusters We find that there are several ways to define similar and each way defines a new method of clustering
27 Agglomerative Methods Next we discuss several different way to complete Agglomerative hierarchical clustering: Single Linkage Complete Linkage Average Linkage Centroid Median Ward Method
28 Agglomerative Methods In describing these I will give the definition of how we define similar when clusters are combined. And for the first two I will give detailed examples.
29 Example Distance Matrix The example will be based on the distance matrix below
30 Single Linkage The single linkage method of clustering involves combining clusters by finding the nearest neighbor the cluster closest to any given observation within the current cluster Notice that this is equal to the minimum distance of any observation in cluster one with any observation in cluster 3 Cluster 1 Cluster
31 Single Linkage So the distance between any two clusters is: D( A, B) min d( yi, y j ) Notice any other distance is longer For all y i in A and y j in B So how would we do this using our distance matrix? 2 5
32 Single Linkage Example The first step in the process is to determine the two elements with the smallest distance, and combine them into a single cluster. Here, the two objects that are most similar are objects 3 and 5 we will now combine these into a new cluster, and compute the distance from that cluster to the remaining clusters (objects) via the single linkage rule
33 Single Linkage Example The shaded rows/columns are the portions of the table with These are the distances These the from are distances the an distances of 3 and object of 3 and 5 with 51. with Our 2. rule Our says rule that says the that the to object distance 3 of or our object new 5. distance of our cluster new cluster with with The distance 1 equal 2 is from to equal the the minimum to the (35) minimum of of cluster to theses two remaining theses values 3 two values 7 objects is given below: d (35),1 = min{d 31,d 51 } = min{3,11} = 3 d (35),2 = min{d 32,d 52 } = min{7,10} = 7 d (35),4 = min{d 34,d 54 } = min{9,8} = In equation form our new distances are
34 Single Linkage Example Using the distance values, we now consolidate our table so that (35) is now a single row/column The distance from the (35) cluster to the remaining objects is given below: (35) (35) d (35)1 = min{d 31,d 51 } = min{3,11} = 3 d (35)2 = min{d 32,d 52 } = min{7,10} = 7 d (35)4 = min{d 34,d 54 } = min{9,8} =
35 Single Linkage Example We now repeat the process, by finding the smallest distance between within the set of remaining clusters The smallest distance is between object 1 and cluster (35) Therefore, object 1 joins cluster (35), creating cluster (135) (35) (35) The distance from cluster (135) to the other clusters is then computed: d(135)2 = min{d(35)2,d12} = min{7,9} = 7 d(135)4 = min{d(35)4,d14} = min{8,6} = 6
36 Single Linkage Example Using the distance values, we now consolidate our table so that (135) is now a single row/column The distance from the (135) cluster to the remaining objects is given below: (135) 2 4 (135) d(135)2 = min{d(35)2,d12} = min{7,9} = 7 d(135)4 = min{d(35)4,d14} = min{8,6} = 6
37 Single Linkage Example We now repeat the process, by finding the smallest distance between within the set of remaining clusters The smallest distance is between object 2 and object 4 These two objects will be joined to from cluster (24) The distance from (24) to (135) is then computed (135) 2 4 (135) d (135)(24) = min{d (135)2,d (135)4 } = min{7,6} = 6 The final cluster is formed (12345) with a distance of 6
38 The Dendogram For example, here is where 3 and 5 joined
39 Complete Linkage The complete linkage method of clustering involves combining clusters by finding the farthest neighbor the cluster farthest to any given observation within the current cluster This ensures that all objects in a cluster are within some maximum distance of each other Cluster 1 Cluster
40 Complete Linkage So the distance between any two clusters is: D( A, B) max d( yi, y j ) Notice any other distance is shorter For all y i in A and y j in B So how would we do this using our distance matrix? 2 5
41 Complete Linkage Example To demonstrate complete linkage in action, consider the five-object distance matrix The first step in the process is to determine the two elements with the smallest distance, and combine them into a single cluster Here, the two objects that are most similar are objects 3 and 5 We will now combine these into a new cluster, and compute the distance from that cluster to the remaining clusters (objects) via the complete linkage rule
42 Complete Linkage Example The shaded rows/columns are the portions of the table with the distances from an object to object 3 or object 5 The distance from the (35) cluster to the remaining objects is given below: d (35)1 = max{d 31,d 51 } = max{3,11} = 11 d (35)2 = max{d 32,d 52 } = max{7,10} = 10 d (35)4 = max{d 34,d 54 } = max{9,8} =
43 Complete Linkage Example Using the distance values, we now consolidate our table so that (35) is now a single row/column The distance from the (35) cluster to the remaining objects is given below: d (35)1 = max{d 31,d 51 } = max{3,11} = 11 d (35)2 = max{d 32,d 52 } = max{7,10} = 10 d (35)4 = max{d 34,d 54 } = max{9,8} = 9 (35) (35) Notice our new computed distances with (35)
44 Complete Linkage Example We now repeat the process, by finding the smallest distance between within the set of remaining clusters The smallest distance is between object 2 and object 4. Therefore, they form cluster (24) The distance from cluster (24) to the other clusters is then computed: (35) (35) d (24)(135) = max{d 2(35),d 4(35) } = max{10,9} = 10 d (24)1 = max{d 21,d 41 } = max{9,6} = 9 So now we use our rule to combine 2 and 4
45 Complete Linkage Example Using the distance values, we now consolidate our table so that (24) is now a single row/column (35) (24) 1 (35) The distance from the (24) cluster to the remaining objects is given below: (24) d (24)(135) = max{d 2(35),d 4(35) } = max{10,9} = 10 d (24)1 = max{d 21,d 41 } = max{9,6} = 9 Notice our 10 and 9
46 Complete Linkage Example We now repeat the process, by finding the smallest distance between within the set of remaining clusters The smallest distance is between cluster (24) and object 1 These two objects will be joined to from cluster (124) The distance from (124) to (35) is then computed d (35)(124) = max{d 1(35),d (24)(35) } = max{11,10} = 11 (35) (24) 1 (35) 0 10 (24) The final cluster is formed (12345) with a distance of 11
47 The Dendogram
48 Average Linkage The average linkage method proceeds similarly to the single and complete linkage methods, with the exception that at the end of each agglomeration step, the distance between clusters is now represented by the average distance of all objects within each cluster In reality, the average linkage method will produce very similar results to the complete linkage method
49 So To summarize we have explained three of the methods to combine groups Notice that once things are in the same group they cannot be separated Also, as before with factor analysis the method you use is largely up to you
50 Example Distance Matrix To further demonstrate the agglomerative methods, imagine we had the following distance matrix for five objects:
51 Linkage Methods for Agglomerative Clustering We will discuss three techniques for agglomerative clustering: 1. Single linkage 2. Complete linkage 3. Average linkage Each of these techniques provides an assignment rule that determines when an observation becomes part of a cluster
52 Cluster Analysis in SAS To perform hierarchical clustering in SAS, proc cluster is used Prior to using proc cluster, the data must be converted from the raw observations into a distance matrix SAS provides a built-in macro to create distance matrices that are then used as input into the proc cluster routine To produce dendrograms in SAS, proc tree is used
53 Distance Macro Prior to specifying the distance macro, a pair of %include statements must be placed (preferably at the beginning of the syntax file): %include 'C:\Program Files\SAS Institute\SAS\V8\stat\sample\xmacro.sas'; %include 'C:\Program Files\SAS Institute\SAS\V8\stat\sample\distnew.sas'; The generic syntax for the distance-creating macro is to use %distance() all arguments are optional %distance(data=cities, id=city, options=nomiss, out=distcity, shape=square, method=euclid, var=work--salary); For more information on the distance macro, go to:
54 Distance Macro Example %include 'C:\Program Files\SAS Institute\SAS\V8\stat\sample\xmacro.sas'; %include 'C:\Program Files\SAS Institute\SAS\V8\stat\sample\distnew.sas'; title '1991 City Data'; data cities; input city $ work price salary; cards; Amsterdam Athens ; %distance(data=cities, id=city, options=nomiss, out=distcity, shape=square, method=euclid, var=work--salary);
55 Proc Cluster The proc cluster guide can be found at: ter_toc.htm The generic syntax is: proc cluster data=dataset method=single; id idvariable; run;
56 Same Example in SAS *SAS Example #1; %include 'C:\Program Files\SAS Institute\SAS\V8\stat\sample\xmacro.sas'; %include 'C:\Program Files\SAS Institute\SAS\V8\stat\sample\distnew.sas'; title 'small distance matrix'; data dist(type=distance); input object1-object5 name $; cards; object object object object object5 ; run; The CLUSTER Procedure Single Linkage Cluster Analysis Mean Distance Between Observations = 7 Cluster History Norm T Min i NCL --Clusters Joined--- FREQ Dist e 4 object3 object object1 CL object2 object CL3 CL title 'Single Link'; proc cluster data=dist method=single nonorm; id name; run; proc tree horizontal spaces=2; id name; run;
57 Dendrogram for Single Link Clustering Method
58 Complete Linkage Example The shaded rows/columns are the portions of the table with the distances from an object to object 3 or object 5. The distance from the (35) cluster to the remaining objects is given below: d (35)1 = max{d 31,d 51 } = max{3,11} = 11 d (35)2 = max{d 32,d 52 } = max{7,10} = 10 d (35)4 = max{d 34,d 54 } = max{9,8} =
59 Same Example in SAS title 'Complete Link'; proc cluster data=dist method=complete nonorm; id name; run; proc tree horizontal spaces=2; id name; run; The CLUSTER Procedure Complete Linkage Cluster Analysis Cluster History T Max i NCL --Clusters Joined--- FREQ Dist e 4 object3 object object2 object object1 CL CL2 CL4 5 11
60 Dendrogram for Complete Link Clustering Method
61 Average Linkage The average linkage method proceeds similarly to the single and complete linkage methods, with the exception that at the end of each agglomeration step, the distance between clusters is now represented by average distance of all objects within each cluster In reality, the average linkage method will produce very similar results to the complete linkage method In the interests of time, I will forgo the gory details and get right into the SAS code
62 Average Link Example in SAS title 'Average Link'; proc cluster data=dist method=average nonorm; id name; run; proc tree horizontal spaces=2; id name; run; The CLUSTER Procedure Average Linkage Cluster Analysis Cluster History T RMS i NCL --Clusters Joined--- FREQ Dist e 4 object3 object object2 object object1 CL CL2 CL
63 Dendrogram for Complete Link Clustering Method
64 Clustering Cutoffs In hierarchical clustering, the final decision is to decide just what distance to use to evaluate the final set of clusters Many people choose a spot based on one of two criteria: 1. There is a large gap between agglomeration steps 2. The clusters represent sets of objects that are consistent with preconceived theories To demonstrate, let s pull out the economic data from the cities of the world
65 City Data Example For this data set, we will use the average linkage method of agglomeration. This corresponds to SAS Example #2 in the code %distance(data=cities, id=city, options=nomiss, out=distcity, shape=square, method=euclid, var=work--salary); proc cluster data=distcity method=average; id city; run; proc tree horizontal spaces=2; id city; run;
66 City Data Example The CLUSTER Procedure Average Linkage Cluster Analysis Root-Mean-Square Distance Between Observations = Cluster History Norm T RMS i NCL --Clusters Joined--- FREQ Dist e 45 Caracas Singpore London Paris Milan Vienna Amsterda Brussels Lisbon Rio_de_J Mexico_C Nairobi Copenhag Madrid Geneva Zurich Bogota Kuala_Lu Frankfur Sydney Nicosia Seoul Dublin CL Chicago New_York Johannes CL
67 City Data Dendrogram
68 K-means Clustering Another procedure used to cluster objects is the k-means clustering method K-means initially (and randomly) assigns objects to a total of k clusters The method then goes through each object and reassigns the object to the cluster that has the smallest distance The mean (or centroid) of the cluster is then updated to reflect the new observation This process is repeated until very little change in the clusters occurs
69 K-means Clustering K-means allows for the user to not have to worry about what distance value to use to set a final criterion for a cut-off, as in hierarchical clustering K-means is sensitive to local optima Solutions will depend upon starting values There is a need to re-run this method multiple times to find the best solution
70 K-means Example (SAS Example #3) To do K-means clustering in SAS, the fastclus procedure is used. The guide for the fastclus procedure can be found at: astclus_sect016.htm We will attempt to use proc fastclus to cluster our city data set into three clusters
71 K-means Example *SAS Example #3 - K-means with Cities; proc fastclus data=cities maxc=3 maxiter=1000 out=clus; var work price salary; run; proc sort data=clus; by cluster; run; proc print data=clus; var city cluster; run;
72 1991 City Data 22:13 Saturday, November 27, The FASTCLUS Procedure Replace=FULL Radius=0 Maxclusters=3 Maxiter=1000 Converge=0.02 Initial Seeds Cluster work price salary ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Minimum Distance Between Initial Seeds = Iteration History Relative Change in Cluster Seeds Iteration Criterion ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Convergence criterion is satisfied. Criterion Based on Final Seeds = Cluster Summary Maximum Distance RMS Std from Seed Radius Nearest Distance Between Cluster Frequency Deviation to Observation Exceeded Cluster Cluster Centroids ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
73 Statistics for Variables Variable Total STD Within STD R-Square RSQ/(1-RSQ) ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ work price salary OVER-ALL Pseudo F Statistic = The FASTCLUS Procedure Replace=FULL Radius=0 Maxclusters=3 Maxiter=1000 Converge=0.02 Approximate Expected Over-All R-Squared = Cubic Clustering Criterion = WARNING: The two values above are invalid for correlated variables. Cluster Means Cluster work price salary ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cluster Standard Deviations Cluster work price salary ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
74 1 Amsterda 1 2 Athens 1 3 Brussels 1 4 Copenhag 1 5 Dublin 1 6 Dusseldo 1 7 Frankfur 1 8 Helsinki 1 9 Lagos 1 10 Lisbon 1 11 London 1 12 Luxembou 1 13 Madrid 1 14 Milan 1 15 Montreal 1 16 Nicosia 1 17 Oslo 1 18 Paris 1 19 Rio_de_J 1 20 Seoul 1 21 Stockhol 1 22 Sydney 1 23 Vienna 1 24 Bombay 2 25 Buenos_A 2 26 Caracas 2 27 Chicago 2 28 Geneva 2 29 Houston 2 30 Johannes 2 31 Los_Ange 2 32 Mexico_C 2 33 Nairobi 2 34 New_York 2 35 Panama 2 36 Sao_Paul 2 37 Singpore 2 38 Tel_Aviv 2 39 Tokyo 2 40 Toronto 2 41 Zurich 2 42 Bogota 3 43 Hong_Kon 3 44 Kuala_Lu 3 45 Manila 3 46 Taipei 3 Obs city CLUSTER
75 Multidimensional Scaling Up to this point, we have been working with methods for taking distances and then clustering objects But does using distances remind you of anything? Have you ever been in a really long car ride, and had nothing better to look at than an atlas? Did you ever notice that the back of the atlas provides a nice distance matrix to look at? Those distances come from the proximity of cities on the map
76 Multidimensional Scaling Multidimensional scaling is the process by which a map is drawn from a set of distances Before we get into the specifics of the procedure, let s begin with an example Let s take a state (that is a favorite of mine) that you may or may not be very familiar with Nevada
77 Multidimensional Scaling You may be familiar with several cities in NV Las Vegas Reno Carson City But have you heard of Lovelock? Fernley? Winnemucca?
78 Mapping Nevada From the State of Nevada website, I was able to download a mileage chart for the distance between cities in the Great State of Nevada Using this chart and MDS, we will now draw a map of the Silver State with the cities in their proper locations
79 Nevada from MDS vs. from a Map
80 How MDS Works Multidimensional scaling attempts to recreate the distances in the observed distance matrix by a set of coordinates in a smaller number of dimensions The algorithm works by selecting a set of values for distances, then computing an index of how bad the distances fit, called Stress: Stress( q) i k d i k ( q) ˆ ( q) ik dik d ( q) ik 2 2 1/ 2
81 How MDS Works (Continued) Based on an optimization procedure, new values for the distances are selected, and the new value for Stress is computed The algorithm exits when values of Stress do not change by large amounts Note that this procedure is also subject to local optima multiple runs of the procedure may be necessary
82 Types of MDS Although this is a gross simplification of the methods used for MDS, there are generally two techniques used to scale distances: Metric MDS uses distances (magnitudes) to obtain solution Nonmetric MDS uses ordinal (rank) information of object s distances to obtain solution
83 MDS of Paired Comparison Data In graduate school, I once went through a phase where I was into NASCAR A friend of mine and I would always argue about whether or not NASCAR was a sport We attempted to settle the argument the old-fashioned way, by applying advanced quantitative methods to data sets
84 MDS of Sports Perceptions So, I set out to understand a bit more about how auto racing in general is perceived relative to other sports I created a questionnaire asking for people to judge the similarity of two sports at a time, across 21 different sports I had eight of my closest friends complete the survey (which was quite lengthy because of all of the pairs of sports needing to be compared)
85 MDS in SAS To do MDS in SAS, we will use proc mds The online user guide for proc mds can be found at: _toc.htm This is also SAS Example #5 in the example code file online
86 MDS of Sports Perceptions proc mds data=sports level=ordinal coef=diagonal dimension=2 pfinal out=out outres=res ; run; title1 'Plot of configuration'; %plotit(data=out(where=(_type_='config')), datatype=mds, labelvar=_name_, vtoh=1.75); %plotit(data=out(where=(_type_='config')), datatype=mds, plotvars=dim1 dim2, labelvar=_name_, vtoh=1.75); run;
87 Plot of configuration 00:39 Sunday, November 28, Multidimensional Scaling: Data=WORK.SPORTS.DATA Shape=TRIANGLE Condition=MATRIX Level=ORDINAL Coef=DIAGONAL Dimension=2 Formula=1 Fit=1 Mconverge=0.01 Gconverge=0.01 Maxiter=100 Over=2 Ridge= Badness- Convergence Measures of-fit Change in ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Iteration Type Criterion Criterion Monotone Gradient ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 Initial Monotone Lev-Mar Monotone Lev-Mar Monotone Lev-Mar Monotone Lev-Mar Monotone Lev-Mar Monotone Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar Lev-Mar E Multidimensional Scaling: Data=WORK.SPORTS.DATA Shape=TRIANGLE Condition=MATRIX Level=ORDINAL Coef=DIAGONAL Dimension=2 Formula=1 Fit=1 Badness- Convergence Measures of-fit Change in ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Iteration Type Criterion Criterion Monotone Gradient ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 43 Lev-Mar Lev-Mar Lev-Mar Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Lev-Mar E Convergence criteria are satisfied.
88 Multidimensional Scaling: Data=WORK.SPORTS.DATA Shape=TRIANGLE Condition=MATRIX Level=ORDINAL Coef=DIAGONAL Dimension=2 Formula=1 Fit=1 Configuration Dim1 Dim2 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Auto_Racing Baseball Basketball Biking Bowling Boxing Card_Games Chess Football Golf Hockey Martial_Arts Pool Skiing Soccer Swimming Tennis Track Volleyball Weightlifting Dimension Coefficients _MATRIX_ 1 2 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Multidimensional Scaling: Data=WORK.SPORTS.DATA Shape=TRIANGLE Condition=MATRIX Level=ORDINAL Coef=DIAGONAL Dimension=2 Formula=1 Fit=1 Number of Badness-of- Uncorrected Nonmissing Fit Distance Distance _MATRIX_ Data Weight Criterion Correlation Correlation ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
89 MDS Plot
90 More MDS As alluded to earlier, there are multiple methods for performing MDS There are ways of examining MDS for each individual (Individual Differences Scaling INDSCAL) There are ways of doing MDS on categorical data (Correspondence Analysis)
91 Wrapping Up Today we covered methods for clustering and displaying distances Again, we only scratched the surface of the methods presented today Tomorrow we will talk about more methods for clustering: mixture models Mixture models are model-based clustering methods distributional assumptions are needed
Tree Models of Similarity and Association. Clustering and Classification Lecture 5
Tree Models of Similarity and Association Clustering and Lecture 5 Today s Class Tree models. Hierarchical clustering methods. Fun with ultrametrics. 2 Preliminaries Today s lecture is based on the monograph
More informationIntroduction to Clustering and Classification. Psych 993 Methods for Clustering and Classification Lecture 1
Introduction to Clustering and Classification Psych 993 Methods for Clustering and Classification Lecture 1 Today s Lecture Introduction to methods for clustering and classification Discussion of measures
More informationCLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi
CLUSTER ANALYSIS V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi-110 012 In multivariate situation, the primary interest of the experimenter is to examine and understand the relationship amongst the
More informationCluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010
Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationUnsupervised Learning
Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo 6.873/HST.951 Medical Decision
More informationCluster Analysis. Angela Montanari and Laura Anderlucci
Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a
More informationCluster analysis. Agnieszka Nowak - Brzezinska
Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More information3. Cluster analysis Overview
Université Laval Analyse multivariable - mars-avril 2008 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as
More informationCluster Analysis: Agglomerate Hierarchical Clustering
Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical
More informationCS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample
CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:
More informationDATA CLASSIFICATORY TECHNIQUES
DATA CLASSIFICATORY TECHNIQUES AMRENDER KUMAR AND V.K.BHATIA Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 akjha@iasri.res.in 1. Introduction Rudimentary, exploratory
More information3. Cluster analysis Overview
Université Laval Multivariate analysis - February 2006 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationThe EMCLUS Procedure. The EMCLUS Procedure
The EMCLUS Procedure Overview Procedure Syntax PROC EMCLUS Statement VAR Statement INITCLUS Statement Output from PROC EMCLUS EXAMPLES-SECTION Example 1: Syntax for PROC FASTCLUS Example 2: Use of the
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning
More informationUnsupervised Learning
Unsupervised Learning A review of clustering and other exploratory data analysis methods HST.951J: Medical Decision Support Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More informationForestry Applied Multivariate Statistics. Cluster Analysis
1 Forestry 531 -- Applied Multivariate Statistics Cluster Analysis Purpose: To group similar entities together based on their attributes. Entities can be variables or observations. [illustration in Class]
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised
More informationWeek 7 Picturing Network. Vahe and Bethany
Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationFinding Clusters 1 / 60
Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationApplied Clustering Techniques. Jing Dong
Applied Clustering Techniques Jing Dong Nov 31, 2016 What is cluster analysis? What is Cluster Analysis? Cluster: o Similar to one another within the same cluster o Dissimilar to the objects in other clusters
More informationINF4820, Algorithms for AI and NLP: Hierarchical Clustering
INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More information4. Ad-hoc I: Hierarchical clustering
4. Ad-hoc I: Hierarchical clustering Hierarchical versus Flat Flat methods generate a single partition into k clusters. The number k of clusters has to be determined by the user ahead of time. Hierarchical
More informationCS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample
Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups
More informationChapter 6: Cluster Analysis
Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationClustering algorithms 6CCS3WSN-7CCSMWAL
Clustering algorithms 6CCS3WSN-7CCSMWAL Contents Introduction: Types of clustering Hierarchical clustering Spatial clustering (k means etc) Community detection (next week) What are we trying to cluster
More informationCHAPTER THREE THE DISTANCE FUNCTION APPROACH
50 CHAPTER THREE THE DISTANCE FUNCTION APPROACH 51 3.1 INTRODUCTION Poverty is a multi-dimensional phenomenon with several dimensions. Many dimensions are divided into several attributes. An example of
More information5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction
Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering
More informationWorkload Characterization Techniques
Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/
More informationHierarchical Clustering / Dendrograms
Chapter 445 Hierarchical Clustering / Dendrograms Introduction The agglomerative hierarchical clustering algorithms available in this program module build a cluster hierarchy that is commonly displayed
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationClustering. Lecture 6, 1/24/03 ECS289A
Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed
More information9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology
9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example
More informationPredict Outcomes and Reveal Relationships in Categorical Data
PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,
More information11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records
11/2/2017 MIST.6060 Business Intelligence and Data Mining 1 An Example Clustering X 2 X 1 Objective of Clustering The objective of clustering is to group the data into clusters such that the records within
More informationMultivariate Analysis
Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationChapter 1. Using the Cluster Analysis. Background Information
Chapter 1 Using the Cluster Analysis Background Information Cluster analysis is the name of a multivariate technique used to identify similar characteristics in a group of observations. In cluster analysis,
More informationECS 234: Data Analysis: Clustering ECS 234
: Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed
More informationHierarchical Clustering
What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering
More informationHierarchical Clustering Lecture 9
Hierarchical Clustering Lecture 9 Marina Santini Acknowledgements Slides borrowed and adapted from: Data Mining by I. H. Witten, E. Frank and M. A. Hall 1 Lecture 9: Required Reading Witten et al. (2011:
More informationGetting to Know Your Data
Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the characterization of data dispersion, and discuss
More informationClustering. Chapter 10 in Introduction to statistical learning
Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationHierarchical clustering
Hierarchical clustering Rebecca C. Steorts, Duke University STA 325, Chapter 10 ISL 1 / 63 Agenda K-means versus Hierarchical clustering Agglomerative vs divisive clustering Dendogram (tree) Hierarchical
More informationUnsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More information2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m.
Cluster Analysis The main aim of cluster analysis is to find a group structure for all the cases in a sample of data such that all those which are in a particular group (cluster) are relatively similar
More informationSTATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010
STATS306B Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Outline K-means, K-medoids, EM algorithm choosing number of clusters: Gap test hierarchical clustering spectral
More informationMarket basket analysis
Market basket analysis Find joint values of the variables X = (X 1,..., X p ) that appear most frequently in the data base. It is most often applied to binary-valued data X j. In this context the observations
More informationBased on Raymond J. Mooney s slides
Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit
More informationHierarchical Clustering
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 0 0 0 00
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationSAS/STAT 13.2 User s Guide. The MDS Procedure
SAS/STAT 13.2 User s Guide The MDS Procedure This document is an individual chapter from SAS/STAT 13.2 User s Guide. The correct bibliographic citation for the complete manual is as follows: SAS Institute
More informationLecture 4 Hierarchical clustering
CSE : Unsupervised learning Spring 00 Lecture Hierarchical clustering. Multiple levels of granularity So far we ve talked about the k-center, k-means, and k-medoid problems, all of which involve pre-specifying
More informationData Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC
Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC Clustering Idea Given a set of data can we find a natural grouping? Essential R commands: D =rnorm(12,0,1)
More informationSTA 570 Spring Lecture 5 Tuesday, Feb 1
STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row
More informationStat 321: Transposable Data Clustering
Stat 321: Transposable Data Clustering Art B. Owen Stanford Statistics Art B. Owen (Stanford Statistics) Clustering 1 / 27 Clustering Given n objects with d attributes, place them (the objects) into groups.
More informationImage Analysis - Lecture 5
Texture Segmentation Clustering Review Image Analysis - Lecture 5 Texture and Segmentation Magnus Oskarsson Lecture 5 Texture Segmentation Clustering Review Contents Texture Textons Filter Banks Gabor
More informationCS573 Data Privacy and Security. Li Xiong
CS573 Data Privacy and Security Anonymizationmethods Li Xiong Today Clustering based anonymization(cont) Permutation based anonymization Other privacy principles Microaggregation/Clustering Two steps:
More informationHomework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000.
Homework # 4 1. Attribute Types Classify the following attributes as binary, discrete, or continuous. Further classify the attributes as qualitative (nominal or ordinal) or quantitative (interval or ratio).
More informationCluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008
Cluster Analysis Jia Li Department of Statistics Penn State University Summer School in Statistics for Astronomers IV June 9-1, 8 1 Clustering A basic tool in data mining/pattern recognition: Divide a
More informationSAS/STAT 12.3 User s Guide. The MDS Procedure (Chapter)
SAS/STAT 12.3 User s Guide The MDS Procedure (Chapter) This document is an individual chapter from SAS/STAT 12.3 User s Guide. The correct bibliographic citation for the complete manual is as follows:
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationStatistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1
Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group
More informationClustering Techniques
Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationMSA220 - Statistical Learning for Big Data
MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups
More informationChapter DM:II. II. Cluster Analysis
Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1
More informationWhat is Unsupervised Learning?
Clustering What is Unsupervised Learning? Unlike in supervised learning, in unsupervised learning, there are no labels We simply a search for patterns in the data Examples Clustering Density Estimation
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationWorking with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan
Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using
More informationCS7267 MACHINE LEARNING
S7267 MAHINE LEARNING HIERARHIAL LUSTERING Ref: hengkai Li, Department of omputer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) Mingon Kang, Ph.D. omputer Science,
More informationClustering Part 3. Hierarchical Clustering
Clustering Part Dr Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Hierarchical Clustering Two main types: Agglomerative Start with the points
More informationStatistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975.
Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975. SPSS Statistics were designed INTRODUCTION TO SPSS Objective About the
More informationUnsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team
Unsupervised Learning Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Table of Contents 1)Clustering: Introduction and Basic Concepts 2)An Overview of Popular Clustering Methods 3)Other Unsupervised
More informationUnsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis
7 Supervised learning vs unsupervised learning Unsupervised Learning Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute These patterns are then
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.
More informationSummer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis
Summer School in Statistics for Astronomers & Physicists June 15-17, 2005 Session on Computational Algorithms for Astrostatistics Cluster Analysis Max Buot Department of Statistics Carnegie-Mellon University
More informationHard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering
An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other
More informationCluster Analysis for Microarray Data
Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that
More informationSupervised vs. Unsupervised Learning
Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now
More informationSequence clustering. Introduction. Clustering basics. Hierarchical clustering
Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering
More informationData Clustering Hierarchical Clustering, Density based clustering Grid based clustering
Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms
More informationClustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York
Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity
More informationWhat is Cluster Analysis?
Cluster Analysis What is Cluster Analysis? Finding groups of objects (data points) such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 5
Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean
More informationProject and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi
Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi Lecture - 8 Consistency and Redundancy in Project networks In today s lecture
More informationData Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science
Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 06 07 Department of CS - DM - UHD Road map Cluster Analysis: Basic
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationK-Means Clustering 3/3/17
K-Means Clustering 3/3/17 Unsupervised Learning We have a collection of unlabeled data points. We want to find underlying structure in the data. Examples: Identify groups of similar data points. Clustering
More informationData Mining Algorithms
for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester
More information