Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Similar documents
Chapter 6: Cluster Analysis

CSE 5243 INTRO. TO DATA MINING

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

CSE 5243 INTRO. TO DATA MINING

Cluster Analysis. Ying Shen, SSE, Tongji University

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Hierarchical clustering

Hierarchical Clustering 4/5/17

Cluster Analysis: Agglomerate Hierarchical Clustering

Unsupervised Learning and Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Unsupervised Learning

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Unsupervised Learning and Clustering

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Gene Clustering & Classification

Lesson 3. Prof. Enza Messina

Hierarchical Clustering

University of Florida CISE department Gator Engineering. Clustering Part 2

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Cluster Analysis for Microarray Data

Information Retrieval and Web Search Engines

Clustering CS 550: Machine Learning

Machine Learning (BSMC-GA 4439) Wenke Liu

Unsupervised Learning Hierarchical Methods

Clustering COMS 4771

Multivariate Analysis

Tree Models of Similarity and Association. Clustering and Classification Lecture 5

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

3. Cluster analysis Overview

EECS730: Introduction to Bioinformatics

Clustering part II 1

Information Retrieval and Web Search Engines

Machine Learning (BSMC-GA 4439) Wenke Liu

Hierarchical clustering

3. Cluster analysis Overview

Clustering Part 3. Hierarchical Clustering

Cluster analysis. Agnieszka Nowak - Brzezinska

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Hierarchical Clustering Lecture 9

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

What to come. There will be a few more topics we will cover on supervised learning

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Midterm Examination CS540-2: Introduction to Artificial Intelligence

10701 Machine Learning. Clustering

Clustering: K-means and Kernel K-means

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

CHAPTER 4: CLUSTER ANALYSIS

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Chapter 1. Using the Cluster Analysis. Background Information

ECLT 5810 Clustering

CS7267 MACHINE LEARNING

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Cluster Analysis. Angela Montanari and Laura Anderlucci

ECLT 5810 Clustering

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Unsupervised Learning : Clustering

CSE 5243 INTRO. TO DATA MINING

Network Traffic Measurements and Analysis

Clustering Algorithms for general similarity measures

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Machine Learning: Symbol-based

Distances, Clustering! Rafael Irizarry!

Machine Learning using MapReduce

Finding Clusters 1 / 60

Clustering & Classification (chapter 15)

Based on Raymond J. Mooney s slides

Lecture 10: Semantic Segmentation and Clustering

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

UNSUPERVISED LEARNING IN R. Introduction to hierarchical clustering

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Hierarchical Clustering

MSA220 - Statistical Learning for Big Data

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Unsupervised Learning

TELCOM2125: Network Science and Analysis

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

CLUSTERING IN BIOINFORMATICS

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

Gene expression & Clustering (Chapter 10)

Clustering and Visualisation of Data

Transcription:

Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation, is an unsupervised learning method. As you will recall a method is an unsupervised learning method if it doesn t involve prediction or classification. The major purpose of Cluster Analysis is to group together collections of objects (e.g. customers) into clusters so that the objects in the clusters are similar. One reason a company might want to organize its customers into groups is to come to better understand the nature of its customers. Given the delineation of its customers into distinct groups, the company could advertise differently to its distinct groups, send different catalogues to its distinct groups, and the like. In terms of building prediction and classification models, cluster analysis can help the analyst identify groups of input variables that in turn can lead to different models for each group. This is, of course, assuming that the output relationships vis-à-vis the input variables across the groups are not the same. But then one can always test the poolability of the models by either conventional hypothesis tests, when considering econometric models, or accuracy measures across validation and test data partitions when considering machine learning models. As one will come to understand after working on several clustering projects, clustering is an Art Form. It must be practiced with care. The more experience you have in doing cluster analysis, the better you become as a practitioner. Before beginning cluster analysis it is often recommended that the data be normalized first. Cluster analysis based on variables with very different scales of measurement can lead to clusters that are not very robust to adding or deleting variables or observations. In this discussion, we will be focusing on clustering only continuous input variables. The clustering of mixed data, some continuous and some categorical, is not considered here as it is beyond the scope of this discussion. Now let us begin. There are two basic approaches to clustering: a) Hierarchical Clustering (Agglomerative Clustering discussed here) b) Non-hierarchical clustering (K-means) 1

Hierarchical Clustering With respect to hierarchical clustering, the final clusters chosen are built in a series of steps. If we start with N objects, each being in its own separate cluster, and then combine one of the clusters with another cluster resulting in N 1 clusters and continue to combine clusters into fewer and few clusters with more and more objects in each cluster, we are engaging in Agglomerative clustering. In contrast, if we start with all of the objects being in a single cluster and then remove one of the objects to form a second cluster and then continue to build more and more clusters with fewer and few objects in each cluster until each object is in its own cluster, we are engaging in Divisive clustering. The distinction between these two hierarchical methods is represented in the below figure taken from the XLMINER help file. Figure 1 Hierarchical Clustering: Agglomerative versus Divisive Methods The above figure is called a dendrogram and represents the fusions or divisions made at each successive stage of the analysis. More formally then, a dendrogram is a tree-like diagram that summarizes the process of clustering.

Distance Measures Using in Clustering In order to build clusters, either agglomeratively or divisively, we need to define the distance between two objects (cases), x, x. x ) and x, x,, x ) and ( i1 i, ip ( j1 j jp eventually between clusters. Let us first examine the distance between two objects. If the units of measure of the p variables are quite different, it is suggested that the variables be first normalized by forming z-scores of the variables as in subtracting the sample means from the original variables and dividing the deviations by their respective sample standard deviations. The most often used measure of distance (dissimilarity) between the two cases is the Euclidean distance defined by d ij. (1) ( xi 1 j1) ( xi j ) ( xip jp ) Alternatively, a weighted Euclidean distance can be used and is defined by d * ij w1 ( xi 1 j1) w ( xi j ) wp ( xip jp ) () where the weights 1, w, wp satisfy the properties i 0 w, p w and w 1. For the remaining discussion let us focus on the Euclidean distance measure of distance between objects (cases). Moving to the discussion of the distance between clusters we need to somehow define the distance between the objects in one cluster and the objects in another cluster. Cluster distances are usually defined in one of three basic ways: Single Linkage (Nearest Neighbor), Complete Linkage (Farthest Neighbor), and Average Group Linkage. Each of these cluster distance measures are defined in order below: i1 i Single Linkage (Nearest Neighbor) The Single Linkage distance between two clusters is defined as the distance between the nearest pair of objects in the two clusters (one object in each cluster). If cluster A is the set of objects A, A, 1, Am and cluster B is B, B, 1, Bn, the Single Linkage distance between clusters A and B is D ( A, B) = Min { d ij : where object A i is in cluster A and object and d ij is the Euclidean distance between B j is in cluster B Ai and B j } At each stage of hierarchical clustering based on the Single Linkage distance measure, the clusters A and B, for which D(A, B) is minimum, are merged. The Single Linkage distance is represented in the XLMINER Help File figure below: 3

Figure Single Linkage Distance Between Clusters Complete Linkage (Farthest Neighbor) The Complete Linkage distance between two clusters is defined as the distance between the most distant (farthest) pair of objects in the two clusters (one object in each cluster). If cluster A is the set of objects A, A, 1, Am and cluster B is B, B, 1, Bn, the Single Linkage distance between clusters A and B is D ( A, B) = Max { d ij : where object A i is in cluster A and object and d ij is the Euclidean distance between B j is in cluster B Ai and B j } At each stage of hierarchical clustering based on the Complete Linkage distance measure, the clusters A and B, for which D(A, B) is minimum, are merged. The Complete Linkage distance is represented in the XLMINER Help File figure below: 4

Figure 3 Complete Linkage Distance Between Clusters Average Linkage Under Average Linkage the distance between two clusters is defined to be the average of the distances between all pairs of objects, where each pair is made up on one object from each cluster. If cluster A is the set of objects A, A, 1, Am and cluster B is B, B, 1, B n, the Average Linkage distance between clusters A and B is ( ) where is the sum of all pairwise distances between cluster A and Cluster B. and are the sizes of the clusters A and B, respectively. At each stage of hierarchical clustering based on the Average Linkage distance measure, the clusters A and B are merged such that, after merger, the average pairwise distance within the newly formed cluster, is minimum. The Complete Linkage distance is represented in the XLMINER Help File figure below: 5

Figure 4 Average Linkage Distance Between Clusters Steps in Agglomerative Clustering The steps in Agglomerative Clustering are as follows: 1. Start with n clusters (each observation = cluster). The two closest observations are merged into one cluster 3. At every step, the two clusters that are closest to each other are merged. That is, either single observations are added to existing clusters or two exiting clusters are merged. 4. This process continues until all observations are merged. This process of agglomeration leads to the construction of a dendrogram. This is a tree-like diagram that summarizes the process of clustering. For any given number of clusters we can determine the records in the clusters by sliding a horizontal line (ruler) up and down the dendrogram until the number of vertical intersections of the horizontal line equals the number of clusters desired. 6

Distance 0 1 3 4 5 6 7 8 9 10 11 1 13 14 15 16 17 18 19 0 1 3 Dendrograms are more useful visually when there are a smaller number of cases as in the Utilities.xls data set. However, the agglomerative procedure works for larger data sets but is computing intensive in that nxn matrices are the basic building blocks for the Agglomerative procedure. To demonstrate the construction and interpretation of a dendrogram let s cluster the data contained in the Utilities.xls data set. This data set consists of observations on utilities each utility being described by 8 variables. As noted above we have 3 different choices of distance between clusters. They are Single Linkage (Nearest Neighbor), Complete Linkage (Farthest Neighbor) and Average Linkage. Three separate dendrograms can be generated for each choice of distance measure. Let s look at the dendrogram generated by using the Average Linkage measure. It is reproduced below: 5 4.5 4 3.5 Dendrogram(Average linkage) 000 3.5 1.5 1 0.5 0 1 18 14 19 6 3 9 4 0 10 13 7 1 1 15 17 5 8 16 11 0 If we put our horizontal ruler at 4.0 for the maximal distance allowed between clusters (as measured by average linkage) we cut across 4 vertical lines and thus get 4 clusters. They are as follows:{ }; { }; { } ; { }. If we put our horizontal ruler at 3.5 for the maximal distance allowed between clusters we cut across 7 vertical lines and thus get 7 clusters. They are as follows: { }; { } ; { }; { };{ };{ };{ }. The four cluster group is constructed by combining the first and second clusters, the third and fourth clusters, and the sixth and seventh clusters in the seven cluster group. You can now see why this type of clustering is call hierarchical because the 4 cluster group is constructed by combining cluster groupings immediately below it. As you move up 7

Distance 0 1 3 4 5 6 7 8 9 10 11 1 13 14 15 16 17 18 19 0 1 3 slowly from the bottom of the dendrogram to the top you move from n clusters to n-l clusters to n- clusters etc. until all of the observations are contained into one cluster. To show how sensitive the choice of clusters is to the choice of distance, consider the Single Linkage dendrogram for the Utilities data: 4 Dendrogram(Single linkage) 000 3.5 3.5 1.5 1 0.5 0 1 18 14 19 9 4 10 13 0 7 1 1 15 6 3 8 16 17 11 5 0 In the case of forming 4 groups, set the maximal allowed distance to be 3.0 in the above dendrogram. Then we get the following 4 clusters:{ } ; { }; { }; { }. These four clusters are quite different from the 4 clusters determined by using the Average Linkage dendrogram. This just goes to show that cluster analysis is an art form and the clusters should be interpreted with caution and hopefully only accepted if the clusters make sense given the domain-specific knowledge we have concerning the utilities under study. Also we should note some additional limitations of hierarchical clustering: For very large data sets, can be expensive and slow Makes only one pass through the data. Therefore, early clustering decisions affect the rest of the clustering results. Often has low stability. That is, adding or subtracting variables or adding or dropping observations can affect the groupings substantially. Sensitive to outliers and their treatment 8

Non-hierarchical Clustering (K-means) The following is hopefully a not too technical discussion of K-means clustering. It is a non-hierarchical method in the sense that if one has clusters, say, generated by pre-specifying means (centroids) in the K-means algorithm and 3 clusters generated by pre-specifying 3 means in the K-means algorithm, then it may be the case that no combination of any two clusters of the 3 cluster group can give rise to the cluster grouping. In this sense the K-means algorithm is non-hierarchical. Let us turn again to the Utilities data and use the K-means clustering method to determine 4 clusters based on the normalized data. We use the following choices: Normalized data 10 Random Starts 10 iterations per start Fixed random seed = 1345 Number of reported clusters = 4 Then the K-means algorithm in XLMiner for four clusters generated the following clusters: cluster 1 = { }; cluster = { }; cluster 3 = { }; cluster 4 = { }. Again we derive another distinct 4 cluster grouping. Once can then use domain-specific knowledge to determine if this 4 cluster grouping makes more or less sense than the 4 group clusters determined by either of the choices of cluster distance in the agglomerative approach. The Steps in the K-means Clustering Approach Given a set of observations ( ) where each observation is a d- dimensional real vector, then K-means clustering aims to partition the n observations into K sets (K < n), { } so as to minimize the within-cluster sum of squares (WCSS): (1) where is the mean of the points in. Now minimizing (1) can, in theory, be done by the integer programming method but this can be extremely time-consuming. Instead the Lloyd algorithm is more often used. The steps of the Lloyd algorithm are as follows. ( ) ( Given the initial set of K-means ) which can be specified randomly or by some heuristic, the algorithm proceeds by alternating between two steps: Assignment Step: Assign each observation to the cluster with the closest mean ( ) { ( ) ( ) } for all. () 9

Update Step: Calculate the new means to be the centroids of the observations in the clusters, i.e. ( ) ( ) ( ) for. (3) Repeat the Assignment and Update steps until WCSS (equation (1)) no longer changes. Then the centroids and members of the K clusters are determined. Note: When using random assignment of the K-means to start the algorithm, one might try several starting point K-means and then choose the best starting point to be the random K-means that produces the smallest WCSS among all of the random starting points K-means tried. Regardless of the clustering technique used, one should strive to choose clusters that are interpretable and make sense given the domain-specific knowledge that we have about the problem at hand. Review Utilities.xls data 10