Chapter 6: Cluster Analysis

Similar documents
Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis: Agglomerate Hierarchical Clustering

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Hierarchical clustering

Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Hierarchical clustering

Clustering COMS 4771

Cluster Analysis. Angela Montanari and Laura Anderlucci

CSE 5243 INTRO. TO DATA MINING

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

CHAPTER 4: CLUSTER ANALYSIS

Unsupervised Learning and Clustering

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

MSA220 - Statistical Learning for Big Data

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m.

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

What to come. There will be a few more topics we will cover on supervised learning

10701 Machine Learning. Clustering

Lecture 15 Clustering. Oct

Unsupervised learning, Clustering CS434

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

4. Ad-hoc I: Hierarchical clustering

Data Science and Statistics in Research: unlocking the power of your data Session 3.4: Clustering

Lecture 25: Review I

Cluster Analysis. Ying Shen, SSE, Tongji University

UNSUPERVISED LEARNING IN R. Introduction to hierarchical clustering

Machine Learning (BSMC-GA 4439) Wenke Liu

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Cluster Analysis for Microarray Data

CSE 5243 INTRO. TO DATA MINING

Information Retrieval and Web Search Engines

Clustering Lecture 3: Hierarchical Methods

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Machine Learning (BSMC-GA 4439) Wenke Liu

Clustering Algorithms. Margareta Ackerman

CS7267 MACHINE LEARNING

Cluster analysis. Agnieszka Nowak - Brzezinska

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Hierarchical Clustering 4/5/17

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Unsupervised Learning and Clustering

Distances, Clustering! Rafael Irizarry!

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Unsupervised Learning Hierarchical Methods

K-Means Clustering 3/3/17

University of Florida CISE department Gator Engineering. Clustering Part 2

Introduction to Machine Learning. Xiaojin Zhu

Gene Clustering & Classification

Multivariate Methods

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

Clustering & Classification (chapter 15)

Multivariate Analysis

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Information Retrieval and Web Search Engines

Hierarchical Clustering

Clustering Lecture 8. David Sontag New York University. Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Carlos Guestrin, Andrew Moore, Dan Klein

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC

MATH5745 Multivariate Methods Lecture 13

Exploratory data analysis for microarrays

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Finding Clusters 1 / 60

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Machine Learning. Unsupervised Learning. Manfred Huber

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Applied Multivariate Analysis

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Clustering CS 550: Machine Learning

Clustering Part 3. Hierarchical Clustering

3. Cluster analysis Overview

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Lecture 4 Hierarchical clustering

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering

Hierarchical clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Unsupervised Learning

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

Clustering part II 1

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 16

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Kernels and Clustering

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

3. Cluster analysis Overview

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Unsupervised Learning: Clustering

Transcription:

Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each individual. Often in clustering the items are called objects. We wish to create clusters such that the objects within each cluster are similar and objects in different clusters are dissimilar. The dissimilarity between any two objects is typically quantified using a distance measure (like Euclidean distance). Cluster analysis is a type of unsupervised classification, because we do not know the nature of the groups (or the number of groups, typically) before we classify the objects into clusters. STAT J530 Page 1

Applications of Cluster Analysis In marketing, researchers attempt to find distinct clusters of the consumer population so that several distinct marketing strategies can be used for the clusters. In ecology, scientists classify plants or animals into various groups based on some measurable characteristics. Researchers in genetics may separate genes into several classes based on their expression ratios measured at different time points. STAT J530 Page 2

The Need for Clustering Algorithms We assume there are n objects that we wish we separate into a small number (say, k) of clusters, where k < n. If we know the true number of clusters k ahead of time, then the number of ways to partition the n objects into k clusters is a Stirling number of the second kind. Example: There are 48, 004, 081, 105, 038, 305 ways to separate n = 30 objects into k = 4 clusters. STAT J530 Page 3

The Need for Clustering Algorithms (Continued) If we don t know the value of k, the possible number of partitions is even more massive. Example: There are 35, 742, 549, 198, 872, 617, 291, 353, 508, 656, 626, 642, 567 possible partitions of n = 42 objects if we let the number of clusters k vary. (This is called the Bell number for n.) Clearly even a computer cannot investigate all the possible clustering partitions to see which is best. We need intelligently designed algorithms that will search among the best possible partitions relatively quickly. STAT J530 Page 4

Types of Clustering Algorithms There are three major classes of clustering methods from oldest to newest, they are: 1. Hierarchical methods 2. Partitioning methods 3. Model-based methods Hierarchical methods cluster the data in a series of n steps, typically joining observations together step by step to form clusters. Partitioning methods first determine k, and then typically attempt to find the partition into k clusters that optimizes some objective function of the data. Model-based clustering takes a statistical approach, formulating a model that categorizes the data into subpopulations and using maximum likelihood to estimate the model parameters. STAT J530 Page 5

Hierarchical Clustering Agglomerative hierarchical clustering begins with n clusters, each containing a single object. At each step, the two clusters that are closest are merged together. So as the steps iterate, there are n clusters, then n 1 clusters, then n 2, etc. By the last step, there is 1 cluster containing all n objects. The R function hclust will perform a variety of hierarchical clustering methods. STAT J530 Page 6

Defining Closeness of Clusters The key in a hierarchical clustering algorithm is specifying how to determine the two closest clusters at any given step. For the first step, it s easy: Join the two objects whose (Euclidean?) distance is smallest. After that, we have a choice: Do we join two individual objects together, or merge an object into a cluster that already has multiple objects? Intercluster dissimilarity is typically defined in one of three ways, which give rise to linkage methods. STAT J530 Page 7

Linkage Methods in Hierarchical Clustering The single linkage algorithm, at each step, joins the clusters whose minimum distance between objects is smallest, i.e., joins the clusters A and B with the smallest d AB = min i A,j B (d ij) Single linkage clustering is sometimes called nearest neighbor clustering. The complete linkage algorithm, at each step, joins the clusters whose maximum distance between objects is smallest, i.e., joins the clusters A and B with the smallest d AB = max i A,j B (d ij) Complete linkage clustering is sometimes called farthest neighbor clustering. STAT J530 Page 8

Linkage Methods in Hierarchical Clustering (Continued) The average linkage algorithm, at each step, joins the clusters whose average distance between objects is smallest, i.e., joins the clusters A and B with the smallest d AB = 1 n A n B d ij i A j B where n A and n B are the number of objects in clusters A and B, respectively. See Figure 6.4 in the Everitt textbook. STAT J530 Page 9

Dendrograms A hierarchical algorithm actually produces not one partition of the data, but lots of partitions. There is a clustering partition for each step 1, 2,...,n. The series of mergings can be represented at a glance by a treelike structure called a dendrogram. To get a single k-cluster solution, we would cut the dendrogram horizontally at a point that would produce k groups (The R function cutree can do this). It is strongly recommended to examine the full dendrogram first before determining where to cut it. A natural set of clusters may be apparent from a glance at the dendrogram. STAT J530 Page 10

Standardization of Observations If the variables in our data set are of different types or are measured on very different scales, then some variables may play an inappropriately dominant role in the clustering process. In this case, it is recommended to standardize the variables in some way before clustering the objects. Possible standardization approaches: 1. Divide each column by its sample standard deviation, so that all variables have standard deviation 1. 2. Divide each variable by its sample range (max min); Milligan and Cooper (1988) found that this approach best preserved the clustering structure. 3. Convert data to z-scores by (for each variable) subtracting the sample mean and then dividing by the sample standard deviation a common option in clustering software packages. STAT J530 Page 11

Pros and Cons of Hierarchical Clustering An advantage of hierarchical clustering methods is their computational speed for small data sets. Another advantage is that the dendrogram gives a picture of the clustering solution for a variety of choices of k (the number of clusters) at once. On the other hand, a major disadvantage is that once two clusters have been joined, they can never be split apart later in the algorithm, even if such a move would improve the clustering. The so-called partitioning methods of cluster analysis do not have this restriction. In addition, hierarchical methods can be less efficient than partitioning methods for large data sets, when n is much greater than k. STAT J530 Page 12