Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Similar documents
Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Hierarchical clustering

Clustering CS 550: Machine Learning

Hierarchical clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Unsupervised Learning : Clustering

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CS7267 MACHINE LEARNING

Introduction to Clustering

Cluster analysis. Agnieszka Nowak - Brzezinska

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Cluster Analysis: Basic Concepts and Algorithms

Clustering Basic Concepts and Algorithms 1

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Exploratory data analysis for microarrays

Cluster Analysis: Agglomerate Hierarchical Clustering

CSE 5243 INTRO. TO DATA MINING

Gene Clustering & Classification

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Unsupervised Learning and Clustering

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis for Microarray Data

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Unsupervised Learning and Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Chapter 6: Cluster Analysis

Cluster Analysis. Ying Shen, SSE, Tongji University

CSE 5243 INTRO. TO DATA MINING

10701 Machine Learning. Clustering

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Clustering fundamentals

Understanding Clustering Supervising the unsupervised

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning (BSMC-GA 4439) Wenke Liu

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Unsupervised Learning

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Clustering Part 3. Hierarchical Clustering

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Clustering: K-means and Kernel K-means

Information Retrieval and Web Search Engines

Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

Clustering. Content. Typical Applications. Clustering: Unsupervised data mining technique

CSE 494/598 Lecture-11: Clustering & Classification

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

ECLT 5810 Clustering

Machine Learning (BSMC-GA 4439) Wenke Liu

CSE 347/447: DATA MINING

Hierarchical Clustering

Information Retrieval and Web Search Engines

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Machine Learning for OR & FE

MIS2502: Data Analytics Clustering and Segmentation. Jing Gong

Clustering Lecture 3: Hierarchical Methods

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

University of Florida CISE department Gator Engineering. Clustering Part 2

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

ECLT 5810 Clustering

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

CLUSTERING IN BIOINFORMATICS

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

HW4 VINH NGUYEN. Q1 (6 points). Chapter 8 Exercise 20

数据挖掘 Introduction to Data Mining

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Clustering algorithms

EECS 730 Introduction to Bioinformatics Microarray. Luke Huan Electrical Engineering and Computer Science

Cluster Analysis: Basic Concepts and Algorithms

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Network Traffic Measurements and Analysis

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

What to come. There will be a few more topics we will cover on supervised learning

What is Cluster Analysis?

Chapter DM:II. II. Cluster Analysis

Distances, Clustering! Rafael Irizarry!

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Chapter 10 in Introduction to statistical learning

CS Introduction to Data Mining Instructor: Abdullah Mueen

Clustering in Data Mining

Transcription:

Clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1

Clustering Clustering Goal: Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. An unsupervised problem that tries to produce labelled data from unlabelled data. Many different techniques, but most revolve around forming groups with small within cluster distances relative to between cluster distances. 2 / 1

What is Cluster Analysis? Cluster analysis! Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized Tan,Steinbach, Kumar Introduction to 4/18/2004 2 3 / 1

Clustering Applications Understanding of some structure in a dataset. Group related documents for browsing; Group genes and proteins that have similar functionality; Group stocks with similar price fluctuations Summarization: reduce the size of large data sets (sometimes known as vector quantization... ) 4 / 1

Cluster analysis 14.3 Cluster Analysis 527 FIGURE 14.14. DNA microarray data: average linkage hierarchical clustering has been applied independently to the rows (genes) and columns (samples), determining the ordering of the rows and columns (see text). The colors range from 5 / 1

Cluster analysis Original image... 6 / 1

Cluster analysis Some compression... 7 / 1

Cluster analysis Too much compression? 8 / 1

Cluster analysis Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters Clusters can be ambiguous. 9 / 1 Tan,Steinbach, Kumar Introduction to 4/18/2004 5

Clustering Types of clustering Partitional A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset. Hierarchical A set of nested clusters organized as a hierarchical tree. Each data object is in exactly one subset for any horizontal cut of the tree... 10 / 1

Cluster analysis 502 14. Unsupervised Learning X2 X 1 FIGURE 14.4. Simulated data in the plane, clustered into three classes (represented A partitional by orange, example blue and green) by the K-means clustering algorithm 11 / 1

Cluster analysis 522 14. Unsupervised Learning LEUKEMIA K562B-repro K562A-repro CNS CNS UNKNOWN OVARIAN MCF7A-repro MCF7D-repro LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA OVARIAN OVARIAN LEUKEMIA OVARIAN OVARIAN PROSTATE OVARIAN PROSTATE CNS CNS CNS FIGURE 14.12. Dendrogram from agglomerative hierarchical clustering with average linkage to the human tumor microarray data. A hierarchical example chical structure produced by the algorithm. Hierarchical methods impose 12 / 1

Clustering Other distinctions Exclusivity Are points in only one cluster? Soft vs. hard Can we give a score for each case and each cluster? Partial vs. complete Do we cluster all points, or only some? Heterogeneity Are the clusters similar in size, shape, etc. 13 / 1

Types of clusters: well-separated! Well-Separated Clusters: A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters This type of cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. Tan,Steinbach, Kumar Introduction to 4/18/2004 11 14 / 1

! Center-based A cluster is a set of objects such that an object in a cluster is closer (more similar) to the center of a cluster, than to the center of any other cluster Types of clusters: center-based The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most representative point of a cluster 4 center-based clusters This type of cluster is a set of objects such that an object in a cluster is closer (more similar) to the center of a cluster, than to the center of any other cluster. Tan,Steinbach, Kumar Introduction to 4/18/2004 12 15 / 1

! Center-based A cluster is a set of objects such that an object in a cluster is closer (more similar) to the center of a cluster, than to the center of any other cluster Types of clusters: center-based The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most representative point of a cluster 4 center-based clusters The center of these clusters is usually, the average of all the points in the cluster, or a medoid, the most representative point of a cluster. Tan,Steinbach, Kumar Introduction to 4/18/2004 12 16 / 1

! Contiguous Cluster (Nearest neighbor o Types of Transitive) clusters: contiguity-based A cluster is a set of points such that a point in a clus closer (or more similar) to one or more other points i cluster than to any point not in the cluster. This type of cluster is a set of points such that a point in a cluster is closer (or more similar) 8 contiguous to one or clusters more other points in the cluster than to any point not in the cluster. Tan,Steinbach, Kumar Introduction to 4/18 17 / 1

Types! Density-based of clusters: contiguity-based A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. Used when the clusters are irregular or intertwined, and when noise and outliers are present. These types of clusters 6 density-based are made upclusters of dense region of points, often described by one of the other cluster types, but separated by low-density regions, often in the form of noise. Tan,Steinbach, Kumar Introduction to 4/18/2004 14 18 / 1

Clustering Mathematical characterizations Most clustering algorithms are based on a dissimilarity measure d. Data may be of mixed type so some of the similarities we saw earlier may be used. Most clustering algorithms do not insist that the dissimilarity is truly a distance. (Local) Given two elements of the partition C r, C s, we might consider d SL (C r, C s ) = min d(x, y) x C r,y C s These types of measures are often used in hierarchical clustering algorithms. 19 / 1

Clustering Mathematical characterizations (Global) A clustering is a partition C = {C 1,..., C k } of the cases. It writes T = 1 2 n d 2 (x i, x j ) i,j=1 and tries to minimize W (C). This problem is NP Hard... = W (C) + B(C) These types of measures are often used in partitional clustering algorithms. 20 / 1

Clustering Most common models Partitional K-means / medoid ; mixture models Hierarchical agglomerative (bottom-up) hierarchical clustering. We ll spend a some time on each of these... 21 / 1

Cluster analysis 502 14. Unsupervised Learning X2 X 1 FIGURE 14.4. Simulated data in the plane, clustered into three classes (represented K-means by orange, clustering blue and green) by the K-means clustering algorithm 22 / 1

Cluster analysis 522 14. Unsupervised Learning LEUKEMIA K562B-repro K562A-repro CNS CNS UNKNOWN OVARIAN MCF7A-repro MCF7D-repro LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA OVARIAN OVARIAN LEUKEMIA OVARIAN OVARIAN PROSTATE OVARIAN PROSTATE CNS CNS CNS FIGURE 14.12. Dendrogram from agglomerative hierarchical clustering with average linkage to the human tumor microarray data. Agglomerative hierarchical clustering chical structure produced by the algorithm. Hierarchical methods impose 23 / 1

24 / 1