Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering

Similar documents
Unsupervised Learning

Cluster Analysis. Ying Shen, SSE, Tongji University

CSE 5243 INTRO. TO DATA MINING

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CHAPTER 4: CLUSTER ANALYSIS

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning. Pantelis P. Analytis. Introduction. Finding structure in graphs. Clustering analysis. Dimensionality reduction.

Cluster Analysis: Agglomerate Hierarchical Clustering

Unsupervised Learning and Clustering

Clustering CS 550: Machine Learning

Unsupervised Learning and Clustering

4. Cluster Analysis. Francesc J. Ferri. Dept. d Informàtica. Universitat de València. Febrer F.J. Ferri (Univ. València) AIRF 2/ / 1

Unsupervised Learning : Clustering

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Understanding Clustering Supervising the unsupervised

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. Content. Typical Applications. Clustering: Unsupervised data mining technique

Methods for Intelligent Systems

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Figure (5) Kohonen Self-Organized Map

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Cluster Analysis. Angela Montanari and Laura Anderlucci

Lesson 3. Prof. Enza Messina

Network Traffic Measurements and Analysis

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Clustering in Ratemaking: Applications in Territories Clustering

Machine Learning. Unsupervised Learning. Manfred Huber

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Clustering algorithms

Clustering Lecture 3: Hierarchical Methods

Unsupervised Learning

Chapter 6: Cluster Analysis

Clustering Part 3. Hierarchical Clustering

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Clustering part II 1

Unsupervised Learning

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017.

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Data Informatics. Seon Ho Kim, Ph.D.

CHAPTER FOUR NEURAL NETWORK SELF- ORGANIZING MAP

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

4. Ad-hoc I: Hierarchical clustering

ECS 234: Data Analysis: Clustering ECS 234

Cluster Analysis for Microarray Data

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Exploratory Analysis: Clustering

Exploratory data analysis for microarrays

Hierarchical Clustering

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Dimension reduction : PCA and Clustering

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Road map. Basic concepts

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Multivariate Analysis

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

University of Florida CISE department Gator Engineering. Clustering Part 2

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering in Data Mining

3. Cluster analysis Overview

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Unsupervised Learning

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Hierarchical clustering

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Clustering Part 4 DBSCAN

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Finding Clusters 1 / 60

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

ECLT 5810 Clustering

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

ECLT 5810 Clustering

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Machine Learning (BSMC-GA 4439) Wenke Liu

Introduction to Data Mining

3. Cluster analysis Overview

CSE 5243 INTRO. TO DATA MINING

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

University of Florida CISE department Gator Engineering. Clustering Part 4

CHAPTER THREE THE DISTANCE FUNCTION APPROACH

Data Mining Concepts & Techniques

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Applied Clustering Techniques. Jing Dong

Unsupervised: no target value to predict

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Gene Clustering & Classification

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Transcription:

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering

Clustering Algorithms Contents K-means Hierarchical algorithms Linkage functions Vector quantization SOM

Clustering Formulation Objects................................. Attributes Find groups of similar points (observations) in multidimensional space No target variable (unsupervised learning)......... Model

Methods of Clustering - Overview Variety of methods: Hierarchical clustering create hierarchy of clusters (one cluster entirely contained within another cluster) Non-hierarchical methods create disjoint clusters Overlapping clusters (objects can belong to >1 cluster simultaneously) Fuzzy clusters (defined by the probability (grade) of membership of each object in each cluster) Useful data preprocessing prior to clustering: PCA (Principal Components Analysis) to reduce dimensionality of data Data standarization (transform data to reduce large influence of variables with larger variance on results of clustering)

Introductory Example 97 countries described by 3 attributes: Birth, Death, InfantDeath rate (given as number per 1000, data from year 1995)

Analysis I Clustering raw data K-means algorithm Result: 3 clusters (no. of obs. in each cluster: 13, 32, 52) Example cntd.

Example Profiles of Clusters

Example Profiles of Clusters Notice: data clustered based on InfantDeath Rate only!

Example Standarization of Data Analysis II Data standarized prior to clustering (variables divided by their standard deviation) Result: 3 clusters (with 35, 46, 16 obs.) Data clustered based on InfantDeath and Death Analysis II Analysis I Observe that data with largest variance have largest influence on results of clustering

Example Profiles of Clusters Analysis II: profiles of clusters

Methods of Clustering Non-hierarchical methods K-means clustering Non-deterministic O(n), n - number of observations Hierarchical methods Aglomerative (join small clusters) Divisive (split big clusters) Deterministic methods O(n 2 ) O(n 3 ), depending on the clustering method (i.e. definition of intercluster distance)

Methods of Clustering - Remarks Clustering large datasets K-means If results of hierarchical clustering needed first use K-means yielding e.g. 50 clusters, followed by hierarchical clustering on results of K-means Consensus clustering Discover real clusters in data analyze stability of results with noise injected

K-means Algorithm K-means clustering Select k points (centroids of initial clusters; select randomly) Assign each observation to the nearest centroid (nearest cluster) For each cluster find the new centroid Repeat step 2 and 3 until no change occurs in cluster assignments

K-means Algorithm Result: k separate clusters Algorithm requires that the correct number of clusters k is specified in advance (difficult problem: how to know the real number of clusters in data )

Hierarchical Clustering Notation x i observations, i=1..n C k clusters G current number of clusters D KL distance between clusters C K and C L Between-cluster distance D KL linkage function (various definitions available, results of clustering depend on D KL ) C L C K D KL

Hierarchical Clustering Algorithm (agglomerative hierarchical clustering) C k = {x k }, k=1..n, G=n Find K, L such that D KL = min D IJ, 1<=I,J<=G Replace clusters C K and C L by cluster C K C L, G=G-1 Repeat steps 2 and 3 while G>1 C L D KL C K Result: hierarchy of clusters dendrogram

Hierarchy of Clusters - Dendrogram

Definitions of Distance Between Clusters Different definitions of distance between clusters Average linkage Single linkage Complete linkage Density linkage Ward s minimum variance method (SAS CLUSTER procedure accepts 11 different definitions of inter-cluster distance)

Notation x i observations, i=1..n Average Linkage d(x,y) distance between observations (Euclidean distance assumed from now on) C k clusters N K number of observations in cluster C K D KL distance between clusters C K and C L mean CK mean observation in cluster C K W K = x i -mean CK 2 x i C K variance in cluster Average linkage Tends to join clusters with small variance Resulting clusters tend to have similar variance

Notation x i observations, i=1..n Complete Linkage d(x,y) distance between observations C k clusters N K number of observations in cluster C K D KL distance between clusters C K and C L mean CK mean observation in cluster C K W K = x i -mean CK 2 x i C K variance in cluster Complete linkage Resulting clusters tend to have similar diameter

Notation x i observations, i=1..n Single Linkage d(x,y) distance between observations C k clusters N K number of observations in cluster C K D KL distance between clusters C K and C L mean CK mean observation in cluster C K W K = x i -mean CK 2 x i C K variance in cluster Single linkage Tends to produce elongated clusters, irregular in shape

Ward s Minimum Variance Method Notation x i observations, i=1..n d(x,y) distance between observations C k clusters N K number of observations in cluster C K D KL distance between clusters C K and C L mean CK mean observation in cluster C K W K = x i -mean CK 2 x i C K variance in cluster B KL =W M -W K -W L where C M =C K C L Ward s minimum variance method Tends to join small clusters Tends to produce clusters with similar number of observations

Density Linkage Notation x i observations, i=1..n d(x,y) distance between observations r a fixed constant f(x) proportion of observations within sphere centered at x with radius r divided by the volume of the sphere (measure of density of points near observation x) Density linkage We realize single linkage using the measure d* Capable of discovering clusters of irregular shape

Example Average Linkage Elongated clusters in data

Elongated clusters in data Example K-means

Example Density Linkage Elongated clusters in data

Nonconvex clusters in data Example K-means

Example Centroid Linkage Nonconvex clusters in data

Example Density Linkage Nonconvex clusters in data

Clusters of unequal size Example True Clusters

Clusters of unequal size Example K-means

Example Ward s Method Clusters of unequal size

Example Average Linkage Method: average linkage

Example Centroid Linkage Clusters of unequal size

Example Single Linkage Clusters of unequal size

Example Well Separated Data Any method will work

Example Poorly Separated Data True clusters

Example Poorly Separated Data Method: K-means

Example Poorly Separated Data Ward s method

Clustering Methods Final Remarks Standarization of variables prior to clustering Often necessary, otherwise variables with large variance tend to have large influence on clustering Often standarized measurement z ij is computed as the z-score: where x ij original measurement in observation i and variable j, j mean value of variable j, s j mean absolute deviation of variable j (or its standard deviation) Other ideas: divide variable by its range, max value or standard deviation

Clustering Methods Final Remarks The number of clusters No satisfactory theory to determine the right number of clusters in data Various criteria can be observed to help determine the right number of clusters, e.g. criteria based on variance accounted for by clusters R 2 =1-P G /T or semipartial R 2 =B KL /T where T total variance of observations; P G = W K over G clusters B KL =W M -W K -W L where C M =C K C L Cubic Clustering Criterion (CCC) Often data visualization useful for determining the number of clusters Scatterplot for 2-3 dimensional data In high dimensions apply PCA transformation (or similar) visualize data in 2-3 dimensional space of first principal components

Example 2 R, Semi-partial 2 R

Example Number of Clusters Useful Checks PST2: 3 or 6 or 9 (one before peak in value) PSF: 9 (peak in value) CCC: 18 (CCC around 3)

Kohonen VQ (Vector Quantization) Algorithm similar to k-means Idea of VQ algorithm: 1. Select k points (initial cluster centroids) 2. For observation x i find nearest centroid (winning seed) denoted by S n 3. Modify S n according to the formula: where L learning constant (decresing during learning process) 4. Repeat steps 2 and 3 over all training observations 5. Repeat steps 2-4 given number of iterations

VQ MacQueen Method For L=const VQ algorithm does not coverge MacQueen method: Learning constant L reciprocal to the numer of observations N n in cluster associated with the winning seed S n This algorithm converges 46

Kohonen SOM (Self Organizing Maps) 1. Select k initial points (cluster centroids), represent them on a 2D map 2. For observation x i find winning seed S n 3. Modify all centroids : S j =S j (1-K(j,n)L)+x i K(j,n)L, where L learning constant (decreasing during training) K(j,n) function decreasing with increasing distance on the 2D map between S j i S n centroids (K(j,j)=1) 4. Repeat steps 2 and 3 over all training observations 47

Example SOM-based clustering of wine data (R language, dataset wines, package kohonen) 48

Example SOM-based clustering of wine data (R language, dataset wines, package kohonen) 49

R system implementation of the SOM algorithm: function som() (package kohonen) Results: structure wine.som important members: wine.som$codes wine.som$unit.classif # codebook vectors # winning units for all data points 50

Codebook vectors represent clusters created at each 2D grid element (attributes of codebook vectors are mean values of respective attributes of cluster elements) 51

R system implementation of the SOM algorithm: function som() (package kohonen) Results: structure wine.som important members: wine.som$codes wine.som$unit.classif # codebook vectors # winning units for all data points 52

R system implementation of the SOM algorithm: function som() (package kohonen) Results: structure wine.som important members: wine.som$codes wine.som$unit.classif # codebook vectors # winning units for all data points 53

Results: assignment of observations (individual wines) to 2D grid Grouping seeds (codebook vectors) e.g. with hierarchical clustering (hclust function): 54

Przykład SOM w R 55

Przykład SOM w R 56

Przykład SOM w R 57