Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Similar documents
Unsupervised Learning and Clustering

Clustering CS 550: Machine Learning

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Unsupervised Learning

Unsupervised Learning and Clustering

Unsupervised Learning

Network Traffic Measurements and Analysis

Market basket analysis

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CSE 5243 INTRO. TO DATA MINING

Cluster analysis. Agnieszka Nowak - Brzezinska

Unsupervised Learning

Based on Raymond J. Mooney s slides

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Clustering and Visualisation of Data

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

What to come. There will be a few more topics we will cover on supervised learning

CSE 5243 INTRO. TO DATA MINING

ECLT 5810 Clustering

Cluster Analysis. Ying Shen, SSE, Tongji University

Unsupervised Learning

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Machine Learning for OR & FE

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

ECLT 5810 Clustering

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

10701 Machine Learning. Clustering

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Unsupervised learning, Clustering CS434

Unsupervised Learning

Cluster Analysis: Agglomerate Hierarchical Clustering

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Cluster Analysis. Angela Montanari and Laura Anderlucci

Road map. Basic concepts

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

MSA220 - Statistical Learning for Big Data

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

CHAPTER 4: CLUSTER ANALYSIS

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Unsupervised Learning

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Hierarchical Clustering

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

CS7267 MACHINE LEARNING

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Chapter DM:II. II. Cluster Analysis

Methods for Intelligent Systems

Lecture 15 Clustering. Oct

Lesson 3. Prof. Enza Messina

Pattern Clustering with Similarity Measures

Unsupervised: no target value to predict

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Hierarchical Clustering 4/5/17

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Hierarchical Clustering

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Data Informatics. Seon Ho Kim, Ph.D.

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Unsupervised Learning

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Nonlinear dimensionality reduction of large datasets for data exploration

Exploratory data analysis for microarrays

Gene Clustering & Classification

Chapter 6: Cluster Analysis

Clustering Lecture 3: Hierarchical Methods

A Course in Machine Learning

INF 4300 Classification III Anne Solberg The agenda today:

Hierarchical clustering

Administration. Final Exam: Next Tuesday, 12/6 12:30, in class. HW 7: Due on Thursday 12/1. Final Projects:

Clustering algorithms

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering. Chapter 10 in Introduction to statistical learning

Dimension reduction : PCA and Clustering

Multivariate Analysis

Image Analysis - Lecture 5

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Hierarchical Clustering

Unsupervised Learning

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

Community Detection. Community

Finding Clusters 1 / 60

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Clustering. Pattern Recognition IX. Michal Haindl. Clustering. Outline

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Multivariate Methods

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

Unsupervised Learning : Clustering

CSE 5243 INTRO. TO DATA MINING

Information Retrieval and Web Search Engines

Transcription:

Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Introduction Cluster analysis is the formal study of algorithms and methods for grouping data. Cluster analysis is a tool for exploring the structure of the data. Applications: in a variety of engineering and scientific disciplines 2003/3/11 2

Applications of Cluster Analysis (1) Biology, Psychology, Archeology, Geology, Marketing, Information retrieval, Remote sensing, etc. 2003/3/11 3

Applications of Cluster Analysis (2) Characterizing customer groups based on purchasing patterns. Categorizing Web documents. Grouping genes and proteins that have similar functionality. Grouping spatial locations prone to earth-quakes based on seismological data. Feature extraction. Image segmentation 2003/3/11 4

Backgrounds While it is easy to give a functional definition of a cluster, it is very difficult to give an operational definition of a cluster A cluster is a set of entities which are alike, and entities from different clusters are not alike. At global level or local level? 2003/3/11 5

2003/3/11 6

Data Representation (1) 2003/3/11 7

Data Representation (2) Pattern Matrix: It can be viewed as a n x d matrix where n and d represent the number of objects and features, respectively. Ex: 2 4.5 8 1 5 4 13 5 7 0 0 19 2 4 6-2 6 7 7 26 2003/3/11 8

Data Representation (3) Proximity Matrix: It accumulates the pairwise indices of proximity in a matrix in which each row and column represents a pattern. Ex: 0 4 7 6 1 4 0 9 1 4 7 9 0 8 3 6 1 8 0 5 1 4 3 5 0 Note: All proximity matrices are symmetry. 2003/3/11 9

Data Types and Scales (1) Data Types: the degree of quantization in the data. Binary: 0/1, Yes/No. Discrete: a finite number of possible values. Continuous: a point on the real line. 2003/3/11 10

Data Types and Scales (2) Data Scale: It indicates the relative significance of numbers. Qualitative (normal and ordinal) scales: discrete numbers can be coded on these qualitative scales. (1) A nominal scale is not really a scale at all because numbers are simply used as a names. E.g. a (yes, no) response could be coded as (0, 1) or (1,0) or (50, 100). (2) The ordinal scale: the numbers have meaning only in relation to one another. E.g. (1,2,3), (10,20,30), and (100, 200, 300) are all equivalent from an ordinal viewpoint. 2003/3/11 11

Data Types and Scales (3) Quantitative (interval and ratio): a unit of measurement exists vs. an absolute zero exists along with a unit of measurement. (1) Interval: The interpretation of the numbers depends on this unit. E.g. 90 degree of Fahrenheit vs. Celsius or judge satisfaction (2) Ratio: The ratio between two numbers has meaning. E.g. the distance between two cities 2003/3/11 12

Proximity Indices A proximity index between the ith and kth patterns is denoted d(i,k). The most common proximity index for patterns is the Minkowski metric, which measures dissimilarity. d( i, k) = ( d j= 1 x ij x kj r 1 ) r 2003/3/11 13

2003/3/11 14 Common Distance Metrics Euclidean distance (r=2) Manhattan or city block distance (r=1) Mahalanobis distance 2 1 2 1 )] ( ) [( ) ( ), ( 1 2 k i T k i d j kj ij x x x x x x k i d = = = = = d j x ij x kj k i d 1 ), ( ) ( ) ( ), ( 1 k i T k i x x x x k i d Σ =

Normalization (1) Some normalization is usually employed based on the requirements of the analysis 2003/3/11 15

Normalization (2) Zero mean and unit variance: m1 N N 1 * m = ( M ) = N x 1 * i σ j = ( xij m N j ) i= 1 i= 1 m n (1) Invariant to rigid displacements x ij = x * ij m j (2) All features have zero mean and unit variance 2 xij x ij m = * j σ j 2003/3/11 16

Classification Types (1) Clustering is a special kind of classification. 2003/3/11 17

Classification Types (2) Exclusive vs. Nonexclusive: Each object belongs to exactly one subset, or cluster. Nonexclusive classification can assign an object to several classes. Unsupervised vs. Supervised: An unsupervised classification uses only the proximity matrix to perform the classification. Supervised classification uses category labels on the subjects as well as the proximity matrix. 2003/3/11 18

Classification Types (3) Hierarchical vs. Partitional: A hierarchical classification is a nested sequence of partitions, whereas a partitional classification is a single partition. 2003/3/11 19

Hierarchical Clustering (1) A picture of a hierarchical clustering is much easier for a human being to comprehend than is a list of abstract symbols. A dendrogram is a special type of tree structure that provides a convenient picture of a hierarchical clustering. Two types: agglomerative and divisive Agglomerative: It starts with the disjoint clustering, which places each of the n objects in an individual cluster and then merges them in a nested procedure Divisive: It performs the task in the reverse order 2003/3/11 20

Hierarchical Clustering (2) Step 1: Assign each object to its own cluster. Step 2: Computer the distances between all clusters. Step 3: Merge the two clusters that are closest to each other. Step 4: Return to step 2 until there is only one cluster left. 2003/3/11 21

Hierarchical Clustering (3) {X1}, {X2}, {X3}, {X4}, {X5} {X1, X2}, {X3}, {X4}, {X5} {X1, X2}, {X3, X4}, {X5} {X1, X2, X3, X4}, {X5} {X1, X2, X3, X4, X5} Note: Cutting s dendrogram horizontally creates a clustering. 2003/3/11 22

2003/3/11 23 Hierarchical Clustering (4) The single-linkage algorithm The complete-linkage algorithm: The average-linkage algorithm: ), ( max ), (, b a d C C D i C j b C a j i CL = ), ( 1 ), (, b a d N N C C D i C j b C a j i j i SL = ), ( min ), (, b a d C C D i C j b C a j i SL =

Hierarchical Clustering (5) The single-linkage algorithm allows clusters to grow long and thin. The complete-linkage algorithm produces more compact clusters. Both the single-linkage algorithm and the complete-linkage algorithm are susceptible to distortion by outliers or deviant observation. The average-linkage algorithm is an attempt to compromise between the extreme of the singlelinkage algorithm and the complete-linkage algorithm. 2003/3/11 24

Hierarchical Clustering (6) 2003/3/11 25

Partitional Clustering Partitional: We generate a single partition of the data in an attempt to recover natural groups present in the data Basic idea: Simply select a criterion, evaluate it for all possible partitions containing K clusters, and pick the partition that optimizes the criterion Hierarchical techniques: biological, social, and behavior science because of the need to construct taxonomies Partitional technologies: engineering applications where single partitions are important 2003/3/11 26

Algorithm for Iterative Partitional Clustering Step 1. Select an initial partition with K clusters. Step 2. Generate a new partition by assigning each pattern to its closest cluster center. Step 3. Compute new cluster centers as the centers of the clusters. Step 4. Repeat step2 and 3 until an optimum value of the criterion function is found. Step 5. Adjust the number of clusters by merging and splitting existing clusters or by removing small, or outlier, clusters. 2003/3/11 27

The K-means Algorithm (1) Step 1: Choose K cluster centers: C1(1), L, CK (1) Step 2: At the kth iterative step distribute the samples among the K cluster domains, using the relation x Sj( k) if x cj( k) < x ci ( k) for i j Step 3: Computer the new cluster centers C j ( k where 1 + 1) = x N j N j x S ( k ) j j = 1, LK = the number of samples in S Step 4: If the algorithm has converged and the procedure is terminated. Otherwise go to Step 2. 2003/3/11 28 j ( k)

The K-Means Algorithm (2) Seed patterns can be the first K patterns of K randomly chosen data points. Different initial partitions can lead to different final clustering results If the clustering results using several different initial partitions all lead to the same final partition, we have some confidence on the result. The Euclidean distance can be replaced by the Mahalanobis distance. 2003/3/11 29

The K-Means Algorithm (3) 2003/3/11 30

The K-Means Algorithm (4) 2003/3/11 31

The K-Means Algorithm (5) 2003/3/11 32

Nearest-Neighbor Clustering Algorithm (1) Step 1: Set i=1 and k=1. Assign pattern to cluster C1 Step 2: Set i=i+1. Find the nearest neighbor of x i among the patterns already assigned to clusters. Let d denote the distance from x i to its nearest neighbor. Suppose that the nearest neighbor is in cluster c k. Step 3: If d m t (a prespecified threshold), then assign xi to cm. Otherwise, set k=k+1 and assign xi to a new cluster ck. Step 4: If every pattern has been assigned to a cluster, stop. Else, go to step 2. x 1 Note: The number of clusters generated, K, is a function of the parameter t. As the value of t increases, fewer clusters are generated. 2003/3/11 33

Nearest-Neighbor Clustering Algorithm (2) 2003/3/11 34

Nearest-Neighbor Clustering Algorithm (3) 2003/3/11 35

Projections Projection algorithms maps a set of N ndimensional patterns onto an m-dimensional space, where m<n. The main motivation for projection algorithms is to permit visual examination of multidimensional data such that one can cluster by eye and qualitatively validate clustering results. Projection algorithms can be categorized into two types linear type and nonlinear type. 2003/3/11 36

Linear Projections (1) y = H xi for i = 1, L, N i Linear projection algorithms are relatively simple to use and have well-understood mathematical properties. Eigenvector projection (Karhunen-Loeve method) is commonly used. The eigenvectors of the covariance matrix R defines a linear projection and replace the features in the raw data with uncorrelated features. 2003/3/11 37

Linear Projections (2) Let Σ denote the covariance matrix of the data and λ denote the eigenvalue of Σ. λ λ L 1 2 λ d c 1, c2, L, c d denote the corresponding eigenvectors (principal components). m = 1 N 1 N i= 1 x i N T = ( xi m)( x m) 2003/3/11 N i= 1 38

Linear Projections (3) Define the m x d transformation matrix H as H = c c M c T 1 T 2 T m 2003/3/11 39

Linear Projections (4) This matrix projects the pattern space into an m-dimensional subspace (m<d) whose axes are in the directions of the largest eigenvalues of Σ as follows. y = H xi for i = 1, L, N i The covariance matrix in the new space becomes a diagonal matrix as diag ( 1 2 m λ, λ, L, λ ) 2003/3/11 40

Linear Projections (5) This implies that the m new features are uncorrelated. One could choose m so that rm m = λi / λi i= 1 d i= 1 0.95 which would assure that 95% of the variance is retained in the new space. Thus a good eigenvector projection is that which retains a large proportion of the variance present in the original feature space with only a few features in the transformed space. 2003/3/11 41

Linear Projections (6) 2003/3/11 42

Linear Projections (7) 2003/3/11 43

Linear Projections (8) 2003/3/11 44

Linear Projections (9) There is no guarantee that the features with the largest eigenvalues will be best for preserving the separation among categories. 2003/3/11 45

Nonlinear Projections The inability of linear projections to preserve complex data structures has made nonlinear projections more popular in recent years. Most nonlinear projection algorithms are based on maximizing or minimizing an object function. Nonlinear projection algorithms are expensive to use, so several heuristics are employed to reduce the search time for the optimal solution. In exploratory data analysis, we seek two-dimensional projections to visually perceive the structure present in the data. 2003/3/11 46

Sammon s Algorithm (1) Sammon proposed a nonlinear technique that tries to create a two-dimensional configuration of points in which interpattern distances are preserved. Let { x i } denote a set of N n-dimensional patterns and let d( i, j) denote the distance between patterns xi and x j in the n-dimensional space. Let { y i } denote a set of N m- dimensional corresponding patterns to be found and let D( i, j) denote the distance between patterns y and y i j in the m-dimensional space. 2003/3/11 47

Sammon s Algorithm (2) Sammon suggested looking for minimizing the error function E called stress d i j D i j E = d 1 [ (, ) (, )] ( i, j) d ( i, j) i< j i< j Sammon s algorithm starts with a random configuration of N patterns in m dimensions and use the method of steepest descent to reconfigure the patterns so as to minimize E in an iterative fashion. The algorithm should be applied with several initial configurations to ensure a global minimum of E. 2 2003/3/11 48

2003/3/11 49 Sammon s Algorithm (3) = = = + N i k k ik ij ij ij ij ij y y k D i k i d k D i k i d t y t y t E t y t y 1, ) ]( ), ( ), ( ), ( ), ( [ 2 ) ( ) ( ) ( ) ( 1) ( λ α α where < = j i j i d ), ( λ Ref: N. R. Pal and V. K. E;uri, Two efficient connectionist schemes for structure preserving dimensionality reduction, IEEE Trans. on Neural Networks, vol. 9, no. 6, pp. 1142-1154, 1998.

Sammon s Algorithm (4) (a) (b) Figure: (a) iris data set; (b) 10-dimensional data set. 2003/3/11 50