Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Introduction Cluster analysis is the formal study of algorithms and methods for grouping data. Cluster analysis is a tool for exploring the structure of the data. Applications: in a variety of engineering and scientific disciplines 2003/3/11 2
Applications of Cluster Analysis (1) Biology, Psychology, Archeology, Geology, Marketing, Information retrieval, Remote sensing, etc. 2003/3/11 3
Applications of Cluster Analysis (2) Characterizing customer groups based on purchasing patterns. Categorizing Web documents. Grouping genes and proteins that have similar functionality. Grouping spatial locations prone to earth-quakes based on seismological data. Feature extraction. Image segmentation 2003/3/11 4
Backgrounds While it is easy to give a functional definition of a cluster, it is very difficult to give an operational definition of a cluster A cluster is a set of entities which are alike, and entities from different clusters are not alike. At global level or local level? 2003/3/11 5
2003/3/11 6
Data Representation (1) 2003/3/11 7
Data Representation (2) Pattern Matrix: It can be viewed as a n x d matrix where n and d represent the number of objects and features, respectively. Ex: 2 4.5 8 1 5 4 13 5 7 0 0 19 2 4 6-2 6 7 7 26 2003/3/11 8
Data Representation (3) Proximity Matrix: It accumulates the pairwise indices of proximity in a matrix in which each row and column represents a pattern. Ex: 0 4 7 6 1 4 0 9 1 4 7 9 0 8 3 6 1 8 0 5 1 4 3 5 0 Note: All proximity matrices are symmetry. 2003/3/11 9
Data Types and Scales (1) Data Types: the degree of quantization in the data. Binary: 0/1, Yes/No. Discrete: a finite number of possible values. Continuous: a point on the real line. 2003/3/11 10
Data Types and Scales (2) Data Scale: It indicates the relative significance of numbers. Qualitative (normal and ordinal) scales: discrete numbers can be coded on these qualitative scales. (1) A nominal scale is not really a scale at all because numbers are simply used as a names. E.g. a (yes, no) response could be coded as (0, 1) or (1,0) or (50, 100). (2) The ordinal scale: the numbers have meaning only in relation to one another. E.g. (1,2,3), (10,20,30), and (100, 200, 300) are all equivalent from an ordinal viewpoint. 2003/3/11 11
Data Types and Scales (3) Quantitative (interval and ratio): a unit of measurement exists vs. an absolute zero exists along with a unit of measurement. (1) Interval: The interpretation of the numbers depends on this unit. E.g. 90 degree of Fahrenheit vs. Celsius or judge satisfaction (2) Ratio: The ratio between two numbers has meaning. E.g. the distance between two cities 2003/3/11 12
Proximity Indices A proximity index between the ith and kth patterns is denoted d(i,k). The most common proximity index for patterns is the Minkowski metric, which measures dissimilarity. d( i, k) = ( d j= 1 x ij x kj r 1 ) r 2003/3/11 13
2003/3/11 14 Common Distance Metrics Euclidean distance (r=2) Manhattan or city block distance (r=1) Mahalanobis distance 2 1 2 1 )] ( ) [( ) ( ), ( 1 2 k i T k i d j kj ij x x x x x x k i d = = = = = d j x ij x kj k i d 1 ), ( ) ( ) ( ), ( 1 k i T k i x x x x k i d Σ =
Normalization (1) Some normalization is usually employed based on the requirements of the analysis 2003/3/11 15
Normalization (2) Zero mean and unit variance: m1 N N 1 * m = ( M ) = N x 1 * i σ j = ( xij m N j ) i= 1 i= 1 m n (1) Invariant to rigid displacements x ij = x * ij m j (2) All features have zero mean and unit variance 2 xij x ij m = * j σ j 2003/3/11 16
Classification Types (1) Clustering is a special kind of classification. 2003/3/11 17
Classification Types (2) Exclusive vs. Nonexclusive: Each object belongs to exactly one subset, or cluster. Nonexclusive classification can assign an object to several classes. Unsupervised vs. Supervised: An unsupervised classification uses only the proximity matrix to perform the classification. Supervised classification uses category labels on the subjects as well as the proximity matrix. 2003/3/11 18
Classification Types (3) Hierarchical vs. Partitional: A hierarchical classification is a nested sequence of partitions, whereas a partitional classification is a single partition. 2003/3/11 19
Hierarchical Clustering (1) A picture of a hierarchical clustering is much easier for a human being to comprehend than is a list of abstract symbols. A dendrogram is a special type of tree structure that provides a convenient picture of a hierarchical clustering. Two types: agglomerative and divisive Agglomerative: It starts with the disjoint clustering, which places each of the n objects in an individual cluster and then merges them in a nested procedure Divisive: It performs the task in the reverse order 2003/3/11 20
Hierarchical Clustering (2) Step 1: Assign each object to its own cluster. Step 2: Computer the distances between all clusters. Step 3: Merge the two clusters that are closest to each other. Step 4: Return to step 2 until there is only one cluster left. 2003/3/11 21
Hierarchical Clustering (3) {X1}, {X2}, {X3}, {X4}, {X5} {X1, X2}, {X3}, {X4}, {X5} {X1, X2}, {X3, X4}, {X5} {X1, X2, X3, X4}, {X5} {X1, X2, X3, X4, X5} Note: Cutting s dendrogram horizontally creates a clustering. 2003/3/11 22
2003/3/11 23 Hierarchical Clustering (4) The single-linkage algorithm The complete-linkage algorithm: The average-linkage algorithm: ), ( max ), (, b a d C C D i C j b C a j i CL = ), ( 1 ), (, b a d N N C C D i C j b C a j i j i SL = ), ( min ), (, b a d C C D i C j b C a j i SL =
Hierarchical Clustering (5) The single-linkage algorithm allows clusters to grow long and thin. The complete-linkage algorithm produces more compact clusters. Both the single-linkage algorithm and the complete-linkage algorithm are susceptible to distortion by outliers or deviant observation. The average-linkage algorithm is an attempt to compromise between the extreme of the singlelinkage algorithm and the complete-linkage algorithm. 2003/3/11 24
Hierarchical Clustering (6) 2003/3/11 25
Partitional Clustering Partitional: We generate a single partition of the data in an attempt to recover natural groups present in the data Basic idea: Simply select a criterion, evaluate it for all possible partitions containing K clusters, and pick the partition that optimizes the criterion Hierarchical techniques: biological, social, and behavior science because of the need to construct taxonomies Partitional technologies: engineering applications where single partitions are important 2003/3/11 26
Algorithm for Iterative Partitional Clustering Step 1. Select an initial partition with K clusters. Step 2. Generate a new partition by assigning each pattern to its closest cluster center. Step 3. Compute new cluster centers as the centers of the clusters. Step 4. Repeat step2 and 3 until an optimum value of the criterion function is found. Step 5. Adjust the number of clusters by merging and splitting existing clusters or by removing small, or outlier, clusters. 2003/3/11 27
The K-means Algorithm (1) Step 1: Choose K cluster centers: C1(1), L, CK (1) Step 2: At the kth iterative step distribute the samples among the K cluster domains, using the relation x Sj( k) if x cj( k) < x ci ( k) for i j Step 3: Computer the new cluster centers C j ( k where 1 + 1) = x N j N j x S ( k ) j j = 1, LK = the number of samples in S Step 4: If the algorithm has converged and the procedure is terminated. Otherwise go to Step 2. 2003/3/11 28 j ( k)
The K-Means Algorithm (2) Seed patterns can be the first K patterns of K randomly chosen data points. Different initial partitions can lead to different final clustering results If the clustering results using several different initial partitions all lead to the same final partition, we have some confidence on the result. The Euclidean distance can be replaced by the Mahalanobis distance. 2003/3/11 29
The K-Means Algorithm (3) 2003/3/11 30
The K-Means Algorithm (4) 2003/3/11 31
The K-Means Algorithm (5) 2003/3/11 32
Nearest-Neighbor Clustering Algorithm (1) Step 1: Set i=1 and k=1. Assign pattern to cluster C1 Step 2: Set i=i+1. Find the nearest neighbor of x i among the patterns already assigned to clusters. Let d denote the distance from x i to its nearest neighbor. Suppose that the nearest neighbor is in cluster c k. Step 3: If d m t (a prespecified threshold), then assign xi to cm. Otherwise, set k=k+1 and assign xi to a new cluster ck. Step 4: If every pattern has been assigned to a cluster, stop. Else, go to step 2. x 1 Note: The number of clusters generated, K, is a function of the parameter t. As the value of t increases, fewer clusters are generated. 2003/3/11 33
Nearest-Neighbor Clustering Algorithm (2) 2003/3/11 34
Nearest-Neighbor Clustering Algorithm (3) 2003/3/11 35
Projections Projection algorithms maps a set of N ndimensional patterns onto an m-dimensional space, where m<n. The main motivation for projection algorithms is to permit visual examination of multidimensional data such that one can cluster by eye and qualitatively validate clustering results. Projection algorithms can be categorized into two types linear type and nonlinear type. 2003/3/11 36
Linear Projections (1) y = H xi for i = 1, L, N i Linear projection algorithms are relatively simple to use and have well-understood mathematical properties. Eigenvector projection (Karhunen-Loeve method) is commonly used. The eigenvectors of the covariance matrix R defines a linear projection and replace the features in the raw data with uncorrelated features. 2003/3/11 37
Linear Projections (2) Let Σ denote the covariance matrix of the data and λ denote the eigenvalue of Σ. λ λ L 1 2 λ d c 1, c2, L, c d denote the corresponding eigenvectors (principal components). m = 1 N 1 N i= 1 x i N T = ( xi m)( x m) 2003/3/11 N i= 1 38
Linear Projections (3) Define the m x d transformation matrix H as H = c c M c T 1 T 2 T m 2003/3/11 39
Linear Projections (4) This matrix projects the pattern space into an m-dimensional subspace (m<d) whose axes are in the directions of the largest eigenvalues of Σ as follows. y = H xi for i = 1, L, N i The covariance matrix in the new space becomes a diagonal matrix as diag ( 1 2 m λ, λ, L, λ ) 2003/3/11 40
Linear Projections (5) This implies that the m new features are uncorrelated. One could choose m so that rm m = λi / λi i= 1 d i= 1 0.95 which would assure that 95% of the variance is retained in the new space. Thus a good eigenvector projection is that which retains a large proportion of the variance present in the original feature space with only a few features in the transformed space. 2003/3/11 41
Linear Projections (6) 2003/3/11 42
Linear Projections (7) 2003/3/11 43
Linear Projections (8) 2003/3/11 44
Linear Projections (9) There is no guarantee that the features with the largest eigenvalues will be best for preserving the separation among categories. 2003/3/11 45
Nonlinear Projections The inability of linear projections to preserve complex data structures has made nonlinear projections more popular in recent years. Most nonlinear projection algorithms are based on maximizing or minimizing an object function. Nonlinear projection algorithms are expensive to use, so several heuristics are employed to reduce the search time for the optimal solution. In exploratory data analysis, we seek two-dimensional projections to visually perceive the structure present in the data. 2003/3/11 46
Sammon s Algorithm (1) Sammon proposed a nonlinear technique that tries to create a two-dimensional configuration of points in which interpattern distances are preserved. Let { x i } denote a set of N n-dimensional patterns and let d( i, j) denote the distance between patterns xi and x j in the n-dimensional space. Let { y i } denote a set of N m- dimensional corresponding patterns to be found and let D( i, j) denote the distance between patterns y and y i j in the m-dimensional space. 2003/3/11 47
Sammon s Algorithm (2) Sammon suggested looking for minimizing the error function E called stress d i j D i j E = d 1 [ (, ) (, )] ( i, j) d ( i, j) i< j i< j Sammon s algorithm starts with a random configuration of N patterns in m dimensions and use the method of steepest descent to reconfigure the patterns so as to minimize E in an iterative fashion. The algorithm should be applied with several initial configurations to ensure a global minimum of E. 2 2003/3/11 48
2003/3/11 49 Sammon s Algorithm (3) = = = + N i k k ik ij ij ij ij ij y y k D i k i d k D i k i d t y t y t E t y t y 1, ) ]( ), ( ), ( ), ( ), ( [ 2 ) ( ) ( ) ( ) ( 1) ( λ α α where < = j i j i d ), ( λ Ref: N. R. Pal and V. K. E;uri, Two efficient connectionist schemes for structure preserving dimensionality reduction, IEEE Trans. on Neural Networks, vol. 9, no. 6, pp. 1142-1154, 1998.
Sammon s Algorithm (4) (a) (b) Figure: (a) iris data set; (b) 10-dimensional data set. 2003/3/11 50