Methods for Intelligent Systems

Size: px

Start display at page:

Download "Methods for Intelligent Systems"

Erin Carson
6 years ago
Views:

1 Methods for Intelligent Systems Lecture Notes on Clustering (II) Davide Eynard Department of Electronics and Information Politecnico di Milano Davide Eynard - Lecture Notes on Clustering (II) p. 1/34 Course Schedule Date Topic 28/03/2006 Clustering Introduction & Algorithms (I) (K-Means, Hierarchical) 11/04/2006 Clustering Algorithms (II) (Fuzzy, SOM, Gaussians, PDDP) 16/05/2006 How many clusters? (Evaluations and tuning) 20/06/2006 Monography on Text Clustering (I) 21/06/2006 Monography on Text Clustering (II) (14.15 AM2) (+ exercises) Davide Eynard - Lecture Notes on Clustering (II) p. 2/34

2 Lecture outline PDDP Fuzzy C-Means Gaussian Mixtures Self-Organizing Maps Davide Eynard - Lecture Notes on Clustering (II) p. 3/34 PDDP PDDP stands for "Principal Direction Divisive Partitioning" Principal Direction, because the algorithm is based on the computation of the leading principal direction at each stage of the partitioning Partitioning, because we place all data into one cluster, so that at every stage clusters are disjoint and their union equals the entire set of documents Divisive, because it s a hierarchical divisive (vs. hierarchical agglomerative) clustering algorithm Davide Eynard - Lecture Notes on Clustering (II) p. 5/34

3 PDDP Algorithm Description Sample space of m samples in which each sample (document) is an n-vector containing a numerical value Each document is represented by a column vector of attribute values d = (d 1, d 2,..., d n ) T whose i-th entry, d i, is the relative frequency of the i-th word. Each document vector is normalized to have a euclidean length of 1: TF i d i = j (TF j) 2 where TF i is the number of occurrences of word i in the particular document d. Davide Eynard - Lecture Notes on Clustering (II) p. 6/34 PDDP Algorithm Description The entire set of documents is represented by an n m matrix M = (d 1,..., d m ) whose i-th column, d i, is the column vector representing the i-th document The algorithm proceeds by separating the entire set of documents into two partitions by using principal directions Each of the two partitions will be splitted into two subpartitions using the same process recursively The result is a hierarchical structure of partitions arranged into a binary tree What method is used to split a partition into two subpartitions? In what order are the partitions selected to be split? Davide Eynard - Lecture Notes on Clustering (II) p. 6/34

4 Splitting a partition The mean or centroid of the document set is w = d d m m = M e 1 m In the general case, the covariance matrix is C = (M we T ) (M we T ) T = A A T where e = (1,1,...,1) T is a vector of appropriate dimension. The eigenvectors corresponding to the k largest eigenvalues are called the principal components or principal directions. Davide Eynard - Lecture Notes on Clustering (II) p. 7/34 Splitting a partition A partition of p documents is represented by an n p matrix M p = (d 1,..., d p ) where each d i is an n-vector representing a document. The matrix M p is a submatrix of M consisting of some selection of p columns of M, not necessarily the first p in the set! The principal directions of the matrix M p are the eigenvectors of its sample covariance matrix C We re interested in temporarily projecting each document onto the single leading eigenvector u (the principal direction) Besides reducing the dimensionality, the transformation often has the effect of removing much noise present in the data Davide Eynard - Lecture Notes on Clustering (II) p. 7/34

5 Principal Component Analysis Davide Eynard - Lecture Notes on Clustering (II) p. 8/34 Principal Component Analysis Davide Eynard - Lecture Notes on Clustering (II) p. 8/34

6 Principal Component Analysis Davide Eynard - Lecture Notes on Clustering (II) p. 8/34 Splitting a partition The projection of the i-th document d i is given by the formula where σ is a positive constant. σv i = u T (d i w) All the documents are translated so that their mean is at the origin, then they re projected on the principal direction The values v 1,..., v k are used to determine the splitting (accordingly to their sign) We still have to decide at each stage which node should be split next: A "scatter" value is used to measure the distance between each document in the cluster and the overall mean of the cluster (a measure of its cohesiveness) Davide Eynard - Lecture Notes on Clustering (II) p. 9/34

7 The algorithm Start with n m matrix M of (scaled) document vectors, and a desired number of clusters c max. 1. Initialize Binary Tree with a single Root Node 2. For c = 2,3,..., c max do 3. Select node K with largest scat value 4. Create nodes L:=leftchild(K) and R:=rightchild(K) 5. Set indices(l):=indices of the non-positive entries in rightvec(k) 6. Set indices(r):=indices of the positive entries in rightvec(k) 7. Compute all the other fields for the nodes L,R 8. end. Result: A binary tree with c max leaf nodes forming a partitioning of the entire data set. Davide Eynard - Lecture Notes on Clustering (II) p. 10/34 Conclusions PDDP algorithm is effective at least as well as an agglomeration algorithm, but is much faster Its expected running time is linear in the number of documents, whereas unmodified agglomeration algorithms typically have O(m 2 ) running time Davide Eynard - Lecture Notes on Clustering (II) p. 11/34

8 Experimental results Davide Eynard - Lecture Notes on Clustering (II) p. 12/34 Experimental results Davide Eynard - Lecture Notes on Clustering (II) p. 12/34

9 Experimental results Davide Eynard - Lecture Notes on Clustering (II) p. 12/34 Fuzzy C-Means Fuzzy C-Means (FCM, developed by Dunn in 1973 and improved by Bezdek in 1981) is a method of clustering which allows one piece of data to belong to two or more clusters. frequently used in pattern recognition based on minimization of the following objective function: J m = N C u m ij x i c j 2,1 m < i=1 j=1 where: m is any real number greater than 1 (fuzzyness coefficient), u ij is the degree of membership of x i in the cluster j, x i is the i-th of d-dimensional measured data, c j is the d-dimension center of the cluster, is any norm expressing the similarity between measured data and the center. Davide Eynard - Lecture Notes on Clustering (II) p. 14/34

10 K-Means vs. FCM With K-Means, a datum either belongs to centroid A or to centroid B Davide Eynard - Lecture Notes on Clustering (II) p. 15/34 K-Means vs. FCM With FCM, the same datum does not belong exclusively to one cluster, but it may belong to several clusters with different values of the membership coefficient Davide Eynard - Lecture Notes on Clustering (II) p. 15/34

11 Data representation (KM)U N C = (FCM)U N C = Davide Eynard - Lecture Notes on Clustering (II) p. 16/34 FCM Algorithm The algorithm is composed of the following steps: 1. Initialize U = [u ij ] matrix, U (0) 2. At k-step: calculate the centers vectors C (k) = [c j ] with U (k) : c j = N i=1 um ij x i N i=1 um ij 3. Update U (k), U (k+1) : u j = 1 C k=1 ( xi c j x i c k ) 2 m 1 4. If U (k+1) U (k) < ε then STOP; otherwise return to step 2. Davide Eynard - Lecture Notes on Clustering (II) p. 17/34

12 An Example Davide Eynard - Lecture Notes on Clustering (II) p. 18/34 An Example Davide Eynard - Lecture Notes on Clustering (II) p. 18/34

13 An Example Davide Eynard - Lecture Notes on Clustering (II) p. 18/34 Clustering as a Mixture of Gaussians Gaussians Mixture is a model-based clustering approach It uses a statistical model for clusters and attempts to optimize the fit between the data and the model. Each cluster can be mathematically represented by a parametric distribution, like a Gaussian (continuous) or a Poisson (discrete) The entire data set is modelled by a mixture of these distributions A mixture model with high likelihood tends to have the following traits: Component distributions have high "peaks" (data in one cluster are tight) The mixture model "covers" the data well (dominant patterns in data are captured by component distributions) Davide Eynard - Lecture Notes on Clustering (II) p. 21/34

14 Advantages of Model-Based Clustering well studied statistical inference techniques available flexibility in choosing the component distribution obtain a density estimation for each cluster a "soft" classification is available Davide Eynard - Lecture Notes on Clustering (II) p. 22/34 Mixture of Gaussians It is the most widely used model-based clustering method: we can actually consider clusters as Gaussian distributions centered on their barycentres (as we can see in the figure, where the grey circle represents the first variance of the distribution). Davide Eynard - Lecture Notes on Clustering (II) p. 23/34

15 How does it work? it chooses the component (the Gaussian) at random with probability P(ω i ) it samples a point N(µ i, σ 2 I) Let s suppose we have x 1, x 2,..., x n and P(ω 1 ),..., P(ω K ), σ We can obtain the likelihood of the sample: P(x ω i, µ 1, µ 2,..., µ K ) (probability that an observation from class ω i would have value x given class means µ 1,..., µ K ) What we really want is to maximize P(x µ 1, µ 2,..., µ K )... Can we do it? How? Davide Eynard - Lecture Notes on Clustering (II) p. 24/34 The Algorithm The algorithm is composed of the following steps: 1. Initialize parameters: 2. E-step: λ 0 = {µ (0) 1, µ(0) 2,..., µ(0) k, p(0) 1, p(0) 2,..., p(0) k } P(ω j x k, λ t ) = P(x k ω j, λ t )P(ω j λ t ) P(x k λ t ) = P(x k ω i, µ (t) i, σ 2 )p i (t) Pk P(x k ω j, µ (t) j, σ2 )p (t) j 3. M-step: µ (t+1) i = where R is the number of records p (t+1) i = P k P(ω i x k, λ t )x k P k P(ω i x k, λ t ) P k P(ω i x k, λ t ) R Davide Eynard - Lecture Notes on Clustering (II) p. 25/34

Self Organizing Features Maps Kohonen Self Organizing Features Maps (a.k.a. SOM) provide a way to represent multidimensional data in much lower dimensional spaces.

Mapping of colors from their three dimensional components (i.e., red, green and blue) into two dimensions. Davide Eynard - Lecture Notes on Clustering (II) p.

16 Self Organizing Features Maps Kohonen Self Organizing Features Maps (a.k.a. SOM) provide a way to represent multidimensional data in much lower dimensional spaces. They implement a data compression technique similar to vector quantization They store information in such a way that any topological relationships within the training set are maintained Example: Mapping of colors from their three dimensional components (i.e., red, green and blue) into two dimensions. Davide Eynard - Lecture Notes on Clustering (II) p. 28/34 Self Organizing Feature Maps: The Topology The network is a lattice of "nodes", each of which is fully connected to the input layer Each node has a specific topological position and contains a vector of weights of the same dimension as the input vectors There are no lateral connections between nodes within the lattice A SOM does not deed a target output to be specified; instead, where the node weights match the input vector, that area of the lattice is selectively optimized to more closely resemble the data vector Davide Eynard - Lecture Notes on Clustering (II) p. 29/34

17 Self Organizing Features Maps: The Algorithm Training occurs in several steps over many iterations: 1. Initialize each node s weights 2. Presented a random vector from the training set to the lattice 3. Examinate every node to calculate which one s weights are most like the input vector (the winning node is commonly known as the Best Matching Unit) 4. Calculate the radius of the neighborhood of the BMU (this is a value that starts large, typically set to the radius of the lattice, but diminishes each time-step), any nodes found within this radius are deemed to be inside the BMU s neighborhood 5. Each neighboring node s weights are adjusted to make them more like the input vector. The closer a node is to the BMU, the more its weights get altered 6. Repeat step 2 for N iterations Davide Eynard - Lecture Notes on Clustering (II) p. 30/34 Practical Learning of Self Organizing Features Maps There are few things that have to be specified in the previous algorithm: Choosing the weights initialization We select the Best Matching Unit according to its the weight distance from the input vector: x w i = q Pp k=1 (x[k] w i[k]) 2 Select the neighborhood according to some decreasing function h ij = e (i j)2 2σ 2 Define the updating rule 8 < w i (t + 1) = : w i + α(t)[x(t) w i (t)], w i, i N i (t) i / N i (t) Davide Eynard - Lecture Notes on Clustering (II) p. 31/34

18 Bibliography A Tutorial on Clustering Algorithms Online tutorial by M. Matteucci As usual, more info on del.icio.us Davide Eynard - Lecture Notes on Clustering (II) p. 33/34

Unsupervised Learning

Unsupervised Learning Networks for Pattern Recognition, 2014 Networks for Single Linkage K-Means Soft DBSCAN PCA Networks for Kohonen Maps Linear Vector Quantization Networks for Problems/Approaches in Machine Learning Supervised