Methods for Intelligent Systems

Similar documents
Unsupervised Learning

Clustering Lecture 5: Mixture Model

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Cluster Analysis: Agglomerate Hierarchical Clustering

CS Introduction to Data Mining Instructor: Abdullah Mueen

Network Traffic Measurements and Analysis

Region-based Segmentation

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Clustering Documents in Large Text Corpora

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

CHAPTER 3 TUMOR DETECTION BASED ON NEURO-FUZZY TECHNIQUE

Function approximation using RBF network. 10 basis functions and 25 data points.

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Unsupervised Learning

Figure (5) Kohonen Self-Organized Map

SGN (4 cr) Chapter 11

Behavioral Data Mining. Lecture 18 Clustering

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Clustering algorithms

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Cluster Analysis. Ying Shen, SSE, Tongji University

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Seismic regionalization based on an artificial neural network

Discriminate Analysis

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Clustering CS 550: Machine Learning

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Objective of clustering

Unsupervised: no target value to predict

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Clustering and Visualisation of Data

Lecture 7: Segmentation. Thursday, Sept 20

Community Detection. Community

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

ECG782: Multidimensional Digital Signal Processing

EE 589 INTRODUCTION TO ARTIFICIAL NETWORK REPORT OF THE TERM PROJECT REAL TIME ODOR RECOGNATION SYSTEM FATMA ÖZYURT SANCAR

Clustering: Classic Methods and Modern Views

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning (BSMC-GA 4439) Wenke Liu

4. Cluster Analysis. Francesc J. Ferri. Dept. d Informàtica. Universitat de València. Febrer F.J. Ferri (Univ. València) AIRF 2/ / 1

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

10-701/15-781, Fall 2006, Final

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Grundlagen der Künstlichen Intelligenz

Mixture Models and the EM Algorithm

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2

Cluster Analysis. Debashis Ghosh Department of Statistics Penn State University (based on slides from Jia Li, Dept. of Statistics)

Lecture Topic Projects

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Approaches to Clustering

Content-based image and video analysis. Machine learning

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering.

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

CHAPTER 4: CLUSTER ANALYSIS

CIE L*a*b* color model

Cluster Analysis for Microarray Data

SD 372 Pattern Recognition

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Chapter 7: Competitive learning, clustering, and self-organizing maps

An Introduction to PDF Estimation and Clustering

Dimension reduction : PCA and Clustering

COMS 4771 Clustering. Nakul Verma

Artificial Neural Networks Unsupervised learning: SOM

Machine Learning (BSMC-GA 4439) Wenke Liu

Supervised vs. Unsupervised Learning

Fuzzy C-MeansC. By Balaji K Juby N Zacharias

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Unsupervised Learning : Clustering

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

6.801/866. Segmentation and Line Fitting. T. Darrell

IBL and clustering. Relationship of IBL with CBR

Unsupervised Learning

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Clustering Lecture 3: Hierarchical Methods

Clustering & Classification (chapter 15)

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Inference and Representation

Overview. Efficient Simplification of Point-sampled Surfaces. Introduction. Introduction. Neighborhood. Local Surface Analysis

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Image Analysis, Classification and Change Detection in Remote Sensing

Nonlinear dimensionality reduction of large datasets for data exploration

The K-modes and Laplacian K-modes algorithms for clustering

Pattern Recognition ( , RIT) Exercise 1 Solution

Spectral Methods for Network Community Detection and Graph Partitioning

Transcription:

Methods for Intelligent Systems Lecture Notes on Clustering (II) Davide Eynard eynard@elet.polimi.it Department of Electronics and Information Politecnico di Milano Davide Eynard - Lecture Notes on Clustering (II) p. 1/34 Course Schedule Date Topic 28/03/2006 Clustering Introduction & Algorithms (I) (K-Means, Hierarchical) 11/04/2006 Clustering Algorithms (II) (Fuzzy, SOM, Gaussians, PDDP) 16/05/2006 How many clusters? (Evaluations and tuning) 20/06/2006 Monography on Text Clustering (I) 21/06/2006 Monography on Text Clustering (II) (14.15 AM2) (+ exercises) Davide Eynard - Lecture Notes on Clustering (II) p. 2/34

Lecture outline PDDP Fuzzy C-Means Gaussian Mixtures Self-Organizing Maps Davide Eynard - Lecture Notes on Clustering (II) p. 3/34 PDDP PDDP stands for "Principal Direction Divisive Partitioning" Principal Direction, because the algorithm is based on the computation of the leading principal direction at each stage of the partitioning Partitioning, because we place all data into one cluster, so that at every stage clusters are disjoint and their union equals the entire set of documents Divisive, because it s a hierarchical divisive (vs. hierarchical agglomerative) clustering algorithm Davide Eynard - Lecture Notes on Clustering (II) p. 5/34

PDDP Algorithm Description Sample space of m samples in which each sample (document) is an n-vector containing a numerical value Each document is represented by a column vector of attribute values d = (d 1, d 2,..., d n ) T whose i-th entry, d i, is the relative frequency of the i-th word. Each document vector is normalized to have a euclidean length of 1: TF i d i = j (TF j) 2 where TF i is the number of occurrences of word i in the particular document d. Davide Eynard - Lecture Notes on Clustering (II) p. 6/34 PDDP Algorithm Description The entire set of documents is represented by an n m matrix M = (d 1,..., d m ) whose i-th column, d i, is the column vector representing the i-th document The algorithm proceeds by separating the entire set of documents into two partitions by using principal directions Each of the two partitions will be splitted into two subpartitions using the same process recursively The result is a hierarchical structure of partitions arranged into a binary tree What method is used to split a partition into two subpartitions? In what order are the partitions selected to be split? Davide Eynard - Lecture Notes on Clustering (II) p. 6/34

Splitting a partition The mean or centroid of the document set is w = d 1 +... + d m m = M e 1 m In the general case, the covariance matrix is C = (M we T ) (M we T ) T = A A T where e = (1,1,...,1) T is a vector of appropriate dimension. The eigenvectors corresponding to the k largest eigenvalues are called the principal components or principal directions. Davide Eynard - Lecture Notes on Clustering (II) p. 7/34 Splitting a partition A partition of p documents is represented by an n p matrix M p = (d 1,..., d p ) where each d i is an n-vector representing a document. The matrix M p is a submatrix of M consisting of some selection of p columns of M, not necessarily the first p in the set! The principal directions of the matrix M p are the eigenvectors of its sample covariance matrix C We re interested in temporarily projecting each document onto the single leading eigenvector u (the principal direction) Besides reducing the dimensionality, the transformation often has the effect of removing much noise present in the data Davide Eynard - Lecture Notes on Clustering (II) p. 7/34

Principal Component Analysis Davide Eynard - Lecture Notes on Clustering (II) p. 8/34 Principal Component Analysis Davide Eynard - Lecture Notes on Clustering (II) p. 8/34

Principal Component Analysis Davide Eynard - Lecture Notes on Clustering (II) p. 8/34 Splitting a partition The projection of the i-th document d i is given by the formula where σ is a positive constant. σv i = u T (d i w) All the documents are translated so that their mean is at the origin, then they re projected on the principal direction The values v 1,..., v k are used to determine the splitting (accordingly to their sign) We still have to decide at each stage which node should be split next: A "scatter" value is used to measure the distance between each document in the cluster and the overall mean of the cluster (a measure of its cohesiveness) Davide Eynard - Lecture Notes on Clustering (II) p. 9/34

The algorithm Start with n m matrix M of (scaled) document vectors, and a desired number of clusters c max. 1. Initialize Binary Tree with a single Root Node 2. For c = 2,3,..., c max do 3. Select node K with largest scat value 4. Create nodes L:=leftchild(K) and R:=rightchild(K) 5. Set indices(l):=indices of the non-positive entries in rightvec(k) 6. Set indices(r):=indices of the positive entries in rightvec(k) 7. Compute all the other fields for the nodes L,R 8. end. Result: A binary tree with c max leaf nodes forming a partitioning of the entire data set. Davide Eynard - Lecture Notes on Clustering (II) p. 10/34 Conclusions PDDP algorithm is effective at least as well as an agglomeration algorithm, but is much faster Its expected running time is linear in the number of documents, whereas unmodified agglomeration algorithms typically have O(m 2 ) running time Davide Eynard - Lecture Notes on Clustering (II) p. 11/34

Experimental results Davide Eynard - Lecture Notes on Clustering (II) p. 12/34 Experimental results Davide Eynard - Lecture Notes on Clustering (II) p. 12/34

Experimental results Davide Eynard - Lecture Notes on Clustering (II) p. 12/34 Fuzzy C-Means Fuzzy C-Means (FCM, developed by Dunn in 1973 and improved by Bezdek in 1981) is a method of clustering which allows one piece of data to belong to two or more clusters. frequently used in pattern recognition based on minimization of the following objective function: J m = N C u m ij x i c j 2,1 m < i=1 j=1 where: m is any real number greater than 1 (fuzzyness coefficient), u ij is the degree of membership of x i in the cluster j, x i is the i-th of d-dimensional measured data, c j is the d-dimension center of the cluster, is any norm expressing the similarity between measured data and the center. Davide Eynard - Lecture Notes on Clustering (II) p. 14/34

K-Means vs. FCM With K-Means, a datum either belongs to centroid A or to centroid B Davide Eynard - Lecture Notes on Clustering (II) p. 15/34 K-Means vs. FCM With FCM, the same datum does not belong exclusively to one cluster, but it may belong to several clusters with different values of the membership coefficient Davide Eynard - Lecture Notes on Clustering (II) p. 15/34

Data representation (KM)U N C = 1 0 0 1 1 0...... 0 1 (FCM)U N C = 0.8 0.2 0.3 0.7 0.6 0.4...... 0.9 0.1 Davide Eynard - Lecture Notes on Clustering (II) p. 16/34 FCM Algorithm The algorithm is composed of the following steps: 1. Initialize U = [u ij ] matrix, U (0) 2. At k-step: calculate the centers vectors C (k) = [c j ] with U (k) : c j = N i=1 um ij x i N i=1 um ij 3. Update U (k), U (k+1) : u j = 1 C k=1 ( xi c j x i c k ) 2 m 1 4. If U (k+1) U (k) < ε then STOP; otherwise return to step 2. Davide Eynard - Lecture Notes on Clustering (II) p. 17/34

An Example Davide Eynard - Lecture Notes on Clustering (II) p. 18/34 An Example Davide Eynard - Lecture Notes on Clustering (II) p. 18/34

An Example Davide Eynard - Lecture Notes on Clustering (II) p. 18/34 Clustering as a Mixture of Gaussians Gaussians Mixture is a model-based clustering approach It uses a statistical model for clusters and attempts to optimize the fit between the data and the model. Each cluster can be mathematically represented by a parametric distribution, like a Gaussian (continuous) or a Poisson (discrete) The entire data set is modelled by a mixture of these distributions A mixture model with high likelihood tends to have the following traits: Component distributions have high "peaks" (data in one cluster are tight) The mixture model "covers" the data well (dominant patterns in data are captured by component distributions) Davide Eynard - Lecture Notes on Clustering (II) p. 21/34

Advantages of Model-Based Clustering well studied statistical inference techniques available flexibility in choosing the component distribution obtain a density estimation for each cluster a "soft" classification is available Davide Eynard - Lecture Notes on Clustering (II) p. 22/34 Mixture of Gaussians It is the most widely used model-based clustering method: we can actually consider clusters as Gaussian distributions centered on their barycentres (as we can see in the figure, where the grey circle represents the first variance of the distribution). Davide Eynard - Lecture Notes on Clustering (II) p. 23/34

How does it work? it chooses the component (the Gaussian) at random with probability P(ω i ) it samples a point N(µ i, σ 2 I) Let s suppose we have x 1, x 2,..., x n and P(ω 1 ),..., P(ω K ), σ We can obtain the likelihood of the sample: P(x ω i, µ 1, µ 2,..., µ K ) (probability that an observation from class ω i would have value x given class means µ 1,..., µ K ) What we really want is to maximize P(x µ 1, µ 2,..., µ K )... Can we do it? How? Davide Eynard - Lecture Notes on Clustering (II) p. 24/34 The Algorithm The algorithm is composed of the following steps: 1. Initialize parameters: 2. E-step: λ 0 = {µ (0) 1, µ(0) 2,..., µ(0) k, p(0) 1, p(0) 2,..., p(0) k } P(ω j x k, λ t ) = P(x k ω j, λ t )P(ω j λ t ) P(x k λ t ) = P(x k ω i, µ (t) i, σ 2 )p i (t) Pk P(x k ω j, µ (t) j, σ2 )p (t) j 3. M-step: µ (t+1) i = where R is the number of records p (t+1) i = P k P(ω i x k, λ t )x k P k P(ω i x k, λ t ) P k P(ω i x k, λ t ) R Davide Eynard - Lecture Notes on Clustering (II) p. 25/34

Self Organizing Features Maps Kohonen Self Organizing Features Maps (a.k.a. SOM) provide a way to represent multidimensional data in much lower dimensional spaces. They implement a data compression technique similar to vector quantization They store information in such a way that any topological relationships within the training set are maintained Example: Mapping of colors from their three dimensional components (i.e., red, green and blue) into two dimensions. Davide Eynard - Lecture Notes on Clustering (II) p. 28/34 Self Organizing Feature Maps: The Topology The network is a lattice of "nodes", each of which is fully connected to the input layer Each node has a specific topological position and contains a vector of weights of the same dimension as the input vectors There are no lateral connections between nodes within the lattice A SOM does not deed a target output to be specified; instead, where the node weights match the input vector, that area of the lattice is selectively optimized to more closely resemble the data vector Davide Eynard - Lecture Notes on Clustering (II) p. 29/34

Self Organizing Features Maps: The Algorithm Training occurs in several steps over many iterations: 1. Initialize each node s weights 2. Presented a random vector from the training set to the lattice 3. Examinate every node to calculate which one s weights are most like the input vector (the winning node is commonly known as the Best Matching Unit) 4. Calculate the radius of the neighborhood of the BMU (this is a value that starts large, typically set to the radius of the lattice, but diminishes each time-step), any nodes found within this radius are deemed to be inside the BMU s neighborhood 5. Each neighboring node s weights are adjusted to make them more like the input vector. The closer a node is to the BMU, the more its weights get altered 6. Repeat step 2 for N iterations Davide Eynard - Lecture Notes on Clustering (II) p. 30/34 Practical Learning of Self Organizing Features Maps There are few things that have to be specified in the previous algorithm: Choosing the weights initialization We select the Best Matching Unit according to its the weight distance from the input vector: x w i = q Pp k=1 (x[k] w i[k]) 2 Select the neighborhood according to some decreasing function h ij = e (i j)2 2σ 2 Define the updating rule 8 < w i (t + 1) = : w i + α(t)[x(t) w i (t)], w i, i N i (t) i / N i (t) Davide Eynard - Lecture Notes on Clustering (II) p. 31/34

Bibliography A Tutorial on Clustering Algorithms Online tutorial by M. Matteucci As usual, more info on del.icio.us Davide Eynard - Lecture Notes on Clustering (II) p. 33/34