Information-Theoretic Co-clustering

Similar documents
Progress Report: Collaborative Filtering Using Bregman Co-clustering

Clustering Large and Sparse Co-occurrence Data

Hierarchical Co-Clustering Based on Entropy Splitting

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT

A tutorial for blockcluster R package Version 1.01

Data Clustering. Danushka Bollegala

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan

Auxiliary Variational Information Maximization for Dimensionality Reduction

A tutorial for blockcluster R package Version 4

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

1 1. Introduction Clustering is an important, old and very difficult problem. While a wide variety of clustering algorithms are already available in t

Mixture Models and the EM Algorithm

Random projection for non-gaussian mixture models

FIVE YEARS OLD. P r o f e s s o r ikobbins B. Stoeckei ol is s h o w n b y the J u n e 1 pig c r o p

Clustering Lecture 5: Mixture Model

A Modified Approach for Image Segmentation in Information Bottleneck Method

Detecting Novel Associations in Large Data Sets

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Convexization in Markov Chain Monte Carlo

International Journal of Advanced Research in Computer Science and Software Engineering

Information-based Biclustering for the Analysis of Multivariate Time Series Data

Clustering Using Elements of Information Theory

Modeling Dynamic Behavior in Large Evolving Graphs

Multi-View Clustering with Web and Linguistic Features for Relation Extraction

Expectation Maximization!

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

The Effect of Word Sampling on Document Clustering

Medical Image Segmentation Based on Mutual Information Maximization

Co-Clustering by Similarity Refinement

Networks for Control. California Institute of Technology. Pasadena, CA Abstract

Lecture 8: The EM algorithm

1 Overview. 2 Applications of submodular maximization. AM 221: Advanced Optimization Spring 2016

Fast Document Clustering Based on Weighted Comparative Advantage

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Representation of Quad Tree Decomposition and Various Segmentation Algorithms in Information Bottleneck Method

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

Opening the black box of Deep Neural Networks via Information (Ravid Shwartz-Ziv and Naftali Tishby) An overview by Philip Amortila and Nicolas Gagné

Energy Minimization for Segmentation in Computer Vision

Lecture 3 January 22

Scene Modeling Using Co-Clustering

Course Introduction / Review of Fundamentals of Graph Theory

Introduction to Mobile Robotics

6. Dicretization methods 6.1 The purpose of discretization

A Co-Clustering approach for Sum-Product Network Structure Learning

Network Traffic Measurements and Analysis

Bipartite Graph Partitioning and Content-based Image Clustering

Modeling and Reasoning with Bayesian Networks. Adnan Darwiche University of California Los Angeles, CA

Collaborative Filtering using Euclidean Distance in Recommendation Engine

Introduction to Machine Learning

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

CS 534: Computer Vision Segmentation II Graph Cuts and Image Segmentation

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Collaborative Rough Clustering

Mixture models and frequent sets: combining global and local methods for 0 1 data

Visual Representations for Machine Learning

Segmentation: Clustering, Graph Cut and EM

IBL and clustering. Relationship of IBL with CBR

Product Clustering for Online Crawlers

Nikolaos Tsapanos, Anastasios Tefas, Nikolaos Nikolaidis and Ioannis Pitas. Aristotle University of Thessaloniki

Efficient Semi-supervised Spectral Co-clustering with Constraints

Note Set 4: Finite Mixture Models and the EM Algorithm

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Network. Department of Statistics. University of California, Berkeley. January, Abstract

The K-modes and Laplacian K-modes algorithms for clustering

Dimension reduction : PCA and Clustering

Clustering: Overview and K-means algorithm

2 ATTILA FAZEKAS The tracking model of the robot car The schematic picture of the robot car can be seen on Fig.1. Figure 1. The main controlling task

Schema Matching with Inter-Attribute Dependencies Using VF2 Approach

Behavioral Data Mining. Lecture 18 Clustering

K-Means and Gaussian Mixture Models

10701 Machine Learning. Clustering

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Algorithms for Nearest Neighbors

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Modelling image complexity by independent component analysis, with application to content-based image retrieval

CS Introduction to Data Mining Instructor: Abdullah Mueen

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski

ECLT 5810 Clustering

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering.

Search Engines. Information Retrieval in Practice

Correspondence Clustering: An Approach to Cluster Multiple Related Spatial Datasets

Introduction to Trajectory Clustering. By YONGLI ZHANG

Graph Algorithms Matching

Power to the points: Local certificates for clustering

The clustering in general is the task of grouping a set of objects in such a way that objects

A Topography-Preserving Latent Variable Model with Learning Metrics

MIA - Master on Artificial Intelligence

Machine Learning Lecture 3

Machine Learning Lecture 3

Clustering. k-mean clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Enhancing K-means Clustering Algorithm with Improved Initial Center

Collective classification in network data

Lecture 7: Decision Trees

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Introduction to Machine Learning. Xiaojin Zhu

Clustering of documents from a two-way viewpoint

Estimating the Information Rate of Noisy Two-Dimensional Constrained Channels

Testing Continuous Distributions. Artur Czumaj. DIMAP (Centre for Discrete Maths and it Applications) & Department of Computer Science

Spectral Methods for Network Community Detection and Graph Partitioning

Machine learning - HT Clustering

Transcription:

Information-Theoretic Co-clustering Authors: I. S. Dhillon, S. Mallela, and D. S. Modha. MALNIS Presentation Qiufen Qi, Zheyuan Yu 20 May 2004

Outline 1. Introduction 2. Information Theory Concepts 3. Co-Clustering Algorithm 4. Experimental Results 5. Conclusions and Future Work 1

1. Introduction Document Clustering: grouping together of "similar" documents Hard Clustering { Each document belongs to a single cluster Soft Clustering { Each document is probabilistically assigned to clusters 2

One way clustering vs. co-clustering One way clustering { Clustering documents base on their word distribution { Clustering words by their co-occurrence in documents Co-clustering { Word clustering induces document clustering while document clustering induces word clustering Implicit dimensionality reduction at each step Computationally economical 3

Co-clustering Methods Information-Theoretic Co-clustering Co-clustering by nding a pair of maps from rows to rowclusters and from columns to column-clusters, with minimum mutual information loss. Dhillon, et al(2003) Bipartite Spectral Graph Partitioning Co-clustering by nding minimum cut vertex partitions in a bipartite graph between documents and words. Dhillon, et al(2001) Drawback: Each word cluster need to be associated with a document clustering. 4

2. Information Theory Concepts Entropy of a random variable X p: H(p) = X x with probability distribution p(x) log p(x) Measure of the average uncertainty The Kullback-Leibler(KL) Divergence: X D(pkq) = p(x) log p(x) x q(x) Measure of how dierent two probability distributions are. D(pkq) 0; D(pkq) = 0 i p = q Not strictly a distance. 5

Mutual Information between random variables X and Y : I(X; Y ) = H(X) H(XjY ) = X xy p(x; y) log The amount of information X contains about Y Vice versa I(X; Y ) = I(Y ; X); I(X; Y ) 0 p(x; y) p(x)p(y) 6

"Optimal" Co-Clustering Finding maps C x and C y, C X : fx 1 ; x 2 ; :::; x m g! f^x 1 ; ^x 2 ; :::; ^x k g C Y : fy 1 ; y 2 ; :::; y n g! f^y 1 ; ^y 2 ; :::; ^y l g hard-clustering of the rows and columns loss in mutual information is minimized Example: 0 B@ I(X; Y ) I( ^X; ^Y ) :05 :05 :05 0 0 0 :05 :05 :05 0 0 0 0 0 0 :05 :05 :05 0 0 0 :05 :05 :05 :04 :04 0 :04 :04 :04 :04 :04 :04 0 :04 :04 1 0 B@ :3 0 CA! 0 :3 :2 :2 1 CA Loss: 0:0957 7

Finding optimal Co-Clustering Lemma: The loss in mutual information can be expressed as the distance of p(x, Y ) to an approximation q(x, Y ) I(X; Y ) I( ^X; ^Y ) = D(p(X; Y ) k q(x; Y )) q is approximation of p: q(x; y) = p(^x; ^y)p(xj^x)p(yj^y), where x 2 ^x; y 2 ^y: Related to data compression problem Transmit the cluster identies ^X and ^Y ; Transmit X given ^X; Transmit Y given ^Y 8

Example: Calculating q(x; y): approximation of p(x; y) 9

Objective function for loss in mutual information The loss in mutual information can be expressed as a weighted sum of relative entropies between row distribution and row cluster distribution X D(p(X; Y ) k q(x; Y )) = X^x p(x)d(p(y jx) k q(y j^x)) x:2^x a weighted sum of relative entropies between column distribution and column cluster distribution D(p(X; Y ) k q(x; Y )) = X^y p(y)d(p(xjy) k q(xj^y)) y2^y It allows us to dene: a row cluster prototype: q(y j^x) a column cluster prototype: q(xj^y) X Lead to a "natural" algorithm 10

3. Co-Clustering Algorithm Input: p(x; Y ) - the joint probability distribution k - the desired number of row clusters l - the desired number of column clusters Output: The partition function C X and C Y 1. Inititalization: Set t = 0. Start with some initial partition functions C (0) X and C(0) Y. Compute q(0) ( ^X; ^Y ) and the distribution for each row-cluster prototype q (0) (Y j^x), 1 ^x k 2. Row re-clustering: For each row x, assign it to the "closet" row-cluster prototype. Update C (t+1) X. C (t+1) Y = C (t) Y. 3. Computer q (t+1) ( ^X; ^Y ) and the distribution for each columncluster prototype q (t+1) (Xj^y), 1 ^y l 11

Co-clustering Algorithm (con't) 4. Column re-clustering: For each column y, assign it to the "closest" column-cluster prototype. Update C (t+2) Y. C (t+2) X = C (t+1) X. 5. Compute q (t+2) ( ^X; ^Y ) and the distribution for each row cluster prototype q (t+2) (Y j^x). 6. If the change of the loss in mutual information, i.e. D(p(X; Y )jjq t (X; Y )) D(p(X; Y ) k q t+2 (X; Y )) is small (say 10 3 ), Return C (t+2) X and go to step 2. and C (t+2) Y ; Else set t = t + 2 12

Properties of Co-clustering Algorithm Co-clustering monotonically decreases loss in mutual information Co-clustering converges to a local minimum Can be generalized to multi-dimensional contingency tables Implicit dimensionality reduction at each step helps overcome sparsity & high-dimensionality Computationally ecient: O(nt(k + l)), where n is the number of non zeros in the input joint distribution t is the number of iterations k is the desired number of row clusters l is the desired number of column clusters 13

4. Experimental Results Algorithms to be compared Data sets Evaluation measures Results and discussion 14

Algorithms to be compared: IB-double: Information Bottleneck Double Clustering IDC: Iterative Double Clustering 1D-clustering: without any word clustering (information theoretic method?) Data sets: NG20: Binary, Multi5 and Multi10 (with and without subjects, 500 documents each) SMART: CLASSIC3 - MEDLINE (1033), CISI (1460) and CRANFIELD (1400) Top 2000 words were selected by mutual information (frequency?) after the stop words were removed. 15

Evaluation Measures: Confusion matrix: Each entry(i; j) represents the number of documents in the cluster i that belong to true class j Micro-averaged-precision: where: p = P l1 a i N a i - the number of correctly assigned documents in cluster l - the number of document clusters (classes) N - the total number of documents in a whole data set 16

Results Table 3: Co-clustering (0.9835), 1D-clustering (0.9432) Table 4: Co-clustering (0.98, 0.96), 1D-clustering(0.67, 0.648) 17

Results (con't) - Co-clustering performs much better than IB-Double and 1Dclustering and is comparable with IDC - Word clustering can alleviate the problem of clustering in high dimensions (Note: The peak values were selected.) 18

Results (con't) - Dierent data sets achieve their maximum at dierent number of word clusters 19

Results (con't) Correlation between Figure 2 & 3: the lower the loss in mutual information, the better is the clustering 20

Results (con't) Co-clustering converges quickly in about 20 iterations on all the tested datasets 21

Results (con't) 22

5. Conclusions and Future Work Information-theoretic approach to co-clustering Implicit dimensionality reduction at each step to overcome sparsity & high-dimensionality Theoretical approach has the potential of extending to other problems: - Multi-dimensional co-clustering - MDL (Minimum Description Length) to choose number of coclusters - Generalize co-clustering to an abstract multivariate clustering setting 23

REFERENCES 1. Information-Theoretic Co-clustering I. S. Dhillon, S. Mallela, and D. S. Modha Proceedings of The Ninth ACM SIGKDD Conference on Knowledge Discovery and Data Mining(KDD), pages 89-98, August, 2003. 24