A Comparison of Document Clustering Techniques

Similar documents
Cluster Analysis. Ying Shen, SSE, Tongji University

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering CS 550: Machine Learning

University of Florida CISE department Gator Engineering. Clustering Part 2

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Hierarchical Clustering 4/5/17

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Hierarchical Clustering Algorithms for Document Datasets

Lesson 3. Prof. Enza Messina

Clustering: Overview and K-means algorithm

CSE494 Information Retrieval Project C Report

CSE 5243 INTRO. TO DATA MINING

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Unsupervised Learning : Clustering

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Criterion Functions for Document Clustering Experiments and Analysis

Hierarchical Clustering

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Document Clustering: Comparison of Similarity Measures

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

CS570: Introduction to Data Mining

CSE 5243 INTRO. TO DATA MINING

Clustering Color/Intensity. Group together pixels of similar color/intensity.

Machine Learning: Symbol-based

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Hierarchical Document Clustering

Data Mining Concepts & Techniques

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Clustering Lecture 3: Hierarchical Methods

Cluster Analysis: Agglomerate Hierarchical Clustering

International Journal of Advanced Research in Computer Science and Software Engineering

Clustering: Overview and K-means algorithm

Hierarchical Clustering Algorithms for Document Datasets

Clustering Part 3. Hierarchical Clustering

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Figure (5) Kohonen Self-Organized Map

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Information Retrieval and Web Search Engines

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Chapter 4: Text Clustering

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

Unsupervised Learning Hierarchical Methods

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

CSE 5243 INTRO. TO DATA MINING

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Clustering & Bootstrapping

Hierarchical and Ensemble Clustering

CS47300: Web Information Search and Management

Machine Learning. Unsupervised Learning. Manfred Huber

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Cluster Analysis. Angela Montanari and Laura Anderlucci

ALTERNATIVE METHODS FOR CLUSTERING

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

Lecture 8 May 7, Prabhakar Raghavan

CSE 347/447: DATA MINING

Exploiting Parallelism to Support Scalable Hierarchical Clustering

Midterm Examination CS540-2: Introduction to Artificial Intelligence

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

INF 4300 Classification III Anne Solberg The agenda today:

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Hierarchical Clustering

Project C Report (Clustering Algorithms)

Lecture Notes for Chapter 8. Introduction to Data Mining

Information Retrieval and Web Search Engines

Unsupervised Learning and Data Mining

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

Introduction to Data Mining

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Gene Clustering & Classification

Clustering. Bruno Martins. 1 st Semester 2012/2013

Based on Raymond J. Mooney s slides

The comparative study of text documents clustering algorithms

Region-based Segmentation

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Network Traffic Measurements and Analysis

Cluster Analysis: Basic Concepts and Algorithms

Image Segmentation. Shengnan Wang

Mining Web Data. Lijun Zhang

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

SGN (4 cr) Chapter 11

Data Informatics. Seon Ho Kim, Ph.D.

Large Scale 3D Reconstruction by Structure from Motion

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

4. Ad-hoc I: Hierarchical clustering

CS7267 MACHINE LEARNING

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Transcription:

A Comparison of Document Clustering Techniques M. Steinbach, G. Karypis, V. Kumar Present by Leo Chen Feb-01 Leo Chen 1

Road Map Background & Motivation (2) Basic (6) Vector Space Model Cluster Quality Evaluation Agglomerative Clustering Algorithm (4) K-means & Bisecting K-means (6) Comparisons & Explanations (5) Conclusions (3) Feb-01 Leo Chen 2

Background & Motivation Wide usage of document clustering Improving the precision in information retrieval system [Rij79, Kow97] (how?) Finding the nearest neighbors of a document [BL85] Browsing a collection of documents [CKPT92] Organizing the results returned by a search engine [ZEMK97] Generating hierarchical clusters of document [KS97] Producing an effective document classifier [AGY99] Agglomerative hierarchical clustering & K-means Agglomerative hierarchical clustering is better? [DJ88], [CKPT92], [LA99] A simple and efficient variant of K-means is better? Follow me! Feb-01 Leo Chen 3

Vector Space Model Document vector: D tf = (tf 1, tf 2,, tf n ) Tf i : frequency of the i th term in the document Weight each term based on its inverse document frequency (how? Why?) Normalize each vector to unit length: d = 1 cosine(d 1, d 2 ) = (d 1 d 2 ) / ( d 1 * d 2 ) = d 1 d 2 Centroid: c = Σ d / S Average pairwise similarity: Σ cosine(d, d) / S 2 = Σ d d / S 2 = (Σ d / S ) (Σ d / S ) = c c = c 2 Feb-01 Leo Chen 4

Cluster Quality Evaluation (1) Entropy [Sha48] (the lower, the better) Class distribution: p ij, the probability that a member of cluster j belongs to class i. Entropy of cluster j: E j = - Σ p ij log (p ij ) Total Entropy: E cs = Σ (n j * E j / n) Feb-01 Leo Chen 5

Cluster Quality Evaluation (2) F-measure [LA99] (the higher, the better) For cluster j and class i, Recall (i, j) = n ij / n i, Precision (i, j) = n ij / n j F (i, j) = ( 2 * Recall (i, j) * Precision (i, j) ) / ( (Precision (i, j) + Recall (i, j) ) Entire F-measure: F = Σ max { F (i, j) } * n i / n Overall similarity: c 2 (the higher, the better) Feb-01 Leo Chen 6

Agglomerative Clustering Algorithm (1) Simple Agglomerative Clustering Algorithm 1. Compute similarity between all pairs of clusters 2. Merge the most similar (closest) two clusters 3. Update the similarity matrix 4. Repeat 2 & 3 until only a single cluster remains p1 p2 p3 p4 Feb-01 Leo Chen 7

Agglomerative Clustering Algorithm (2) Comparison of Agglomerative Clustering Algorithm IST: Intra-Cluster Similarity look at the similarity of all the documents in a cluster to their cluster centroid CST: Centroid Similarity Technique look at the cosine similarity between the centroids of the two clusters UPGMA: [DJ88, KR90] look at cluster similarity as following: similarity (cluster1, cluster2) = Σ cosine (d1, d2) / (size (cluster1) * size (cluster2)) Feb-01 Leo Chen 8

Agglomerative Clustering Algorithm (3) Experiment Results Shows, the winner of the three agglomerative hierarchical techniques is: Feb-01 Leo Chen 9

K-means & Bisecting K-means (1) Basic K-means Algorithm 1. Select K points as the initial centroids 2. Assign all points to the closet centroid 3. Recompute the centroid of each cluster 4. Repeat steps 2 & 3 until the centroids don t change Feb-01 Leo Chen 10

K-means & Bisecting K-means (2) Basic Bisecting K-means Algorithm 1. Pick a cluster to split (split the largest 2. Find 2 sub-clusters using the basic K-means algorithm 3. Repeat step 2, the bisecting step, for ITER times and take the split that produces the clustering with the highest overall similarity 4. Repeat steps 1, 2 and 3 until the desired number of clusters is reached 5. Divisive hierarchical clustering algorithm. Complexity? Feb-01 Leo Chen 11

Comparison & Explanations (1) Bisecting K-means, with or without refinement is better than regular K-means and UPGMA, with or without refinement, in most cases. Refinement significantly improve the performance of UPGMA for both the overall similarity and the entropy measures. Regular K-means, is generally better than UPGMA. Feb-01 Leo Chen 12

Comparison & Explanations (2) Why agglomerative hierarchical clustering performs poorly? Documents share core vocabularies. Two documents can often be nearest neighbors without belonging to the same class, so agglomerative algorithms make mistakes. Global properties help overcome local minima. K-means better suited to document clustering. Feb-01 Leo Chen 13

Comparison & Explanations (3) Why bisecting K-means works better than regular K-means? Bisecting K-means tends to produce clusters of relatively uniform size. Regular K-means tends to produce clusters of widely different sizes. Bisecting K-means beats Regular K-means in Entropy measurement. Feb-01 Leo Chen 14

Conclusions K-means, improvement Many runs Incremental updating Hybrid approach My doubts? Measurement? Complexity? Bona fide K-means? Applicable scope? Feb-01 Leo Chen 15