Clustering (COSC 488) Nazli Goharian. Document Clustering.

Similar documents
Clustering (COSC 416) Nazli Goharian. Document Clustering.

Machine Learning: Symbol-based

Clustering: Overview and K-means algorithm

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Unsupervised Learning

Information Retrieval and Web Search Engines

Cluster Analysis. Ying Shen, SSE, Tongji University

Lesson 3. Prof. Enza Messina

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

CSE494 Information Retrieval Project C Report

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Introduction to Information Retrieval

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Exploiting Parallelism to Support Scalable Hierarchical Clustering

Information Retrieval and Web Search Engines

PV211: Introduction to Information Retrieval

Clustering: Overview and K-means algorithm

Information Retrieval and Organisation

Clustering Lecture 3: Hierarchical Methods

Clustering Part 3. Hierarchical Clustering

Cluster analysis. Agnieszka Nowak - Brzezinska

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

CSE 5243 INTRO. TO DATA MINING

Clustering Results. Result List Example. Clustering Results. Information Retrieval

CSE 5243 INTRO. TO DATA MINING

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Information Retrieval

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Hierarchical Clustering

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Introduction to Information Retrieval

Clustering CS 550: Machine Learning

Text Documents clustering using K Means Algorithm

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

CSE 5243 INTRO. TO DATA MINING

Data Informatics. Seon Ho Kim, Ph.D.

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Hierarchical Clustering 4/5/17

Clustering Algorithms for general similarity measures

CSE 494 Project C. Garrett Wolf

University of Florida CISE department Gator Engineering. Clustering Part 2

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Hierarchical clustering

Based on Raymond J. Mooney s slides

Concept-Based Document Similarity Based on Suffix Tree Document

What to come. There will be a few more topics we will cover on supervised learning

Workload Characterization Techniques

Chapter 9. Classification and Clustering

COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017.

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Clustering part II 1

ECLT 5810 Clustering

CS47300: Web Information Search and Management

Hierarchical Clustering

CHAPTER 4: CLUSTER ANALYSIS

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

10701 Machine Learning. Clustering

ECLT 5810 Clustering

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Clustering: K-means and Kernel K-means

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Road map. Basic concepts

Lecture 8 May 7, Prabhakar Raghavan

Clustering. Bruno Martins. 1 st Semester 2012/2013

Enhancing Cluster Quality by Using User Browsing Time

A Comparison of Document Clustering Techniques

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Unsupervised Learning Hierarchical Methods

Chapter 18 Clustering

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Data Mining Concepts & Techniques

Hierarchical Clustering

4. Ad-hoc I: Hierarchical clustering

Unsupervised learning, Clustering CS434

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

CS7267 MACHINE LEARNING

Gene Clustering & Classification

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

Search Engines. Information Retrieval in Practice

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Hierarchical clustering

MI example for poultry/export in Reuters. Overview. Introduction to Information Retrieval. Outline.

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Unsupervised Learning and Clustering

Clustering in Data Mining

Enhancing Cluster Quality by Using User Browsing Time

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Unsupervised Learning and Clustering

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Transcription:

Clustering (COSC 488) Nazli Goharian nazli@ir.cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen, Information Retrieval, nd ed. London: Butterworths, 1979. 1

What can be Clustered? Collection (Pre-retrieval) Reducing the search space to smaller subset -- not generally used due to expense in generating clusters. Improving UI with displaying groups of topics -- have to label the clusters Scatter-gather the user selected clusters are merged and re-clustered Result Set (Post-retrieval) Improving the ranking (re-ranking) Utilizing in query refinement -- Relevance feedback Improving UI to display clustered search results Query Understanding the intent of a user query Suggesting query to users (query suggestion / recommendation) 3 Document/Web Clustering Input: set of documents, [k clusters] Output: document assignments to clusters Features Text from document/snippet (words: single; phrase) Link and anchor text URL Tag (social bookmarking websites allow users to tag documents).. Term weight (tf, tf-idf, ) Distance measure: Euclidian, Cosine,.. Evaluation Manual -- difficult Web directories 4

Result Set Clustering Clusters are generated online (during query processing) Retrieved Result url, title, Snippets, tags 5 Result Set Clustering To improve efficiency, clusters may be generated from document snippets. Clusters for popular queries may be cached Clusters maybe labeled into categories, providing the advantage of both query & category information for the search Clustering result set as a whole or per site Stemming can help due to limited result set 6 3

Cluster Labeling The goal is to create meaningful labels Approaches: Manually (not a good idea) Using already tagged documents (not always available) Using external knowledge such as Wikipedia, etc. Using each cluster s data to determine label Cluster s Centroid terms/phrases -- frequency & importance Title of document centroid or closest document to centroid can be used Using also other clusters data to determine label Cluster s Hierarchical information (sibling/parent) of terms/phrases 7 Result Clustering Systems Northern Light (end of 90 s) -- used pre-defined categories Grouper (STC) Carrot CREDO WhatsOnWeb Vivisimo s Clusty (acquired by Yippy): generated clusters and labels dynamically..etc. 8 4

Query Clustering Approach to Query Suggestion Exploit information on past users' queries Propose to a user a list of queries related to the one (or the ones, considering past queries in the same session/log) submitted Various approaches to consider both query terms and documents Tutorial by: Salvatore Orlando, University of Venice, Italy & Fabrizio Silvestri, ISTI - CNR, Pisa, Italy, 009 Query Clustering Approach to Query Suggestion Baeza-Yates et al. use a clustering approach A two tier approach An offline component clusters past queries using query text along with the text of clicked URLs. An online component that recommends queries based on an incoming query and using clusters generated in the offline mode R. Baeza-Yates, C. Hurtado, and M. Mendoza, Query Recommendation Using Query Logs in Search Engines LNCS, Springer, 004. Tutorial by: Salvatore Orlando, University of Venice, Italy & Fabrizio Silvestri, ISTI - CNR, Pisa, Italy, 009 5

Query Clustering Approach to Query Suggestion Offline component: Clustering algorithm operates over queries enriched by a selection of terms extracted from the documents pointed by the user clicked URLs. Clusters computed by using an implementation of k-means different values of k SSE becomes even smaller by increasing k Similarity between queries computed according to a vectorspace approach Vectors q of n dimensions, one for each term R. Baeza-Yates, C. Hurtado, and M. Mendoza, Query Recommendation Using Query Logs in Search Engines LNCS, Springer, 004. Tutorial by: Salvatore Orlando, University of Venice, Italy & Fabrizio Silvestri, ISTI - CNR, Pisa, Italy, 009 Query Clustering Approach to Query Suggestion Baeza-Yates et al. use a clustering approach (cont d) Online component: (I) given an input query the most representative (i.e. similar) cluster is found each cluster has a natural representative, i.e. its centroid (II)ranking of the queries of the cluster, according to: attractiveness of query answer, i.e. the fraction of the documents returned by the query that captured the attention of users (clicked documents) similarity wrt the input query (the same distance used for clustering) popularity of query, i.e. the frequency of the occurrences of queries R. Baeza-Yates, C. Hurtado, and M. Mendoza, Query Recommendation Using Query Logs in Search Engines LNCS, Springer, 004. Tutorial by: Salvatore Orlando, University of Venice, Italy & Fabrizio Silvestri, ISTI - CNR, Pisa, Italy, 009 6

Clustering Automatically group related data into clusters. An unsupervised approach -- no training data is needed. A data object may belong to only one cluster (Hard clustering) overlapped clusters (Soft Clustering) Set of clusters may relate to each other (Hierarchical clustering) have no explicit structure between clusters (Flat clustering) 13 Considerations Distance/similarity measures Various; example: Cosine Number of clusters Cardinality of a clustering (# of clusters) Objective functions Evaluates the quality (structural properties) of clusters; often defined using distance/similarity measures External quality measures such as: using: annotated document set; existing directories; manual evaluation of documents 14 7

Distance/Similarity Measures Euclidean Distance dist( d, d ) i j ( d i d d d j1 i j... d d ip j 1 p ) Cosine Sim d, d i j t t dik d jk k 1 t k 1 d ik d jk k 1 15 Structural Properties of Clusters Good clusters have: high intra-class similarity low inter-class similarity Intera-class Inter-class Calculate the sum of squared error (Commonly done in K-means) Goal is to minimize SSE (intra-cluster variance): SSE k i 1 p c i p m i 16 8

Quality Measures Macro average precision -- measure the precision of each cluster (ratio of members that belong to that class label), and average over all clusters. Micro average precision -- precision over all elements in all clusters Accuracy: (tp + tn) / (tp + tn + fp + fn) F1 measure 17 Clustering Algorithms Hierarchical A set of nested clusters are generated, represented as dendrogram. Agglomerative (bottom-up) - a more common approach Divisive (top-down) Partitioning (Flat Clustering) no link (no overlapping) among the generated clusters 18 9

The K-Means Clustering Method A Flat clustering algorithm A Hard clustering A Partitioning (Iterative) Clustering Start with k random cluster centroids and iteratively adjust (redistribute) until some termination condition is set. Number of cluster k is an input in the algorithm. The outcome is k clusters. 19 The K-Means Clustering Method Pick k documents as your initial k clusters Partition documents into k closets cluster centroids (centroid: mean of document vectors; consider most significant terms to reduce the distance computations) Re-calculate the centroid of each cluster Re-distribute documents to clusters till a termination condition is met Relatively efficient: O(tkn), n: number of documents k: number of clusters t: number of iterations Normally, k, t << n 0 10

Example The K-Means Clustering Method 10 9 8 7 6 5 4 3 1 0 0 1 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 1 0 0 1 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 1 0 0 1 3 4 5 6 7 8 9 10 Jiawei Han and Micheline Kamber 10 9 8 7 6 5 4 3 1 0 0 1 3 4 5 6 7 8 9 10 1 Limiting Random Initialization in K-Means Various methods, such as: Various K may be good candidates Take sample number of documents and perform hierarchical clustering, take them as initial centroids Select more than k initial centroids (choose the ones that are further away from each other) Perform clustering and merge closer clusters Try various starting seeds and pick the better choices 11

The K-Means Clustering Method Re-calculating Centroid: Updating centroids after each iteration (all documents are assigned to clusters) Updating after each document is assigned. More calculations More order dependency 3 The K-Means Clustering Method Termination Condition: A fixed number of iterations Reduction in re-distribution (no changes to centroids) Reduction in SSE 4 1

Effect of Outliers Outliers are documents that are far from other documents. Outlier documents create a singleton (cluster with only one member) Outliers should be removed and not picked as the initialization seed (centroid) 5 Evaluate Quality in K-Means Calculate the sum of squared error (Commonly done in K-means) Goal is to minimize SSE (intra-cluster variance): SSE k i 1 p c i p m i 6 13

Hierarchical Agglomerative Clustering (HAC) Treats documents as singleton clusters, then merge pairs of clusters till reaching one big cluster of all documents. Any k number of clusters may be picked at any level of the tree (using thresholds, e.g. SSE) Each element belongs to one cluster or to the superset cluster; but does not belong to more than one cluster. 7 Example Singletons A, D, E, and B are clustered. ABCDE BCE AD BE C A D E B 8 14

Hierarchical Agglomerative Create NxN doc-doc similarity matrix Each document starts as a cluster of size one Do Until there is only one cluster Combine the best two clusters based on cluster similarities using one of these criteria: single linkage, complete linkage, average linkage, centroid, Ward s method. Update the doc-doc matrix Note: Similarity is defined as vector space similarity (eg. Cosine) or Euclidian distance 9 Merging Criteria Various functions in computing the cluster similarity result in clusters with different characteristics. The goal is to minimize any of the following functions: Single Link/MIN (minimum distance between documents of two clusters) Complete Linkage/MAX (maximum distance between documents of two clusters) Average Linkage (average of pair-wise distances) Centroid (centroid distances) Ward s Method (intra-cluster variance) 30 15

HAC s Cluster Similarities Single Link Average Link Complete Link Centroid 31 Example (Hierarchical Agglomerative) Consider A, B, C, D, E as objects with the following similarities: A B C D E A - 7 9 4 B - 9 11 14 C 7 9-4 8 D 9 11 4 - E 4 14 8 - Highest pair is: E-B = 14 3 16

Example (Cont d) So lets cluster E and B. We now have the structure: BE A C D E B 33 Example (Cont d) Now we update the matrix: A BE C D A - 7 9 BE - 8 C 7 8-4 D 9 4-34 17

Example (Cont d) So lets cluster A and D. We now have the structure: AD BE A D E B C 35 Example (Cont d) Now we update the matrix: AD BE C AD - 4 BE - 8 C 4 8-36 18

Example (Cont d) So lets cluster BE and C. At this point, there are only two nodes that have not been clustered, AD and BCE. We now cluster them. BCE AD BE C A D E B 37 Now we update the similarity matrix. AD BEC AD - BEC - Example (Cont d) At this point there is only one cluster. 38 19

Example (Cont d) Now we have clustered everything. ABCDE BCE AD BE C A D E B 39 How to do Query Processing Calculate the centroid of each cluster. Calculate the SC between the query vector and each cluster centroid. Pick the cluster with higher SC. Continue the process toward the leafs of the subtree of the cluster with higher SC. 40 0

Analysis Hierarchical clustering requires: O(n ) to compute the doc-doc similarity matrix One node is added during each round of clustering, thus n steps For each clustering step we must re-compute the DOC-DOC matrix. That is finding the closest is O(n ) plus recomputing the similarity in O(n) steps. Thus: O(n + n) Thus, we have: O(n )+ O(n)(n +n) = O(n 3 ) (with an efficient implementation in some cases may accomplish finding the closet in O(nlogn) steps; Thus: O(n ) + (n)(nlogn+n) = O(n log n) Thus, very expensive! 41 Buckshot Clustering A hybrid approach (HAC & K-Means) To avoid building the DOC-DOC matrix: Buckshot (building similarity matrix for a subset) Goal is to reduce run time to O(kn) instead of O(n 3 ) or O(n log n) of HAC. 4 1

Buckshot Algorithm Randomly select d documents where d is n or Cluster these using hierarchical clustering algorithm into k clusters: ~ O( n) Compute the centroid of each of the k clusters: O n Scan remaining documents and assign them to the closest of the k clusters (k-means): O( n n) Thus: O( n) + O n + O( n n) ~ O(n) kn 43 Summary Clustering provides users an overview of the contents of a document collection Commonly used in organizing search results Cluster labeling aims to make the clusters meaningful for users Can reduce the search space and improve efficiency, and potentially accuracy HAC is computationally expensive K-Means suits for clustering large data sets Difficulty in evaluating the quality of clusters 44