Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Similar documents
Lecture 8 May 7, Prabhakar Raghavan

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS47300: Web Information Search and Management

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval and Organisation

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Document Clustering: Comparison of Similarity Measures

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Exploratory Analysis: Clustering

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

DD2475 Information Retrieval Lecture 10: Clustering. Document Clustering. Recap: Classification. Today

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Introduction to Information Retrieval

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering


Clustering: Overview and K-means algorithm

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Introduction to Data Mining

Introduction to Information Retrieval

Based on Raymond J. Mooney s slides

Unsupervised Learning

Introduction to Information Retrieval

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

VECTOR SPACE CLASSIFICATION

PV211: Introduction to Information Retrieval

Instructor: Stefan Savev

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Information Retrieval and Web Search Engines

Analytics and Visualization of Big Data

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Information Retrieval and Web Search Engines

Cluster Analysis: Agglomerate Hierarchical Clustering

Query Answering Using Inverted Indexes

Concept-Based Document Similarity Based on Suffix Tree Document

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Information Retrieval

Text Documents clustering using K Means Algorithm

What to come. There will be a few more topics we will cover on supervised learning

Clustering: Overview and K-means algorithm

A Comparison of Document Clustering Techniques

Unsupervised: no target value to predict

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Data Informatics. Seon Ho Kim, Ph.D.

Hierarchical Clustering 4/5/17

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Segmentation Computer Vision Spring 2018, Lecture 27

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Clustering COMS 4771

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Gene Clustering & Classification

CHAPTER 4: CLUSTER ANALYSIS

Supervised and Unsupervised Learning (II)

Clustering (COSC 416) Nazli Goharian. Document Clustering.

CSE494 Information Retrieval Project C Report

Clustering (COSC 488) Nazli Goharian. Document Clustering.

University of Florida CISE department Gator Engineering. Clustering Part 2

Text Analytics (Text Mining)

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning I: K-Means Clustering

Feature selection. LING 572 Fei Xia

K-Means Clustering 3/3/17

CSE 5243 INTRO. TO DATA MINING

Information Retrieval. (M&S Ch 15)

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Natural Language Processing

Topics du jour CS347. Centroid/NN. Example

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Clustering Part 1. CSC 4510/9010: Applied Machine Learning. Dr. Paula Matuszek

Clustering CS 550: Machine Learning

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Clustering. Bruno Martins. 1 st Semester 2012/2013

Boolean Model. Hongning Wang

Text Analytics (Text Mining)

Methods for Intelligent Systems

MI example for poultry/export in Reuters. Overview. Introduction to Information Retrieval. Outline.

ECG782: Multidimensional Digital Signal Processing

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

High Dimensional Indexing by Clustering

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Hierarchical Clustering Lecture 9

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Unsupervised Learning : Clustering

Query Processing and Alternative Search Structures. Indexing common words

Hierarchical Clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Transcription:

Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Given the set of docs from the results of a search (say jaguar), partition into groups of related docs semantic disambiguation Results list clustering example Cluster 1: Jaguar Motor Cars home page Mike s XJS resource page Vermont Jaguar owners club Cluster 2: Big cats My summer safari trip Pictures of jaguars, leopards and lions Cluster 3: Jacksonville Jaguars Home Page AFC East Football Teams 1

What makes docs related? Ideal: semantic similarity. Practical: statistical similarity We will use cosine similarity. Docs as vectors. For many algorithms, easier to think in terms of a distance (rather than similarity) between docs. We will describe algorithms in terms of cosine distance Recall doc as vector Each doc j is a vector of tf idf values, one component for each term. Can normalize to unit length. So we have a vector space terms are axes n docs live in this space even with stemming, may have 10000+ dimensions Intuition Cosine similarity t 3 D2 D3 y x D1 t 1 Cosine similarity of D, : (, ) = m j D k sim D j D w w k i = 1 ij ik Aka normalized inner product. t 2 D4 Postulate: Documents that are close together in vector space talk about the same things. 2

Two flavors of clustering Given n docs and a positive integer k, partition docs into k (disjoint) subsets. Given docs, partition into an appropriate number of subsets. E.g., for query results - ideal value of k not known up front. Can usually take an algorithm for one flavor and convert to the other. Cluster centroid Centroidof a cluster = average of vectors in a cluster - is a vector. Need not be a doc. Centroid of (1,2,3); (4,5,6); (7,2,6) is (4,3,5). Centroid Outliers in centroid computation Ignore outliers when computing centroid. What is an outlier? Distance to centroid > M average. Centroid Say 10. Outlier Agglomerative clustering Given target number of clusters k. Initially, each doc viewed as a cluster start with n clusters; Repeat: while there are > k clusters, find the closest pair of clusters and merge them. 3

Closest pair of clusters Example; n=6, k=3 Many variants to defining closest pair of clusters. Closest pair two clusters whose centroids are the most cosine-similar. d5 d6 d4 d3 d1 d2 Centroid after first step. Issues Have to discover closest pairs compare all pairs? n 3 cosine similarity computations. Avoid: recall techniques from lecture 4. points are changing as centroids change. Changes at each step are not localized on a large corpus, memory management becomes an issue. How would you adapt sampling/pregrouping? Exercise Consider agglomerative clustering on n points on a line. Explain how you could avoid n 3 distance computations - how many will your scheme use? 4

Hierarchical clustering As clusters agglomerate, docs likely to fall into a hieararchy of topics or concepts. d1 d2 d3 d4 d5 d3,d4,d5 d1,d2 d4,d5 d3 Different algorithm: k-means Iterative algorithm. More locality within each iteration. Hard to get good bounds on the number of iterations. Basic iteration Iteration example At the start of the iteration, we have k centroids. Need not be docs, just some k points. Each doc assigned to the nearest centroid. All docs assigned to the same centroid are averaged to compute a new centroid; thus have k new centroids. Docs Current centroids 5

Iteration example k-means clustering Begin with k docs as centroids could be any k docs, but k random docs are better. Repeat Basic Iteration until termination condition satisfied. Why? Docs New centroids Termination conditions Several possibilities, e.g., A fixed number of iterations. Centroid positions don t change. Does this mean that the docs in a cluster are unchanged? Convergence Why should the k-means algorithm ever reach a fixed point? A state in which clusters don t change. k-means is a special case of a general procedure known as the EM algorithm. Under reasonable conditions, known to converge. Number of iterations could be large. 6

Exercise Consider running 2-means clustering on a corpus, each doc of which is from one of two different languages. What are the two clusters we would expect to see? Is agglomerative clustering likely to produce different results? Multi-lingual docs Canadian/Belgian government docs. Every doc in English and equivalent French. Cluster by concepts rather than language. Cross-lingual retrieval. k not specified in advance Say, the results of a query. Solve an optimization problem: penalize having lots of clusters compressed summary of list of docs. Tradeoff between having more clusters (better focus within each cluster) and having too many clusters k not specified in advance Given a clustering, define the Benefit for a doc to be the cosine similarity to its centroid Define the Total Benefit to be the sum of the individual doc Benefits. Why is there always a clustering of Total Benefit n? 7

Penalize lots of clusters For each cluster, we have a Cost C. Thus for a clustering with k clusters, the Total Cost is kc. Define the Value of a cluster to be = Total Benefit - Total Cost. Find the clustering of highest Value, over all choices of k. Back to agglomerative clustering In a run of agglomerative clustering, we can try all values of k=n,n-1,n-2, 1. At each, we can measure our Value, then pick the best choice of k. Exercises Suppose a run of agglomerative clustering finds k=7 to have the highest Value amongst all k. Have we found the highest- Value clustering amongst all clusterings with k=7? Using clustering in applications 8

Clustering to speed up scoring From Lecture 4, recall sampling and pregrouping Wanted to find, given a query Q, the nearest docs in the corpus Wanted to avoid computing cosine similarity of Q to each of n docs in the corpus. Sampling and pre-grouping (Lecture 4) First run a pre-processing phase: pick n docs at random: call these leaders For each other doc, pre-compute nearest leader Docs attached to a leader: its followers; Likely: each leader has ~ n followers. Process a query as follows: Given query Q, find its nearest leader L. Seek nearest docs from among L s followers. Instead of random leaders, cluster First run a pre-processing phase: Cluster docs into n clusters. For each cluster, its centroid is the leader. Process a query as follows: Given query Q, find its nearest leader L. Seek nearest docs from among L s followers. Navigation structure Given a corpus, agglomerate into a hierarchy Throw away lower layers so you don t have n leaf topics each having a single doc. d1 d2 d3 d4 d5 d3,d4,d5 d1,d2 d4,d5 d3 9

Navigation structure Deciding how much to throw away needs human judgement. Can also induce hierarchy top-down - e.g., use k-means, then recur on the clusters. Topics induced by clustering need human ratification. Need to address issues like partitioning at the top level by language. Major issue - labelling After clustering algorithm finds clusters - how can they be useful to the end user? Need pithy label for each cluster In search results, say Football or Car in the jaguar example. In topic trees, need navigational cues. Often done by hand, a posteriori. Labeling Common heuristics - list 5-10 most frequent terms in the centroid vector. Drop stop-words; stem. Differential labeling by frequent terms Within the cluster Computers, child clusters all have the word computer as frequent terms. Discriminant analysis of centroids for peer clusters. Supervised vs. unsupervised learning Unsupervised learning: Given corpus, infer structure implicit in the docs, without prior training. Supervised learning: Train system to recognize docs of a certain type (e.g., docs in Italian, or docs about religion) Decide whether or not new docs belong to the class(es) trained on 10

Resources Good demo of results-list clustering: cluster.cs.yale.edu 11