Information Retrieval and Organisation

Similar documents
Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Introduction to Information Retrieval

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

PV211: Introduction to Information Retrieval

Introduction to Information Retrieval

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University


Introduction to Information Retrieval

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

MI example for poultry/export in Reuters. Overview. Introduction to Information Retrieval. Outline.

CS47300: Web Information Search and Management

Data Informatics. Seon Ho Kim, Ph.D.

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Clustering: Overview and K-means algorithm

Clustering: Overview and K-means algorithm

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

DD2475 Information Retrieval Lecture 10: Clustering. Document Clustering. Recap: Classification. Today

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Expectation Maximization!

Based on Raymond J. Mooney s slides

Chapter 9. Classification and Clustering

Information Retrieval and Web Search Engines

Informa(on Retrieval

Information Retrieval and Web Search Engines

Informa(on Retrieval

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering algorithms

CLUSTERING. Quiz information. today 11/14/13% ! The second midterm quiz is on Thursday (11/21) ! In-class (75 minutes!)

Clustering (COSC 416) Nazli Goharian. Document Clustering.

CSE 494/598 Lecture-11: Clustering & Classification

Text Documents clustering using K Means Algorithm

Unsupervised Learning

Exploratory Analysis: Clustering

Unsupervised Learning Partitioning Methods

Gene Clustering & Classification

Lecture 8 May 7, Prabhakar Raghavan

Data Clustering. Danushka Bollegala

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Clustering. Bruno Martins. 1 st Semester 2012/2013

Administrative. Machine learning code. Machine learning: Unsupervised learning

Introduction to Machine Learning CMU-10701

Clustering (COSC 488) Nazli Goharian. Document Clustering.

What to come. There will be a few more topics we will cover on supervised learning

CS145: INTRODUCTION TO DATA MINING

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Introduction to Mobile Robotics

ECLT 5810 Clustering

Artificial Intelligence. Programming Styles

Clust Clus e t ring 2 Nov

Unsupervised Learning and Clustering

Clustering CS 550: Machine Learning

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

Unsupervised Learning and Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering

ECLT 5810 Clustering

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Lecture on Modeling Tools for Clustering & Regression

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

CSE 5243 INTRO. TO DATA MINING

INF 4300 Classification III Anne Solberg The agenda today:

Unsupervised Learning : Clustering

K-Means. Oct Youn-Hee Han

CHAPTER 4: CLUSTER ANALYSIS

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Chapter 4: Text Clustering

Feature Subset Selection using Clusters & Informed Search. Team 3

Road map. Basic concepts

Machine Learning. Unsupervised Learning. Manfred Huber

Clustering and Visualisation of Data

University of Florida CISE department Gator Engineering. Clustering Part 2

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Unsupervised Learning I: K-Means Clustering

Supervised and Unsupervised Learning (II)

Birkbeck (University of London)

Computer Science 572 Exam Prof. Horowitz Monday, November 27, 2017, 8:00am 9:00am

Clustering. Chapter 10 in Introduction to statistical learning

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

CSE 158. Web Mining and Recommender Systems. Midterm recap

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

CSE 5243 INTRO. TO DATA MINING

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Lecture 7: Clustering

SOCIAL MEDIA MINING. Data Mining Essentials

10701 Machine Learning. Clustering

Clustering Algorithms for general similarity measures

The k-means Algorithm and Genetic Algorithm

Clustering Basic Concepts and Algorithms 1

Transcription:

Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London

What Is Text Clustering? Text Clustering = Grouping a set of documents into classes of similar documents. Documents within a cluster should be similar. Documents from different clusters should be dissimilar. Classification vs. Clustering Classification: supervised learning Labeled data are given for training Clustering: unsupervised learning Only unlabeled data are available

A data set with clear cluster structure

The Cluster Hypothesis Documents in the same cluster behave similarly with respect to relevance to information needs.

Why Text Clustering? To improve retrieval recall When a query matches a doc d, also return other docs in the cluster containing d. Hope if we do this, the query car will also return docs containing automobile.

Why Text Clustering? To improve retrieval speed Cluster Pruning: consider only documents in a small number of clusters as candidates for which we compute similarity scores.

Sec. 7.1.6 Cluster Pruning Chapter 7 Section 7.1.6 Preprocessing Pick N docs at random: call these leaders For every other doc, pre-compute nearest leader Docs attached to a leader: its followers; Likely: each leader has ~ N followers. Query Processing Given query Q, find its nearest leader L. Seek K nearest docs from among L s followers.

Sec. 7.1.6 Cluster Pruning Query Leader Follower

Sec. 7.1.6 Cluster Pruning Why use random sampling? Fast Leaders reflect data distribution More sophisticated clustering techniques later

Sec. 7.1.6 Cluster Pruning General Variants Have each follower attached to b1=3 (say) nearest leaders. From query, find b2=4 (say) nearest leaders and their followers. Can recurse on leader/follower construction.

Sec. 7.1.6 Cluster Pruning Exercises To find the nearest leader in step 1, how many cosine computations do we do? Why did we have N in the first place? What is the effect of the constants b1, b2 on the previous slide? Devise an example where this is likely to fail i.e., we miss one of the K nearest docs. Likely under random sampling.

Why Text Clustering? To improve user interface Navigation/Visualization of document collections Navigation/Visualization of search results Searching + Browsing

http://news.google.co.uk/

http://clusty.com/

What Clustering Is Good? External Criteria Consistency with the latent classes in gold standard (ground truth) data. Purity Normalized Mutual Information Rand Index Cluster F Measure Internal Criteria High intra-cluster similarity Low inter-cluster similarity

What Clustering Is Good? Purity majority class x (5) majority class o (4) purity = (5+4+3)/17 = 0.71 majority class d (3)

What Clustering Is Good? Rand Index (RI) The percentage of decisions (on document-pairs) that are correct {A, A, B} and {B, B} A 1 A 2 B 1 B 2 B 3 A 1 tp fp tn tn A 2 fp tn tn B 1 fn fn B 2 tp B 3 RI = (TP+TN)/(TP+FP+FN+TN) = 6/10 = 0.6

Issues in Clustering Similarity/Distance between docs Ideally: semantic Practically: statistical e.g., cosine similarity or Euclidean distance For text clustering, the doc vectors usually need to be length normalized. Membership of docs Hard: each doc belongs to exactly one cluster Soft: A doc can belong to more than one cluster e.g., you may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes.

Issues in Clustering Number of clusters Fixed in advance Discovered from data Structure of clusters Flat (partition) e.g., kmeans. Hierarchical (tree) e.g., HAC.

K-Means Algorithm

Time Complexity Computing distance between two docs is O(m) where m is the dimensionality of the vectors. Reassigning clusters: O(kn) distance computations, or O(knm). Computing centroids: Each doc gets added once to some centroid: O(nm). Assume these two steps are each done once for i iterations: O(iknm).

Stopping Criterion Fixed number of iterations Convergence: to reach a state in which clusters don t change k-means is proved to converge k-means usually converges quickly, i.e., the number of iterations needed for convergence is typically small.

K-Means Example (k = 2) Pick seeds Reassign clusters Compute centroids Reassign clusters x x x x Compute centroids Reassign clusters Converged!

K-Means Demo http://home.dei.polimi.it/matteucc/clustering/tutorial_html/appletkm.html

kmeans Exercise Digital Camera Megapixel Zoom

kmeans Exercise

Seed Choice k-means (with k=2) For seeds d 2 and d 5, the algorithm converges to {{d 1, d 2, d 3 }, {d 4, d 5, d 6 }}, a suboptimal clustering. For seeds d 2 and d 3, the algorithm converges to {{d 1, d 2, d 4, d 5 }, {d 3, d 6 }}, the global optimum.

Seed Choice Problem The outcome of clustering in k-means depends on the initial seeds. Solution Some seeds can result in poor convergence rate, or convergence to sub-optimal clustering. Excluding outliers from the seed set. Trying out multiple sets of random starting points and choosing the clustering with the lowest cost. Obtaining good seeds from another method e.g., hierarchical clustering, k-means++