Clustering. Bruno Martins. 1 st Semester 2012/2013

Similar documents
CSE 5243 INTRO. TO DATA MINING

Clustering Results. Result List Example. Clustering Results. Information Retrieval

CSE 5243 INTRO. TO DATA MINING

Bruno Martins. 1 st Semester 2012/2013

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Information Retrieval and Web Search Engines

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Information Retrieval and Organisation

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Cluster Analysis. Ying Shen, SSE, Tongji University

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm


Cluster analysis. Agnieszka Nowak - Brzezinska

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Classification. 1 o Semestre 2007/2008

Lesson 3. Prof. Enza Messina

Clustering: Overview and K-means algorithm

Information Retrieval and Web Search Engines

CS7267 MACHINE LEARNING

Dimension reduction : PCA and Clustering

Clustering Part 4 DBSCAN

Clustering Lecture 3: Hierarchical Methods

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Clustering Algorithms for general similarity measures

University of Florida CISE department Gator Engineering. Clustering Part 4

Unsupervised Learning

CS47300: Web Information Search and Management

Unsupervised Learning

Clustering CS 550: Machine Learning

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Data Informatics. Seon Ho Kim, Ph.D.

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

10701 Machine Learning. Clustering

Text Analytics (Text Mining)

University of Florida CISE department Gator Engineering. Clustering Part 2

Introduction to Information Retrieval

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Hierarchical Clustering

Machine Learning for OR & FE

Hierarchical Clustering

Introduction to Information Retrieval

Clustering algorithms

Clustering: Overview and K-means algorithm

Introduction to Information Retrieval

Search Engines. Information Retrieval in Practice

Behavioral Data Mining. Lecture 18 Clustering

Data Mining Concepts & Techniques

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

ECLT 5810 Clustering

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Based on Raymond J. Mooney s slides

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Community Detection. Community

Clustering: K-means and Kernel K-means

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

CS Introduction to Data Mining Instructor: Abdullah Mueen

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CSE 347/447: DATA MINING

Text Analytics (Text Mining)

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Introduction to Clustering

Machine Learning. Unsupervised Learning. Manfred Huber

Unsupervised Learning : Clustering

Modern Multidimensional Scaling

Hierarchical Clustering Lecture 9

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Association Rule Mining and Clustering

4. Ad-hoc I: Hierarchical clustering

ECLT 5810 Clustering

Unsupervised Learning and Data Mining

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Clustering (COSC 488) Nazli Goharian. Document Clustering.

PV211: Introduction to Information Retrieval

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

General Instructions. Questions

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Methods for Intelligent Systems

Gene Clustering & Classification

Hierarchical clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Mining Web Data. Lijun Zhang

Clustering in Data Mining

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Mining Web Data. Lijun Zhang

DD2475 Information Retrieval Lecture 10: Clustering. Document Clustering. Recap: Classification. Today

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

CHAPTER 4: CLUSTER ANALYSIS

Clustering Part 3. Hierarchical Clustering

A Comparison of Document Clustering Techniques

Transcription:

Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti.

Outline 1 Motivation Basic Concepts 2 Bottom-Up Top-Down 3 Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing

Outline Motivation Basic Concepts 1 Motivation Basic Concepts 2 3

Motivation Motivation Basic Concepts Problem: Query word could be ambiguous: E.g., query star retrieves documents about astronomy, animals, etc. Solution: clustering document responses to queries along lines of different topics Problem: Manual construction of topic hierarchies and taxonomies Solution: preliminary clustering of large samples of web documents Problem: Speeding up similarity search Solution: restrict the search for documents similar to a query to most representative cluster(s)

An Example Motivation Basic Concepts

Motivation Basic Concepts Cluster Hypothesis: Given a suitable clustering of a collection, if the user is interested in document d (or term t), he is likely to be interested in other members of the cluster to which d (t) belongs Task: Evolve measures of similarity to cluster a collection of documents/terms into groups within which similarity within a cluster is larger than across clusters

Concepts Motivation Basic Concepts Two important paradigms: Bottom-up agglomerative clustering Top-down partitioning Visualization techniques: of corpus in a low-dimensional space Characterizing the entities: Internally: Vector space model, probabilistic models Externally: Measure of similarity/dissimilarity between pairs

Parameters Motivation Basic Concepts General parameters: Similarity measure: ρ(d 1,d 2 ) E.g., cosine similarity Distance measure: δ(d 1,d 2 ) E.g., Euclidean distance Number of clusters: k Problems: Large number of noisy dimensions Notion of noise is application dependent

Motivation Basic Concepts Bottom-up clustering Top-down clustering Self-organization map Multidimensional scaling Latent semantic indexing Generative models and probabilistic approaches Single topic per document Documents correspond to mixtures of multiple topics

Outline Bottom-Up Top-Down 1 2 Bottom-Up Top-Down 3

Bottom-Up Top-Down Partition document collection into k clusters [D 1,D 2,...,D k ] Choices: Minimize intra-cluster distance: i d 1,d 2 D i δ(d 1,d 2 ) Maximize intra-cluster semblance: i d 1,d 2 D i ρ(d 1,d 2 ) If cluster representations D i are available Minimize i d D i δ(d, D i ) Maximize i d D i ρ(d, D i ) Soft clustering d assigned to with D i confidence z d,i Find z d,i so as to minimize i d D i z d,i δ(d, D i ) or maximize i d D i z d,i ρ(d, D i ) Two ways to get partitions: bottom-up clustering and top-down clustering

Hierarchical Agglomerative Bottom-Up Top-Down Initially G is a collection of singleton groups, each with one document d Repeat Find Γ, G with max similarity measure, s(γ ) Merge group Γ with group Use above info to plot the hierarchical merging process (dendogram) To get desired number of clusters: cut across any level of the dendogram

Example Dendogram Bottom-Up Top-Down

Similarity Measures Bottom-Up Top-Down Self-Similarity Average pair wise similarity between documents in Φ s(φ) = 1 ) ( Φ 2 d 1,d 2 Φ s(d 1,d 2 ) s(d 1,d 2 ) is the inter-document similarity measure (e.g., cosine of tfidf vectors) Other criteria: Maximum/minimum pair wise similarity between documents in the clusters Complexity: O(n 2 logn) with n 2 space

Top-Down Bottom-Up Top-Down Uses an internal representation for documents as well as clusters Partitions documents into k clusters 2 variants Hard: 0/1 assignment of documents to clusters Soft: documents belong to clusters with fractional scores Termination When assignment of documents to clusters ceases to change much, or when cluster centroids move negligibly over successive iterations

The k-means Algorithm Bottom-Up Top-Down Hard k-means Choose k arbitrary centroids Assign each document to nearest centroid Recompute centroids Contribution for updating cluster centroid { η(d µc ) if µ c is closest to d µ c = d 0 otherwise µ c µ c + µ c The parameter η is called the learning rate

The k-means Algorithm Bottom-Up Top-Down Soft k-means Don t break close ties between document assignments to clusters and don t make documents contribute to a single cluster which wins narrowly The contribution for updating cluster centroid µ c from document d is related to the current similarity between both 1/ d µ c 2 µ c = η γ 1/ d µ γ 2(d µ c)

An Example Bottom-Up Top-Down (online demo)

Outline Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing 1 2 3 Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing

Visualization Techniques Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing Goal: of corpus in a low-dimensional space Hierarchical Agglomerative (HAC) lends itself easily to visualization Self-Organization Map (SOM) Technique related to k-means Multidimensional scaling (MDS) minimize the distortion of interpoint distances in the low-dimensional embedding as compared to the dissimilarity given in the input data Latent Semantic Indexing (LSI) Linear transformations to reduce number of dimensions

Self-Organizing Maps Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing Like soft k-means Determine association between clusters and documents Associate a representative vector with each cluster and iteratively refine Unlike k-means Embed the clusters in a low-dimensional space right from the beginning Large number of clusters can be initialized even if eventually many are to remain devoid of documents Each cluster can be a slot in a square/hexagonal grid. The grid structure defines the neighborhood N(c) for each cluster c Also involves a proximity function h(γ, c) between clusters

Update Rule Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing Data item d activates node (closest cluster) as well as the neighborhood nodes E.g., all nodes within n hops Update rule for node under the influence of d is: µ γ µ γ +ηh(γ,c d )(d µ γ )

Example Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing

Multidimensional Scaling Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing Goal: Distance preserving low dimensional embedding of documents (often called Principal Coordinate Analysis) Symmetric inter-document distances are given a priori or computed from internal representation Coarse-grained user feedback User provides similarity d i,j between documents i and j With increasing feedback, prior distances are overridden Objective : Minimize the stress of embedding i,j stress = ( dˆ i,j d i,j ) 2 i,j d2 i,j

Problems Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing Stress not easy to optimize Iterative hill climbing 1 Points (documents) assigned random coordinates by external heuristic 2 Points (one at a time) moved by small distance in direction of locally decreasing stress For n documents Each takes O(n) time to be moved Total O(n 2 ) time per relaxation Faster approach: FastMap [Christos Faloutsos and Lin 1995]

Extended Similarity Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing In the vector space model different terms have different meanings car automobile However, car and automobile are related They are likely to co-occur often Documents having related words are related Useful for search and clustering Two basic approaches: Hand-made thesaurus (e.g., WordNet) Co-occurrence and associations

Latent Semantic Indexing Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing Vector-space model distinct orthogonal direction for each terms Not all terms are orthogonal E.g., car and automobile We need a matrix with a lower rank than the traditional document-term matrix This matrix can be found through Singular Value Decomposition

Singular Value Decomposition Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing

Querying Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing A query q is projected to the new space ˆq = D 1 r r UT r T q T Cosine similarity can now be applied to the projections Results are often better than tfidf retrieval SVD filters out noise and discovers semantic associations between terms

Example Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing

Example Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing

Problems Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing Computationally expensive Can run in minutes to hours on a 10 3 to 10 4 collection Current implementations need to store the whole input matrix in memory Not feasible for the Web Probabilistic Latent Semantic Indexing adds a sounder probabilistic model Based on a mixture decomposition derived from a latent class model

Self-Organizing Maps Multidimensional Scaling Latent Semantic Indexing Questions?