Web Document Clustering

Size: px

Start display at page:

Download "Web Document Clustering"

Dustin Carson
5 years ago
Views:

1 Web Document Clustering IT

2 Contents IR and Web Search Problems on Web Search Results Document Clustering? Why Cluster Documents on Web Search Results? Applications of Document Clustering Clustering Approaches More Applications Discussions

3 Typical IR Task Given: A corpus of textual natural-language documents. A user query in the form of a textual string. Find: A ranked set of documents that are relevant to the query.

4 IR System Document corpus Query String IR System Ranked Documents 1. Doc1 2. Doc2 3. Doc3..

5 Relevance a subjective judgment may include: Being on the proper subject. Being timely (recent information). Being authoritative (from a trusted source). Satisfying the goals of the user and his/her intended use of the information (information need).

6 Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

7 Intelligent IR Taking into account the meaning of the words used. Taking into account the order of words in the query. Taking into account the authority of the source. Adapting to the user based on the direct or indirect feedback.

8 Web Search Application of IR to HTML documents on the World Wide Web. Differences with typical IR: Must assemble document corpus by spidering the web. Can use the structural layout information in HTML (XML). Can use the link structure of the web. Documents change uncontrollably.

9 Web Search System Web Spider Document corpus Query String IR System 1. Page1 2. Page2 3. Page3.. Ranked Documents

10 Web users information needs classification based on the type of answers expected by the Web user 1. Question answering type of answer : a very short answer (single sentence, part of the document) query example What is the population of Korea? 2. Named page finding - Homepage finding type of answer : a single document containing specific information query example Authors instructions for the KISS journal 3. Topic relevance type of answer: a range of documents relevant to the topic of their information need query example KOREA s immigration policy 4. Online services type of answer: a range of documents that allow the user to access a particular service query example download mp3 5. Topic distillation type of answer: a range of documents that are relevant key resources on the topic (Relevance + Quality) query example highway safety

11 Web search results Big problem A long list of search results, ranked by their relevancies to the given query Engine : search engine, car part, Engine Corp. A time consuming task when multiple sub-topics of the given query are mixed together. Possible solution To (online) cluster web search results Evidence Relationships between the results Cluster Hypothesis (van Rijsbergen 1979): Closely related documents tend to be relevant to the same requests. Potential results Aids user-engine interaction Browsing Help users express his need

12 An example - Vivisimo

13 What is clustering? Partition unlabeled examples into disjoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different the act of grouping similar objects into sets unsupervised classification no predefined classes Typical applications to get insight into data as a preprocessing step

14 What is clustering? (cont.)

15 What is document clustering? Document clustering : To group similar documents into sets To group similar or related documents into a common cluster. Eg) Scatter/Gather

16 Why not just document clustering? Web search results clustering is a version of document clustering, but Billions of pages Constantly changing Data mainly unstructured and heterogeneous Additional information to consider (i.e. links, clickthrough data, etc.)

17 Why cluster web documents? 1. For whole corpus analysis/navigation Better user interface 2. For improving recall in search applications Better search results 3. For better navigation of search results Effective user recall will be higher 4. For speeding up vector space retrieval Faster search

18 1. Corpus analysis/navigation Standard IR Document clusters Some time, TOC is very useful Index Aardvark, 15 Blueberry, 200 Capricorn, 1, Dog, Egypt, 65 Falafel, Giraffes, Table of Contents 1. Science of Cognition 1.a. Motivations 1.a.i. Intellectual Curiosity 1.a.ii. Practical Applications 1.b. History of Cognitive Psychology 2. The Neural Basis of Cognition 2.a. The Nervous System 2.b. Organization of the Brain 2.c. The Visual System 3. Perception and Attention 3.a. Sensory Memory 3.b. Attention and Sensory Information Processing

19 1. Corpus analysis/navigation (cont.) Document clustering Can induce a tree of topics To allow user to browse through corpus to find information Crucial need: meaningful labels for topic nodes. Yahoo!: manual hierarchy Often not available for new document collection

20 For visualizing a document collection and its themes Wise et al, Visualizing the non-visual PNNL ThemeScapes, Cartia [Mountain height = cluster size]

21 2. Improvement of search recall Cluster hypothesis Documents with similar text are related Therefore, to improve search recall: To cluster docs in corpus a priori (pre-clustering) To return other docs in the cluster containing D when a query matches a doc D. Hope if we do this: The query car will also return docs containing automobile Because clustering grouped together docs containing car with those containing automobile.

22 3. Better navigation of search results For grouping search results thematically clusty.com / Vivisimo

23 3. Better navigation of search results - Kartoo.com -

24 3. Better navigation of search results - iboogie.com -

25 3. Better navigation of search results -mooter.com-

26 3. Better navigation of search results One can also view grouping documents with the same sense of a word as clustering Given the results of a search (say Jaguar, or NLP), partition into groups of related docs Can be viewed as a form of word sense disambiguation E.g., jaguar may have senses: The car company The animal The football team The video game

27 4. Speeding up vector space retrieval In vector space retrieval, To find nearest doc vectors to query vector To entail finding the similarity of the query to every doc slow (for some applications) By clustering docs in corpus a priori To find nearest docs in cluster close to query To be inexact but avoid exhaustive similarity computation

28 What Is A Good Clustering? Internal criterion: A good clustering will produce high quality clusters in which: the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representation and the similarity measure used External criterion: The quality of a clustering is also measured by its ability to discover some or all of the hidden patterns or latent classes Assessable with gold standard data

29 Main issues Online or offline clustering? What to use as input Entire documents Snippets Structure information (links) Other data (i.e. click-through) Use stop word lists, stemming, etc. How to define similarity? Content (i.e. vector-space model) Link analysis Usage statistics How to group similar documents? How to label the groups?

30 Components Of Clustering System 1. Feature selection/extraction the most effective subset from patterns 2. Feature representation one or more transformations of the input features (attributes) to produce new salient features. 3. Pattern proximity (measure similarity) The distance of pairs of patterns (features). 4. Clustering (Grouping) Hard (a partition of the data into groups) Fuzzy (each pattern has a variable degree) 5. Data abstraction Extracting a compact representation of a data set Automatic analysis, Human-oriented. 6. Evaluation to measure the performance of competing algorithms

31 1. Feature Selection/Extraction What is the representation of documents? document = a set of features What is the features? Linguistic structure in documents Co-occurrences of Terms : n-gram Semantic structure : case structure Meta-data of documents Authors Citation Co-citation document : documents cites the examined documents bibliographic coupling : documents are cited by the examined documents

32 1. Feature Selection/Extraction Which terms to use as axes for vector space? Better is to use highest weight mid-frequency words the most discriminating terms Pseudo-linguistic heuristics, e.g., drop stop-words stemming/lemmatization use only nouns/noun phrases Good clustering should figure out some of these

33 2. Feature representation Documents are transformed into vectors of weighted terms Data matrix (two modes) d d d 1 2 Μ n t x x x Μ n1 t x x x Μ n2 Λ Λ Λ Ο Λ t x x x m 1m 2m Μ nm How to compute x ij tf*idf x = tf idf ij ij i idf i = log N df i

34 3. Pattern proximity (measure similarity) Dissimilarity matrix (one mode) How to compute d ij Dissimilarity - Distance Similarity - inverse of distance 0 d21 d31 Μ dn1 d d 0 32 Μ n2 0 Μ

35 Proximity Measure Dissimilarity The Euclidean distance: The Manhattan distance: Weighted distance: Similarity Inner Product Cos Coefficient d d d s s ij ij ij ij ij = = = = m z= 1 m z= 1 m z= 1 m z= 1 m x w z= 1 = m ( x y ) iz z ( x iz ( x x 2 iz iz iz y jz jz 2 ( x y ) x iz x m jz jz ) ) x 2 jz jz 2

36 4. Clustering Approach and/or Algorithm (1) Using contents of documents (2) Using user s usage logs (3) Using current search engines (4) Using hyperlinks (5) Other classical methods

37 (1) Using Contents of Documents Based on snippets returned by web search engines. Be as good as clusters created using the full text of Web documents. Suffix Tree Clustering (STC) : incremental, O(n) time algorithm three logical steps: (1) document cleaning (2) identifying base clusters using a suffix tree (3) combining these base clusters into clusters

38 (2) Using user s usage logs Relevancy information is objectively reflected by the usage logs An experimental result on Cluster 1 Cluster 2 Cluster 3 /shuttle/missions/41-c/news /shuttle/missions/61-b /history/apollo/sa-2/news/ /history/apollo/sa-2/images /software/winvn/userguide/3_3_2.htm /software/winvn/userguide/3_3_4.htm.

39 (3) Using current web search engines - Metacrawler -

40 (4) Using hyperlinks Cluster web documents based on both the textual and hyperlink The hyperlink structure is used as the dominant factor in the similarity metric Kleinberg s HITS (Hypertext Induced Topic Selection) algorithm based purely on hyperlink information. authority and hub documents for a user query. only cover the most popular topics and leave out the less popular ones.

41 Other classical approaches

42 Hierarchical Clustering Clusters are created in levels actually creating sets of clusters at each level. Agglomerative Initially each item in its own cluster Iteratively clusters are merged together Bottom Up Divisive Initially all items in one cluster Large clusters are successively divided Top Down

43 Hierarchical Clustering (Cent) Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition. Step 0 Step 1 Step 2 Step 3 Step 4 a a b b a b c d e c c d e d d e e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative divisive

44 Agglomerative Example E D C B A E D C B A B A E C D 4 Threshold of A B C D E

45 Distance Between Two Clusters Single-Link Method : Nearest Neighbor Complete-Link : Furthest Neighbor Average-link : all cross-cluster pairs. Min distance Max distance Average distance

46 Single-Link Method Euclidean Distance a b c d a,b a,b,c a,b,c,d c d d (1) (2) (3) a b c b 2 c d a b c b 2 c d a, b c c 3 d 5 4 a, b, c d 4 Distance Matrix

47 Complete-Link Method Euclidean Distance a b c d a,b a,b a,b,c,d c d c,d (1) (2) (3) a b c b 2 c d a b c b 2 c d a, b c c 5 d 6 4 a, b c, d 6 Distance Matrix

48 Example of the Chaining Effect Single-link (10 clusters) Complete-link (2 clusters)

49 Effect of Bias towards Spherical Clusters Single-link (2 clusters) Complete-link (2 clusters)

50 Partitioning clustering Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen 67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw 87): Each cluster is represented by one of the objects in the cluster

51 K-means algorithm Initial center of cluster are randomly selected Assign objects to cluster using distances between center and object Re-compute the center of each cluster Return step2 until stopping criteria is satisfied

52 K-means algorithm Example

53 What is the problem of k-means? The k-means algorithm is sensitive to outliers! Since an object with an extremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster

54 5. Data Abstraction - Cluster labelling - How do we generate meaningful descriptions, or labels, for clusters? it is an open issue Possible solution To list 5-10 most frequent features from the cluster features: index terms, noun phrases, proper names, To show features that occur frequently in this cluster and not frequently in others Differential labeling To summarize multiple documents in the cluster

55 6. Cluster validation How can we tell if a clustering is good or not? ask users whether they agree with the clusters agreement with human clustering is problematic (Macskassy et al., 1998) use statistical techniques to measure qualities of the cluster, e.g. the purity, or the divergence from random clustering For information retrieval do we manage to cluster relevant documents together? This is the holy grail for cluster-based IR can be evaluated through cluster-based retrieval

56 Approaches to evaluating Anecdotal I wrote this clustering algorithm and look what it found! No benchmarks, no comparison possible User inspection Experts evaluate the results and score them Expensive / time-consuming Ground truth comparison To compare clustering results to a known taxonomy like Yahoo The static prior taxonomy may be incomplete/wrong in places Microeconomic / utility Net economic gain produced by an approach (vs. another approach) How do various clustering methods affect the quality of what s retrieved? Compare two IR algorithms 1. send query, present ranked results 2. send query, cluster results, present clusters Purely quantitative measures Probability of generating clusters found Average distance between cluster members

57 Cluster Validity Assessment δ(c i, C j ) : inter-cluster distance (C k ) : intra-cluster distance (C K ) δ(c i, C j )

58 Inter-cluster distance : Average linkage δ(c i, C j ) Centroid linkage Complete linkage Single linkage Average to centroids linkage

59 Intra-cluster distance: (C k ) Complete diameter Average diameter Centroid diameter

60 DB(Davies-Bouldin) Index S ij K 1 S i + S j DB = max K i j i= 1 Sij 1 = zi z j S = i x C { x zi } i C i K : the number of clusters S ij : Distance between the center z i of cluster i C i & the center z j of cluster j C j S i : Scatter between C i and C j

61 Davies-Bouldin (DB) Index (2) The objective: To minimize the index the number of clusters optimal number of clusters, K The clusters with minimized DB index Small values of DB correspond to good clusters

62 Dunn s Index v D = The objective To identify sets of clusters that are compact and well separated to maximize the inter-cluster distances and minimize the intra-cluster distances Large values of V D correspond to good clusters δ ( C i j min min 1 i K 1 j K max ) j i k 1 k K, C ) { ( C }

63 Other Cluster Validity Measures Calinski Harabasz (CH) Index Index I Xie and Beni F-Measure Λ-Measure Silhouette Symmetry Graph-Based Boundary Analysis Partition Coefficient Separation Index Classification Entropy (CE) Etc.

64 Other clustering applications Relevance feedback (Iwayama, 2000) Search engine query logs cluster queries on the basis of the documents that users selected from these queries (Beeferman & Berger, 2000) Recommender systems cluster users based on preferences, purchases, etc. (Sarwar et al., 2002) Topic detection and tracking (TDT) & Novelty Detection cluster news stories together and try to find emerging new topics (Allan, 2001) cluster previously seen information, and measure novelty of incoming information by its distance to that already seen (TREC Novelty track, 2002, 2003)

65 Discussion In general, clustering presents problems and challenges: Selection of attributes for clustering Selection of clustering method Generation of cluster representations Validity/quality of the generated clustering Updating clustering structures e.g. inserting or deleting a document from a hierarchy Effectiveness gains are not always evident Computational expenses

66 Looking back Document clustering has been applied to IR for over 30 years the research focus has shifted over the years efficiency (70s), effectiveness (80s), other applications (e.g. browsing, visualisation (90s)), dynamic clustering (00s),? But, there are still many unresolved issues: Cluster representation Cluster-based search strategies Dynamic clustering (per-query basis) Algorithmic aspects Document representation for clustering (e.g. reduce noise by using only the most important part of a document) New applications for clustering in IR

67 As a conclusion Clustering is a useful tool for IR It is theoretically solid and intuitively plausible Although it has been used for over 30 years there are still open issues It is still an interesting area to research

68 Some references If you only read one article/reference: Tombros, A., PhD Thesis, Chapter 3 (optionally 4 & 5) available at: Willett, P., Recent trends in hierarchic document clustering: A critical review. Information Processing & Management, 24(5): , More than worth to have a look at: van Rijsbergen, C.J., Information Retrieval. London: Butterworths, 2 nd Edition, 1979; available at Also recommended : Hearst, M.A. and Pedersen, J.O. Re-examining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. In Proceedings of ACM SIGIR 96, pages 76-84, 1996.

69 Some references Zamir, O. and Etzioni, O. Web document clustering: A feasibility demonstration. In Proceedings of ACM SIGIR 09, pages 46-54, Iwayama, M. Relevance feedback with a small number of relevance judgements: incremental relevance feedback vs. document clustering. In Proceedings of ACM SIGIR 00, pages 10-16, Beeferman, D. and Berger, A. Agglomerative clustering of a search engine log. In Proceedings of the 6 th International Conference on Knowledge Discovery in Data, pages , B.M. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Recommender Systems for Large-Scale E-Commerce: Scalable Neighborhood Formation Using Clustering. Proceedings of the 5 th International Conference on Computer and Information Technology, J. Allan, J. Carbonell, G. Doddington, J. Yamron and Y. Yang. Topic detection and tracking pilot study. In Topic Detection and Tracking Workshop Report, 2001.

70 Some references Macskassy, S.A., Banerjee, A., Davidson, B.D., Hirsh, H. Human performance on clustering web pages: a preliminary study. In Proceedings of The 4 th Knowledge Discovery and Data Mining Conference (KDD-98), pp , Goldszmidt, M. and Sahami, M. A Probabilistic Approach to Full-Text Document Clustering. Technical Report ITAD-433-MS , SRI International, Available at Bradley, P., Fayyad, U., Reina, C. Scaling EM (expectation maximization) algorithm to large databases, Microsoft Research Technical Report, MSR-TR-98, Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W. Scatter/Gather: A cluster based approach to browsing large document collections. In Proceedings of ACM SIGIR 92, pages , 1992.

71 Acknowledgement C. Manning and P. Raghavan Why cluster documents Cluster Validation, T. Tombros, Clustering for IR,

72 External Evaluation of Cluster Quality Assesses clustering with respect to ground truth Assume that there are C gold standard classes, while our clustering algorithms produce k clusters, 1, 2,, k with n i members. Simple measure: purity, the ratio between the dominant class in the cluster i and the size of cluster i 1 Purity (π i ) = max j ( nij ) j C n i Others are entropy of classes in clusters (or mutual information between classes and clusters)

73 Purity Cluster I Cluster II Cluster III Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document