Web Document Clustering
|
|
- Dustin Carson
- 5 years ago
- Views:
Transcription
1 Web Document Clustering IT
2 Contents IR and Web Search Problems on Web Search Results Document Clustering? Why Cluster Documents on Web Search Results? Applications of Document Clustering Clustering Approaches More Applications Discussions
3 Typical IR Task Given: A corpus of textual natural-language documents. A user query in the form of a textual string. Find: A ranked set of documents that are relevant to the query.
4 IR System Document corpus Query String IR System Ranked Documents 1. Doc1 2. Doc2 3. Doc3..
5 Relevance a subjective judgment may include: Being on the proper subject. Being timely (recent information). Being authoritative (from a trusted source). Satisfying the goals of the user and his/her intended use of the information (information need).
6 Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).
7 Intelligent IR Taking into account the meaning of the words used. Taking into account the order of words in the query. Taking into account the authority of the source. Adapting to the user based on the direct or indirect feedback.
8 Web Search Application of IR to HTML documents on the World Wide Web. Differences with typical IR: Must assemble document corpus by spidering the web. Can use the structural layout information in HTML (XML). Can use the link structure of the web. Documents change uncontrollably.
9 Web Search System Web Spider Document corpus Query String IR System 1. Page1 2. Page2 3. Page3.. Ranked Documents
10 Web users information needs classification based on the type of answers expected by the Web user 1. Question answering type of answer : a very short answer (single sentence, part of the document) query example What is the population of Korea? 2. Named page finding - Homepage finding type of answer : a single document containing specific information query example Authors instructions for the KISS journal 3. Topic relevance type of answer: a range of documents relevant to the topic of their information need query example KOREA s immigration policy 4. Online services type of answer: a range of documents that allow the user to access a particular service query example download mp3 5. Topic distillation type of answer: a range of documents that are relevant key resources on the topic (Relevance + Quality) query example highway safety
11 Web search results Big problem A long list of search results, ranked by their relevancies to the given query Engine : search engine, car part, Engine Corp. A time consuming task when multiple sub-topics of the given query are mixed together. Possible solution To (online) cluster web search results Evidence Relationships between the results Cluster Hypothesis (van Rijsbergen 1979): Closely related documents tend to be relevant to the same requests. Potential results Aids user-engine interaction Browsing Help users express his need
12 An example - Vivisimo
13 What is clustering? Partition unlabeled examples into disjoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different the act of grouping similar objects into sets unsupervised classification no predefined classes Typical applications to get insight into data as a preprocessing step
14 What is clustering? (cont.)
15 What is document clustering? Document clustering : To group similar documents into sets To group similar or related documents into a common cluster. Eg) Scatter/Gather
16 Why not just document clustering? Web search results clustering is a version of document clustering, but Billions of pages Constantly changing Data mainly unstructured and heterogeneous Additional information to consider (i.e. links, clickthrough data, etc.)
17 Why cluster web documents? 1. For whole corpus analysis/navigation Better user interface 2. For improving recall in search applications Better search results 3. For better navigation of search results Effective user recall will be higher 4. For speeding up vector space retrieval Faster search
18 1. Corpus analysis/navigation Standard IR Document clusters Some time, TOC is very useful Index Aardvark, 15 Blueberry, 200 Capricorn, 1, Dog, Egypt, 65 Falafel, Giraffes, Table of Contents 1. Science of Cognition 1.a. Motivations 1.a.i. Intellectual Curiosity 1.a.ii. Practical Applications 1.b. History of Cognitive Psychology 2. The Neural Basis of Cognition 2.a. The Nervous System 2.b. Organization of the Brain 2.c. The Visual System 3. Perception and Attention 3.a. Sensory Memory 3.b. Attention and Sensory Information Processing
19 1. Corpus analysis/navigation (cont.) Document clustering Can induce a tree of topics To allow user to browse through corpus to find information Crucial need: meaningful labels for topic nodes. Yahoo!: manual hierarchy Often not available for new document collection
20 For visualizing a document collection and its themes Wise et al, Visualizing the non-visual PNNL ThemeScapes, Cartia [Mountain height = cluster size]
21 2. Improvement of search recall Cluster hypothesis Documents with similar text are related Therefore, to improve search recall: To cluster docs in corpus a priori (pre-clustering) To return other docs in the cluster containing D when a query matches a doc D. Hope if we do this: The query car will also return docs containing automobile Because clustering grouped together docs containing car with those containing automobile.
22 3. Better navigation of search results For grouping search results thematically clusty.com / Vivisimo
23 3. Better navigation of search results - Kartoo.com -
24 3. Better navigation of search results - iboogie.com -
25 3. Better navigation of search results -mooter.com-
26 3. Better navigation of search results One can also view grouping documents with the same sense of a word as clustering Given the results of a search (say Jaguar, or NLP), partition into groups of related docs Can be viewed as a form of word sense disambiguation E.g., jaguar may have senses: The car company The animal The football team The video game
27 4. Speeding up vector space retrieval In vector space retrieval, To find nearest doc vectors to query vector To entail finding the similarity of the query to every doc slow (for some applications) By clustering docs in corpus a priori To find nearest docs in cluster close to query To be inexact but avoid exhaustive similarity computation
28 What Is A Good Clustering? Internal criterion: A good clustering will produce high quality clusters in which: the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representation and the similarity measure used External criterion: The quality of a clustering is also measured by its ability to discover some or all of the hidden patterns or latent classes Assessable with gold standard data
29 Main issues Online or offline clustering? What to use as input Entire documents Snippets Structure information (links) Other data (i.e. click-through) Use stop word lists, stemming, etc. How to define similarity? Content (i.e. vector-space model) Link analysis Usage statistics How to group similar documents? How to label the groups?
30 Components Of Clustering System 1. Feature selection/extraction the most effective subset from patterns 2. Feature representation one or more transformations of the input features (attributes) to produce new salient features. 3. Pattern proximity (measure similarity) The distance of pairs of patterns (features). 4. Clustering (Grouping) Hard (a partition of the data into groups) Fuzzy (each pattern has a variable degree) 5. Data abstraction Extracting a compact representation of a data set Automatic analysis, Human-oriented. 6. Evaluation to measure the performance of competing algorithms
31 1. Feature Selection/Extraction What is the representation of documents? document = a set of features What is the features? Linguistic structure in documents Co-occurrences of Terms : n-gram Semantic structure : case structure Meta-data of documents Authors Citation Co-citation document : documents cites the examined documents bibliographic coupling : documents are cited by the examined documents
32 1. Feature Selection/Extraction Which terms to use as axes for vector space? Better is to use highest weight mid-frequency words the most discriminating terms Pseudo-linguistic heuristics, e.g., drop stop-words stemming/lemmatization use only nouns/noun phrases Good clustering should figure out some of these
33 2. Feature representation Documents are transformed into vectors of weighted terms Data matrix (two modes) d d d 1 2 Μ n t x x x Μ n1 t x x x Μ n2 Λ Λ Λ Ο Λ t x x x m 1m 2m Μ nm How to compute x ij tf*idf x = tf idf ij ij i idf i = log N df i
34 3. Pattern proximity (measure similarity) Dissimilarity matrix (one mode) How to compute d ij Dissimilarity - Distance Similarity - inverse of distance 0 d21 d31 Μ dn1 d d 0 32 Μ n2 0 Μ
35 Proximity Measure Dissimilarity The Euclidean distance: The Manhattan distance: Weighted distance: Similarity Inner Product Cos Coefficient d d d s s ij ij ij ij ij = = = = m z= 1 m z= 1 m z= 1 m z= 1 m x w z= 1 = m ( x y ) iz z ( x iz ( x x 2 iz iz iz y jz jz 2 ( x y ) x iz x m jz jz ) ) x 2 jz jz 2
36 4. Clustering Approach and/or Algorithm (1) Using contents of documents (2) Using user s usage logs (3) Using current search engines (4) Using hyperlinks (5) Other classical methods
37 (1) Using Contents of Documents Based on snippets returned by web search engines. Be as good as clusters created using the full text of Web documents. Suffix Tree Clustering (STC) : incremental, O(n) time algorithm three logical steps: (1) document cleaning (2) identifying base clusters using a suffix tree (3) combining these base clusters into clusters
38 (2) Using user s usage logs Relevancy information is objectively reflected by the usage logs An experimental result on Cluster 1 Cluster 2 Cluster 3 /shuttle/missions/41-c/news /shuttle/missions/61-b /history/apollo/sa-2/news/ /history/apollo/sa-2/images /software/winvn/userguide/3_3_2.htm /software/winvn/userguide/3_3_4.htm.
39 (3) Using current web search engines - Metacrawler -
40 (4) Using hyperlinks Cluster web documents based on both the textual and hyperlink The hyperlink structure is used as the dominant factor in the similarity metric Kleinberg s HITS (Hypertext Induced Topic Selection) algorithm based purely on hyperlink information. authority and hub documents for a user query. only cover the most popular topics and leave out the less popular ones.
41 Other classical approaches
42 Hierarchical Clustering Clusters are created in levels actually creating sets of clusters at each level. Agglomerative Initially each item in its own cluster Iteratively clusters are merged together Bottom Up Divisive Initially all items in one cluster Large clusters are successively divided Top Down
43 Hierarchical Clustering (Cent) Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition. Step 0 Step 1 Step 2 Step 3 Step 4 a a b b a b c d e c c d e d d e e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative divisive
44 Agglomerative Example E D C B A E D C B A B A E C D 4 Threshold of A B C D E
45 Distance Between Two Clusters Single-Link Method : Nearest Neighbor Complete-Link : Furthest Neighbor Average-link : all cross-cluster pairs. Min distance Max distance Average distance
46 Single-Link Method Euclidean Distance a b c d a,b a,b,c a,b,c,d c d d (1) (2) (3) a b c b 2 c d a b c b 2 c d a, b c c 3 d 5 4 a, b, c d 4 Distance Matrix
47 Complete-Link Method Euclidean Distance a b c d a,b a,b a,b,c,d c d c,d (1) (2) (3) a b c b 2 c d a b c b 2 c d a, b c c 5 d 6 4 a, b c, d 6 Distance Matrix
48 Example of the Chaining Effect Single-link (10 clusters) Complete-link (2 clusters)
49 Effect of Bias towards Spherical Clusters Single-link (2 clusters) Complete-link (2 clusters)
50 Partitioning clustering Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen 67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw 87): Each cluster is represented by one of the objects in the cluster
51 K-means algorithm Initial center of cluster are randomly selected Assign objects to cluster using distances between center and object Re-compute the center of each cluster Return step2 until stopping criteria is satisfied
52 K-means algorithm Example
53 What is the problem of k-means? The k-means algorithm is sensitive to outliers! Since an object with an extremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster
54 5. Data Abstraction - Cluster labelling - How do we generate meaningful descriptions, or labels, for clusters? it is an open issue Possible solution To list 5-10 most frequent features from the cluster features: index terms, noun phrases, proper names, To show features that occur frequently in this cluster and not frequently in others Differential labeling To summarize multiple documents in the cluster
55 6. Cluster validation How can we tell if a clustering is good or not? ask users whether they agree with the clusters agreement with human clustering is problematic (Macskassy et al., 1998) use statistical techniques to measure qualities of the cluster, e.g. the purity, or the divergence from random clustering For information retrieval do we manage to cluster relevant documents together? This is the holy grail for cluster-based IR can be evaluated through cluster-based retrieval
56 Approaches to evaluating Anecdotal I wrote this clustering algorithm and look what it found! No benchmarks, no comparison possible User inspection Experts evaluate the results and score them Expensive / time-consuming Ground truth comparison To compare clustering results to a known taxonomy like Yahoo The static prior taxonomy may be incomplete/wrong in places Microeconomic / utility Net economic gain produced by an approach (vs. another approach) How do various clustering methods affect the quality of what s retrieved? Compare two IR algorithms 1. send query, present ranked results 2. send query, cluster results, present clusters Purely quantitative measures Probability of generating clusters found Average distance between cluster members
57 Cluster Validity Assessment δ(c i, C j ) : inter-cluster distance (C k ) : intra-cluster distance (C K ) δ(c i, C j )
58 Inter-cluster distance : Average linkage δ(c i, C j ) Centroid linkage Complete linkage Single linkage Average to centroids linkage
59 Intra-cluster distance: (C k ) Complete diameter Average diameter Centroid diameter
60 DB(Davies-Bouldin) Index S ij K 1 S i + S j DB = max K i j i= 1 Sij 1 = zi z j S = i x C { x zi } i C i K : the number of clusters S ij : Distance between the center z i of cluster i C i & the center z j of cluster j C j S i : Scatter between C i and C j
61 Davies-Bouldin (DB) Index (2) The objective: To minimize the index the number of clusters optimal number of clusters, K The clusters with minimized DB index Small values of DB correspond to good clusters
62 Dunn s Index v D = The objective To identify sets of clusters that are compact and well separated to maximize the inter-cluster distances and minimize the intra-cluster distances Large values of V D correspond to good clusters δ ( C i j min min 1 i K 1 j K max ) j i k 1 k K, C ) { ( C }
63 Other Cluster Validity Measures Calinski Harabasz (CH) Index Index I Xie and Beni F-Measure Λ-Measure Silhouette Symmetry Graph-Based Boundary Analysis Partition Coefficient Separation Index Classification Entropy (CE) Etc.
64 Other clustering applications Relevance feedback (Iwayama, 2000) Search engine query logs cluster queries on the basis of the documents that users selected from these queries (Beeferman & Berger, 2000) Recommender systems cluster users based on preferences, purchases, etc. (Sarwar et al., 2002) Topic detection and tracking (TDT) & Novelty Detection cluster news stories together and try to find emerging new topics (Allan, 2001) cluster previously seen information, and measure novelty of incoming information by its distance to that already seen (TREC Novelty track, 2002, 2003)
65 Discussion In general, clustering presents problems and challenges: Selection of attributes for clustering Selection of clustering method Generation of cluster representations Validity/quality of the generated clustering Updating clustering structures e.g. inserting or deleting a document from a hierarchy Effectiveness gains are not always evident Computational expenses
66 Looking back Document clustering has been applied to IR for over 30 years the research focus has shifted over the years efficiency (70s), effectiveness (80s), other applications (e.g. browsing, visualisation (90s)), dynamic clustering (00s),? But, there are still many unresolved issues: Cluster representation Cluster-based search strategies Dynamic clustering (per-query basis) Algorithmic aspects Document representation for clustering (e.g. reduce noise by using only the most important part of a document) New applications for clustering in IR
67 As a conclusion Clustering is a useful tool for IR It is theoretically solid and intuitively plausible Although it has been used for over 30 years there are still open issues It is still an interesting area to research
68 Some references If you only read one article/reference: Tombros, A., PhD Thesis, Chapter 3 (optionally 4 & 5) available at: Willett, P., Recent trends in hierarchic document clustering: A critical review. Information Processing & Management, 24(5): , More than worth to have a look at: van Rijsbergen, C.J., Information Retrieval. London: Butterworths, 2 nd Edition, 1979; available at Also recommended : Hearst, M.A. and Pedersen, J.O. Re-examining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. In Proceedings of ACM SIGIR 96, pages 76-84, 1996.
69 Some references Zamir, O. and Etzioni, O. Web document clustering: A feasibility demonstration. In Proceedings of ACM SIGIR 09, pages 46-54, Iwayama, M. Relevance feedback with a small number of relevance judgements: incremental relevance feedback vs. document clustering. In Proceedings of ACM SIGIR 00, pages 10-16, Beeferman, D. and Berger, A. Agglomerative clustering of a search engine log. In Proceedings of the 6 th International Conference on Knowledge Discovery in Data, pages , B.M. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Recommender Systems for Large-Scale E-Commerce: Scalable Neighborhood Formation Using Clustering. Proceedings of the 5 th International Conference on Computer and Information Technology, J. Allan, J. Carbonell, G. Doddington, J. Yamron and Y. Yang. Topic detection and tracking pilot study. In Topic Detection and Tracking Workshop Report, 2001.
70 Some references Macskassy, S.A., Banerjee, A., Davidson, B.D., Hirsh, H. Human performance on clustering web pages: a preliminary study. In Proceedings of The 4 th Knowledge Discovery and Data Mining Conference (KDD-98), pp , Goldszmidt, M. and Sahami, M. A Probabilistic Approach to Full-Text Document Clustering. Technical Report ITAD-433-MS , SRI International, Available at Bradley, P., Fayyad, U., Reina, C. Scaling EM (expectation maximization) algorithm to large databases, Microsoft Research Technical Report, MSR-TR-98, Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W. Scatter/Gather: A cluster based approach to browsing large document collections. In Proceedings of ACM SIGIR 92, pages , 1992.
71 Acknowledgement C. Manning and P. Raghavan Why cluster documents Cluster Validation, T. Tombros, Clustering for IR,
72 External Evaluation of Cluster Quality Assesses clustering with respect to ground truth Assume that there are C gold standard classes, while our clustering algorithms produce k clusters, 1, 2,, k with n i members. Simple measure: purity, the ratio between the dominant class in the cluster i and the size of cluster i 1 Purity (π i ) = max j ( nij ) j C n i Others are entropy of classes in clusters (or mutual information between classes and clusters)
73 Purity Cluster I Cluster II Cluster III Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5
CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University
CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document
More informationCS47300: Web Information Search and Management
CS47300: Web Information Search and Management Text Clustering Prof. Chris Clifton 19 October 2018 Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti Document clustering Motivations Document
More informationClustering CE-324: Modern Information Retrieval Sharif University of Technology
Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch. 16 What
More informationhttp://www.xkcd.com/233/ Text Clustering David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Administrative 2 nd status reports Paper review
More informationHierarchical Document Clustering
Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationInformation Retrieval. Information Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent
More informationConcept-Based Document Similarity Based on Suffix Tree Document
Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri
More informationToday s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan
Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationINF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering
INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,
More informationClustering part II 1
Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:
More informationClustering. Bruno Martins. 1 st Semester 2012/2013
Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationClustering (COSC 488) Nazli Goharian. Document Clustering.
Clustering (COSC 488) Nazli Goharian nazli@ir.cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More information[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116
IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632
More informationLecture 8 May 7, Prabhakar Raghavan
Lecture 8 May 7, 2001 Prabhakar Raghavan Clustering documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Given the set of docs from the results of
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationUnsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More informationDD2475 Information Retrieval Lecture 10: Clustering. Document Clustering. Recap: Classification. Today
Sec.14.1! Recap: Classification DD2475 Information Retrieval Lecture 10: Clustering Hedvig Kjellström hedvig@kth.se www.csc.kth.se/dd2475 Data points have labels Classification task: Finding good separators
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationData Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data
More informationLecture on Modeling Tools for Clustering & Regression
Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into
More informationPAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods
Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationClustering (COSC 416) Nazli Goharian. Document Clustering.
Clustering (COSC 416) Nazli Goharian nazli@cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,
More informationClustering. Chapter 10 in Introduction to statistical learning
Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationData Informatics. Seon Ho Kim, Ph.D.
Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 6: Flat Clustering Wiltrud Kessler & Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 0-- / 83
More informationCSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)
CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze Institute for
More informationFlat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017
Flat Clustering Slides are mostly from Hinrich Schütze March 7, 07 / 79 Overview Recap Clustering: Introduction 3 Clustering in IR 4 K-means 5 Evaluation 6 How many clusters? / 79 Outline Recap Clustering:
More informationRoad map. Basic concepts
Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London What Is Text Clustering? Text Clustering = Grouping a set of documents into classes of similar
More informationDocument Clustering: Comparison of Similarity Measures
Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.
More informationCluster Analysis for Effective Information Retrieval through Cohesive Group of Cluster Methods
Cluster Analysis for Effective Information Retrieval through Cohesive Group of Cluster Methods Prof. S.N. Sawalkar 1, Ms. Sheetal Yamde 2 1Head Department of Computer Science and Engineering, Computer
More informationClustering Results. Result List Example. Clustering Results. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationAutomatic Data Analysis in Visual Analytics Selected Methods
Automatic Data Analysis in Visual Analytics Selected Methods Multimedia Information Systems 2 VU (SS 2015, 707.025) Vedran Sabol Know-Center March 15 th, 2016 2 Lecture Overview Visual Analytics Overview
More informationEfficient Clustering of Web Documents Using Hybrid Approach in Data Mining
Efficient Clustering of Web Documents Using Hybrid Approach in Data Mining Pralhad Sudam Gamare 1, Ganpati A. Patil 2 1 P.G. Student, Computer Science and Technology, Department of Technology-Shivaji University-Kolhapur,
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationText Mining. Munawar, PhD. Text Mining - Munawar, PhD
10 Text Mining Munawar, PhD Definition Text mining also is known as Text Data Mining (TDM) and Knowledge Discovery in Textual Database (KDT).[1] A process of identifying novel information from a collection
More informationA NEW CLUSTER MERGING ALGORITHM OF SUFFIX TREE CLUSTERING
A NEW CLUSTER MERGING ALGORITHM OF SUFFIX TREE CLUSTERING Jianhua Wang, Ruixu Li Computer Science Department, Yantai University, Yantai, Shandong, China Abstract: Key words: Document clustering methods
More informationA Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,
More informationClustering Web Documents using Hierarchical Method for Efficient Cluster Formation
Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College
More informationData Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394
Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31 Table of contents 1 Introduction 2 Data matrix and
More informationClustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme
Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme Why do we need to find similarity? Similarity underlies many data science methods and solutions to business problems. Some
More informationText Documents clustering using K Means Algorithm
Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationExploratory Analysis: Clustering
Exploratory Analysis: Clustering (some material taken or adapted from slides by Hinrich Schutze) Heejun Kim June 26, 2018 Clustering objective Grouping documents or instances into subsets or clusters Documents
More informationPV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211
PV: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv IIR 6: Flat Clustering Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationClustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures
Clustering and Dissimilarity Measures Clustering APR Course, Delft, The Netherlands Marco Loog May 19, 2008 1 What salient structures exist in the data? How many clusters? May 19, 2008 2 Cluster Analysis
More informationCluster Analysis. Angela Montanari and Laura Anderlucci
Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationData Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1
Data Mining: Concepts and Techniques Chapter 7.1-4 March 8, 2007 Data Mining: Concepts and Techniques 1 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis Chapter 7 Cluster Analysis 3. A
More informationKapitel 4: Clustering
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.
More informationA Patent Retrieval Method Using a Hierarchy of Clusters at TUT
A Patent Retrieval Method Using a Hierarchy of Clusters at TUT Hironori Doi Yohei Seki Masaki Aono Toyohashi University of Technology 1-1 Hibarigaoka, Tenpaku-cho, Toyohashi-shi, Aichi 441-8580, Japan
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation
More informationInformation Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured
More informationDATA MINING - 1DL105, 1DL111
1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationBased on Raymond J. Mooney s slides
Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationCHAPTER THREE INFORMATION RETRIEVAL SYSTEM
CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost
More informationSOCIAL MEDIA MINING. Data Mining Essentials
SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate
More information9. Conclusions. 9.1 Definition KDD
9. Conclusions Contents of this Chapter 9.1 Course review 9.2 State-of-the-art in KDD 9.3 KDD challenges SFU, CMPT 740, 03-3, Martin Ester 419 9.1 Definition KDD [Fayyad, Piatetsky-Shapiro & Smyth 96]
More informationDatasets Size: Effect on Clustering Results
1 Datasets Size: Effect on Clustering Results Adeleke Ajiboye 1, Ruzaini Abdullah Arshah 2, Hongwu Qin 3 Faculty of Computer Systems and Software Engineering Universiti Malaysia Pahang 1 {ajibraheem@live.com}
More informationChapter 4: Text Clustering
4.1 Introduction to Text Clustering Clustering is an unsupervised method of grouping texts / documents in such a way that in spite of having little knowledge about the content of the documents, we can
More informationDATA MINING II - 1DL460. Spring 2014"
DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationINF4820, Algorithms for AI and NLP: Hierarchical Clustering
INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 16: Flat Clustering Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2009.06.16 1/ 64 Overview
More informationHigh Performance Text Document Clustering
Wright State University CORE Scholar Browse all Theses and Dissertations Theses and Dissertations 2007 High Performance Text Document Clustering Yanjun Li Wright State University Follow this and additional
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:
More informationClustering of Web Search Results Based on Document Segmentation
Computer and Information Science; Vol. 6, No. 3; 23 ISSN 93-8989 E-ISSN 93-8997 Published by Canadian Center of Science and Education Clustering of Web Search Results Based on Document Segmentation Mohammad
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More informationData Mining Algorithms
for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 6: Flat Clustering Hinrich Schütze Center for Information and Language Processing, University of Munich 04-06- /86 Overview Recap
More informationA New Measure of the Cluster Hypothesis
A New Measure of the Cluster Hypothesis Mark D. Smucker 1 and James Allan 2 1 Department of Management Sciences University of Waterloo 2 Center for Intelligent Information Retrieval Department of Computer
More informationCLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16
CLUSTERING CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 1. K-medoids: REFERENCES https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method https://anuradhasrinivas.files.wordpress.com/2013/04/lesson8-clustering.pdf
More informationCS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample
Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups
More informationCluster analysis. Agnieszka Nowak - Brzezinska
Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationAuthoritative K-Means for Clustering of Web Search Results
Authoritative K-Means for Clustering of Web Search Results Gaojie He Master in Information Systems Submission date: June 2010 Supervisor: Kjetil Nørvåg, IDI Co-supervisor: Robert Neumayer, IDI Norwegian
More informationECS 234: Data Analysis: Clustering ECS 234
: Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed
More informationApproaches to Mining the Web
Approaches to Mining the Web Olfa Nasraoui University of Louisville Web Mining: Mining Web Data (3 Types) Structure Mining: extracting info from topology of the Web (links among pages) Hubs: pages pointing
More information