Web Document Clustering

Size: px
Start display at page:

Download "Web Document Clustering"

Transcription

1 Web Document Clustering IT

2 Contents IR and Web Search Problems on Web Search Results Document Clustering? Why Cluster Documents on Web Search Results? Applications of Document Clustering Clustering Approaches More Applications Discussions

3 Typical IR Task Given: A corpus of textual natural-language documents. A user query in the form of a textual string. Find: A ranked set of documents that are relevant to the query.

4 IR System Document corpus Query String IR System Ranked Documents 1. Doc1 2. Doc2 3. Doc3..

5 Relevance a subjective judgment may include: Being on the proper subject. Being timely (recent information). Being authoritative (from a trusted source). Satisfying the goals of the user and his/her intended use of the information (information need).

6 Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

7 Intelligent IR Taking into account the meaning of the words used. Taking into account the order of words in the query. Taking into account the authority of the source. Adapting to the user based on the direct or indirect feedback.

8 Web Search Application of IR to HTML documents on the World Wide Web. Differences with typical IR: Must assemble document corpus by spidering the web. Can use the structural layout information in HTML (XML). Can use the link structure of the web. Documents change uncontrollably.

9 Web Search System Web Spider Document corpus Query String IR System 1. Page1 2. Page2 3. Page3.. Ranked Documents

10 Web users information needs classification based on the type of answers expected by the Web user 1. Question answering type of answer : a very short answer (single sentence, part of the document) query example What is the population of Korea? 2. Named page finding - Homepage finding type of answer : a single document containing specific information query example Authors instructions for the KISS journal 3. Topic relevance type of answer: a range of documents relevant to the topic of their information need query example KOREA s immigration policy 4. Online services type of answer: a range of documents that allow the user to access a particular service query example download mp3 5. Topic distillation type of answer: a range of documents that are relevant key resources on the topic (Relevance + Quality) query example highway safety

11 Web search results Big problem A long list of search results, ranked by their relevancies to the given query Engine : search engine, car part, Engine Corp. A time consuming task when multiple sub-topics of the given query are mixed together. Possible solution To (online) cluster web search results Evidence Relationships between the results Cluster Hypothesis (van Rijsbergen 1979): Closely related documents tend to be relevant to the same requests. Potential results Aids user-engine interaction Browsing Help users express his need

12 An example - Vivisimo

13 What is clustering? Partition unlabeled examples into disjoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different the act of grouping similar objects into sets unsupervised classification no predefined classes Typical applications to get insight into data as a preprocessing step

14 What is clustering? (cont.)

15 What is document clustering? Document clustering : To group similar documents into sets To group similar or related documents into a common cluster. Eg) Scatter/Gather

16 Why not just document clustering? Web search results clustering is a version of document clustering, but Billions of pages Constantly changing Data mainly unstructured and heterogeneous Additional information to consider (i.e. links, clickthrough data, etc.)

17 Why cluster web documents? 1. For whole corpus analysis/navigation Better user interface 2. For improving recall in search applications Better search results 3. For better navigation of search results Effective user recall will be higher 4. For speeding up vector space retrieval Faster search

18 1. Corpus analysis/navigation Standard IR Document clusters Some time, TOC is very useful Index Aardvark, 15 Blueberry, 200 Capricorn, 1, Dog, Egypt, 65 Falafel, Giraffes, Table of Contents 1. Science of Cognition 1.a. Motivations 1.a.i. Intellectual Curiosity 1.a.ii. Practical Applications 1.b. History of Cognitive Psychology 2. The Neural Basis of Cognition 2.a. The Nervous System 2.b. Organization of the Brain 2.c. The Visual System 3. Perception and Attention 3.a. Sensory Memory 3.b. Attention and Sensory Information Processing

19 1. Corpus analysis/navigation (cont.) Document clustering Can induce a tree of topics To allow user to browse through corpus to find information Crucial need: meaningful labels for topic nodes. Yahoo!: manual hierarchy Often not available for new document collection

20 For visualizing a document collection and its themes Wise et al, Visualizing the non-visual PNNL ThemeScapes, Cartia [Mountain height = cluster size]

21 2. Improvement of search recall Cluster hypothesis Documents with similar text are related Therefore, to improve search recall: To cluster docs in corpus a priori (pre-clustering) To return other docs in the cluster containing D when a query matches a doc D. Hope if we do this: The query car will also return docs containing automobile Because clustering grouped together docs containing car with those containing automobile.

22 3. Better navigation of search results For grouping search results thematically clusty.com / Vivisimo

23 3. Better navigation of search results - Kartoo.com -

24 3. Better navigation of search results - iboogie.com -

25 3. Better navigation of search results -mooter.com-

26 3. Better navigation of search results One can also view grouping documents with the same sense of a word as clustering Given the results of a search (say Jaguar, or NLP), partition into groups of related docs Can be viewed as a form of word sense disambiguation E.g., jaguar may have senses: The car company The animal The football team The video game

27 4. Speeding up vector space retrieval In vector space retrieval, To find nearest doc vectors to query vector To entail finding the similarity of the query to every doc slow (for some applications) By clustering docs in corpus a priori To find nearest docs in cluster close to query To be inexact but avoid exhaustive similarity computation

28 What Is A Good Clustering? Internal criterion: A good clustering will produce high quality clusters in which: the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representation and the similarity measure used External criterion: The quality of a clustering is also measured by its ability to discover some or all of the hidden patterns or latent classes Assessable with gold standard data

29 Main issues Online or offline clustering? What to use as input Entire documents Snippets Structure information (links) Other data (i.e. click-through) Use stop word lists, stemming, etc. How to define similarity? Content (i.e. vector-space model) Link analysis Usage statistics How to group similar documents? How to label the groups?

30 Components Of Clustering System 1. Feature selection/extraction the most effective subset from patterns 2. Feature representation one or more transformations of the input features (attributes) to produce new salient features. 3. Pattern proximity (measure similarity) The distance of pairs of patterns (features). 4. Clustering (Grouping) Hard (a partition of the data into groups) Fuzzy (each pattern has a variable degree) 5. Data abstraction Extracting a compact representation of a data set Automatic analysis, Human-oriented. 6. Evaluation to measure the performance of competing algorithms

31 1. Feature Selection/Extraction What is the representation of documents? document = a set of features What is the features? Linguistic structure in documents Co-occurrences of Terms : n-gram Semantic structure : case structure Meta-data of documents Authors Citation Co-citation document : documents cites the examined documents bibliographic coupling : documents are cited by the examined documents

32 1. Feature Selection/Extraction Which terms to use as axes for vector space? Better is to use highest weight mid-frequency words the most discriminating terms Pseudo-linguistic heuristics, e.g., drop stop-words stemming/lemmatization use only nouns/noun phrases Good clustering should figure out some of these

33 2. Feature representation Documents are transformed into vectors of weighted terms Data matrix (two modes) d d d 1 2 Μ n t x x x Μ n1 t x x x Μ n2 Λ Λ Λ Ο Λ t x x x m 1m 2m Μ nm How to compute x ij tf*idf x = tf idf ij ij i idf i = log N df i

34 3. Pattern proximity (measure similarity) Dissimilarity matrix (one mode) How to compute d ij Dissimilarity - Distance Similarity - inverse of distance 0 d21 d31 Μ dn1 d d 0 32 Μ n2 0 Μ

35 Proximity Measure Dissimilarity The Euclidean distance: The Manhattan distance: Weighted distance: Similarity Inner Product Cos Coefficient d d d s s ij ij ij ij ij = = = = m z= 1 m z= 1 m z= 1 m z= 1 m x w z= 1 = m ( x y ) iz z ( x iz ( x x 2 iz iz iz y jz jz 2 ( x y ) x iz x m jz jz ) ) x 2 jz jz 2

36 4. Clustering Approach and/or Algorithm (1) Using contents of documents (2) Using user s usage logs (3) Using current search engines (4) Using hyperlinks (5) Other classical methods

37 (1) Using Contents of Documents Based on snippets returned by web search engines. Be as good as clusters created using the full text of Web documents. Suffix Tree Clustering (STC) : incremental, O(n) time algorithm three logical steps: (1) document cleaning (2) identifying base clusters using a suffix tree (3) combining these base clusters into clusters

38 (2) Using user s usage logs Relevancy information is objectively reflected by the usage logs An experimental result on Cluster 1 Cluster 2 Cluster 3 /shuttle/missions/41-c/news /shuttle/missions/61-b /history/apollo/sa-2/news/ /history/apollo/sa-2/images /software/winvn/userguide/3_3_2.htm /software/winvn/userguide/3_3_4.htm.

39 (3) Using current web search engines - Metacrawler -

40 (4) Using hyperlinks Cluster web documents based on both the textual and hyperlink The hyperlink structure is used as the dominant factor in the similarity metric Kleinberg s HITS (Hypertext Induced Topic Selection) algorithm based purely on hyperlink information. authority and hub documents for a user query. only cover the most popular topics and leave out the less popular ones.

41 Other classical approaches

42 Hierarchical Clustering Clusters are created in levels actually creating sets of clusters at each level. Agglomerative Initially each item in its own cluster Iteratively clusters are merged together Bottom Up Divisive Initially all items in one cluster Large clusters are successively divided Top Down

43 Hierarchical Clustering (Cent) Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition. Step 0 Step 1 Step 2 Step 3 Step 4 a a b b a b c d e c c d e d d e e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative divisive

44 Agglomerative Example E D C B A E D C B A B A E C D 4 Threshold of A B C D E

45 Distance Between Two Clusters Single-Link Method : Nearest Neighbor Complete-Link : Furthest Neighbor Average-link : all cross-cluster pairs. Min distance Max distance Average distance

46 Single-Link Method Euclidean Distance a b c d a,b a,b,c a,b,c,d c d d (1) (2) (3) a b c b 2 c d a b c b 2 c d a, b c c 3 d 5 4 a, b, c d 4 Distance Matrix

47 Complete-Link Method Euclidean Distance a b c d a,b a,b a,b,c,d c d c,d (1) (2) (3) a b c b 2 c d a b c b 2 c d a, b c c 5 d 6 4 a, b c, d 6 Distance Matrix

48 Example of the Chaining Effect Single-link (10 clusters) Complete-link (2 clusters)

49 Effect of Bias towards Spherical Clusters Single-link (2 clusters) Complete-link (2 clusters)

50 Partitioning clustering Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen 67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw 87): Each cluster is represented by one of the objects in the cluster

51 K-means algorithm Initial center of cluster are randomly selected Assign objects to cluster using distances between center and object Re-compute the center of each cluster Return step2 until stopping criteria is satisfied

52 K-means algorithm Example

53 What is the problem of k-means? The k-means algorithm is sensitive to outliers! Since an object with an extremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster

54 5. Data Abstraction - Cluster labelling - How do we generate meaningful descriptions, or labels, for clusters? it is an open issue Possible solution To list 5-10 most frequent features from the cluster features: index terms, noun phrases, proper names, To show features that occur frequently in this cluster and not frequently in others Differential labeling To summarize multiple documents in the cluster

55 6. Cluster validation How can we tell if a clustering is good or not? ask users whether they agree with the clusters agreement with human clustering is problematic (Macskassy et al., 1998) use statistical techniques to measure qualities of the cluster, e.g. the purity, or the divergence from random clustering For information retrieval do we manage to cluster relevant documents together? This is the holy grail for cluster-based IR can be evaluated through cluster-based retrieval

56 Approaches to evaluating Anecdotal I wrote this clustering algorithm and look what it found! No benchmarks, no comparison possible User inspection Experts evaluate the results and score them Expensive / time-consuming Ground truth comparison To compare clustering results to a known taxonomy like Yahoo The static prior taxonomy may be incomplete/wrong in places Microeconomic / utility Net economic gain produced by an approach (vs. another approach) How do various clustering methods affect the quality of what s retrieved? Compare two IR algorithms 1. send query, present ranked results 2. send query, cluster results, present clusters Purely quantitative measures Probability of generating clusters found Average distance between cluster members

57 Cluster Validity Assessment δ(c i, C j ) : inter-cluster distance (C k ) : intra-cluster distance (C K ) δ(c i, C j )

58 Inter-cluster distance : Average linkage δ(c i, C j ) Centroid linkage Complete linkage Single linkage Average to centroids linkage

59 Intra-cluster distance: (C k ) Complete diameter Average diameter Centroid diameter

60 DB(Davies-Bouldin) Index S ij K 1 S i + S j DB = max K i j i= 1 Sij 1 = zi z j S = i x C { x zi } i C i K : the number of clusters S ij : Distance between the center z i of cluster i C i & the center z j of cluster j C j S i : Scatter between C i and C j

61 Davies-Bouldin (DB) Index (2) The objective: To minimize the index the number of clusters optimal number of clusters, K The clusters with minimized DB index Small values of DB correspond to good clusters

62 Dunn s Index v D = The objective To identify sets of clusters that are compact and well separated to maximize the inter-cluster distances and minimize the intra-cluster distances Large values of V D correspond to good clusters δ ( C i j min min 1 i K 1 j K max ) j i k 1 k K, C ) { ( C }

63 Other Cluster Validity Measures Calinski Harabasz (CH) Index Index I Xie and Beni F-Measure Λ-Measure Silhouette Symmetry Graph-Based Boundary Analysis Partition Coefficient Separation Index Classification Entropy (CE) Etc.

64 Other clustering applications Relevance feedback (Iwayama, 2000) Search engine query logs cluster queries on the basis of the documents that users selected from these queries (Beeferman & Berger, 2000) Recommender systems cluster users based on preferences, purchases, etc. (Sarwar et al., 2002) Topic detection and tracking (TDT) & Novelty Detection cluster news stories together and try to find emerging new topics (Allan, 2001) cluster previously seen information, and measure novelty of incoming information by its distance to that already seen (TREC Novelty track, 2002, 2003)

65 Discussion In general, clustering presents problems and challenges: Selection of attributes for clustering Selection of clustering method Generation of cluster representations Validity/quality of the generated clustering Updating clustering structures e.g. inserting or deleting a document from a hierarchy Effectiveness gains are not always evident Computational expenses

66 Looking back Document clustering has been applied to IR for over 30 years the research focus has shifted over the years efficiency (70s), effectiveness (80s), other applications (e.g. browsing, visualisation (90s)), dynamic clustering (00s),? But, there are still many unresolved issues: Cluster representation Cluster-based search strategies Dynamic clustering (per-query basis) Algorithmic aspects Document representation for clustering (e.g. reduce noise by using only the most important part of a document) New applications for clustering in IR

67 As a conclusion Clustering is a useful tool for IR It is theoretically solid and intuitively plausible Although it has been used for over 30 years there are still open issues It is still an interesting area to research

68 Some references If you only read one article/reference: Tombros, A., PhD Thesis, Chapter 3 (optionally 4 & 5) available at: Willett, P., Recent trends in hierarchic document clustering: A critical review. Information Processing & Management, 24(5): , More than worth to have a look at: van Rijsbergen, C.J., Information Retrieval. London: Butterworths, 2 nd Edition, 1979; available at Also recommended : Hearst, M.A. and Pedersen, J.O. Re-examining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. In Proceedings of ACM SIGIR 96, pages 76-84, 1996.

69 Some references Zamir, O. and Etzioni, O. Web document clustering: A feasibility demonstration. In Proceedings of ACM SIGIR 09, pages 46-54, Iwayama, M. Relevance feedback with a small number of relevance judgements: incremental relevance feedback vs. document clustering. In Proceedings of ACM SIGIR 00, pages 10-16, Beeferman, D. and Berger, A. Agglomerative clustering of a search engine log. In Proceedings of the 6 th International Conference on Knowledge Discovery in Data, pages , B.M. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Recommender Systems for Large-Scale E-Commerce: Scalable Neighborhood Formation Using Clustering. Proceedings of the 5 th International Conference on Computer and Information Technology, J. Allan, J. Carbonell, G. Doddington, J. Yamron and Y. Yang. Topic detection and tracking pilot study. In Topic Detection and Tracking Workshop Report, 2001.

70 Some references Macskassy, S.A., Banerjee, A., Davidson, B.D., Hirsh, H. Human performance on clustering web pages: a preliminary study. In Proceedings of The 4 th Knowledge Discovery and Data Mining Conference (KDD-98), pp , Goldszmidt, M. and Sahami, M. A Probabilistic Approach to Full-Text Document Clustering. Technical Report ITAD-433-MS , SRI International, Available at Bradley, P., Fayyad, U., Reina, C. Scaling EM (expectation maximization) algorithm to large databases, Microsoft Research Technical Report, MSR-TR-98, Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W. Scatter/Gather: A cluster based approach to browsing large document collections. In Proceedings of ACM SIGIR 92, pages , 1992.

71 Acknowledgement C. Manning and P. Raghavan Why cluster documents Cluster Validation, T. Tombros, Clustering for IR,

72 External Evaluation of Cluster Quality Assesses clustering with respect to ground truth Assume that there are C gold standard classes, while our clustering algorithms produce k clusters, 1, 2,, k with n i members. Simple measure: purity, the ratio between the dominant class in the cluster i and the size of cluster i 1 Purity (π i ) = max j ( nij ) j C n i Others are entropy of classes in clusters (or mutual information between classes and clusters)

73 Purity Cluster I Cluster II Cluster III Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Text Clustering Prof. Chris Clifton 19 October 2018 Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti Document clustering Motivations Document

More information

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Clustering CE-324: Modern Information Retrieval Sharif University of Technology Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch. 16 What

More information

http://www.xkcd.com/233/ Text Clustering David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Administrative 2 nd status reports Paper review

More information

Hierarchical Document Clustering

Hierarchical Document Clustering Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

Concept-Based Document Similarity Based on Suffix Tree Document

Concept-Based Document Similarity Based on Suffix Tree Document Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri

More information

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

Clustering. Bruno Martins. 1 st Semester 2012/2013

Clustering. Bruno Martins. 1 st Semester 2012/2013 Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Clustering (COSC 488) Nazli Goharian. Document Clustering. Clustering (COSC 488) Nazli Goharian nazli@ir.cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Lecture 8 May 7, Prabhakar Raghavan

Lecture 8 May 7, Prabhakar Raghavan Lecture 8 May 7, 2001 Prabhakar Raghavan Clustering documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Given the set of docs from the results of

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

DD2475 Information Retrieval Lecture 10: Clustering. Document Clustering. Recap: Classification. Today

DD2475 Information Retrieval Lecture 10: Clustering. Document Clustering. Recap: Classification. Today Sec.14.1! Recap: Classification DD2475 Information Retrieval Lecture 10: Clustering Hedvig Kjellström hedvig@kth.se www.csc.kth.se/dd2475 Data points have labels Classification task: Finding good separators

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Clustering (COSC 416) Nazli Goharian. Document Clustering. Clustering (COSC 416) Nazli Goharian nazli@cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 6: Flat Clustering Wiltrud Kessler & Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 0-- / 83

More information

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16) CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze Institute for

More information

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017 Flat Clustering Slides are mostly from Hinrich Schütze March 7, 07 / 79 Overview Recap Clustering: Introduction 3 Clustering in IR 4 K-means 5 Evaluation 6 How many clusters? / 79 Outline Recap Clustering:

More information

Road map. Basic concepts

Road map. Basic concepts Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London What Is Text Clustering? Text Clustering = Grouping a set of documents into classes of similar

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

Cluster Analysis for Effective Information Retrieval through Cohesive Group of Cluster Methods

Cluster Analysis for Effective Information Retrieval through Cohesive Group of Cluster Methods Cluster Analysis for Effective Information Retrieval through Cohesive Group of Cluster Methods Prof. S.N. Sawalkar 1, Ms. Sheetal Yamde 2 1Head Department of Computer Science and Engineering, Computer

More information

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Clustering Results. Result List Example. Clustering Results. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Automatic Data Analysis in Visual Analytics Selected Methods

Automatic Data Analysis in Visual Analytics Selected Methods Automatic Data Analysis in Visual Analytics Selected Methods Multimedia Information Systems 2 VU (SS 2015, 707.025) Vedran Sabol Know-Center March 15 th, 2016 2 Lecture Overview Visual Analytics Overview

More information

Efficient Clustering of Web Documents Using Hybrid Approach in Data Mining

Efficient Clustering of Web Documents Using Hybrid Approach in Data Mining Efficient Clustering of Web Documents Using Hybrid Approach in Data Mining Pralhad Sudam Gamare 1, Ganpati A. Patil 2 1 P.G. Student, Computer Science and Technology, Department of Technology-Shivaji University-Kolhapur,

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD 10 Text Mining Munawar, PhD Definition Text mining also is known as Text Data Mining (TDM) and Knowledge Discovery in Textual Database (KDT).[1] A process of identifying novel information from a collection

More information

A NEW CLUSTER MERGING ALGORITHM OF SUFFIX TREE CLUSTERING

A NEW CLUSTER MERGING ALGORITHM OF SUFFIX TREE CLUSTERING A NEW CLUSTER MERGING ALGORITHM OF SUFFIX TREE CLUSTERING Jianhua Wang, Ruixu Li Computer Science Department, Yantai University, Yantai, Shandong, China Abstract: Key words: Document clustering methods

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31 Table of contents 1 Introduction 2 Data matrix and

More information

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme Why do we need to find similarity? Similarity underlies many data science methods and solutions to business problems. Some

More information

Text Documents clustering using K Means Algorithm

Text Documents clustering using K Means Algorithm Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Exploratory Analysis: Clustering

Exploratory Analysis: Clustering Exploratory Analysis: Clustering (some material taken or adapted from slides by Hinrich Schutze) Heejun Kim June 26, 2018 Clustering objective Grouping documents or instances into subsets or clusters Documents

More information

PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211

PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211 PV: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv IIR 6: Flat Clustering Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures Clustering and Dissimilarity Measures Clustering APR Course, Delft, The Netherlands Marco Loog May 19, 2008 1 What salient structures exist in the data? How many clusters? May 19, 2008 2 Cluster Analysis

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques Chapter 7.1-4 March 8, 2007 Data Mining: Concepts and Techniques 1 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis Chapter 7 Cluster Analysis 3. A

More information

Kapitel 4: Clustering

Kapitel 4: Clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

More information

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT A Patent Retrieval Method Using a Hierarchy of Clusters at TUT Hironori Doi Yohei Seki Masaki Aono Toyohashi University of Technology 1-1 Hibarigaoka, Tenpaku-cho, Toyohashi-shi, Aichi 441-8580, Japan

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured

More information

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL105, 1DL111 1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

9. Conclusions. 9.1 Definition KDD

9. Conclusions. 9.1 Definition KDD 9. Conclusions Contents of this Chapter 9.1 Course review 9.2 State-of-the-art in KDD 9.3 KDD challenges SFU, CMPT 740, 03-3, Martin Ester 419 9.1 Definition KDD [Fayyad, Piatetsky-Shapiro & Smyth 96]

More information

Datasets Size: Effect on Clustering Results

Datasets Size: Effect on Clustering Results 1 Datasets Size: Effect on Clustering Results Adeleke Ajiboye 1, Ruzaini Abdullah Arshah 2, Hongwu Qin 3 Faculty of Computer Systems and Software Engineering Universiti Malaysia Pahang 1 {ajibraheem@live.com}

More information

Chapter 4: Text Clustering

Chapter 4: Text Clustering 4.1 Introduction to Text Clustering Clustering is an unsupervised method of grouping texts / documents in such a way that in spite of having little knowledge about the content of the documents, we can

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 16: Flat Clustering Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2009.06.16 1/ 64 Overview

More information

High Performance Text Document Clustering

High Performance Text Document Clustering Wright State University CORE Scholar Browse all Theses and Dissertations Theses and Dissertations 2007 High Performance Text Document Clustering Yanjun Li Wright State University Follow this and additional

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Clustering of Web Search Results Based on Document Segmentation

Clustering of Web Search Results Based on Document Segmentation Computer and Information Science; Vol. 6, No. 3; 23 ISSN 93-8989 E-ISSN 93-8997 Published by Canadian Center of Science and Education Clustering of Web Search Results Based on Document Segmentation Mohammad

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

Data Mining Algorithms

Data Mining Algorithms for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 6: Flat Clustering Hinrich Schütze Center for Information and Language Processing, University of Munich 04-06- /86 Overview Recap

More information

A New Measure of the Cluster Hypothesis

A New Measure of the Cluster Hypothesis A New Measure of the Cluster Hypothesis Mark D. Smucker 1 and James Allan 2 1 Department of Management Sciences University of Waterloo 2 Center for Intelligent Information Retrieval Department of Computer

More information

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 CLUSTERING CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 1. K-medoids: REFERENCES https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method https://anuradhasrinivas.files.wordpress.com/2013/04/lesson8-clustering.pdf

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Authoritative K-Means for Clustering of Web Search Results

Authoritative K-Means for Clustering of Web Search Results Authoritative K-Means for Clustering of Web Search Results Gaojie He Master in Information Systems Submission date: June 2010 Supervisor: Kjetil Nørvåg, IDI Co-supervisor: Robert Neumayer, IDI Norwegian

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Approaches to Mining the Web

Approaches to Mining the Web Approaches to Mining the Web Olfa Nasraoui University of Louisville Web Mining: Mining Web Data (3 Types) Structure Mining: extracting info from topology of the Web (links among pages) Hubs: pages pointing

More information