Clustering CE-324: Modern Information Retrieval Sharif University of Technology
|
|
- Vernon Dalton
- 5 years ago
- Views:
Transcription
1 Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
2 Ch. 16 What is clustering? Clustering: grouping a set of objects into similar ones Docs within a cluster should be similar. Docs from different clusters should be dissimilar. The commonest form of unsupervised learning Unsupervised learning learning from raw data, as opposed to supervised data where a classification of examples is given A common and important task that finds many applications in IR and other places 2
3 Ch. 16 A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this case? 3
4 Applications of clustering in IR 5
5 Search result clustering 6
6 7 yippy.com grouping search results
7 Yahoo! Hierarchy (30) agriculture biology physics CS space dairy crops botany cell forestry agronomy evolution magnetism relativity AI HCI courses craft missions Yahoo! Hierarchy isn t clustering but is the kind of output you want from clustering 8
8 Google News: automatic clustering gives an effective news presentation metaphor 9
9 Sec Scatter/Gather Scatter and gather approach instead of searching by entering terms 10
10 For visualizing a doc collection and its themes Wise et al, Visualizing the non-visual PNNL ThemeScapes, Cartia [Mountain height = cluster size] 11
11 Sec For improving search recall Cluster hypothesis: Docs in the same cluster behave similarly with respect to relevance to information needs Therefore, to improve search recall: Cluster docs in corpus a priori When a query matches a doc d, also return other docs in the cluster containing d Query car: also return docs containing automobile Because clustering grouped together docs containing car with those containing automobile. Why might this happen? 12
12 Sec Issues for clustering Representation for clustering Doc representation Vector space? Normalization? Centroids aren t length normalized Need a notion of similarity/distance How many clusters? Fixed a priori? Completely data driven? Avoid trivial clusters - too large or small too large: for navigation purposes you've wasted an extra user click without whittling down the set of docs much 13
13 Notion of similarity/distance Ideal: semantic similarity. Practical: term-statistical similarity We will use cosine similarity. For many algorithms, easier to think in terms of a distance (rather than similarity) We will mostly speak of Euclidean distance But real implementations use cosine similarity 14
14 Clustering algorithms categorization Flat algorithms (k-means) Usually start with a random (partial) partitioning Refine it iteratively Hierarchical algorithms Bottom-up, agglomerative Top-down, divisive 15
15 Hard vs. soft clustering Hard clustering: Each doc belongs to exactly one cluster More common and easier to do Soft clustering:a doc can belong to more than one cluster. 16
16 Partitioning algorithms Construct a partition of N docs into K clusters Given: a set of docs and the number K Find: a partition of docs into K clusters that optimizes the chosen partitioning criterion Finding a global optimum is intractable for many objective functions of clustering Effective heuristic methods: K-means and K-medoids algorithms 17
17 Sec K-means Assumes docs are real-valued vectors x (1),, x (N). Clusters based on centroids (aka the center of gravity or mean) of points in a cluster: μ j = 1 C j x (i) C j x (i) K-means cost function: J(C) = K j=1 x (i) C j x (i) μ j 2 C = {C 1, C 2,, C K } C j : the set of data points assigned to j-th cluster 18
18 Sec K-means algorithm Select K random points {μ 1, μ 2, μ K } as clusters initial centroids. Until clustering converges (or other stopping criterion): For each doc x (i) : Assign x (i) to the cluster C j such that dist(x (i), μ j ) is minimal. For each cluster C j μ j = i C j x(i) C j Reassignment of instances to clusters is based on distance to the current cluster centroids (can equivalently be in terms of similarities) 19
19 20 [Bishop]
20 21
21 Sec Termination conditions Several possibilities for terminal condition, e.g., A fixed number of iterations Doc partition unchanged J < θ: cost function falls below a threshold J < θ: the decrease in the cost function (in two successive iterations) falls below a threshold 22
22 Sec Convergence of K-means K-means algorithm ever reaches a fixed point in which clusters don t change. We must use tie-breaking when there are samples with the same distance from two or more clusters (by assigning it to the lower index cluster) 23
23 K-means decreases J(C) in each iteration (before convergence) Sec First, reassignment monotonically decreases J(C) since each vector is assigned to the closest centroid. Second, recomputation monotonically decreases each i C k x (i) μ k 2 : i Ck x (i) μ k 2 reaches minimum for μ k = 1 C k i C k x (i) K-means typically converges quickly 24
24 Sec Time complexity of K-means Computing distance between two docs: O(M) M is the dimensionality of the vectors. Reassigning clusters: O(KN) distance computations O(KNM). Computing centroids: Each doc gets added once to some centroid: O(NM). Assume these two steps are each done once for I iterations: O(IKNM). 25
25 Sec Seed choice Results can vary based on random selection of initial centroids. Some initializations get poor convergence rate, or convergence to sub-optimal clustering Try out multiple starting points Select good seeds using a heuristic (e.g., doc least similar to any existing mean) Initialize with the results of another method. Example showing sensitivity to seeds If you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F, you converge to {A,B,D,E} {C,F} 26
26 Sec K-means issues, variations, etc. Computes the centroid only after all points are reassigned Instead, we can re-compute the centroid after every assignment It can improve speed of convergence of K-means Assumes clusters are spherical in vector space Sensitive to coordinate changes, weighting etc. Disjoint and exhaustive Doesn t have a notion of outliers by default But can add outlier filtering Dhillon et al. ICDM 2002 variation to fix some issues with small document clusters 27
27 How many clusters? Number of clusters K is given Partition n docs into predetermined number of clusters Finding the right number is part of the problem Given docs, partition into an appropriate no. of subsets. E.g., for query results - ideal value of K not known up front - though UI may impose limits. 28
28 How many clusters? How many clusters? Six Clusters Two Clusters Four Clusters 29
29 Selecting k 30
30 K not specified in advance Tradeoff between having better focus within each cluster and having too many clusters Solve an optimization problem: penalize having lots of clusters application dependent e.g., compressed summary of search results list. k = min k J min k + λk J min k : show the minimum value of J {C 1,C 2,,C k } obtained in e.g. 100 runs of k-means (with different initializations) 31
31 Penalize lots of clusters Benefit for a doc: cosine similarity to its centroid Total Benefit: sum of the individual doc Benefits. Why is there always a clustering of Total Benefit n? For each cluster, we have a Cost C. For K clusters, the Total Cost is KC. Value of a clustering = Total Benefit - Total Cost. Find clustering of highest value, over all choices of K. Total benefit increases with increasing K. But can stop when it doesn t increase by much. The Cost term enforces this. 32
32 Sec What is a good clustering? Internal criterion: intra-class (that is, intra-cluster) similarity is high inter-class similarity is low The measured quality of a clustering depends on both the doc representation and the similarity measure 33
33 Sec External criteria for clustering quality Quality: ability to discover some or all of patterns in gold standard data Assesses a clustering with respect to ground truth requires labeled data C gold standard classes C 1,, C c, while clustering produces K clusters, ω 1, ω 2,, ω K. 34
34 External criteria Purity Rand Index F measure Normalized Mutual Information (NMI) 35
35 Sec External evaluation of cluster quality Purity: ratio between the dominant class in the cluster ω i and the size of cluster ω i max ω i C j j=1,,c purity ω i = ω i TotalPurity ω 1,, ω c = 1 N i=1 K max j=1,,c ω i C j Biased because having N clusters maximizes purity 36
36 Sec Purity example Cluster I Cluster II Cluster III Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5 Total Purity = 1/17 (5+4+3) = 12/17 37
37 Sec Rand Index (RI) measures between pair decisions RI = 0.68 Number of points Same Cluster in clustering Different Clusters in clustering Same class in ground truth Different classes in ground truth
38 Sec Rand index and F-measure RI TP TN TP FP TN FN Compare with standard Precision and Recall: P TP TP FP R TP TP TN People also define and use a cluster F-measure, which is probably a better measure. 39 F β = β2 + 1 PR β 2 P + R
39 Ch. 17 Hierarchical clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set of docs. animal vertebrate fish reptile amphib. mammal invertebrate worm insect crustacean One approach: recursive application of a partitional clustering algorithm. 40
40 Hierarchical clustering 41
41 Dendrogram: hierarchical clustering Clustering obtained by cutting the dendrogram at a desired level: Each connected component forms a cluster. 42
42 Hierarchical Agglomerative Clustering (HAC) Sec Starts with each doc in a separate cluster then repeatedly joins the closest pair of clusters, until there is only one cluster. The history of merging forms a binary tree or hierarchy. Note: the resulting clusters are still hard and induce a partition 43
43 Example Hierarchical Agglomerative Clustering (HAC)
44 45
45 Sec Closest pair of clusters Many variants to defining closest pair of clusters Single-link Similarity of the most similar pair (single-link) Complete-link Similarity of the furthest points, the least similar pair Centroid Clusters whose centroids (centers of gravity) are the most similar Average-link Average similarities between pairs of elements 46
46 Distances between cluster pairs Single-link Complete-link Centroid Average-link 47
47 Sec Single link Use maximum similarity of pairs: sim ( C, C ) max sim ( x, y ) Can result in straggly (long and thin) clusters due to chaining effect. i j x C i, y C After merging C i and C j, the similarity of the resulting cluster to another cluster, C k : sim (( C C ), C ) max( sim ( C, C ), sim ( C, C )) i j k i k j k j 48
48 Sec Single link: example 49
49 Sec Complete link Use minimum similarity of pairs: sim ( C, C ) min sim ( x, y ) i j x C, y C i Makes tighter, spherical clusters that are typically preferable. After merging C i and C j, the similarity of the resulting cluster to another cluster, C k : sim (( C C ), C ) min( sim ( C, C ), sim ( C, C )) i j k i k j k j 50
50 Sec Complete link: example 51
51 Sec Group average Similarity of two clusters = average similarity of all pairs within merged cluster. 1 sim ( C i, C j) sim ( x, y ) C C ( C C 1) x ( C i C j ) y ( C i C j ): y x i j i j Compromise between single and complete link. 52
52 Sec Computing group average similarity When we use dot product as the similarity measure we can compute similarity of two clusters in constant time if we maintain sum of vectors in each cluster. Compute similarity of clusters: s j x C j x If the length of doc vectors are assumed one sim ( c, c ) i j ( s s ) ( s s ) ( C C ) i j i j i j ( C C )( C C 1) i j i j 53
53 Sec Computational complexity First iteration: all HAC methods need to compute similarity of all pairs of instances, O(N 2 ). N 1 merging iterations: compute the distance between the most recently created cluster and all other existing ones. In order to maintain an overall O(N 2 ) performance, computing similarity to each other cluster must be done in constant time. Often O(N 3 ) if done naively or O(N 2 log N) if done more cleverly 54
54 55
55 Cluster Labeling 56
56 Final word and resources In clustering, clusters are inferred from the data without human input (unsupervised learning) However, in practice, it s a bit less clear many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of docs,... Resources IIR 16 except 16.5 IIR
CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University
CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document
More informationCS47300: Web Information Search and Management
CS47300: Web Information Search and Management Text Clustering Prof. Chris Clifton 19 October 2018 Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti Document clustering Motivations Document
More informationInforma(on Retrieval
Introduc*on to Informa(on Retrieval CS276: Informa*on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 12: Clustering Today s Topic: Clustering Document clustering Mo*va*ons Document
More informationInforma(on Retrieval
Introduc*on to Informa(on Retrieval Clustering Chris Manning, Pandu Nayak, and Prabhakar Raghavan Today s Topic: Clustering Document clustering Mo*va*ons Document representa*ons Success criteria Clustering
More informationDD2475 Information Retrieval Lecture 10: Clustering. Document Clustering. Recap: Classification. Today
Sec.14.1! Recap: Classification DD2475 Information Retrieval Lecture 10: Clustering Hedvig Kjellström hedvig@kth.se www.csc.kth.se/dd2475 Data points have labels Classification task: Finding good separators
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London What Is Text Clustering? Text Clustering = Grouping a set of documents into classes of similar
More informationData Informatics. Seon Ho Kim, Ph.D.
Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,
More informationFlat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017
Flat Clustering Slides are mostly from Hinrich Schütze March 7, 07 / 79 Overview Recap Clustering: Introduction 3 Clustering in IR 4 K-means 5 Evaluation 6 How many clusters? / 79 Outline Recap Clustering:
More informationhttp://www.xkcd.com/233/ Text Clustering David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Administrative 2 nd status reports Paper review
More informationClustering. Partition unlabeled examples into disjoint subsets of clusters, such that:
Text Clustering 1 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover
More informationBased on Raymond J. Mooney s slides
Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit
More informationCSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)
CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze Institute for
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 6: Flat Clustering Wiltrud Kessler & Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 0-- / 83
More informationHierarchical Clustering
Hierarchical Clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set animal of documents. vertebrate invertebrate fish reptile amphib. mammal worm insect crustacean One approach: recursive
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 16: Flat Clustering Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2009.06.16 1/ 64 Overview
More informationPV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211
PV: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv IIR 6: Flat Clustering Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center
More informationAdministrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES
Administrative Machine learning: Unsupervised learning" Assignment 5 out soon David Kauchak cs311 Spring 2013 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Machine
More informationClustering algorithms
Clustering algorithms Machine Learning Hamid Beigy Sharif University of Technology Fall 1393 Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 1 / 22 Table of contents 1 Supervised
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 6: Flat Clustering Hinrich Schütze Center for Information and Language Processing, University of Munich 04-06- /86 Overview Recap
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More informationCluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University
Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University Kinds of Clustering Sequential Fast Cost Optimization Fixed number of clusters Hierarchical
More informationLecture 15 Clustering. Oct
Lecture 15 Clustering Oct 31 2008 Unsupervised learning and pattern discovery So far, our data has been in this form: x 11,x 21, x 31,, x 1 m y1 x 12 22 2 2 2,x, x 3,, x m y We will be looking at unlabeled
More informationMI example for poultry/export in Reuters. Overview. Introduction to Information Retrieval. Outline.
Introduction to Information Retrieval http://informationretrieval.org IIR 16: Flat Clustering Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2009.06.16 Outline 1 Recap
More informationUnsupervised learning, Clustering CS434
Unsupervised learning, Clustering CS434 Unsupervised learning and pattern discovery So far, our data has been in this form: We will be looking at unlabeled data: x 11,x 21, x 31,, x 1 m x 12,x 22, x 32,,
More informationINF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering
INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,
More informationClassification & Clustering. Hadaiq Rolis Sanabila
Classification & Clustering Hadaiq Rolis Sanabila hadaiq@cs.ui.ac.id Natural Language Processing and Text Mining Pusilkom UI 22 26 Maret 2016 CLASSIFICATION 2 Categorization/Classification Given: A description
More informationAdministrative. Machine learning code. Machine learning: Unsupervised learning
Machine learning: Unsupervised learning http://www.youtube.com/watch?v=or_-y-eilqo David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Machine
More informationClustering Results. Result List Example. Clustering Results. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to
More informationWhat to come. There will be a few more topics we will cover on supervised learning
Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster
More informationToday s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan
Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.
More informationBig Data Analytics! Special Topics for Computer Science CSE CSE Feb 9
Big Data Analytics! Special Topics for Computer Science CSE 4095-001 CSE 5095-005! Feb 9 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Clustering I What
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised
More informationClustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search
Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2
More informationLecture 8 May 7, Prabhakar Raghavan
Lecture 8 May 7, 2001 Prabhakar Raghavan Clustering documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Given the set of docs from the results of
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework
More informationExpectation Maximization!
Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University and http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Steps in Clustering Select Features
More informationk-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out
Machine learning: Unsupervised learning" David Kauchak cs Spring 0 adapted from: http://www.stanford.edu/class/cs76/handouts/lecture7-clustering.ppt http://www.youtube.com/watch?v=or_-y-eilqo Administrative
More informationClustering: Overview and K-means algorithm
Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationChapter 9. Classification and Clustering
Chapter 9 Classification and Clustering Classification and Clustering Classification and clustering are classical pattern recognition and machine learning problems Classification, also referred to as categorization
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More informationCLUSTERING. Quiz information. today 11/14/13% ! The second midterm quiz is on Thursday (11/21) ! In-class (75 minutes!)
CLUSTERING Quiz information! The second midterm quiz is on Thursday (11/21)! In-class (75 minutes!)! Allowed one two-sided (8.5x11) cheat sheet! Solutions for optional problems to HW5 posted today 1% Quiz
More informationClustering Algorithms for general similarity measures
Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative
More informationClustering: Overview and K-means algorithm
Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin
More informationClustering (COSC 416) Nazli Goharian. Document Clustering.
Clustering (COSC 416) Nazli Goharian nazli@cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationText classification II CE-324: Modern Information Retrieval Sharif University of Technology
Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationData Clustering. Danushka Bollegala
Data Clustering Danushka Bollegala Outline Why cluster data? Clustering as unsupervised learning Clustering algorithms k-means, k-medoids agglomerative clustering Brown s clustering Spectral clustering
More informationArtificial Intelligence. Programming Styles
Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to
More informationText Documents clustering using K Means Algorithm
Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals
More informationRoad map. Basic concepts
Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?
More informationClustering. Bruno Martins. 1 st Semester 2012/2013
Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationCS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017
CS 2750: Machine Learning Clustering Prof. Adriana Kovashka University of Pittsburgh January 17, 2017 What is clustering? Grouping items that belong together (i.e. have similar features) Unsupervised:
More information5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction
Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering
More informationINF4820, Algorithms for AI and NLP: Hierarchical Clustering
INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score
More informationClustering (COSC 488) Nazli Goharian. Document Clustering.
Clustering (COSC 488) Nazli Goharian nazli@ir.cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationUnsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More informationClustering: K-means and Kernel K-means
Clustering: K-means and Kernel K-means Piyush Rai Machine Learning (CS771A) Aug 31, 2016 Machine Learning (CS771A) Clustering: K-means and Kernel K-means 1 Clustering Usually an unsupervised learning problem
More informationIntroduction to Mobile Robotics
Introduction to Mobile Robotics Clustering Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann Clustering (1) Common technique for statistical data analysis (machine learning,
More informationUnsupervised Learning Partitioning Methods
Unsupervised Learning Partitioning Methods Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form
More informationClust Clus e t ring 2 Nov
Clustering 2 Nov 3 2008 HAC Algorithm Start t with all objects in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, c i and c j, that are most
More informationExploratory Analysis: Clustering
Exploratory Analysis: Clustering (some material taken or adapted from slides by Hinrich Schutze) Heejun Kim June 26, 2018 Clustering objective Grouping documents or instances into subsets or clusters Documents
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationLecture on Modeling Tools for Clustering & Regression
Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into
More informationCS Introduction to Data Mining Instructor: Abdullah Mueen
CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationAssociation Rule Mining and Clustering
Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationHard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering
An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other
More informationSupervised and Unsupervised Learning (II)
Supervised and Unsupervised Learning (II) Yong Zheng Center for Web Intelligence DePaul University, Chicago IPD 346 - Data Science for Business Program DePaul University, Chicago, USA Intro: Supervised
More informationMachine Learning. Unsupervised Learning. Manfred Huber
Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training
More informationChapter 4: Text Clustering
4.1 Introduction to Text Clustering Clustering is an unsupervised method of grouping texts / documents in such a way that in spite of having little knowledge about the content of the documents, we can
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 9: Data Mining (4/4) March 9, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides
More informationCluster analysis. Agnieszka Nowak - Brzezinska
Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that
More informationCSE 494/598 Lecture-11: Clustering & Classification
CSE 494/598 Lecture-11: Clustering & Classification LYDIA MANIKONDA HT TP://WWW.PUBLIC.ASU.EDU/~LMANIKON / **With permission, content adapted from last year s slides and from Intro to DM dmbook@cs.umn.edu
More informationTypes of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters
Types of general clustering methods Clustering Algorithms for general similarity measures agglomerative versus divisive algorithms agglomerative = bottom-up build up clusters from single objects divisive
More informationUnsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning
Unsupervised Learning Clustering and the EM Algorithm Susanna Ricco Supervised Learning Given data in the form < x, y >, y is the target to learn. Good news: Easy to tell if our algorithm is giving the
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationUnsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis
7 Supervised learning vs unsupervised learning Unsupervised Learning Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute These patterns are then
More informationUnsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationClustering part II 1
Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:
More informationCS7267 MACHINE LEARNING
S7267 MAHINE LEARNING HIERARHIAL LUSTERING Ref: hengkai Li, Department of omputer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) Mingon Kang, Ph.D. omputer Science,
More informationINF 4300 Classification III Anne Solberg The agenda today:
INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15
More informationClustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017
Clustering and Dimensionality Reduction Stony Brook University CSE545, Fall 2017 Goal: Generalize to new data Model New Data? Original Data Does the model accurately reflect new data? Supervised vs. Unsupervised
More informationHierarchical Clustering
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 0 0 0 00
More informationData Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data
More informationMachine Learning and Data Mining. Clustering (1): Basics. Kalev Kask
Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of
More informationK-Means Clustering 3/3/17
K-Means Clustering 3/3/17 Unsupervised Learning We have a collection of unlabeled data points. We want to find underlying structure in the data. Examples: Identify groups of similar data points. Clustering
More informationClassification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University
Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate
More informationClustering. Chapter 10 in Introduction to statistical learning
Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What
More information