Lecture 6: Clustering

Size: px
Start display at page:

Download "Lecture 6: Clustering"

Transcription

1 Lecture 6: Clustering Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Lent

2 Overview 1 Recap/Catchup 2 Clustering: Introduction 3 Non-hierarchical clustering 4 Hierarchical clustering

3 Precision and recall w THE TRUTH WHAT THE Relevant Nonrelevant SYSTEM Retrieved true positives (TP) false positives (FP) THINKS Not retrieved false negatives (FN) true negatives (TN) False Negatives True Positives False Positives Relevant Retrieved True Negatives P = TP/(TP +FP) R = TP/(TP +FN) 258

4 Precision/Recall Graph Rank Doc 1 d 12 2 d d 4 4 d 57 5 d d d 24 8 d 26 9 d d 90 Precision Recall 259

5 Avg 11pt prec area under normalised P/R graph P 11 pt = j=0 1 N N P i (r j ) i= Precision Recall

6 Mean Average Precision (MAP) MAP = 1 Q N 1 j P(doc i ) N Q j=1 j i=1 Query 1 Rank P(doc i ) 1 X X X X X 0.25 AVG: MAP = = Query 2 Rank P(doc i ) 1 X X X 0.2 AVG:

7 What we need for a benchmark A collection of documents Documents must be representative of the documents we expect to see in reality. There must be many documents abstracts (as in Cranfield experiment) no longer sufficient to model modern retrieval A collection of information needs...which we will often incorrectly refer to as queries Information needs must be representative of the information needs we expect to see in reality. Human relevance assessments We need to hire/pay judges or assessors to do this. Expensive, time-consuming Judges must be representative of the users we expect to see in reality. 262

8 Second-generation relevance benchmark: TREC TREC = Text Retrieval Conference (TREC) Organized by the U.S. National Institute of Standards and Technology (NIST) TREC is actually a set of several different relevance benchmarks. Best known: TREC Ad Hoc, used for first 8 TREC evaluations between 1992 and million documents, mainly newswire articles, 450 information needs No exhaustive relevance judgments too expensive Rather, NIST assessors relevance judgments are available only for the documents that were among the top k returned for some system which was entered in the TREC evaluation for which the information need was developed. 263

9 Sample TREC Query <num> Number: 508 <title> hair loss is a symptom of what diseases <desc> Description: Find diseases for which hair loss is a symptom. <narr> Narrative: A document is relevant if it positively connects the loss of head hair in humans with a specific disease. In this context, thinning hair and hair loss are synonymous. Loss of body and/or facial hair is irrelevant, as is hair loss caused by drug therapy. 264

10 TREC Relevance Judgements Humans decide which document query pairs are relevant. 265

11 Interjudge agreement at TREC information number of disagreements need docs judged Observation: Judges disagree a lot. This means a large impact on absolute performance numbers of each system But virtually no impact on ranking of systems So, the results of information retrieval experiments of this kind can reliably tell us whether system A is better than system B. even if judges disagree. 266

12 Example of more recent benchmark: ClueWeb09 1 billion web pages 25 terabytes (compressed: 5 terabyte) Collected January/February languages Unique URLs: 4,780,950,903 (325 GB uncompressed, 105 GB compressed) Total Outlinks: 7,944,351,835 (71 GB uncompressed, 24 GB compressed) 267

13 Evaluation at large search engines Recall is difficult to measure on the web Search engines often use precision at top k, e.g., k = or use measures that reward you more for getting rank 1 right than for getting rank 10 right. Search engines also use non-relevance-based measures. Example 1: clickthrough on first result Not very reliable if you look at a single clickthrough (you may realize after clicking that the summary was misleading and the document is nonrelevant)......but pretty reliable in the aggregate. Example 2: A/B testing 268

14 A/B testing Purpose: Test a single innovation Prerequisite: You have a large search engine up and running. Have most users use old system Divert a small proportion of traffic (e.g., 1%) to the new system that includes the innovation Evaluate with an automatic measure like clickthrough on first result Now we can directly see if the innovation does improve user happiness. Probably the evaluation methodology that large search engines trust most 269

15 Reading MRS, Chapter 8 270

16 Upcoming What is clustering? Applications of clustering in information retrieval K-means algorithm Introduction to hierarchical clustering Single-link and complete-link clustering 271

17 Overview 1 Recap/Catchup 2 Clustering: Introduction 3 Non-hierarchical clustering 4 Hierarchical clustering

18 Clustering: Definition (Document) clustering is the process of grouping a set of documents into clusters of similar documents. Documents within a cluster should be similar. Documents from different clusters should be dissimilar. Clustering is the most common form of unsupervised learning. Unsupervised = there are no labeled or annotated data. 272

19 Difference clustering classification Classification supervised learning classes are human-defined and part of the input to the learning algorithm output = membership in class only Clustering unsupervised learning Clusters are inferred from the data without human input. Output = membership in class + distance from centroid ( degree of cluster membership ) 273

20 The cluster hypothesis Cluster hypothesis. Documents in the same cluster behave similarly with respect to relevance to information needs. All applications of clustering in IR are based (directly or indirectly) on the cluster hypothesis. Van Rijsbergen s original wording (1979): closely associated documents tend to be relevant to the same requests. 274

21 Applications of Clustering IR: presentation of results (clustering of documents) Summarisation: clustering of similar documents for multi-document summarisation clustering of similar sentences for re-generation of sentences Topic Segmentation: clustering of similar paragraphs (adjacent or non-adjacent) for detection of topic structure/importance Lexical semantics: clustering of words by cooccurrence patterns 275

22 Scatter-Gather 276

23 Clustering search results 277

24 Clustering news articles 278

25 Clustering topical areas 279

26 Clustering what appears in AAG conferences (AAG = Association of American Geographers) 280

27 Clustering patents 281

28 Clustering terms 282

29 Types of Clustering Hard clustering v. soft clustering Hard clustering: every object is member in only one cluster Soft clustering: objects can be members in more than one cluster Hierarchical v. non-hierarchical clustering Hierarchical clustering: pairs of most-similar clusters are iteratively linked until all objects are in a clustering relationship Non-hierarchical clustering results in flat clusters of similar documents 283

30 Desiderata for clustering General goal: put related docs in the same cluster, put unrelated docs in different clusters. We ll see different ways of formalizing this. The number of clusters should be appropriate for the data set we are clustering. Initially, we will assume the number of clusters K is given. There also exist semiautomatic methods for determining K Secondary goals in clustering Avoid very small and very large clusters Define clusters that are easy to explain to the user Many others

31 Overview 1 Recap/Catchup 2 Clustering: Introduction 3 Non-hierarchical clustering 4 Hierarchical clustering

32 Non-hierarchical (partitioning) clustering Partitional clustering algorithms produce a set of k non-nested partitions corresponding to k clusters of n objects. Advantage: not necessary to compare each object to each other object, just comparisons of objects cluster centroids necessary Optimal partitioning clustering algorithms are O(kn) Main algorithm: K-means 285

33 K-means: Basic idea Each cluster j (with n j elements x i ) is represented by its centroid c j, the average vector of the cluster: c j = 1 n j n j i=1 Measure of cluster quality: minimise mean square distance between elements x i and nearest centroid c j RSS = x i k d( x i, c j ) 2 j=1 x i j Distance: Euclidean; length-normalised vectors in VS We iterate two steps: reassignment: assign each vector to its closest centroid recomputation: recompute each centroid as the average of the vectors that were recently assigned to it 286

34 K-means algorithm Given: a set s 0 = x 1,... x n R m Given: a distance measure d : R m R m R Given: a function for computing the mean µ : P(R) R m Select k initial centers c 1,... c k while stopping criterion not true: k j=1 x i s j d( x i, c j ) 2 < ǫ (stopping criterion) do for all clusters s j do (reassignment) c j :={ x i c l : d( x i, c j ) d( x i, c l )} end for all means c j do (centroid recomputation) cj := µ(s j ) end end 287

35 Worked Example: Set of points to be clustered Exercise: (i) Guess what the optimal clustering into two clusters is in this case; (ii) compute the centroids of the clusters 288

36 Random seeds + Assign points to closest center Iteration One Exercise: (i) Guess what the optimal 289

37 Worked Example: Recompute cluster centroids Iteration One Exercise: (i) 290

38 Worked Example: Assign points to closest centroid Iteration One Exercise: 291

39 Worked Example: Recompute cluster centroids Iteration Two Exercise 292

40 Worked Example: Assign points to closest centroid Iteration Two Exercise 293

41 Worked Example: Recompute cluster centroids Iteration Three Exercise 294

42 Worked Example: Assign points to closest centroid Iteration Three Exercise 295

43 Worked Example: Recompute cluster centroids Iteration Four Exercise 296

44 Worked Example: Assign points to closest centroid Iteration Four Exercise: 297

45 Worked Example: Recompute cluster centroids Iteration Five Exercise 298

46 Worked Example: Assign points to closest centroid Iteration Five Exercise: 299

47 Worked Example: Recompute cluster centroids Iteration Six Exercise: 300

48 Worked Example: Assign points to closest centroid Iteration Six Exercise: 301

49 Worked Example: Recompute cluster centroids Iteration Seven Exercise: 302

50 Worked Ex.: Centroids and assignments after convergence Convergence Exercise: 303

51 K-means is guaranteed to converge: Proof RSS decreases during each reassignment step. because each vector is moved to a closer centroid RSS decreases during each recomputation step. This follows from the definition of a centroid: the new centroid is the vector for which RSS k reaches its minimum There is only a finite number of clusterings. Thus: We must reach a fixed point. Finite set & monotonically decreasing evaluation function convergence Assumption: Ties are broken consistently. 304

52 Other properties of K-means Fast convergence K-means typically converges in around iterations (if we don t care about a few documents switching back and forth) However, complete convergence can take many more iterations. Non-optimality K-means is not guaranteed to find the optimal solution. If we start with a bad set of seeds, the resulting clustering can be horrible. Dependence on initial centroids Solution 1: Use i clusterings, choose one with lowest RSS Solution 2: Use prior hierarchical clustering step to find seeds with good coverage of document space 305

53 Time complexity of K-means Reassignment step: O(KNM) (we need to compute KN document-centroid distances, each of which costs O(M) Recomputation step: O(NM) (we need to add each of the document s < M values to one of the centroids) Assume number of iterations bounded by I Overall complexity: O(IKNM) linear in all important dimensions 306

54 Overview 1 Recap/Catchup 2 Clustering: Introduction 3 Non-hierarchical clustering 4 Hierarchical clustering

55 Hierarchical clustering Imagine we now want to create a hierachy in the form of a binary tree. Assumes a similarity measure for determining the similarity of two clusters. Up to now, our similarity measures were for documents. We will look at different cluster similarity measures. Main algorithm: HAC (hierarchical agglomerative clustering) 307

56 HAC: Basic algorithm Start with each document in a separate cluster Then repeatedly merge the two clusters that are most similar Until there is only one cluster. The history of merging is a hierarchy in the form of a binary tree. The standard way of depicting this history is a dendrogram. 308

57 A dendrogram NYSE closing averages Hog prices tumble Oil prices slip Ag trade reform. Chrysler / Latin America Japanese prime minister / Mexico Fed holds interest rates steady Fed to keep interest rates steady Fed keeps interest rates steady Fed keeps interest rates steady Mexican markets British FTSE index War hero Colin Powell War hero Colin Powell Lloyd s CEO questioned Lloyd s chief / U.S. grilling Ohio Blue Cross Lawsuit against tobacco companies suits against tobacco firms Indiana tobacco lawsuit Viag stays positive Most active stocks CompuServe reports loss Sprint / Internet access service Planet Hollywood Trocadero: tripling of revenues Back to school spending is up German unions split Chains may raise prices Clinton signs law 309

58 Term document matrix to document document matrix Log frequency weighting and cosine normalisation SaS PaP WH SaS P(SaS,SaS) P(PaP,SaS) PaP P(SaS,PaP) P(PaP,PaP) WH P(SaS,WH) P(PaP,WH) SaS PaP SaS PaP WH SaS PaP WH Applying the proximity metric to all pairs of documents... creates the document-document matrix, which reports similarities/distances between objects (documents) The diagonal is trivial (identity) As proximity measures are symmetric, the matrix is a triangle 310

59 Hierarchical clustering: agglomerative (BottomUp, greedy) Given: a set X = x 1,...x n of objects; Given: a function sim : P(X)P(X) R for i:= 1 to n do c i := x i C :=c 1,... c n j := n+1 while C > 1 do (c n1,c n2 ) := max (cu,c v) CCsim(c u,c v) c j := c n1 c n2 C := C { c n1,c n2 } c j j:=j+1 end Similarity function sim : P(X) P(X) R measures similarity between clusters, not objects 311

60 Computational complexity of the basic algorithm First, we compute the similarity of all N N pairs of documents. Then, in each of N iterations: We scan the O(N N) similarities to find the maximum similarity. We merge the two clusters with maximum similarity. We compute the similarity of the new cluster with all other (surviving) clusters. There are O(N) iterations, each performing a O(N N) scan operation. Overall complexity is O(N 3 ). Depending on the similarity function, a more efficient algorithm is possible. 312

61 Hierarchical clustering: similarity functions Similarity between two clusters c k and c j (with similarity measure s) can be interpreted in different ways: Single Link Function: Similarity of two most similar members sim(c u,c v ) = max x cu,y c k s(x,y) Complete Link Function: Similarity of two least similar members sim(c u,c v ) = min x cu,y c k s(x,y) Group Average Function: Avg. similarity of each pair of group members sim(c u,c v ) = avg x cu,y c k s(x,y) 313

62 Example: hierarchical clustering; similarity functions Cluster 8 objects a-h; Euclidean distances (2D) shown in diagram a b c d e f g h b 1 c d e f g h a b c d e f g 314

63 Single Link is O(n 2 ) b 1 c d e f g h a b c d e f g After Step 4 (a b, c d, e f, g h merged): c d 1.5 e f g h a b c d e f min-min at each step 315

64 Clustering Result under Single Link a b c d e f g h a b c d e f g h 316

65 Complete Link b 1 c d e f g h a b c d e f g After step 4 (a b, c d, e f, g h merged): c d e f g h a b c d e f max-min at each step 317

66 Complete Link b 1 c d e f g h a b c d e f g After step 4 (a b, c d, e f, g h merged): c d e f g h a b c d e f max-min at each step ab/ef and cd/gh merges next 318

67 Clustering result under complete link a b c d e f g h Complete Link is O(n 3 ) a b c d e f g h 319

68 Example: gene expression data An example from biology: cluster genes by function Survey 112 rat genes which are suspected to participate in development of CNS Take 9 data points: 5 embryonic (E11, E13, E15, E18, E21), 3 postnatal (P0, P7, P14) and one adult Measure expression of gene (how much mrna in cell?) These measures are normalised logs; for our purposes, we can consider them as weights Cluster analysis determines which genes operate at the same time 320

69 Rat CNS gene expression data (excerpt) gene genbank locus E11 E13 E15 E18 E21 P0 P7 P14 A keratin RNKER cellubrevin s nestin RATNESTIN MAP2 RATMAP GAP43 RATGAP L1 S NFL RATNFL NFM RATNFM NFH RATNFHPEP synaptophysin RNSYN neno RATENONS S100 beta RATS100B GFAP RNU MOG RATMOG GAD65 RATGAD pre-gad67 RATGAD GAD67 RATGAD G67I80/86 RATGAD G67I86 RATGAD GAT1 RATGABAT ChAT (*) ACHE S ODC RATODC TH RATTOHA NOS RRBNOS GRa1 (#)

70 Rat CNS gene clustering single link NFM GRg1 NFL GFAP NMDA1 cellubrevin nestin GAT1 EGFR NFH nachra7 mglur5 actin ACHE machr2 GRg2 MOG GRb1 NT3 GAD67 GAD6 SC2 5 trkb CCO2 SC1 S100 beta GRg3 mglur7 GRa2 GRa5 keratin GRb2 synaptophysin EGF G67I80/86 CRAF CNTF CCO1 GRa4 SOD GRb3 GAP43 MAP2 pre-gad67 GRa3 neno statin PTN SC7 IGFR1 cfos Ins2 NMDA2C G67I86 cyclin A InsR trk IP3R1 L1 NMDA2D 5HT2 nachra4 nachra5 NOS TGFR PDGFR nachra3 H2AZ FGFR machr3 5HT1b BDNF CNTFR IGF I IP3R3 nachre mglur6 TCP GDNF SC6 mglur8 mglur2 PDGFb nachra6 nachrd Ins1 5HT3 mglur4 nachra2 machr4 cjun PDG NGF Fa DD63.2 mglur1 NMDA2A IGFR2 TH Brm ODC IGF II trkc mglur3 NMDA2 B 5HT1c GRa1 bfgf IP3R2 ChAT afgf MK2 cyclin B Clustering of Rat Expression Data (Single Link/Euclidean) 322

71 Rat CNS gene clustering complete link ODC IGF II mglur1 NMDA2A GDNF SC6 Brm FGFR NOS nachra3 L1 5HT2 nachra4 nachra5 CNTFR mglur8 nachre PDGFb nachra6 nachrd Ins1 BDNF IGF I IP3R3 TCP mglur4 nachra2 machr3 5HT1b mglur6 mglur2 5HT3 mglur3 NMDA2 B trk TGFR IGFR2 cyclin A neno statin pre-gad67 GRa3 TH PDGFR G67I86 SC7 IP3R1 machr4 PDG Fa 5HT1c GRa1 bfgf IP3R2 ChAT afgf NGF cjun DD63.2 Ins2 GRa2 GRa5 NMDA2C InsR H2AZ NMDA2D PTN CNTF SOD GAP43 MAP2 GRb3 trkc keratin CCO2 G67I80/86 CRAF NT3 SC1 MK2 cyclin B GRg3 mglur7 IGFR1 EGF cfos CCO1 S100 beta synaptophysin GRa4 GRb2 trkb GRg2 actin mglur5 GAD6 GAD67 5 nachra7 EGFR SC2 GAT1 NFH MOG ACHE GRb1 machr2 cellubrevin nestin NFL NMDA1 GFAP GRg1 NFM Clustering of Rat Expression Data (Complete Link/Euclidean) 323

72 Rat CNS gene clustering group average link NFL NFM NMDA1 nestin GAT1 cellubrevin NFH MOG actin SC2 NT3 SC1 MK2 cyclin B keratin CCO2 IP3R2 ChAT afgf mglur3 NMDA2 B 5HT1c GRa1 bfgf EGF cfos cjun NGF DD63.2 mglur1 NMDA2A CNTF SOD CCO1 IGFR2 cyclin A TH Brm ODC IGF II IGFR1 PTN PDGFR trk TGFR Ins2 NMDA2C machr3 5HT1b NOS nachra3 L1 5HT2 G67I86 InsR nachra4 nachra5 IP3R1 machr4 PDG Fa NMDA2D H2AZ FGFR GDNF SC6 TCP mglur2 mglur6 mglur8 5HT3 mglur4 nachra2 nachre PDGFb nachra6 nachrd Ins1 CNTFR BDNF IGF I IP3R3 CRAF G67I80/86 SC7 trkb S100 beta GRa2 GRa5 GRg3 mglur7 GRa4 GRb2 synaptophysin GRb3 trkc GAP43 pre-gad67 GRa3 MAP2 neno statin EGFR nachra7 ACHE GRb1 machr2 GRg2 mglur5 GAD6 GAD67 5 GFAP GRg1 Clustering of Rat Expression Data (Av Link/Euclidean) 324

73 Flat or hierarchical clustering? When a hierarchical structure is desired: hierarchical algorithm Humans are bad at interpreting hiearchical clusterings (unless cleverly visualised) For high efficiency, use flat clustering For deterministic results, use HAC HAC also can be applied if K cannot be predetermined (can start without knowing K) 325

74 Take-away Partitional clustering Provides less information but is more efficient (best: O(kn)) K-means Hierarchical clustering Best algorithms O(n 2 ) complexity Single-link vs. complete-link (vs. group-average) Hierarchical and non-hierarchical clustering fulfills different needs 326

75 Reading MRS Chapters MRS Chapters

Lecture 7: Clustering

Lecture 7: Clustering Lecture 7: Clustering Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2016 1 Adapted from Simone

More information

Unsupervised and Semi- Supervised Learning. Machine Learning, Fall 2010

Unsupervised and Semi- Supervised Learning. Machine Learning, Fall 2010 Unsupervised and Semi- Supervised Learning Machine Learning, Fall 2010 1 Announcement!"#$%&'($")*+,,-$")$").$-"/)0$).%'12'0+)*34$$5)-")6$&720+%)*3-+"3+) 8+'%)9:;)6*)#'3250

More information

Overview. Lecture 6: Evaluation. Summary: Ranked retrieval. Overview. Information Retrieval Computer Science Tripos Part II.

Overview. Lecture 6: Evaluation. Summary: Ranked retrieval. Overview. Information Retrieval Computer Science Tripos Part II. Overview Lecture 6: Evaluation Information Retrieval Computer Science Tripos Part II Recap/Catchup 2 Introduction Ronan Cummins 3 Unranked evaluation Natural Language and Information Processing (NLIP)

More information

CLUSTERING. Quiz information. today 11/14/13% ! The second midterm quiz is on Thursday (11/21) ! In-class (75 minutes!)

CLUSTERING. Quiz information. today 11/14/13% ! The second midterm quiz is on Thursday (11/21) ! In-class (75 minutes!) CLUSTERING Quiz information! The second midterm quiz is on Thursday (11/21)! In-class (75 minutes!)! Allowed one two-sided (8.5x11) cheat sheet! Solutions for optional problems to HW5 posted today 1% Quiz

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://www.informationretrieval.org Hierarchical Clustering Institute for Natural Language Processing, Universität Stuttgart 2007.06.19 Hierarchical Clustering 1 /

More information

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017 Flat Clustering Slides are mostly from Hinrich Schütze March 7, 07 / 79 Overview Recap Clustering: Introduction 3 Clustering in IR 4 K-means 5 Evaluation 6 How many clusters? / 79 Outline Recap Clustering:

More information

Lecture 5: Evaluation

Lecture 5: Evaluation Lecture 5: Evaluation Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Simone.Teufel@cl.cam.ac.uk Lent 2014 204 Overview 1 Recap/Catchup

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 6: Flat Clustering Wiltrud Kessler & Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 0-- / 83

More information

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16) CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze Institute for

More information

PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211

PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211 PV: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv IIR 6: Flat Clustering Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 16: Flat Clustering Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2009.06.16 1/ 64 Overview

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 6: Flat Clustering Hinrich Schütze Center for Information and Language Processing, University of Munich 04-06- /86 Overview Recap

More information

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Clustering CE-324: Modern Information Retrieval Sharif University of Technology Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch. 16 What

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London What Is Text Clustering? Text Clustering = Grouping a set of documents into classes of similar

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 8: Evaluation & Result Summaries Hinrich Schütze Center for Information and Language Processing, University of Munich 2013-05-07

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

Information Retrieval

Information Retrieval Information Retrieval ETH Zürich, Fall 2012 Thomas Hofmann LECTURE 6 EVALUATION 24.10.2012 Information Retrieval, ETHZ 2012 1 Today s Overview 1. User-Centric Evaluation 2. Evaluation via Relevance Assessment

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information

MI example for poultry/export in Reuters. Overview. Introduction to Information Retrieval. Outline.

MI example for poultry/export in Reuters. Overview. Introduction to Information Retrieval. Outline. Introduction to Information Retrieval http://informationretrieval.org IIR 16: Flat Clustering Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2009.06.16 Outline 1 Recap

More information

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user EVALUATION Sec. 6.2 THIS LECTURE How do we know if our results are any good? Evaluating a search engine Benchmarks Precision and recall Results summaries: Making our good results usable to a user 2 3 EVALUATING

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

Templates for Looking at Gene Expression Clustering TOPICS IN INFORMATION VISUALIZATION. 1. Introduction

Templates for Looking at Gene Expression Clustering TOPICS IN INFORMATION VISUALIZATION. 1. Introduction Forman, S.D., Cohen, J.D., Fitzgerald, M., Eddy, W.F., Mintun, M.A., and Noll, D.C. (1995), Improved Assessment of Significant Change in Functional Magnetic Resonance Imaging (fmri): Use of a Cluster Size

More information

Information Retrieval. Lecture 7

Information Retrieval. Lecture 7 Information Retrieval Lecture 7 Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations This lecture Evaluating a search engine Benchmarks Precision

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS] Search Evaluation Tao Yang CS293S Slides partially based on text book [CMS] [MRS] Table of Content Search Engine Evaluation Metrics for relevancy Precision/recall F-measure MAP NDCG Difficulties in Evaluating

More information

Part 7: Evaluation of IR Systems Francesco Ricci

Part 7: Evaluation of IR Systems Francesco Ricci Part 7: Evaluation of IR Systems Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan 1 This lecture Sec. 6.2 p How

More information

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Clustering Results. Result List Example. Clustering Results. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to

More information

CSCI 5417 Information Retrieval Systems. Jim Martin!

CSCI 5417 Information Retrieval Systems. Jim Martin! CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 7 9/13/2011 Today Review Efficient scoring schemes Approximate scoring Evaluating IR systems 1 Normal Cosine Scoring Speedups... Compute the

More information

http://www.xkcd.com/233/ Text Clustering David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Administrative 2 nd status reports Paper review

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Clustering (COSC 416) Nazli Goharian. Document Clustering. Clustering (COSC 416) Nazli Goharian nazli@cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 5: Evaluation Ruixuan Li http://idc.hust.edu.cn/~rxli/ Sec. 6.2 This lecture How do we know if our results are any good? Evaluating a search engine Benchmarks

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Information Retrieval. Lecture 3: Evaluation methodology

Information Retrieval. Lecture 3: Evaluation methodology Information Retrieval Lecture 3: Evaluation methodology Computer Science Tripos Part II Lent Term 2004 Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk Today 2

More information

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that: Text Clustering 1 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover

More information

CS6322: Information Retrieval Sanda Harabagiu. Lecture 13: Evaluation

CS6322: Information Retrieval Sanda Harabagiu. Lecture 13: Evaluation Sanda Harabagiu Lecture 13: Evaluation Sec. 6.2 This lecture How do we know if our results are any good? Evaluating a search engine Benchmarks Precision and recall Results summaries: Making our good results

More information

Information Retrieval

Information Retrieval Information Retrieval Lecture 7 - Evaluation in Information Retrieval Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 29 Introduction Framework

More information

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007 Information Retrieval Lecture 7 - Evaluation in Information Retrieval Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1 / 29 Introduction Framework

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 Information Retrieval Lecture 9: IR Evaluation 9 Ch. 7 Last Time The VSM Reloaded optimized for your pleasure! Improvements to the computation and selection

More information

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document

More information

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Clustering (COSC 488) Nazli Goharian. Document Clustering. Clustering (COSC 488) Nazli Goharian nazli@ir.cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,

More information

Clustering: Overview and K-means algorithm

Clustering: Overview and K-means algorithm Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,

More information

Retrieval Evaluation. Hongning Wang

Retrieval Evaluation. Hongning Wang Retrieval Evaluation Hongning Wang CS@UVa What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Rep (Index) Query Rep Feedback (Query) Evaluation User

More information

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Text Documents clustering using K Means Algorithm

Text Documents clustering using K Means Algorithm Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 9: Data Mining (4/4) March 9, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar Raghavan Evaluation 1 Situation Thanks to your stellar performance in CS276, you

More information

Web Information Retrieval. Exercises Evaluation in information retrieval

Web Information Retrieval. Exercises Evaluation in information retrieval Web Information Retrieval Exercises Evaluation in information retrieval Evaluating an IR system Note: information need is translated into a query Relevance is assessed relative to the information need

More information

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering

More information

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

Evaluation. David Kauchak cs160 Fall 2009 adapted from: Evaluation David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture8-evaluation.ppt Administrative How are things going? Slides Points Zipf s law IR Evaluation For

More information

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning BANANAS APPLES Administrative Machine learning: Unsupervised learning" Assignment 5 out soon David Kauchak cs311 Spring 2013 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Machine

More information

MIA - Master on Artificial Intelligence

MIA - Master on Artificial Intelligence MIA - Master on Artificial Intelligence 1 Hierarchical Non-hierarchical Evaluation 1 Hierarchical Non-hierarchical Evaluation The Concept of, proximity, affinity, distance, difference, divergence We use

More information

Clustering: Overview and K-means algorithm

Clustering: Overview and K-means algorithm Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin

More information

Clustering algorithms

Clustering algorithms Clustering algorithms Machine Learning Hamid Beigy Sharif University of Technology Fall 1393 Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 1 / 22 Table of contents 1 Supervised

More information

Hierarchical Clustering 4/5/17

Hierarchical Clustering 4/5/17 Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction

More information

Building Test Collections. Donna Harman National Institute of Standards and Technology

Building Test Collections. Donna Harman National Institute of Standards and Technology Building Test Collections Donna Harman National Institute of Standards and Technology Cranfield 2 (1962-1966) Goal: learn what makes a good indexing descriptor (4 different types tested at 3 levels of

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Text Clustering Prof. Chris Clifton 19 October 2018 Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti Document clustering Motivations Document

More information

Behavioral Data Mining. Lecture 18 Clustering

Behavioral Data Mining. Lecture 18 Clustering Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems

More information

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Clustering Algorithms for general similarity measures

Clustering Algorithms for general similarity measures Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative

More information

CSE 494 Project C. Garrett Wolf

CSE 494 Project C. Garrett Wolf CSE 494 Project C Garrett Wolf Introduction The main purpose of this project task was for us to implement the simple k-means and buckshot clustering algorithms. Once implemented, we were asked to vary

More information

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #14: Clustering Seoul National University 1 In This Lecture Learn the motivation, applications, and goal of clustering Understand the basic methods of clustering (bottom-up

More information

What to come. There will be a few more topics we will cover on supervised learning

What to come. There will be a few more topics we will cover on supervised learning Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression

More information

Informa(on Retrieval

Informa(on Retrieval Introduc*on to Informa(on Retrieval Lecture 8: Evalua*on 1 Sec. 6.2 This lecture How do we know if our results are any good? Evalua*ng a search engine Benchmarks Precision and recall 2 EVALUATING SEARCH

More information

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann Search Engines Chapter 8 Evaluating Search Engines 9.7.2009 Felix Naumann Evaluation 2 Evaluation is key to building effective and efficient search engines. Drives advancement of search engines When intuition

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University Kinds of Clustering Sequential Fast Cost Optimization Fixed number of clusters Hierarchical

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Lecture 15 Clustering. Oct

Lecture 15 Clustering. Oct Lecture 15 Clustering Oct 31 2008 Unsupervised learning and pattern discovery So far, our data has been in this form: x 11,x 21, x 31,, x 1 m y1 x 12 22 2 2 2,x, x 3,, x m y We will be looking at unlabeled

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Expectation Maximization!

Expectation Maximization! Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University and http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Steps in Clustering Select Features

More information

Association Rule Mining and Clustering

Association Rule Mining and Clustering Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics

More information

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters Types of general clustering methods Clustering Algorithms for general similarity measures agglomerative versus divisive algorithms agglomerative = bottom-up build up clusters from single objects divisive

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

Unsupervised Learning Partitioning Methods

Unsupervised Learning Partitioning Methods Unsupervised Learning Partitioning Methods Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form

More information

Lecture 4 Hierarchical clustering

Lecture 4 Hierarchical clustering CSE : Unsupervised learning Spring 00 Lecture Hierarchical clustering. Multiple levels of granularity So far we ve talked about the k-center, k-means, and k-medoid problems, all of which involve pre-specifying

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology Clustering Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods Data Clustering 1 What is Clustering? Form o unsupervised learning - no inormation

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Chapter 9. Classification and Clustering

Chapter 9. Classification and Clustering Chapter 9 Classification and Clustering Classification and Clustering Classification and clustering are classical pattern recognition and machine learning problems Classification, also referred to as categorization

More information

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical

More information

Clust Clus e t ring 2 Nov

Clust Clus e t ring 2 Nov Clustering 2 Nov 3 2008 HAC Algorithm Start t with all objects in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, c i and c j, that are most

More information