Lecture 6: Clustering
|
|
- Alexia Short
- 6 years ago
- Views:
Transcription
1 Lecture 6: Clustering Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Lent
2 Overview 1 Recap/Catchup 2 Clustering: Introduction 3 Non-hierarchical clustering 4 Hierarchical clustering
3 Precision and recall w THE TRUTH WHAT THE Relevant Nonrelevant SYSTEM Retrieved true positives (TP) false positives (FP) THINKS Not retrieved false negatives (FN) true negatives (TN) False Negatives True Positives False Positives Relevant Retrieved True Negatives P = TP/(TP +FP) R = TP/(TP +FN) 258
4 Precision/Recall Graph Rank Doc 1 d 12 2 d d 4 4 d 57 5 d d d 24 8 d 26 9 d d 90 Precision Recall 259
5 Avg 11pt prec area under normalised P/R graph P 11 pt = j=0 1 N N P i (r j ) i= Precision Recall
6 Mean Average Precision (MAP) MAP = 1 Q N 1 j P(doc i ) N Q j=1 j i=1 Query 1 Rank P(doc i ) 1 X X X X X 0.25 AVG: MAP = = Query 2 Rank P(doc i ) 1 X X X 0.2 AVG:
7 What we need for a benchmark A collection of documents Documents must be representative of the documents we expect to see in reality. There must be many documents abstracts (as in Cranfield experiment) no longer sufficient to model modern retrieval A collection of information needs...which we will often incorrectly refer to as queries Information needs must be representative of the information needs we expect to see in reality. Human relevance assessments We need to hire/pay judges or assessors to do this. Expensive, time-consuming Judges must be representative of the users we expect to see in reality. 262
8 Second-generation relevance benchmark: TREC TREC = Text Retrieval Conference (TREC) Organized by the U.S. National Institute of Standards and Technology (NIST) TREC is actually a set of several different relevance benchmarks. Best known: TREC Ad Hoc, used for first 8 TREC evaluations between 1992 and million documents, mainly newswire articles, 450 information needs No exhaustive relevance judgments too expensive Rather, NIST assessors relevance judgments are available only for the documents that were among the top k returned for some system which was entered in the TREC evaluation for which the information need was developed. 263
9 Sample TREC Query <num> Number: 508 <title> hair loss is a symptom of what diseases <desc> Description: Find diseases for which hair loss is a symptom. <narr> Narrative: A document is relevant if it positively connects the loss of head hair in humans with a specific disease. In this context, thinning hair and hair loss are synonymous. Loss of body and/or facial hair is irrelevant, as is hair loss caused by drug therapy. 264
10 TREC Relevance Judgements Humans decide which document query pairs are relevant. 265
11 Interjudge agreement at TREC information number of disagreements need docs judged Observation: Judges disagree a lot. This means a large impact on absolute performance numbers of each system But virtually no impact on ranking of systems So, the results of information retrieval experiments of this kind can reliably tell us whether system A is better than system B. even if judges disagree. 266
12 Example of more recent benchmark: ClueWeb09 1 billion web pages 25 terabytes (compressed: 5 terabyte) Collected January/February languages Unique URLs: 4,780,950,903 (325 GB uncompressed, 105 GB compressed) Total Outlinks: 7,944,351,835 (71 GB uncompressed, 24 GB compressed) 267
13 Evaluation at large search engines Recall is difficult to measure on the web Search engines often use precision at top k, e.g., k = or use measures that reward you more for getting rank 1 right than for getting rank 10 right. Search engines also use non-relevance-based measures. Example 1: clickthrough on first result Not very reliable if you look at a single clickthrough (you may realize after clicking that the summary was misleading and the document is nonrelevant)......but pretty reliable in the aggregate. Example 2: A/B testing 268
14 A/B testing Purpose: Test a single innovation Prerequisite: You have a large search engine up and running. Have most users use old system Divert a small proportion of traffic (e.g., 1%) to the new system that includes the innovation Evaluate with an automatic measure like clickthrough on first result Now we can directly see if the innovation does improve user happiness. Probably the evaluation methodology that large search engines trust most 269
15 Reading MRS, Chapter 8 270
16 Upcoming What is clustering? Applications of clustering in information retrieval K-means algorithm Introduction to hierarchical clustering Single-link and complete-link clustering 271
17 Overview 1 Recap/Catchup 2 Clustering: Introduction 3 Non-hierarchical clustering 4 Hierarchical clustering
18 Clustering: Definition (Document) clustering is the process of grouping a set of documents into clusters of similar documents. Documents within a cluster should be similar. Documents from different clusters should be dissimilar. Clustering is the most common form of unsupervised learning. Unsupervised = there are no labeled or annotated data. 272
19 Difference clustering classification Classification supervised learning classes are human-defined and part of the input to the learning algorithm output = membership in class only Clustering unsupervised learning Clusters are inferred from the data without human input. Output = membership in class + distance from centroid ( degree of cluster membership ) 273
20 The cluster hypothesis Cluster hypothesis. Documents in the same cluster behave similarly with respect to relevance to information needs. All applications of clustering in IR are based (directly or indirectly) on the cluster hypothesis. Van Rijsbergen s original wording (1979): closely associated documents tend to be relevant to the same requests. 274
21 Applications of Clustering IR: presentation of results (clustering of documents) Summarisation: clustering of similar documents for multi-document summarisation clustering of similar sentences for re-generation of sentences Topic Segmentation: clustering of similar paragraphs (adjacent or non-adjacent) for detection of topic structure/importance Lexical semantics: clustering of words by cooccurrence patterns 275
22 Scatter-Gather 276
23 Clustering search results 277
24 Clustering news articles 278
25 Clustering topical areas 279
26 Clustering what appears in AAG conferences (AAG = Association of American Geographers) 280
27 Clustering patents 281
28 Clustering terms 282
29 Types of Clustering Hard clustering v. soft clustering Hard clustering: every object is member in only one cluster Soft clustering: objects can be members in more than one cluster Hierarchical v. non-hierarchical clustering Hierarchical clustering: pairs of most-similar clusters are iteratively linked until all objects are in a clustering relationship Non-hierarchical clustering results in flat clusters of similar documents 283
30 Desiderata for clustering General goal: put related docs in the same cluster, put unrelated docs in different clusters. We ll see different ways of formalizing this. The number of clusters should be appropriate for the data set we are clustering. Initially, we will assume the number of clusters K is given. There also exist semiautomatic methods for determining K Secondary goals in clustering Avoid very small and very large clusters Define clusters that are easy to explain to the user Many others
31 Overview 1 Recap/Catchup 2 Clustering: Introduction 3 Non-hierarchical clustering 4 Hierarchical clustering
32 Non-hierarchical (partitioning) clustering Partitional clustering algorithms produce a set of k non-nested partitions corresponding to k clusters of n objects. Advantage: not necessary to compare each object to each other object, just comparisons of objects cluster centroids necessary Optimal partitioning clustering algorithms are O(kn) Main algorithm: K-means 285
33 K-means: Basic idea Each cluster j (with n j elements x i ) is represented by its centroid c j, the average vector of the cluster: c j = 1 n j n j i=1 Measure of cluster quality: minimise mean square distance between elements x i and nearest centroid c j RSS = x i k d( x i, c j ) 2 j=1 x i j Distance: Euclidean; length-normalised vectors in VS We iterate two steps: reassignment: assign each vector to its closest centroid recomputation: recompute each centroid as the average of the vectors that were recently assigned to it 286
34 K-means algorithm Given: a set s 0 = x 1,... x n R m Given: a distance measure d : R m R m R Given: a function for computing the mean µ : P(R) R m Select k initial centers c 1,... c k while stopping criterion not true: k j=1 x i s j d( x i, c j ) 2 < ǫ (stopping criterion) do for all clusters s j do (reassignment) c j :={ x i c l : d( x i, c j ) d( x i, c l )} end for all means c j do (centroid recomputation) cj := µ(s j ) end end 287
35 Worked Example: Set of points to be clustered Exercise: (i) Guess what the optimal clustering into two clusters is in this case; (ii) compute the centroids of the clusters 288
36 Random seeds + Assign points to closest center Iteration One Exercise: (i) Guess what the optimal 289
37 Worked Example: Recompute cluster centroids Iteration One Exercise: (i) 290
38 Worked Example: Assign points to closest centroid Iteration One Exercise: 291
39 Worked Example: Recompute cluster centroids Iteration Two Exercise 292
40 Worked Example: Assign points to closest centroid Iteration Two Exercise 293
41 Worked Example: Recompute cluster centroids Iteration Three Exercise 294
42 Worked Example: Assign points to closest centroid Iteration Three Exercise 295
43 Worked Example: Recompute cluster centroids Iteration Four Exercise 296
44 Worked Example: Assign points to closest centroid Iteration Four Exercise: 297
45 Worked Example: Recompute cluster centroids Iteration Five Exercise 298
46 Worked Example: Assign points to closest centroid Iteration Five Exercise: 299
47 Worked Example: Recompute cluster centroids Iteration Six Exercise: 300
48 Worked Example: Assign points to closest centroid Iteration Six Exercise: 301
49 Worked Example: Recompute cluster centroids Iteration Seven Exercise: 302
50 Worked Ex.: Centroids and assignments after convergence Convergence Exercise: 303
51 K-means is guaranteed to converge: Proof RSS decreases during each reassignment step. because each vector is moved to a closer centroid RSS decreases during each recomputation step. This follows from the definition of a centroid: the new centroid is the vector for which RSS k reaches its minimum There is only a finite number of clusterings. Thus: We must reach a fixed point. Finite set & monotonically decreasing evaluation function convergence Assumption: Ties are broken consistently. 304
52 Other properties of K-means Fast convergence K-means typically converges in around iterations (if we don t care about a few documents switching back and forth) However, complete convergence can take many more iterations. Non-optimality K-means is not guaranteed to find the optimal solution. If we start with a bad set of seeds, the resulting clustering can be horrible. Dependence on initial centroids Solution 1: Use i clusterings, choose one with lowest RSS Solution 2: Use prior hierarchical clustering step to find seeds with good coverage of document space 305
53 Time complexity of K-means Reassignment step: O(KNM) (we need to compute KN document-centroid distances, each of which costs O(M) Recomputation step: O(NM) (we need to add each of the document s < M values to one of the centroids) Assume number of iterations bounded by I Overall complexity: O(IKNM) linear in all important dimensions 306
54 Overview 1 Recap/Catchup 2 Clustering: Introduction 3 Non-hierarchical clustering 4 Hierarchical clustering
55 Hierarchical clustering Imagine we now want to create a hierachy in the form of a binary tree. Assumes a similarity measure for determining the similarity of two clusters. Up to now, our similarity measures were for documents. We will look at different cluster similarity measures. Main algorithm: HAC (hierarchical agglomerative clustering) 307
56 HAC: Basic algorithm Start with each document in a separate cluster Then repeatedly merge the two clusters that are most similar Until there is only one cluster. The history of merging is a hierarchy in the form of a binary tree. The standard way of depicting this history is a dendrogram. 308
57 A dendrogram NYSE closing averages Hog prices tumble Oil prices slip Ag trade reform. Chrysler / Latin America Japanese prime minister / Mexico Fed holds interest rates steady Fed to keep interest rates steady Fed keeps interest rates steady Fed keeps interest rates steady Mexican markets British FTSE index War hero Colin Powell War hero Colin Powell Lloyd s CEO questioned Lloyd s chief / U.S. grilling Ohio Blue Cross Lawsuit against tobacco companies suits against tobacco firms Indiana tobacco lawsuit Viag stays positive Most active stocks CompuServe reports loss Sprint / Internet access service Planet Hollywood Trocadero: tripling of revenues Back to school spending is up German unions split Chains may raise prices Clinton signs law 309
58 Term document matrix to document document matrix Log frequency weighting and cosine normalisation SaS PaP WH SaS P(SaS,SaS) P(PaP,SaS) PaP P(SaS,PaP) P(PaP,PaP) WH P(SaS,WH) P(PaP,WH) SaS PaP SaS PaP WH SaS PaP WH Applying the proximity metric to all pairs of documents... creates the document-document matrix, which reports similarities/distances between objects (documents) The diagonal is trivial (identity) As proximity measures are symmetric, the matrix is a triangle 310
59 Hierarchical clustering: agglomerative (BottomUp, greedy) Given: a set X = x 1,...x n of objects; Given: a function sim : P(X)P(X) R for i:= 1 to n do c i := x i C :=c 1,... c n j := n+1 while C > 1 do (c n1,c n2 ) := max (cu,c v) CCsim(c u,c v) c j := c n1 c n2 C := C { c n1,c n2 } c j j:=j+1 end Similarity function sim : P(X) P(X) R measures similarity between clusters, not objects 311
60 Computational complexity of the basic algorithm First, we compute the similarity of all N N pairs of documents. Then, in each of N iterations: We scan the O(N N) similarities to find the maximum similarity. We merge the two clusters with maximum similarity. We compute the similarity of the new cluster with all other (surviving) clusters. There are O(N) iterations, each performing a O(N N) scan operation. Overall complexity is O(N 3 ). Depending on the similarity function, a more efficient algorithm is possible. 312
61 Hierarchical clustering: similarity functions Similarity between two clusters c k and c j (with similarity measure s) can be interpreted in different ways: Single Link Function: Similarity of two most similar members sim(c u,c v ) = max x cu,y c k s(x,y) Complete Link Function: Similarity of two least similar members sim(c u,c v ) = min x cu,y c k s(x,y) Group Average Function: Avg. similarity of each pair of group members sim(c u,c v ) = avg x cu,y c k s(x,y) 313
62 Example: hierarchical clustering; similarity functions Cluster 8 objects a-h; Euclidean distances (2D) shown in diagram a b c d e f g h b 1 c d e f g h a b c d e f g 314
63 Single Link is O(n 2 ) b 1 c d e f g h a b c d e f g After Step 4 (a b, c d, e f, g h merged): c d 1.5 e f g h a b c d e f min-min at each step 315
64 Clustering Result under Single Link a b c d e f g h a b c d e f g h 316
65 Complete Link b 1 c d e f g h a b c d e f g After step 4 (a b, c d, e f, g h merged): c d e f g h a b c d e f max-min at each step 317
66 Complete Link b 1 c d e f g h a b c d e f g After step 4 (a b, c d, e f, g h merged): c d e f g h a b c d e f max-min at each step ab/ef and cd/gh merges next 318
67 Clustering result under complete link a b c d e f g h Complete Link is O(n 3 ) a b c d e f g h 319
68 Example: gene expression data An example from biology: cluster genes by function Survey 112 rat genes which are suspected to participate in development of CNS Take 9 data points: 5 embryonic (E11, E13, E15, E18, E21), 3 postnatal (P0, P7, P14) and one adult Measure expression of gene (how much mrna in cell?) These measures are normalised logs; for our purposes, we can consider them as weights Cluster analysis determines which genes operate at the same time 320
69 Rat CNS gene expression data (excerpt) gene genbank locus E11 E13 E15 E18 E21 P0 P7 P14 A keratin RNKER cellubrevin s nestin RATNESTIN MAP2 RATMAP GAP43 RATGAP L1 S NFL RATNFL NFM RATNFM NFH RATNFHPEP synaptophysin RNSYN neno RATENONS S100 beta RATS100B GFAP RNU MOG RATMOG GAD65 RATGAD pre-gad67 RATGAD GAD67 RATGAD G67I80/86 RATGAD G67I86 RATGAD GAT1 RATGABAT ChAT (*) ACHE S ODC RATODC TH RATTOHA NOS RRBNOS GRa1 (#)
70 Rat CNS gene clustering single link NFM GRg1 NFL GFAP NMDA1 cellubrevin nestin GAT1 EGFR NFH nachra7 mglur5 actin ACHE machr2 GRg2 MOG GRb1 NT3 GAD67 GAD6 SC2 5 trkb CCO2 SC1 S100 beta GRg3 mglur7 GRa2 GRa5 keratin GRb2 synaptophysin EGF G67I80/86 CRAF CNTF CCO1 GRa4 SOD GRb3 GAP43 MAP2 pre-gad67 GRa3 neno statin PTN SC7 IGFR1 cfos Ins2 NMDA2C G67I86 cyclin A InsR trk IP3R1 L1 NMDA2D 5HT2 nachra4 nachra5 NOS TGFR PDGFR nachra3 H2AZ FGFR machr3 5HT1b BDNF CNTFR IGF I IP3R3 nachre mglur6 TCP GDNF SC6 mglur8 mglur2 PDGFb nachra6 nachrd Ins1 5HT3 mglur4 nachra2 machr4 cjun PDG NGF Fa DD63.2 mglur1 NMDA2A IGFR2 TH Brm ODC IGF II trkc mglur3 NMDA2 B 5HT1c GRa1 bfgf IP3R2 ChAT afgf MK2 cyclin B Clustering of Rat Expression Data (Single Link/Euclidean) 322
71 Rat CNS gene clustering complete link ODC IGF II mglur1 NMDA2A GDNF SC6 Brm FGFR NOS nachra3 L1 5HT2 nachra4 nachra5 CNTFR mglur8 nachre PDGFb nachra6 nachrd Ins1 BDNF IGF I IP3R3 TCP mglur4 nachra2 machr3 5HT1b mglur6 mglur2 5HT3 mglur3 NMDA2 B trk TGFR IGFR2 cyclin A neno statin pre-gad67 GRa3 TH PDGFR G67I86 SC7 IP3R1 machr4 PDG Fa 5HT1c GRa1 bfgf IP3R2 ChAT afgf NGF cjun DD63.2 Ins2 GRa2 GRa5 NMDA2C InsR H2AZ NMDA2D PTN CNTF SOD GAP43 MAP2 GRb3 trkc keratin CCO2 G67I80/86 CRAF NT3 SC1 MK2 cyclin B GRg3 mglur7 IGFR1 EGF cfos CCO1 S100 beta synaptophysin GRa4 GRb2 trkb GRg2 actin mglur5 GAD6 GAD67 5 nachra7 EGFR SC2 GAT1 NFH MOG ACHE GRb1 machr2 cellubrevin nestin NFL NMDA1 GFAP GRg1 NFM Clustering of Rat Expression Data (Complete Link/Euclidean) 323
72 Rat CNS gene clustering group average link NFL NFM NMDA1 nestin GAT1 cellubrevin NFH MOG actin SC2 NT3 SC1 MK2 cyclin B keratin CCO2 IP3R2 ChAT afgf mglur3 NMDA2 B 5HT1c GRa1 bfgf EGF cfos cjun NGF DD63.2 mglur1 NMDA2A CNTF SOD CCO1 IGFR2 cyclin A TH Brm ODC IGF II IGFR1 PTN PDGFR trk TGFR Ins2 NMDA2C machr3 5HT1b NOS nachra3 L1 5HT2 G67I86 InsR nachra4 nachra5 IP3R1 machr4 PDG Fa NMDA2D H2AZ FGFR GDNF SC6 TCP mglur2 mglur6 mglur8 5HT3 mglur4 nachra2 nachre PDGFb nachra6 nachrd Ins1 CNTFR BDNF IGF I IP3R3 CRAF G67I80/86 SC7 trkb S100 beta GRa2 GRa5 GRg3 mglur7 GRa4 GRb2 synaptophysin GRb3 trkc GAP43 pre-gad67 GRa3 MAP2 neno statin EGFR nachra7 ACHE GRb1 machr2 GRg2 mglur5 GAD6 GAD67 5 GFAP GRg1 Clustering of Rat Expression Data (Av Link/Euclidean) 324
73 Flat or hierarchical clustering? When a hierarchical structure is desired: hierarchical algorithm Humans are bad at interpreting hiearchical clusterings (unless cleverly visualised) For high efficiency, use flat clustering For deterministic results, use HAC HAC also can be applied if K cannot be predetermined (can start without knowing K) 325
74 Take-away Partitional clustering Provides less information but is more efficient (best: O(kn)) K-means Hierarchical clustering Best algorithms O(n 2 ) complexity Single-link vs. complete-link (vs. group-average) Hierarchical and non-hierarchical clustering fulfills different needs 326
75 Reading MRS Chapters MRS Chapters
Lecture 7: Clustering
Lecture 7: Clustering Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2016 1 Adapted from Simone
More informationUnsupervised and Semi- Supervised Learning. Machine Learning, Fall 2010
Unsupervised and Semi- Supervised Learning Machine Learning, Fall 2010 1 Announcement!"#$%&'($")*+,,-$")$").$-"/)0$).%'12'0+)*34$$5)-")6$&720+%)*3-+"3+) 8+'%)9:;)6*)#'3250
More informationOverview. Lecture 6: Evaluation. Summary: Ranked retrieval. Overview. Information Retrieval Computer Science Tripos Part II.
Overview Lecture 6: Evaluation Information Retrieval Computer Science Tripos Part II Recap/Catchup 2 Introduction Ronan Cummins 3 Unranked evaluation Natural Language and Information Processing (NLIP)
More informationCLUSTERING. Quiz information. today 11/14/13% ! The second midterm quiz is on Thursday (11/21) ! In-class (75 minutes!)
CLUSTERING Quiz information! The second midterm quiz is on Thursday (11/21)! In-class (75 minutes!)! Allowed one two-sided (8.5x11) cheat sheet! Solutions for optional problems to HW5 posted today 1% Quiz
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://www.informationretrieval.org Hierarchical Clustering Institute for Natural Language Processing, Universität Stuttgart 2007.06.19 Hierarchical Clustering 1 /
More informationFlat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017
Flat Clustering Slides are mostly from Hinrich Schütze March 7, 07 / 79 Overview Recap Clustering: Introduction 3 Clustering in IR 4 K-means 5 Evaluation 6 How many clusters? / 79 Outline Recap Clustering:
More informationLecture 5: Evaluation
Lecture 5: Evaluation Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Simone.Teufel@cl.cam.ac.uk Lent 2014 204 Overview 1 Recap/Catchup
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 6: Flat Clustering Wiltrud Kessler & Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 0-- / 83
More informationCSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)
CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze Institute for
More informationPV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211
PV: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv IIR 6: Flat Clustering Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 16: Flat Clustering Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2009.06.16 1/ 64 Overview
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 6: Flat Clustering Hinrich Schütze Center for Information and Language Processing, University of Munich 04-06- /86 Overview Recap
More informationClustering CE-324: Modern Information Retrieval Sharif University of Technology
Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch. 16 What
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London What Is Text Clustering? Text Clustering = Grouping a set of documents into classes of similar
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 8: Evaluation & Result Summaries Hinrich Schütze Center for Information and Language Processing, University of Munich 2013-05-07
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework
More informationInformation Retrieval
Information Retrieval ETH Zürich, Fall 2012 Thomas Hofmann LECTURE 6 EVALUATION 24.10.2012 Information Retrieval, ETHZ 2012 1 Today s Overview 1. User-Centric Evaluation 2. Evaluation via Relevance Assessment
More informationINF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering
INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster
More informationMI example for poultry/export in Reuters. Overview. Introduction to Information Retrieval. Outline.
Introduction to Information Retrieval http://informationretrieval.org IIR 16: Flat Clustering Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2009.06.16 Outline 1 Recap
More informationTHIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user
EVALUATION Sec. 6.2 THIS LECTURE How do we know if our results are any good? Evaluating a search engine Benchmarks Precision and recall Results summaries: Making our good results usable to a user 2 3 EVALUATING
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised
More informationTemplates for Looking at Gene Expression Clustering TOPICS IN INFORMATION VISUALIZATION. 1. Introduction
Forman, S.D., Cohen, J.D., Fitzgerald, M., Eddy, W.F., Mintun, M.A., and Noll, D.C. (1995), Improved Assessment of Significant Change in Functional Magnetic Resonance Imaging (fmri): Use of a Cluster Size
More informationInformation Retrieval. Lecture 7
Information Retrieval Lecture 7 Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations This lecture Evaluating a search engine Benchmarks Precision
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More informationSearch Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]
Search Evaluation Tao Yang CS293S Slides partially based on text book [CMS] [MRS] Table of Content Search Engine Evaluation Metrics for relevancy Precision/recall F-measure MAP NDCG Difficulties in Evaluating
More informationPart 7: Evaluation of IR Systems Francesco Ricci
Part 7: Evaluation of IR Systems Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan 1 This lecture Sec. 6.2 p How
More informationClustering Results. Result List Example. Clustering Results. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to
More informationCSCI 5417 Information Retrieval Systems. Jim Martin!
CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 7 9/13/2011 Today Review Efficient scoring schemes Approximate scoring Evaluating IR systems 1 Normal Cosine Scoring Speedups... Compute the
More informationhttp://www.xkcd.com/233/ Text Clustering David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Administrative 2 nd status reports Paper review
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationClustering (COSC 416) Nazli Goharian. Document Clustering.
Clustering (COSC 416) Nazli Goharian nazli@cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,
More informationBased on Raymond J. Mooney s slides
Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 5: Evaluation Ruixuan Li http://idc.hust.edu.cn/~rxli/ Sec. 6.2 This lecture How do we know if our results are any good? Evaluating a search engine Benchmarks
More informationClustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search
Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2
More informationInformation Retrieval. Lecture 3: Evaluation methodology
Information Retrieval Lecture 3: Evaluation methodology Computer Science Tripos Part II Lent Term 2004 Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk Today 2
More informationClustering. Partition unlabeled examples into disjoint subsets of clusters, such that:
Text Clustering 1 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover
More informationCS6322: Information Retrieval Sanda Harabagiu. Lecture 13: Evaluation
Sanda Harabagiu Lecture 13: Evaluation Sec. 6.2 This lecture How do we know if our results are any good? Evaluating a search engine Benchmarks Precision and recall Results summaries: Making our good results
More informationInformation Retrieval
Information Retrieval Lecture 7 - Evaluation in Information Retrieval Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 29 Introduction Framework
More informationInformation Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007
Information Retrieval Lecture 7 - Evaluation in Information Retrieval Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1 / 29 Introduction Framework
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 Information Retrieval Lecture 9: IR Evaluation 9 Ch. 7 Last Time The VSM Reloaded optimized for your pleasure! Improvements to the computation and selection
More informationCS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University
CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document
More informationClustering (COSC 488) Nazli Goharian. Document Clustering.
Clustering (COSC 488) Nazli Goharian nazli@ir.cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,
More informationClustering: Overview and K-means algorithm
Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin
More informationData Informatics. Seon Ho Kim, Ph.D.
Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,
More informationRetrieval Evaluation. Hongning Wang
Retrieval Evaluation Hongning Wang CS@UVa What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Rep (Index) Query Rep Feedback (Query) Evaluation User
More informationINF4820, Algorithms for AI and NLP: Hierarchical Clustering
INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationText Documents clustering using K Means Algorithm
Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 9: Data Mining (4/4) March 9, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides
More informationInformation Retrieval
Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar Raghavan Evaluation 1 Situation Thanks to your stellar performance in CS276, you
More informationWeb Information Retrieval. Exercises Evaluation in information retrieval
Web Information Retrieval Exercises Evaluation in information retrieval Evaluating an IR system Note: information need is translated into a query Relevance is assessed relative to the information need
More information5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction
Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering
More informationEvaluation. David Kauchak cs160 Fall 2009 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture8-evaluation.ppt Administrative How are things going? Slides Points Zipf s law IR Evaluation For
More informationAdministrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES
Administrative Machine learning: Unsupervised learning" Assignment 5 out soon David Kauchak cs311 Spring 2013 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Machine
More informationMIA - Master on Artificial Intelligence
MIA - Master on Artificial Intelligence 1 Hierarchical Non-hierarchical Evaluation 1 Hierarchical Non-hierarchical Evaluation The Concept of, proximity, affinity, distance, difference, divergence We use
More informationClustering: Overview and K-means algorithm
Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin
More informationClustering algorithms
Clustering algorithms Machine Learning Hamid Beigy Sharif University of Technology Fall 1393 Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 1 / 22 Table of contents 1 Supervised
More informationHierarchical Clustering 4/5/17
Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction
More informationBuilding Test Collections. Donna Harman National Institute of Standards and Technology
Building Test Collections Donna Harman National Institute of Standards and Technology Cranfield 2 (1962-1966) Goal: learn what makes a good indexing descriptor (4 different types tested at 3 levels of
More informationCS47300: Web Information Search and Management
CS47300: Web Information Search and Management Text Clustering Prof. Chris Clifton 19 October 2018 Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti Document clustering Motivations Document
More informationBehavioral Data Mining. Lecture 18 Clustering
Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems
More informationEvaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology
Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationClustering Algorithms for general similarity measures
Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative
More informationCSE 494 Project C. Garrett Wolf
CSE 494 Project C Garrett Wolf Introduction The main purpose of this project task was for us to implement the simple k-means and buckshot clustering algorithms. Once implemented, we were asked to vary
More informationEvaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology
Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #14: Clustering Seoul National University 1 In This Lecture Learn the motivation, applications, and goal of clustering Understand the basic methods of clustering (bottom-up
More informationWhat to come. There will be a few more topics we will cover on supervised learning
Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression
More informationInforma(on Retrieval
Introduc*on to Informa(on Retrieval Lecture 8: Evalua*on 1 Sec. 6.2 This lecture How do we know if our results are any good? Evalua*ng a search engine Benchmarks Precision and recall 2 EVALUATING SEARCH
More informationSearch Engines Chapter 8 Evaluating Search Engines Felix Naumann
Search Engines Chapter 8 Evaluating Search Engines 9.7.2009 Felix Naumann Evaluation 2 Evaluation is key to building effective and efficient search engines. Drives advancement of search engines When intuition
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationUnsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More informationCluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University
Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University Kinds of Clustering Sequential Fast Cost Optimization Fixed number of clusters Hierarchical
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationHard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering
An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other
More informationEvaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology
Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationLecture 15 Clustering. Oct
Lecture 15 Clustering Oct 31 2008 Unsupervised learning and pattern discovery So far, our data has been in this form: x 11,x 21, x 31,, x 1 m y1 x 12 22 2 2 2,x, x 3,, x m y We will be looking at unlabeled
More informationDocument Clustering: Comparison of Similarity Measures
Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationExpectation Maximization!
Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University and http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Steps in Clustering Select Features
More informationAssociation Rule Mining and Clustering
Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More informationToday s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan
Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics
More informationTypes of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters
Types of general clustering methods Clustering Algorithms for general similarity measures agglomerative versus divisive algorithms agglomerative = bottom-up build up clusters from single objects divisive
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning
More informationUnsupervised Learning Partitioning Methods
Unsupervised Learning Partitioning Methods Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form
More informationLecture 4 Hierarchical clustering
CSE : Unsupervised learning Spring 00 Lecture Hierarchical clustering. Multiple levels of granularity So far we ve talked about the k-center, k-means, and k-medoid problems, all of which involve pre-specifying
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationWhat is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology
Clustering Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods Data Clustering 1 What is Clustering? Form o unsupervised learning - no inormation
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationLecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic
SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationChapter 9. Classification and Clustering
Chapter 9 Classification and Clustering Classification and Clustering Classification and clustering are classical pattern recognition and machine learning problems Classification, also referred to as categorization
More informationCluster Analysis: Agglomerate Hierarchical Clustering
Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical
More informationClust Clus e t ring 2 Nov
Clustering 2 Nov 3 2008 HAC Algorithm Start t with all objects in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, c i and c j, that are most
More information