Lecture 6: Clustering

Size: px

Start display at page:

Download "Lecture 6: Clustering"

Alexia Short
6 years ago
Views:

1 Lecture 6: Clustering Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Lent

2 Overview 1 Recap/Catchup 2 Clustering: Introduction 3 Non-hierarchical clustering 4 Hierarchical clustering

3 Precision and recall w THE TRUTH WHAT THE Relevant Nonrelevant SYSTEM Retrieved true positives (TP) false positives (FP) THINKS Not retrieved false negatives (FN) true negatives (TN) False Negatives True Positives False Positives Relevant Retrieved True Negatives P = TP/(TP +FP) R = TP/(TP +FN) 258

4 Precision/Recall Graph Rank Doc 1 d 12 2 d d 4 4 d 57 5 d d d 24 8 d 26 9 d d 90 Precision Recall 259

5 Avg 11pt prec area under normalised P/R graph P 11 pt = j=0 1 N N P i (r j ) i= Precision Recall

6 Mean Average Precision (MAP) MAP = 1 Q N 1 j P(doc i ) N Q j=1 j i=1 Query 1 Rank P(doc i ) 1 X X X X X 0.25 AVG: MAP = = Query 2 Rank P(doc i ) 1 X X X 0.2 AVG:

7 What we need for a benchmark A collection of documents Documents must be representative of the documents we expect to see in reality. There must be many documents abstracts (as in Cranfield experiment) no longer sufficient to model modern retrieval A collection of information needs...which we will often incorrectly refer to as queries Information needs must be representative of the information needs we expect to see in reality. Human relevance assessments We need to hire/pay judges or assessors to do this. Expensive, time-consuming Judges must be representative of the users we expect to see in reality. 262

8 Second-generation relevance benchmark: TREC TREC = Text Retrieval Conference (TREC) Organized by the U.S. National Institute of Standards and Technology (NIST) TREC is actually a set of several different relevance benchmarks. Best known: TREC Ad Hoc, used for first 8 TREC evaluations between 1992 and million documents, mainly newswire articles, 450 information needs No exhaustive relevance judgments too expensive Rather, NIST assessors relevance judgments are available only for the documents that were among the top k returned for some system which was entered in the TREC evaluation for which the information need was developed. 263

9 Sample TREC Query <num> Number: 508 <title> hair loss is a symptom of what diseases <desc> Description: Find diseases for which hair loss is a symptom. <narr> Narrative: A document is relevant if it positively connects the loss of head hair in humans with a specific disease. In this context, thinning hair and hair loss are synonymous. Loss of body and/or facial hair is irrelevant, as is hair loss caused by drug therapy. 264

10 TREC Relevance Judgements Humans decide which document query pairs are relevant. 265

11 Interjudge agreement at TREC information number of disagreements need docs judged Observation: Judges disagree a lot. This means a large impact on absolute performance numbers of each system But virtually no impact on ranking of systems So, the results of information retrieval experiments of this kind can reliably tell us whether system A is better than system B. even if judges disagree. 266

12 Example of more recent benchmark: ClueWeb09 1 billion web pages 25 terabytes (compressed: 5 terabyte) Collected January/February languages Unique URLs: 4,780,950,903 (325 GB uncompressed, 105 GB compressed) Total Outlinks: 7,944,351,835 (71 GB uncompressed, 24 GB compressed) 267

13 Evaluation at large search engines Recall is difficult to measure on the web Search engines often use precision at top k, e.g., k = or use measures that reward you more for getting rank 1 right than for getting rank 10 right. Search engines also use non-relevance-based measures. Example 1: clickthrough on first result Not very reliable if you look at a single clickthrough (you may realize after clicking that the summary was misleading and the document is nonrelevant)......but pretty reliable in the aggregate. Example 2: A/B testing 268

14 A/B testing Purpose: Test a single innovation Prerequisite: You have a large search engine up and running. Have most users use old system Divert a small proportion of traffic (e.g., 1%) to the new system that includes the innovation Evaluate with an automatic measure like clickthrough on first result Now we can directly see if the innovation does improve user happiness. Probably the evaluation methodology that large search engines trust most 269

15 Reading MRS, Chapter 8 270

16 Upcoming What is clustering? Applications of clustering in information retrieval K-means algorithm Introduction to hierarchical clustering Single-link and complete-link clustering 271

17 Overview 1 Recap/Catchup 2 Clustering: Introduction 3 Non-hierarchical clustering 4 Hierarchical clustering

18 Clustering: Definition (Document) clustering is the process of grouping a set of documents into clusters of similar documents. Documents within a cluster should be similar. Documents from different clusters should be dissimilar. Clustering is the most common form of unsupervised learning. Unsupervised = there are no labeled or annotated data. 272

19 Difference clustering classification Classification supervised learning classes are human-defined and part of the input to the learning algorithm output = membership in class only Clustering unsupervised learning Clusters are inferred from the data without human input. Output = membership in class + distance from centroid ( degree of cluster membership ) 273

20 The cluster hypothesis Cluster hypothesis. Documents in the same cluster behave similarly with respect to relevance to information needs. All applications of clustering in IR are based (directly or indirectly) on the cluster hypothesis. Van Rijsbergen s original wording (1979): closely associated documents tend to be relevant to the same requests. 274

21 Applications of Clustering IR: presentation of results (clustering of documents) Summarisation: clustering of similar documents for multi-document summarisation clustering of similar sentences for re-generation of sentences Topic Segmentation: clustering of similar paragraphs (adjacent or non-adjacent) for detection of topic structure/importance Lexical semantics: clustering of words by cooccurrence patterns 275

22 Scatter-Gather 276

23 Clustering search results 277

24 Clustering news articles 278

25 Clustering topical areas 279

26 Clustering what appears in AAG conferences (AAG = Association of American Geographers) 280

27 Clustering patents 281

28 Clustering terms 282

29 Types of Clustering Hard clustering v. soft clustering Hard clustering: every object is member in only one cluster Soft clustering: objects can be members in more than one cluster Hierarchical v. non-hierarchical clustering Hierarchical clustering: pairs of most-similar clusters are iteratively linked until all objects are in a clustering relationship Non-hierarchical clustering results in flat clusters of similar documents 283

30 Desiderata for clustering General goal: put related docs in the same cluster, put unrelated docs in different clusters. We ll see different ways of formalizing this. The number of clusters should be appropriate for the data set we are clustering. Initially, we will assume the number of clusters K is given. There also exist semiautomatic methods for determining K Secondary goals in clustering Avoid very small and very large clusters Define clusters that are easy to explain to the user Many others

31 Overview 1 Recap/Catchup 2 Clustering: Introduction 3 Non-hierarchical clustering 4 Hierarchical clustering

32 Non-hierarchical (partitioning) clustering Partitional clustering algorithms produce a set of k non-nested partitions corresponding to k clusters of n objects. Advantage: not necessary to compare each object to each other object, just comparisons of objects cluster centroids necessary Optimal partitioning clustering algorithms are O(kn) Main algorithm: K-means 285

33 K-means: Basic idea Each cluster j (with n j elements x i ) is represented by its centroid c j, the average vector of the cluster: c j = 1 n j n j i=1 Measure of cluster quality: minimise mean square distance between elements x i and nearest centroid c j RSS = x i k d( x i, c j ) 2 j=1 x i j Distance: Euclidean; length-normalised vectors in VS We iterate two steps: reassignment: assign each vector to its closest centroid recomputation: recompute each centroid as the average of the vectors that were recently assigned to it 286

34 K-means algorithm Given: a set s 0 = x 1,... x n R m Given: a distance measure d : R m R m R Given: a function for computing the mean µ : P(R) R m Select k initial centers c 1,... c k while stopping criterion not true: k j=1 x i s j d( x i, c j ) 2 < ǫ (stopping criterion) do for all clusters s j do (reassignment) c j :={ x i c l : d( x i, c j ) d( x i, c l )} end for all means c j do (centroid recomputation) cj := µ(s j ) end end 287

35 Worked Example: Set of points to be clustered Exercise: (i) Guess what the optimal clustering into two clusters is in this case; (ii) compute the centroids of the clusters 288

36 Random seeds + Assign points to closest center Iteration One Exercise: (i) Guess what the optimal 289

37 Worked Example: Recompute cluster centroids Iteration One Exercise: (i) 290

38 Worked Example: Assign points to closest centroid Iteration One Exercise: 291

39 Worked Example: Recompute cluster centroids Iteration Two Exercise 292

40 Worked Example: Assign points to closest centroid Iteration Two Exercise 293

41 Worked Example: Recompute cluster centroids Iteration Three Exercise 294

42 Worked Example: Assign points to closest centroid Iteration Three Exercise 295

43 Worked Example: Recompute cluster centroids Iteration Four Exercise 296

44 Worked Example: Assign points to closest centroid Iteration Four Exercise: 297

45 Worked Example: Recompute cluster centroids Iteration Five Exercise 298

46 Worked Example: Assign points to closest centroid Iteration Five Exercise: 299

47 Worked Example: Recompute cluster centroids Iteration Six Exercise: 300

48 Worked Example: Assign points to closest centroid Iteration Six Exercise: 301

49 Worked Example: Recompute cluster centroids Iteration Seven Exercise: 302

50 Worked Ex.: Centroids and assignments after convergence Convergence Exercise: 303

51 K-means is guaranteed to converge: Proof RSS decreases during each reassignment step. because each vector is moved to a closer centroid RSS decreases during each recomputation step. This follows from the definition of a centroid: the new centroid is the vector for which RSS k reaches its minimum There is only a finite number of clusterings. Thus: We must reach a fixed point. Finite set & monotonically decreasing evaluation function convergence Assumption: Ties are broken consistently. 304

52 Other properties of K-means Fast convergence K-means typically converges in around iterations (if we don t care about a few documents switching back and forth) However, complete convergence can take many more iterations. Non-optimality K-means is not guaranteed to find the optimal solution. If we start with a bad set of seeds, the resulting clustering can be horrible. Dependence on initial centroids Solution 1: Use i clusterings, choose one with lowest RSS Solution 2: Use prior hierarchical clustering step to find seeds with good coverage of document space 305

53 Time complexity of K-means Reassignment step: O(KNM) (we need to compute KN document-centroid distances, each of which costs O(M) Recomputation step: O(NM) (we need to add each of the document s < M values to one of the centroids) Assume number of iterations bounded by I Overall complexity: O(IKNM) linear in all important dimensions 306

54 Overview 1 Recap/Catchup 2 Clustering: Introduction 3 Non-hierarchical clustering 4 Hierarchical clustering

55 Hierarchical clustering Imagine we now want to create a hierachy in the form of a binary tree. Assumes a similarity measure for determining the similarity of two clusters. Up to now, our similarity measures were for documents. We will look at different cluster similarity measures. Main algorithm: HAC (hierarchical agglomerative clustering) 307

56 HAC: Basic algorithm Start with each document in a separate cluster Then repeatedly merge the two clusters that are most similar Until there is only one cluster. The history of merging is a hierarchy in the form of a binary tree. The standard way of depicting this history is a dendrogram. 308

57 A dendrogram NYSE closing averages Hog prices tumble Oil prices slip Ag trade reform. Chrysler / Latin America Japanese prime minister / Mexico Fed holds interest rates steady Fed to keep interest rates steady Fed keeps interest rates steady Fed keeps interest rates steady Mexican markets British FTSE index War hero Colin Powell War hero Colin Powell Lloyd s CEO questioned Lloyd s chief / U.S. grilling Ohio Blue Cross Lawsuit against tobacco companies suits against tobacco firms Indiana tobacco lawsuit Viag stays positive Most active stocks CompuServe reports loss Sprint / Internet access service Planet Hollywood Trocadero: tripling of revenues Back to school spending is up German unions split Chains may raise prices Clinton signs law 309

58 Term document matrix to document document matrix Log frequency weighting and cosine normalisation SaS PaP WH SaS P(SaS,SaS) P(PaP,SaS) PaP P(SaS,PaP) P(PaP,PaP) WH P(SaS,WH) P(PaP,WH) SaS PaP SaS PaP WH SaS PaP WH Applying the proximity metric to all pairs of documents... creates the document-document matrix, which reports similarities/distances between objects (documents) The diagonal is trivial (identity) As proximity measures are symmetric, the matrix is a triangle 310

59 Hierarchical clustering: agglomerative (BottomUp, greedy) Given: a set X = x 1,...x n of objects; Given: a function sim : P(X)P(X) R for i:= 1 to n do c i := x i C :=c 1,... c n j := n+1 while C > 1 do (c n1,c n2 ) := max (cu,c v) CCsim(c u,c v) c j := c n1 c n2 C := C { c n1,c n2 } c j j:=j+1 end Similarity function sim : P(X) P(X) R measures similarity between clusters, not objects 311

60 Computational complexity of the basic algorithm First, we compute the similarity of all N N pairs of documents. Then, in each of N iterations: We scan the O(N N) similarities to find the maximum similarity. We merge the two clusters with maximum similarity. We compute the similarity of the new cluster with all other (surviving) clusters. There are O(N) iterations, each performing a O(N N) scan operation. Overall complexity is O(N 3 ). Depending on the similarity function, a more efficient algorithm is possible. 312

61 Hierarchical clustering: similarity functions Similarity between two clusters c k and c j (with similarity measure s) can be interpreted in different ways: Single Link Function: Similarity of two most similar members sim(c u,c v ) = max x cu,y c k s(x,y) Complete Link Function: Similarity of two least similar members sim(c u,c v ) = min x cu,y c k s(x,y) Group Average Function: Avg. similarity of each pair of group members sim(c u,c v ) = avg x cu,y c k s(x,y) 313

62 Example: hierarchical clustering; similarity functions Cluster 8 objects a-h; Euclidean distances (2D) shown in diagram a b c d e f g h b 1 c d e f g h a b c d e f g 314

63 Single Link is O(n 2 ) b 1 c d e f g h a b c d e f g After Step 4 (a b, c d, e f, g h merged): c d 1.5 e f g h a b c d e f min-min at each step 315

64 Clustering Result under Single Link a b c d e f g h a b c d e f g h 316

65 Complete Link b 1 c d e f g h a b c d e f g After step 4 (a b, c d, e f, g h merged): c d e f g h a b c d e f max-min at each step 317

66 Complete Link b 1 c d e f g h a b c d e f g After step 4 (a b, c d, e f, g h merged): c d e f g h a b c d e f max-min at each step ab/ef and cd/gh merges next 318

67 Clustering result under complete link a b c d e f g h Complete Link is O(n 3 ) a b c d e f g h 319

68 Example: gene expression data An example from biology: cluster genes by function Survey 112 rat genes which are suspected to participate in development of CNS Take 9 data points: 5 embryonic (E11, E13, E15, E18, E21), 3 postnatal (P0, P7, P14) and one adult Measure expression of gene (how much mrna in cell?) These measures are normalised logs; for our purposes, we can consider them as weights Cluster analysis determines which genes operate at the same time 320

69 Rat CNS gene expression data (excerpt) gene genbank locus E11 E13 E15 E18 E21 P0 P7 P14 A keratin RNKER cellubrevin s nestin RATNESTIN MAP2 RATMAP GAP43 RATGAP L1 S NFL RATNFL NFM RATNFM NFH RATNFHPEP synaptophysin RNSYN neno RATENONS S100 beta RATS100B GFAP RNU MOG RATMOG GAD65 RATGAD pre-gad67 RATGAD GAD67 RATGAD G67I80/86 RATGAD G67I86 RATGAD GAT1 RATGABAT ChAT (*) ACHE S ODC RATODC TH RATTOHA NOS RRBNOS GRa1 (#)

70 Rat CNS gene clustering single link NFM GRg1 NFL GFAP NMDA1 cellubrevin nestin GAT1 EGFR NFH nachra7 mglur5 actin ACHE machr2 GRg2 MOG GRb1 NT3 GAD67 GAD6 SC2 5 trkb CCO2 SC1 S100 beta GRg3 mglur7 GRa2 GRa5 keratin GRb2 synaptophysin EGF G67I80/86 CRAF CNTF CCO1 GRa4 SOD GRb3 GAP43 MAP2 pre-gad67 GRa3 neno statin PTN SC7 IGFR1 cfos Ins2 NMDA2C G67I86 cyclin A InsR trk IP3R1 L1 NMDA2D 5HT2 nachra4 nachra5 NOS TGFR PDGFR nachra3 H2AZ FGFR machr3 5HT1b BDNF CNTFR IGF I IP3R3 nachre mglur6 TCP GDNF SC6 mglur8 mglur2 PDGFb nachra6 nachrd Ins1 5HT3 mglur4 nachra2 machr4 cjun PDG NGF Fa DD63.2 mglur1 NMDA2A IGFR2 TH Brm ODC IGF II trkc mglur3 NMDA2 B 5HT1c GRa1 bfgf IP3R2 ChAT afgf MK2 cyclin B Clustering of Rat Expression Data (Single Link/Euclidean) 322

71 Rat CNS gene clustering complete link ODC IGF II mglur1 NMDA2A GDNF SC6 Brm FGFR NOS nachra3 L1 5HT2 nachra4 nachra5 CNTFR mglur8 nachre PDGFb nachra6 nachrd Ins1 BDNF IGF I IP3R3 TCP mglur4 nachra2 machr3 5HT1b mglur6 mglur2 5HT3 mglur3 NMDA2 B trk TGFR IGFR2 cyclin A neno statin pre-gad67 GRa3 TH PDGFR G67I86 SC7 IP3R1 machr4 PDG Fa 5HT1c GRa1 bfgf IP3R2 ChAT afgf NGF cjun DD63.2 Ins2 GRa2 GRa5 NMDA2C InsR H2AZ NMDA2D PTN CNTF SOD GAP43 MAP2 GRb3 trkc keratin CCO2 G67I80/86 CRAF NT3 SC1 MK2 cyclin B GRg3 mglur7 IGFR1 EGF cfos CCO1 S100 beta synaptophysin GRa4 GRb2 trkb GRg2 actin mglur5 GAD6 GAD67 5 nachra7 EGFR SC2 GAT1 NFH MOG ACHE GRb1 machr2 cellubrevin nestin NFL NMDA1 GFAP GRg1 NFM Clustering of Rat Expression Data (Complete Link/Euclidean) 323

72 Rat CNS gene clustering group average link NFL NFM NMDA1 nestin GAT1 cellubrevin NFH MOG actin SC2 NT3 SC1 MK2 cyclin B keratin CCO2 IP3R2 ChAT afgf mglur3 NMDA2 B 5HT1c GRa1 bfgf EGF cfos cjun NGF DD63.2 mglur1 NMDA2A CNTF SOD CCO1 IGFR2 cyclin A TH Brm ODC IGF II IGFR1 PTN PDGFR trk TGFR Ins2 NMDA2C machr3 5HT1b NOS nachra3 L1 5HT2 G67I86 InsR nachra4 nachra5 IP3R1 machr4 PDG Fa NMDA2D H2AZ FGFR GDNF SC6 TCP mglur2 mglur6 mglur8 5HT3 mglur4 nachra2 nachre PDGFb nachra6 nachrd Ins1 CNTFR BDNF IGF I IP3R3 CRAF G67I80/86 SC7 trkb S100 beta GRa2 GRa5 GRg3 mglur7 GRa4 GRb2 synaptophysin GRb3 trkc GAP43 pre-gad67 GRa3 MAP2 neno statin EGFR nachra7 ACHE GRb1 machr2 GRg2 mglur5 GAD6 GAD67 5 GFAP GRg1 Clustering of Rat Expression Data (Av Link/Euclidean) 324

73 Flat or hierarchical clustering? When a hierarchical structure is desired: hierarchical algorithm Humans are bad at interpreting hiearchical clusterings (unless cleverly visualised) For high efficiency, use flat clustering For deterministic results, use HAC HAC also can be applied if K cannot be predetermined (can start without knowing K) 325

74 Take-away Partitional clustering Provides less information but is more efficient (best: O(kn)) K-means Hierarchical clustering Best algorithms O(n 2 ) complexity Single-link vs. complete-link (vs. group-average) Hierarchical and non-hierarchical clustering fulfills different needs 326

75 Reading MRS Chapters MRS Chapters

Lecture 7: Clustering

Lecture 7: Clustering Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2016 1 Adapted from Simone