Text Analytics. Text Clustering. Ulf Leser

Size: px

Start display at page:

Download "Text Analytics. Text Clustering. Ulf Leser"

Gervais Moore
6 years ago
Views:

1 Text Analytics Text Clustering Ulf Leser

2 Text Classification Given a set D of docs and a set of classes C. A classifier is a function f: D C Problem: Finding a good classifier A good classifier assigns as many docs as possible their correct class How do we know? Supervised learning Obtain a set of docs with their classes Find the characteristics of docs in each class (= build a model) What do they have in common? How do the differ from docs in other classes? Encode the model in a classifier function f f is the better, the more docs are assign their correct class Ulf Leser: Text Analytics, Vorlesung, Sommersemester

3 Categorical Attributes ID Age Type of car Risk 1 23 Family High 2 17 Sports High 3 43 Sports High 4 68 Family Low 5 25 Truck Low Assume this classification was brought up by some insurance manager. What was in his head? Probably a set of rules, such as if age > 50 then risk = low elseif age < 25 then risk = high elseif car = sports then risk = high else risk = low Ulf Leser: Text Analytics, Vorlesung, Sommersemester

4 A Third Approach Why not: ID Age Type of car Risk 1 23 Family High 2 17 Sports High 3 43 Sports High 4 68 Family Low 5 25 Truck Low If age=17 and car = sports then risk = high elseif age=23 and car = family then risk = high elseif age=25 and car = truck then risk = high elseif age=43 and car = sports then risk = low else risk = low Ulf Leser: Text Analytics, Vorlesung, Sommersemester

5 Overfitting This was in instance of our perfect classifier We always learn a model from a small sample of the real world Overfitting If the model is too close to the training data, it performs perfect on the training data but learned any bias present in the training data Thus, the rules do not generalize well Solution Use an appropriate learning algorithm Evaluate you method using cross-validation Ulf Leser: Text Analytics, Vorlesung, Sommersemester

6 Nearest Neighbor Classifiers Very simple and effective method Definition Let D be a set of classified documents, m a distance function between any two documents, and d an unclassified doc. A nearest-neighbor (NN) classifier assigns to d the class of the nearest document to d in D wrt. m A k-nearest-neighbor (knn) classifier assigns to d the most frequent class among the k nearest documents to d in D wrt. m Remark Obviously, a proper distance function is very important In knn, we may weight the k nearest docs according to their distance to d We need to take care of multiple docs with the same distance Ulf Leser: Text Analytics, Vorlesung, Sommersemester

7 Properties Basic idea: Imagine a copy of d is in D. Of course, we then want to assign the class of this copy to d knn is extremely simple and astonishing good knn in general is more robust than NN [MS99]: 1NN reaches ~95% accuracy, MaxEnt ~96% Reuters collection, class earnings, 20 words with highest X 2 -value knn is a lazy learning Actually, there is no learning or model Major problem: Performance (speed) We need to compute the distance between d and any doc in D Various suggestions to structure D to save computations Clustering Chose one representative per class and find nearest representative (not good) Multidimensional index structures and metric embeddings Ulf Leser: Text Analytics, Vorlesung, Sommersemester

8 Bayes Classification Simple method based on probability Given Set D of docs and classes c 1, c 2, c m Docs are described as a set F of binary features Usually the presence/absence of terms in d = VSM representation We seek p(c i d), the probability of a doc d D being a member of class c i d eventually is assigned to c with p(c d) = argmax p(c i d) Replace d with feature representation p ( c d) = p( c F[ d]) = p( c f [ d],..., fn[ d]) = p( c t1,..., t 1 n ) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

9 Naïve Bayes We have The first term cannot be learned with any reasonably large training set There are 2 n combinations of feature values Solution: Be naïve Assume statistical independence of all terms Then p p( c d) p( t1,..., t c)* p( c) ( 1 t1,..., tn c) = p( t c)*...* p( tn c) n And finally p( c d) p( c)* n i= 1 p( t i c) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

10 Classification with ME The ME approach models the joint probability p(c,d) as Z is a normalization constant The feature weights α i are learned from the data K is the number of features Classification with ME K 1 = f p( c, d) * α Z We have p(c,d) = p(c d) * p(d) Again, p(d) can be dropped for ranking Compute p(c,d) for all classes and return the class with the maximal value i= 1 i i ( d, c) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

11 Where is the Problem? We want the α to assert certain conditions on our joint probability distribution p(c,d) Counting distributions of single features over the training set does not in itself create a joint distribution Counting joint distributions: Data sparsity problem (again) In NB, we additionally assumed statistical independence to come up with a joint distribution (using Bayes Theorem) ME goes another way and computes the probability distribution which maximizes the entropy of the joint distribution Thus, it makes as little assumptions as possible giving the data This distribution is encoded in the feature weights α i Ulf Leser: Text Analytics, Vorlesung, Sommersemester

12 Properties of Maximum Entropy Classifiers In general, ME should outperform NB But not always There is theory behind the discrepancies (not covered here) It does not assume independence of features Two redundant features will simply get half of their weights Very popular in statistical NLP Some of the best POS-tagger are ME-based Some of the best NER systems are ME-based Several extensions Maximum Entropy Markov Models Conditional Random Fields Ulf Leser: Text Analytics, Vorlesung, Sommersemester

13 Class of Linear Classifiers Many common classifiers are (log-)linear classifiers Naïve Bayes Perceptron / Winnow Rocchio Linear and logistic regression Maximum entropy Support vector machines (with linear kernel) All compute a hyperplane which (hopefully) separates the two classes Despite similarity, noticeable performance differences exist Which of the infinite number of possible separating hyperplanes is chosen? How are non-separable data sets handled? Experience: Classifiers more powerful than linear often don t perform better (on text) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

14 Linear Classifiers All learn a hyperplane which is used to separate classes in high dimensional space For illustration, we stay in 2-dimensional space and look at binary classification problems only But which? Quelle: Xiaojin Zhu, SVM-cs540 Ulf Leser: Text Analytics, Vorlesung, Sommersemester

15 Support Vector Machines (sketch) Compute the hyperplane which maximizes the margin I.e., is as far away from any data point as possible Can be cast in a linear optimization problem and solved efficiently Solution only depends on the support vectors (points most closest to hyperplane) Complication since usually the classes are not linearly separable Minimize the error (misclassification) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

16 Problems not Linearly Separable Map high dimensional data into an even higher dimensional space None-linearly separable sets may become linearly separable Doing this efficiently requires a good deal of work ( Kernel trick ) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

17 Content of this Lecture (Text) clustering Clustering algorithms Application Ulf Leser: Text Analytics, Vorlesung, Sommersemester

18 Clustering Clustering groups objects into (usually disjoint) sets NN VBZ NNS PP ( ADV JJ) NN Intuitively, each set should contain objects that are similar to each other and dissimilar to objects in any other set We need a similarity or distance function Two optimization goals Also called unsupervised learning We don t know how many sets/classes we expect We don t know how those sets should look like We have no examples for set members Supervised learning = classification / categorization Ulf Leser: Text Analytics, Vorlesung, Sommersemester

19 Example 1 Ulf Leser: Text Analytics, Vorlesung, Sommersemester

20 Clustering 1 Intuition here: Similarity corresponds to Euclidian distance Optimization for good ratio of inner-cluster coherence and intra-cluster distance Ulf Leser: Text Analytics, Vorlesung, Sommersemester

21 Clustering 2 Better or worse? Ulf Leser: Text Analytics, Vorlesung, Sommersemester

22 Quality of a Clustering Let us measure cluster quality only by the average distance of objects within a cluster Definition Let f be a clustering of a set of objects O into a set of classes C with C =k. Let m c be the centre of all objects of class c (to be defined later), and let d(o,o ) be the distance between two objects o and o. Then, the k-score of f is q ( f ) = d( o, k m c c C f ( o) = c Remark Similarly, we could define the k-score as the average distance across all objects pairs within a cluster Would relieve us from finding the centre of a set of objects ) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

23 6-Score Find centre of all clusters, computer distance, aggregate Probably better than the 2-score of clustering on previous slide But Ulf Leser: Text Analytics, Vorlesung, Sommersemester

24 Disadvantage Optimal clustering trivially is reached for k= O We need to fix our definition Ulf Leser: Text Analytics, Vorlesung, Sommersemester

25 Quality of a Clustering 2 Definition Let f: O C with C arbitrary. Let dist(o, c i ) be the average distance of o to all points of cluster c i. We define Note Inner score: a(o) = dist(o,f(o)) Outer score: b(o) = min( dist(o,c i )) with C i f(o) Let the silhouette s(o) be s( o) = Then, the silhouette s(f) of f is Σs(o) b(o): How much decreases the score if f(o) would not exist and o was assigned its next other cluster? s(o) 0: Point right between two cluster s(o) ~ 1: Point very close to only one (its own) cluster s(o) ~ -1: Point far away from its own cluster b( o) a( o) max( a( o), b( o)) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

26 Quality of Clustering 3 The silhouette is a very technical definition Usually, we want to find intuitively appealing clusters Those might not at all conform to our definitions Quelle: [FPPS96] Ulf Leser: Text Analytics, Vorlesung, Sommersemester

27 Text Clustering Applications Explorative data analysis Learn about the structure within your document collection Corpus preprocessing Clustering provides a semantic index to corpus Group docs into clusters to ease navigation Retrieval speed: Index only one representative per cluster Processing of search results Cluster all hits into groups of similar hits (in particular: duplicates) Improving search recall Return doc and all members of its cluster Has similarity to automatic relevance feedback using top-k docs Word sense disambiguation The different senses of a word should appear as clusters Ulf Leser: Text Analytics, Vorlesung, Sommersemester

28 Processing Search Results The research breakthrough was labeling the clusters, i.e., grouping search results into folder topics [Clusty.com blog] Ulf Leser: Text Analytics, Vorlesung, Sommersemester

29 Similarity between Documents All clustering methods require some form of distance or similarity function Must be a metric: d(x,x)=0, d(x,y)=d(y,x), d(x,y) d(x,z)+d(z,y) In contrast to search, we now compare to docs with each other And not a document and a query Nevertheless, the same methods are usually used Compute TD / IDF values for all terms in the corpus Represent documents as K -dimensional vectors Use cosine as distance function sim( d 1, d 2 ) = cos( d 1, d 2 ) = d d 1 1 o d * d 2 2 = d 1 ( d [ i]* d [ i] ) 1 [ i] 2 * 2 d 2 [ i] 2 Ulf Leser: Text Analytics, Vorlesung, Sommersemester

30 Further Issues To increase speed, feature selection is necessary We never counted the time it takes to compare two high dimensional vectors Do not cluster on terms Instead, use the most descriptive terms for you intended clustering Cluster label Use the representative, e.g., show 5-10 terms with highest TF/IDF values in the cluster centre Ulf Leser: Text Analytics, Vorlesung, Sommersemester

31 Content of this Lecture Text clustering Clustering algorithms Hierarchical clustering K-means Soft clustering: EM algorithm Application Ulf Leser: Text Analytics, Vorlesung, Sommersemester

32 Classes of Cluster Algorithms Hierarchical clustering Iteratively creates a hierarchy of clusters Bottom-Up: Start from D clusters and merge clusters until only one remains Top-Down: Start from one cluster (including all docs) and split clusters until every doc is one cluster Or some stop criterion is met Partitioning Heuristically partition all objects in k clusters Guess a first partitioning and improve iteratively k is a parameter of the method, not a result Other Algorithmically: Max-Cut (partitioning) etc. Density-base clustering Minimum description length Ulf Leser: Text Analytics, Vorlesung, Sommersemester

33 Hierarchical Clustering Also called UPGMA Unweighted Pair-group method with arithmetic mean We only discuss the bottom-up approach Computes a binary tree (dendogram) Simple algorithm Compute distance matrix M Distances between any pair of docs Choose pair d 1, d 2 with smallest distance Compute x = m(d 1,d 2 ) (the centre point) Remove d 1, d 2 from M Insert x Distance between x and any d in M: Average distance between d 1 and d and d 2 and 2 Loop until M is empty Ulf Leser: Text Analytics, Vorlesung, Sommersemester

34 Example: Distance Matrix A B C D E F.. A B C D E 95.. F Ulf Leser: Text Analytics, Vorlesung, Sommersemester

35 Ulf Leser: Text Analytics, Vorlesung, Sommersemester Iteration A B C D E F G ABCDEFG A B. C.. D... E... F... G... (B,D) a ACEFGa A C. E.. F... G... a... A B C D E F G ACGab A C. G.. a... b... (E,F) b A B C D E F G (A,b) c CGac C G. a.. c... A B C D E F G (C,G) d acd a c. d.. A B C D E F G (d,c) e A B C D E F G (a,e) f A B C D E F G ae a e.

36 Hierarchical Clustering Ulf Leser: Text Analytics, Vorlesung, Sommersemester

37 Properties Advantages Simple and intuitive algorithm Number of clusters is not an input of the method Usually good quality clusters Disadvantage Very expensive Requires O(n 2 ) space and time (at least) for distance matrix Total runtime is O(n 2 *log(n)) Why? Not applicable as such to large doc sets Does not really generate clusters Ulf Leser: Text Analytics, Vorlesung, Sommersemester

Intuition Hierarchical clustering organizes a doc collection Ideally, hierarchical clustering directly creates a directory of the corpus This is more of a wish

38 Intuition Hierarchical clustering organizes a doc collection Ideally, hierarchical clustering directly creates a directory of the corpus This is more of a wish Many, many ways to group objects clustering will choose just one There are no names for the groups Ulf Leser: Text Analytics, Vorlesung, Sommersemester

39 Branch Length Use branch length to symbolize distance Outlier detection Outlier Ulf Leser: Text Analytics, Vorlesung, Sommersemester

40 Variations Hierarchical clustering uses the distance between the centers of clusters to decide about distance between clusters Other alternatives Single Link: Distance of the two closest docs in both clusters Complete Link: Distance of the two furthest docs Average Link: Average distance between pairs of docs from both clusters Centroid: Distance between centre points Ulf Leser: Text Analytics, Vorlesung, Sommersemester

41 Variations Hierarchical clustering uses the distance between the centers of clusters to decide about distance between clusters Other alternatives Single Link: Distance of the two closests docs in both clusters Complete Link: Distance of the two furthest docs Average Link: Average distance between pairs of docs from both clusters Centroid: Distance between centre points Ulf Leser: Text Analytics, Vorlesung, Sommersemester

42 Variations Hierarchical clustering uses the distance between the centers of clusters to decide about distance between clusters Other alternatives Single Link: Distance of the two closests docs in both clusters Complete Link: Distance of the two furthest docs Average Link: Average distance between pairs of docs from both clusters Centroid: Distance between centre points Ulf Leser: Text Analytics, Vorlesung, Sommersemester

43 Single-link versus Complete-link Ulf Leser: Text Analytics, Vorlesung, Sommersemester

elongated clusters (chaining effect) Complete-link Optimizes a global criterion (look at the worst pair)

44 More Properties Single-link Optimizes a local criterion (only look at the closest pair) Similar to computing a minimal spanning tree With cuts at most expensive branches as going down the hierarchy Creates elongated clusters (chaining effect) Complete-link Optimizes a global criterion (look at the worst pair) Creates more compact, more convex, spherical clusters Ulf Leser: Text Analytics, Vorlesung, Sommersemester

45 Content of this Lecture Text clustering Clustering algorithms Hierarchical clustering K-means Soft clustering: EM algorithm Application Ulf Leser: Text Analytics, Vorlesung, Sommersemester

46 K-Means Partitioning method K-Means probably is the best known clustering algorithm Requires the number k of clusters to be predefined Algorithm Guess k cluster centers at random Can use k docs, or k random points in doc-space Loop forever Assign all docs to their closest cluster center If no doc has changed its assignment, stop Or if sufficiently few docs have changed their assignment Otherwise, compute new cluster centre as centre of all points in cluster Ulf Leser: Text Analytics, Vorlesung, Sommersemester

47 Example 1 k=3 Choose random start points Quelle: Stanford, CS 262 Computational Genomics Ulf Leser: Text Analytics, Vorlesung, Sommersemester

48 Example 2 Assign docs to closest cluster centre Ulf Leser: Text Analytics, Vorlesung, Sommersemester

49 Example 3 Compute new cluster centre Ulf Leser: Text Analytics, Vorlesung, Sommersemester

50 Example 4 Ulf Leser: Text Analytics, Vorlesung, Sommersemester

51 Example 5 Ulf Leser: Text Analytics, Vorlesung, Sommersemester

52 Example 6 Converged Ulf Leser: Text Analytics, Vorlesung, Sommersemester

53 Properties Usually, k-means converges quite fast Let l be the number of iterations Complexity: O(l*k*n) Assignment: n*k distance computations New centers: Summing up n vectors k times Choosing the right start points is important K-Means essentially is a greedy heuristic and only finds local optima Option 1: Start several times with different start points Option 2: Compute hierarchical clustering on small random sample and choose start points as cluster centers Buckshot algorithm How to choose k? Try for different k and use quality score to find best value Ulf Leser: Text Analytics, Vorlesung, Sommersemester

54 k-means and Outlier Try for k=3 Ulf Leser: Text Analytics, Vorlesung, Sommersemester

55 Help: K-Medoid Chose the doc in the middle of a cluster as representative PAM: Partitioning around Medoids Advantage Less sensitive to outliers Also works for non-metric spaces as no new center point needs to be computed Disadvantage More expensive We need to compute all pair-wise distances in each cluster in each round Overall complexity is O(n 3 ) Can save re-computations at the expense of more space requirements Ulf Leser: Text Analytics, Vorlesung, Sommersemester

56 k-medoid and Outlier Ulf Leser: Text Analytics, Vorlesung, Sommersemester

57 Content of this Lecture Text clustering Clustering algorithms Hierarchical clustering K-means Soft clustering: EM algorithm Application Ulf Leser: Text Analytics, Vorlesung, Sommersemester

58 Soft Clustering We always assumed docs are assigned exactly one cluster Probabilistic interpretation: All docs pertain to all clusters with a certain probability Generative model Assume we have k doc-producing devices Such as authors, topics, Each device produces docs that are normally distributed in vector space with a device-specific mean and variance Assume that k devices have produced D documents Clustering can be interpreted as re-covering mean and variance of each distribution / device Solution: Expectation Maximization Algorithm (EM) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

59 Expectation Maximization (rough sketch, no math) EM optimizes the set of parameters Θ of a multi-variant normal distribution (mean and variance of k clusters) given sample data Iterative process with two phases Expectation: Assuming an instantiation of Θ, we can assign all docs its most likely generator / cluster Maximization: Assuming an assignment of docs to generators, we can compute the optimal Θ using maximum likelihood estimation Algorithm Or using a Bayes approach including a-priori probabilities of generators Guess an initial Θ Iterate through both steps until convergence Finds a local optimum, convergence guaranteed K-Means: special case of EM clustering assuming k normal distributions with different means yet equal and minimal variance Ulf Leser: Text Analytics, Vorlesung, Sommersemester

60 Content of this Lecture Text clustering Clustering algorithms Application Clustering Phenotypes Ulf Leser: Text Analytics, Vorlesung, Sommersemester

61 Mining Phenotypes for Function Prediction Ulf Leser: Text Analytics, Vorlesung, Sommersemester

62 Phenotypes Observable characteristics of an organism produced by the organism's genotype interacting with the environment. Individual 1 A Gene Transcripts Individual 2 T Proteins Phenotypes Ulf Leser: Text Analytics, Vorlesung, Sommersemester

63 Phenotypes Observable characteristics of an organism produced by the organism's genotype interacting with the environment. Individual 1 A Gene Transcripts Individual 2 T Proteins Disease Phenotypes Healthy Ulf Leser: Text Analytics, Vorlesung, Sommersemester

64 Genotypes ATCGATCGATGA ATCGACCGATGA Measuring genotypes: Sequencing, microarrays, etc. Describing genotypes: Gene Ontology > terms, 10 years history, widely accepted Ulf Leser: Text Analytics, Vorlesung, Sommersemester

65 Mining Phenotypes: General Idea Established Genotype Gene A Phenotype Genotype Gene B Phenotype? Established Genes with similar genotypes likely have similar phenotypes Question If genes generate very similar phenotypes do they have the same genotype? Ulf Leser: Text Analytics, Vorlesung, Sommersemester

66 Phenotypes What is a phenotype at? Visible characteristic of an organism Description of a disease Response to a drug Characterization of mutants Results of RNAi / gene knock-out Expression levels of genes Methods for the systematic measurement of phenotypes are established for few years only Describing phenotypes Today: Text, keywords, abstracts, home-grown vocabulary Tomorrow: Mammalian Phenotype Ontology? Ulf Leser: Text Analytics, Vorlesung, Sommersemester

67 Approach GO Annotation Gene A Phenotype Description Prediction Inference Similarity GO Annotation Gene B Phenotype Description Ulf Leser: Text Analytics, Vorlesung, Sommersemester

68 Phenodocs 411,102 phenotypes Short: <250 words Remove all phenotypes associated to more than one gene (~500) PhenomicDB Remove small phenotypes Remove multi-gene phenotypes Remove stop words Stemming 39,610 phenodocs for 15,426 genes Phenodocs Ulf Leser: Text Analytics, Vorlesung, Sommersemester

69 K-Means Clustering Hierarchical clustering would require ~ * = comparisons K-Means: Simple, iterative algorithm Number of clusters must be predefined We experimented with clusters Ulf Leser: Text Analytics, Vorlesung, Sommersemester

70 Properties: Phenodoc Similarity of Genes Genes with 5 PTs Genes in phenoclusters Pairwise similarity 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0, Ge ne no. 0 Genes with 5 PTs Control (Random selection) Pairw is e sim ila rity 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0, Gene no. 0 Pair-wise similarity scores of phenodocs of genes in the same cluster, sorted by score Result: Phenodocs of genes in phenoclusters are highly similar to each other Ulf Leser: Text Analytics, Vorlesung, Sommersemester

71 PPI: Inter-Connectedness Interacting proteins often share function PPI from BIOGRID database Not at all a complete dataset In >200 clusters, >30% of genes interact with each other Control (random groups): 3 clusters Result: Genes in phenoclusters interact with each other much more often than expected by chance Proteins and interactions from BioGrid. Red proteins have no phenotypes in PhenomicDB Ulf Leser: Text Analytics, Vorlesung, Sommersemester

72 Coherence of Functional Annotation Comparison of GO annotation of genes in phenoclusters Data from Entrez Gene Similarity of two GO terms: Normalized n# of shared ancestors Similarity of two genes: Average of the top-k GO pairs >200 clusters with score >0.4 Control: 2 clusters Results: Genes in phenoclusters have a much higher coherence in functional annotation than expected by chance Gene Ontology Molecular Function Biological Process Physiological Process Catalytic Activity Cellular Process Binding Metabolism Transferase Activity Nucleotide Binding Protein Metabolism Kinase Activity Cell Communication Signal Transduction Protein Modification Ulf Leser: Text Analytics, Vorlesung, Sommersemester

73 Function Prediction Can increased functional coherence of clusters be exploited for function prediction? General approach Compute phenoclusters For each cluster, compute set of associated genes (gene cluster) In each gene cluster, predict common GO terms to all genes Common: annotated to >50% of genes in the cluster Filtering clusters Idea: Find clusters a-priori which give hope for very good results Filter 1: Only clusters with >2 members and at least one common GO term Filter 2: Only clusters with GO coherence>0.4 Filter 3: Only clusters with PPI-connectedness >33% Ulf Leser: Text Analytics, Vorlesung, Sommersemester

74 Evaluation How can we know how good we are? Cross-validation Separate genes in training (90%) and test (10%) Remove annotation from genes in test set Build clusters and predict functions on training set Compare predicted with removed annotations Precision and recall Repeat and average results Macro-average Note: This punishes new and potentially valid annotations Ulf Leser: Text Analytics, Vorlesung, Sommersemester

75 Results for Different Filters (Filter 1) (Filter 1 & Filter 2) (Filter 1 & Filter 3) # of groups # of terms # of genes Precision 67.91% 62.52% 60.52% Recall 22.98% 26.16% 19.78% What if we consider predicted terms to be correct that are a little more general than the removed terms (filter 1)? One step more general: 75.6% precision, 28.7% recall Two steps: 76.3% precision, 30.7% recall The less stringent GO equality, the better the results This is a common trick in studies using GO Ulf Leser: Text Analytics, Vorlesung, Sommersemester

76 Results for Different Cluster Sizes K ,000 Cluster w/ GO-Sim 1 14 (5.6%) 26 (5.2%) 44 (5.9%) 71 (7.1%) # Genes Cluster w/ PPi 75% 12 (4.8%) 34 (6.8%) 65 (8.7%) 88 (8.8%) # Genes Cluster w/ PPi 33% 49 (19.6%) 119 (23.8%) 193 (25.7%) 252 (25.2%) # Genes Cluster for GO-Pred. 73 (29.2%) 153 (30.6%) 230 (30.7%) 295 (29.5%) # Genes # Terms Precision 81.53% 77.16% 74.26% 71.73% Recall 16.90% 20.22% 24.45% 26.36% Avg. Genes/Cluster ,750 3, (9.9%) 309 (10.3%) (11.4%) 353 (11.8%) (24.1%) 717 (23.9%) (27.2%) 816 (27.2%) % 62.89% 34.64% 34.61% 4 4 With increasing k Clusters are smaller Number of predicted terms increases Clusters are more homogeneous Number of genes which receive annotations stays roughly the same Precision decreases slowly, recall increases Effect of the rapid increase in number of predictions Ulf Leser: Text Analytics, Vorlesung, Sommersemester

Unsupervised Learning

Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised