Text Analytics. Text Clustering. Ulf Leser

Similar documents
Unsupervised Learning

CS 229 Midterm Review

Information Retrieval and Web Search Engines

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology


INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Clustering CS 550: Machine Learning

Chapter 9. Classification and Clustering

Classification. 1 o Semestre 2007/2008

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Search Engines. Information Retrieval in Practice

Lecture 9: Support Vector Machines

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Kapitel 4: Clustering

Based on Raymond J. Mooney s slides

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Information Retrieval and Web Search Engines

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

ECG782: Multidimensional Digital Signal Processing

VECTOR SPACE CLASSIFICATION

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Gene Clustering & Classification

CSE 5243 INTRO. TO DATA MINING

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

Support Vector Machines + Classification for IR

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Introduction to Information Retrieval

CHAPTER 4: CLUSTER ANALYSIS

CSE 158. Web Mining and Recommender Systems. Midterm recap

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CSE 573: Artificial Intelligence Autumn 2010

Introduction to Information Retrieval

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

10701 Machine Learning. Clustering

SOCIAL MEDIA MINING. Data Mining Essentials

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

INF 4300 Classification III Anne Solberg The agenda today:

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Bioinformatics - Lecture 07

EECS730: Introduction to Bioinformatics

Data Mining in Bioinformatics Day 1: Classification

Information Retrieval and Organisation

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Mining di Dati Web. Lezione 3 - Clustering and Classification

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Lecture 8 May 7, Prabhakar Raghavan

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Supervised vs unsupervised clustering

Clustering Techniques

Natural Language Processing

6.034 Quiz 2, Spring 2005

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Clustering Algorithms for general similarity measures

Introduction to Computer Science

Problem 1: Complexity of Update Rules for Logistic Regression

Machine Learning / Jan 27, 2010

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Slides for Data Mining by I. H. Witten and E. Frank

PV211: Introduction to Information Retrieval

Introduction to Machine Learning. Xiaojin Zhu

Cluster Analysis. Angela Montanari and Laura Anderlucci

Introduction to Mobile Robotics

Understanding Clustering Supervising the unsupervised

Structured Learning. Jun Zhu

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

What to come. There will be a few more topics we will cover on supervised learning

Machine Learning Techniques for Data Mining

CSE 5243 INTRO. TO DATA MINING

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Machine Learning CMU-10701

CS229 Final Project: Predicting Expected Response Times

Supervised and Unsupervised Learning (II)

Classification: Feature Vectors

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

PV211: Introduction to Information Retrieval

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

CLUSTERING IN BIOINFORMATICS

Data Informatics. Seon Ho Kim, Ph.D.

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Machine Learning. Unsupervised Learning. Manfred Huber

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Machine Learning: Think Big and Parallel

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Data Mining Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Transcription:

Text Analytics Text Clustering Ulf Leser

Text Classification Given a set D of docs and a set of classes C. A classifier is a function f: D C Problem: Finding a good classifier A good classifier assigns as many docs as possible their correct class How do we know? Supervised learning Obtain a set of docs with their classes Find the characteristics of docs in each class (= build a model) What do they have in common? How do the differ from docs in other classes? Encode the model in a classifier function f f is the better, the more docs are assign their correct class Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 2

Categorical Attributes ID Age Type of car Risk 1 23 Family High 2 17 Sports High 3 43 Sports High 4 68 Family Low 5 25 Truck Low Assume this classification was brought up by some insurance manager. What was in his head? Probably a set of rules, such as if age > 50 then risk = low elseif age < 25 then risk = high elseif car = sports then risk = high else risk = low Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 3

A Third Approach Why not: ID Age Type of car Risk 1 23 Family High 2 17 Sports High 3 43 Sports High 4 68 Family Low 5 25 Truck Low If age=17 and car = sports then risk = high elseif age=23 and car = family then risk = high elseif age=25 and car = truck then risk = high elseif age=43 and car = sports then risk = low else risk = low Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 4

Overfitting This was in instance of our perfect classifier We always learn a model from a small sample of the real world Overfitting If the model is too close to the training data, it performs perfect on the training data but learned any bias present in the training data Thus, the rules do not generalize well Solution Use an appropriate learning algorithm Evaluate you method using cross-validation Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 5

Nearest Neighbor Classifiers Very simple and effective method Definition Let D be a set of classified documents, m a distance function between any two documents, and d an unclassified doc. A nearest-neighbor (NN) classifier assigns to d the class of the nearest document to d in D wrt. m A k-nearest-neighbor (knn) classifier assigns to d the most frequent class among the k nearest documents to d in D wrt. m Remark Obviously, a proper distance function is very important In knn, we may weight the k nearest docs according to their distance to d We need to take care of multiple docs with the same distance Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 6

Properties Basic idea: Imagine a copy of d is in D. Of course, we then want to assign the class of this copy to d knn is extremely simple and astonishing good knn in general is more robust than NN [MS99]: 1NN reaches ~95% accuracy, MaxEnt ~96% Reuters collection, class earnings, 20 words with highest X 2 -value knn is a lazy learning Actually, there is no learning or model Major problem: Performance (speed) We need to compute the distance between d and any doc in D Various suggestions to structure D to save computations Clustering Chose one representative per class and find nearest representative (not good) Multidimensional index structures and metric embeddings Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 7

Bayes Classification Simple method based on probability Given Set D of docs and classes c 1, c 2, c m Docs are described as a set F of binary features Usually the presence/absence of terms in d = VSM representation We seek p(c i d), the probability of a doc d D being a member of class c i d eventually is assigned to c with p(c d) = argmax p(c i d) Replace d with feature representation p ( c d) = p( c F[ d]) = p( c f [ d],..., fn[ d]) = p( c t1,..., t 1 n ) Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 8

Naïve Bayes We have The first term cannot be learned with any reasonably large training set There are 2 n combinations of feature values Solution: Be naïve Assume statistical independence of all terms Then p p( c d) p( t1,..., t c)* p( c) ( 1 t1,..., tn c) = p( t c)*...* p( tn c) n And finally p( c d) p( c)* n i= 1 p( t i c) Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 9

Classification with ME The ME approach models the joint probability p(c,d) as Z is a normalization constant The feature weights α i are learned from the data K is the number of features Classification with ME K 1 = f p( c, d) * α Z We have p(c,d) = p(c d) * p(d) Again, p(d) can be dropped for ranking Compute p(c,d) for all classes and return the class with the maximal value i= 1 i i ( d, c) Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 10

Where is the Problem? We want the α to assert certain conditions on our joint probability distribution p(c,d) Counting distributions of single features over the training set does not in itself create a joint distribution Counting joint distributions: Data sparsity problem (again) In NB, we additionally assumed statistical independence to come up with a joint distribution (using Bayes Theorem) ME goes another way and computes the probability distribution which maximizes the entropy of the joint distribution Thus, it makes as little assumptions as possible giving the data This distribution is encoded in the feature weights α i Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 11

Properties of Maximum Entropy Classifiers In general, ME should outperform NB But not always There is theory behind the discrepancies (not covered here) It does not assume independence of features Two redundant features will simply get half of their weights Very popular in statistical NLP Some of the best POS-tagger are ME-based Some of the best NER systems are ME-based Several extensions Maximum Entropy Markov Models Conditional Random Fields Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 12

Class of Linear Classifiers Many common classifiers are (log-)linear classifiers Naïve Bayes Perceptron / Winnow Rocchio Linear and logistic regression Maximum entropy Support vector machines (with linear kernel) All compute a hyperplane which (hopefully) separates the two classes Despite similarity, noticeable performance differences exist Which of the infinite number of possible separating hyperplanes is chosen? How are non-separable data sets handled? Experience: Classifiers more powerful than linear often don t perform better (on text) Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 13

Linear Classifiers All learn a hyperplane which is used to separate classes in high dimensional space For illustration, we stay in 2-dimensional space and look at binary classification problems only But which? Quelle: Xiaojin Zhu, SVM-cs540 Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 14

Support Vector Machines (sketch) Compute the hyperplane which maximizes the margin I.e., is as far away from any data point as possible Can be cast in a linear optimization problem and solved efficiently Solution only depends on the support vectors (points most closest to hyperplane) Complication since usually the classes are not linearly separable Minimize the error (misclassification) Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 15

Problems not Linearly Separable Map high dimensional data into an even higher dimensional space None-linearly separable sets may become linearly separable Doing this efficiently requires a good deal of work ( Kernel trick ) Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 16

Content of this Lecture (Text) clustering Clustering algorithms Application Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 17

Clustering Clustering groups objects into (usually disjoint) sets NN VBZ NNS PP ( ADV JJ) NN Intuitively, each set should contain objects that are similar to each other and dissimilar to objects in any other set We need a similarity or distance function Two optimization goals Also called unsupervised learning We don t know how many sets/classes we expect We don t know how those sets should look like We have no examples for set members Supervised learning = classification / categorization Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 18

Example 1 Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 19

Clustering 1 Intuition here: Similarity corresponds to Euclidian distance Optimization for good ratio of inner-cluster coherence and intra-cluster distance Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 20

Clustering 2 Better or worse? Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 21

Quality of a Clustering Let us measure cluster quality only by the average distance of objects within a cluster Definition Let f be a clustering of a set of objects O into a set of classes C with C =k. Let m c be the centre of all objects of class c (to be defined later), and let d(o,o ) be the distance between two objects o and o. Then, the k-score of f is q ( f ) = d( o, k m c c C f ( o) = c Remark Similarly, we could define the k-score as the average distance across all objects pairs within a cluster Would relieve us from finding the centre of a set of objects ) Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 22

6-Score Find centre of all clusters, computer distance, aggregate Probably better than the 2-score of clustering on previous slide But Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 23

Disadvantage Optimal clustering trivially is reached for k= O We need to fix our definition Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 24

Quality of a Clustering 2 Definition Let f: O C with C arbitrary. Let dist(o, c i ) be the average distance of o to all points of cluster c i. We define Note Inner score: a(o) = dist(o,f(o)) Outer score: b(o) = min( dist(o,c i )) with C i f(o) Let the silhouette s(o) be s( o) = Then, the silhouette s(f) of f is Σs(o) b(o): How much decreases the score if f(o) would not exist and o was assigned its next other cluster? s(o) 0: Point right between two cluster s(o) ~ 1: Point very close to only one (its own) cluster s(o) ~ -1: Point far away from its own cluster b( o) a( o) max( a( o), b( o)) Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 25

Quality of Clustering 3 The silhouette is a very technical definition Usually, we want to find intuitively appealing clusters Those might not at all conform to our definitions Quelle: [FPPS96] Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 26

Text Clustering Applications Explorative data analysis Learn about the structure within your document collection Corpus preprocessing Clustering provides a semantic index to corpus Group docs into clusters to ease navigation Retrieval speed: Index only one representative per cluster Processing of search results Cluster all hits into groups of similar hits (in particular: duplicates) Improving search recall Return doc and all members of its cluster Has similarity to automatic relevance feedback using top-k docs Word sense disambiguation The different senses of a word should appear as clusters Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 27

Processing Search Results The research breakthrough was labeling the clusters, i.e., grouping search results into folder topics [Clusty.com blog] Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 28

Similarity between Documents All clustering methods require some form of distance or similarity function Must be a metric: d(x,x)=0, d(x,y)=d(y,x), d(x,y) d(x,z)+d(z,y) In contrast to search, we now compare to docs with each other And not a document and a query Nevertheless, the same methods are usually used Compute TD / IDF values for all terms in the corpus Represent documents as K -dimensional vectors Use cosine as distance function sim( d 1, d 2 ) = cos( d 1, d 2 ) = d d 1 1 o d * d 2 2 = d 1 ( d [ i]* d [ i] ) 1 [ i] 2 * 2 d 2 [ i] 2 Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 29

Further Issues To increase speed, feature selection is necessary We never counted the time it takes to compare two high dimensional vectors Do not cluster on 100.000 terms Instead, use the 1.000 most descriptive terms for you intended clustering Cluster label Use the representative, e.g., show 5-10 terms with highest TF/IDF values in the cluster centre Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 30

Content of this Lecture Text clustering Clustering algorithms Hierarchical clustering K-means Soft clustering: EM algorithm Application Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 31

Classes of Cluster Algorithms Hierarchical clustering Iteratively creates a hierarchy of clusters Bottom-Up: Start from D clusters and merge clusters until only one remains Top-Down: Start from one cluster (including all docs) and split clusters until every doc is one cluster Or some stop criterion is met Partitioning Heuristically partition all objects in k clusters Guess a first partitioning and improve iteratively k is a parameter of the method, not a result Other Algorithmically: Max-Cut (partitioning) etc. Density-base clustering Minimum description length Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 32

Hierarchical Clustering Also called UPGMA Unweighted Pair-group method with arithmetic mean We only discuss the bottom-up approach Computes a binary tree (dendogram) Simple algorithm Compute distance matrix M Distances between any pair of docs Choose pair d 1, d 2 with smallest distance Compute x = m(d 1,d 2 ) (the centre point) Remove d 1, d 2 from M Insert x Distance between x and any d in M: Average distance between d 1 and d and d 2 and 2 Loop until M is empty Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 33

Example: Distance Matrix A B C D E F.. A 90 84 91 81 82.. B 43 99 37 36.. C 14 35 21 87 D 28 34.. E 95.. F.............. Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 34

Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 35 Iteration A B C D E F G ABCDEFG A B. C.. D... E... F... G... (B,D) a ACEFGa A C. E.. F... G... a... A B C D E F G ACGab A C. G.. a... b... (E,F) b A B C D E F G (A,b) c CGac C G. a.. c... A B C D E F G (C,G) d acd a c. d.. A B C D E F G (d,c) e A B C D E F G (a,e) f A B C D E F G ae a e.

Hierarchical Clustering Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 36

Properties Advantages Simple and intuitive algorithm Number of clusters is not an input of the method Usually good quality clusters Disadvantage Very expensive Requires O(n 2 ) space and time (at least) for distance matrix Total runtime is O(n 2 *log(n)) Why? Not applicable as such to large doc sets Does not really generate clusters Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 37

Intuition Hierarchical clustering organizes a doc collection Ideally, hierarchical clustering directly creates a directory of the corpus This is more of a wish Many, many ways to group objects clustering will choose just one There are no names for the groups Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 38

Branch Length Use branch length to symbolize distance Outlier detection Outlier Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 39

Variations Hierarchical clustering uses the distance between the centers of clusters to decide about distance between clusters Other alternatives Single Link: Distance of the two closest docs in both clusters Complete Link: Distance of the two furthest docs Average Link: Average distance between pairs of docs from both clusters Centroid: Distance between centre points Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 40

Variations Hierarchical clustering uses the distance between the centers of clusters to decide about distance between clusters Other alternatives Single Link: Distance of the two closests docs in both clusters Complete Link: Distance of the two furthest docs Average Link: Average distance between pairs of docs from both clusters Centroid: Distance between centre points Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 41

Variations Hierarchical clustering uses the distance between the centers of clusters to decide about distance between clusters Other alternatives Single Link: Distance of the two closests docs in both clusters Complete Link: Distance of the two furthest docs Average Link: Average distance between pairs of docs from both clusters Centroid: Distance between centre points Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 42

Single-link versus Complete-link Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 43

More Properties Single-link Optimizes a local criterion (only look at the closest pair) Similar to computing a minimal spanning tree With cuts at most expensive branches as going down the hierarchy Creates elongated clusters (chaining effect) Complete-link Optimizes a global criterion (look at the worst pair) Creates more compact, more convex, spherical clusters Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 44

Content of this Lecture Text clustering Clustering algorithms Hierarchical clustering K-means Soft clustering: EM algorithm Application Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 45

K-Means Partitioning method K-Means probably is the best known clustering algorithm Requires the number k of clusters to be predefined Algorithm Guess k cluster centers at random Can use k docs, or k random points in doc-space Loop forever Assign all docs to their closest cluster center If no doc has changed its assignment, stop Or if sufficiently few docs have changed their assignment Otherwise, compute new cluster centre as centre of all points in cluster Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 46

Example 1 k=3 Choose random start points Quelle: Stanford, CS 262 Computational Genomics Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 47

Example 2 Assign docs to closest cluster centre Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 48

Example 3 Compute new cluster centre Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 49

Example 4 Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 50

Example 5 Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 51

Example 6 Converged Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 52

Properties Usually, k-means converges quite fast Let l be the number of iterations Complexity: O(l*k*n) Assignment: n*k distance computations New centers: Summing up n vectors k times Choosing the right start points is important K-Means essentially is a greedy heuristic and only finds local optima Option 1: Start several times with different start points Option 2: Compute hierarchical clustering on small random sample and choose start points as cluster centers Buckshot algorithm How to choose k? Try for different k and use quality score to find best value Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 53

k-means and Outlier Try for k=3 Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 54

Help: K-Medoid Chose the doc in the middle of a cluster as representative PAM: Partitioning around Medoids Advantage Less sensitive to outliers Also works for non-metric spaces as no new center point needs to be computed Disadvantage More expensive We need to compute all pair-wise distances in each cluster in each round Overall complexity is O(n 3 ) Can save re-computations at the expense of more space requirements Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 55

k-medoid and Outlier Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 56

Content of this Lecture Text clustering Clustering algorithms Hierarchical clustering K-means Soft clustering: EM algorithm Application Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 57

Soft Clustering We always assumed docs are assigned exactly one cluster Probabilistic interpretation: All docs pertain to all clusters with a certain probability Generative model Assume we have k doc-producing devices Such as authors, topics, Each device produces docs that are normally distributed in vector space with a device-specific mean and variance Assume that k devices have produced D documents Clustering can be interpreted as re-covering mean and variance of each distribution / device Solution: Expectation Maximization Algorithm (EM) Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 58

Expectation Maximization (rough sketch, no math) EM optimizes the set of parameters Θ of a multi-variant normal distribution (mean and variance of k clusters) given sample data Iterative process with two phases Expectation: Assuming an instantiation of Θ, we can assign all docs its most likely generator / cluster Maximization: Assuming an assignment of docs to generators, we can compute the optimal Θ using maximum likelihood estimation Algorithm Or using a Bayes approach including a-priori probabilities of generators Guess an initial Θ Iterate through both steps until convergence Finds a local optimum, convergence guaranteed K-Means: special case of EM clustering assuming k normal distributions with different means yet equal and minimal variance Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 59

Content of this Lecture Text clustering Clustering algorithms Application Clustering Phenotypes Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 60

Mining Phenotypes for Function Prediction Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 61

Phenotypes Observable characteristics of an organism produced by the organism's genotype interacting with the environment. http://www.whatislife.com/education/fact/glossary.htm Individual 1 A Gene Transcripts Individual 2 T Proteins Phenotypes Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 62

Phenotypes Observable characteristics of an organism produced by the organism's genotype interacting with the environment. http://www.whatislife.com/education/fact/glossary.htm Individual 1 A Gene Transcripts Individual 2 T Proteins Disease Phenotypes Healthy Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 63

Genotypes ATCGATCGATGA ATCGACCGATGA Measuring genotypes: Sequencing, microarrays, etc. Describing genotypes: Gene Ontology >20.000 terms, 10 years history, widely accepted Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 64

Mining Phenotypes: General Idea Established Genotype Gene A Phenotype Genotype Gene B Phenotype? Established Genes with similar genotypes likely have similar phenotypes Question If genes generate very similar phenotypes do they have the same genotype? Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 65

Phenotypes What is a phenotype at? Visible characteristic of an organism Description of a disease Response to a drug Characterization of mutants Results of RNAi / gene knock-out Expression levels of genes Methods for the systematic measurement of phenotypes are established for few years only Describing phenotypes Today: Text, keywords, abstracts, home-grown vocabulary Tomorrow: Mammalian Phenotype Ontology? Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 66

Approach GO Annotation Gene A Phenotype Description Prediction Inference Similarity GO Annotation Gene B Phenotype Description Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 67

Phenodocs 411,102 phenotypes Short: <250 words Remove all phenotypes associated to more than one gene (~500) PhenomicDB Remove small phenotypes Remove multi-gene phenotypes Remove stop words Stemming 39,610 phenodocs for 15,426 genes Phenodocs Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 68

K-Means Clustering Hierarchical clustering would require ~ 40.000*40.000 = 1.600.000.000 comparisons K-Means: Simple, iterative algorithm Number of clusters must be predefined We experimented with 250 3000 clusters Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 69

Properties: Phenodoc Similarity of Genes Genes with 5 PTs Genes in phenoclusters Pairwise similarity 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 260439 45850 110212 251560 15170 16392 48932 286830 14028 21869 34292 37116 18509 35771 78405 47802 189590 47190 49165 34264 32133 174203 250137 18738 174194 181295 179502 21687 13801 13512 31332 23871 846 43020 36564 36440 178207 21846 247863 179119 12677 183147 179347 35522 31349 18478 44217 24131 34793 20427 185138 Ge ne no. 0 Genes with 5 PTs Control (Random selection) Pairw is e sim ila rity 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 5593 3589 4174 2037 3258 1043 1128 1681 11394 1835 10153 2548 2077 3111 1577 2237 4493 7307 10923 3246 2727 2373 2602 10073 9977 902 8385 2130 6339 2411 11897 4125 979 2609 5862 243 2138 1614 8717 8217 3480 3327 2475 496 2120 2492 4696 849 7686 1318 1181 Gene no. 0 Pair-wise similarity scores of phenodocs of genes in the same cluster, sorted by score Result: Phenodocs of genes in phenoclusters are highly similar to each other Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 70

PPI: Inter-Connectedness Interacting proteins often share function PPI from BIOGRID database Not at all a complete dataset In >200 clusters, >30% of genes interact with each other Control (random groups): 3 clusters Result: Genes in phenoclusters interact with each other much more often than expected by chance Proteins and interactions from BioGrid. Red proteins have no phenotypes in PhenomicDB Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 71

Coherence of Functional Annotation Comparison of GO annotation of genes in phenoclusters Data from Entrez Gene Similarity of two GO terms: Normalized n# of shared ancestors Similarity of two genes: Average of the top-k GO pairs >200 clusters with score >0.4 Control: 2 clusters Results: Genes in phenoclusters have a much higher coherence in functional annotation than expected by chance Gene Ontology Molecular Function Biological Process Physiological Process Catalytic Activity Cellular Process Binding Metabolism Transferase Activity Nucleotide Binding Protein Metabolism Kinase Activity Cell Communication Signal Transduction Protein Modification Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 72

Function Prediction Can increased functional coherence of clusters be exploited for function prediction? General approach Compute phenoclusters For each cluster, compute set of associated genes (gene cluster) In each gene cluster, predict common GO terms to all genes Common: annotated to >50% of genes in the cluster Filtering clusters Idea: Find clusters a-priori which give hope for very good results Filter 1: Only clusters with >2 members and at least one common GO term Filter 2: Only clusters with GO coherence>0.4 Filter 3: Only clusters with PPI-connectedness >33% Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 73

Evaluation How can we know how good we are? Cross-validation Separate genes in training (90%) and test (10%) Remove annotation from genes in test set Build clusters and predict functions on training set Compare predicted with removed annotations Precision and recall Repeat and average results Macro-average Note: This punishes new and potentially valid annotations Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 74

Results for Different Filters (Filter 1) (Filter 1 & Filter 2) (Filter 1 & Filter 3) # of groups 196 74 53 # of terms 345 159 102 # of genes 3213 711 409 Precision 67.91% 62.52% 60.52% Recall 22.98% 26.16% 19.78% What if we consider predicted terms to be correct that are a little more general than the removed terms (filter 1)? One step more general: 75.6% precision, 28.7% recall Two steps: 76.3% precision, 30.7% recall The less stringent GO equality, the better the results This is a common trick in studies using GO Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 75

Results for Different Cluster Sizes K 250 500 750 1,000 Cluster w/ GO-Sim 1 14 (5.6%) 26 (5.2%) 44 (5.9%) 71 (7.1%) # Genes 561 781 943 1155 Cluster w/ PPi 75% 12 (4.8%) 34 (6.8%) 65 (8.7%) 88 (8.8%) # Genes 785 988 1166 1263 Cluster w/ PPi 33% 49 (19.6%) 119 (23.8%) 193 (25.7%) 252 (25.2%) # Genes 3362 4044 4296 4417 Cluster for GO-Pred. 73 (29.2%) 153 (30.6%) 230 (30.7%) 295 (29.5%) # Genes 3465 4139 4344 4438 # Terms 123 247 383 489 Precision 81.53% 77.16% 74.26% 71.73% Recall 16.90% 20.22% 24.45% 26.36% Avg. Genes/Cluster 52 26 17 13 2,750 3,000 273 (9.9%) 309 (10.3%) 2094 2221 314 (11.4%) 353 (11.8%) 1810 1914 662 (24.1%) 717 (23.9%) 4811 4833 748 (27.2%) 816 (27.2%) 5016 5115 1436 1557 63.92% 62.89% 34.64% 34.61% 4 4 With increasing k Clusters are smaller Number of predicted terms increases Clusters are more homogeneous Number of genes which receive annotations stays roughly the same Precision decreases slowly, recall increases Effect of the rapid increase in number of predictions Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 76