Text Analytics. Text Clustering. Ulf Leser

Size: px
Start display at page:

Download "Text Analytics. Text Clustering. Ulf Leser"

Transcription

1 Text Analytics Text Clustering Ulf Leser

2 Text Classification Given a set D of docs and a set of classes C. A classifier is a function f: D C Problem: Finding a good classifier A good classifier assigns as many docs as possible their correct class How do we know? Supervised learning Obtain a set of docs with their classes Find the characteristics of docs in each class (= build a model) What do they have in common? How do the differ from docs in other classes? Encode the model in a classifier function f f is the better, the more docs are assign their correct class Ulf Leser: Text Analytics, Vorlesung, Sommersemester

3 Categorical Attributes ID Age Type of car Risk 1 23 Family High 2 17 Sports High 3 43 Sports High 4 68 Family Low 5 25 Truck Low Assume this classification was brought up by some insurance manager. What was in his head? Probably a set of rules, such as if age > 50 then risk = low elseif age < 25 then risk = high elseif car = sports then risk = high else risk = low Ulf Leser: Text Analytics, Vorlesung, Sommersemester

4 A Third Approach Why not: ID Age Type of car Risk 1 23 Family High 2 17 Sports High 3 43 Sports High 4 68 Family Low 5 25 Truck Low If age=17 and car = sports then risk = high elseif age=23 and car = family then risk = high elseif age=25 and car = truck then risk = high elseif age=43 and car = sports then risk = low else risk = low Ulf Leser: Text Analytics, Vorlesung, Sommersemester

5 Overfitting This was in instance of our perfect classifier We always learn a model from a small sample of the real world Overfitting If the model is too close to the training data, it performs perfect on the training data but learned any bias present in the training data Thus, the rules do not generalize well Solution Use an appropriate learning algorithm Evaluate you method using cross-validation Ulf Leser: Text Analytics, Vorlesung, Sommersemester

6 Nearest Neighbor Classifiers Very simple and effective method Definition Let D be a set of classified documents, m a distance function between any two documents, and d an unclassified doc. A nearest-neighbor (NN) classifier assigns to d the class of the nearest document to d in D wrt. m A k-nearest-neighbor (knn) classifier assigns to d the most frequent class among the k nearest documents to d in D wrt. m Remark Obviously, a proper distance function is very important In knn, we may weight the k nearest docs according to their distance to d We need to take care of multiple docs with the same distance Ulf Leser: Text Analytics, Vorlesung, Sommersemester

7 Properties Basic idea: Imagine a copy of d is in D. Of course, we then want to assign the class of this copy to d knn is extremely simple and astonishing good knn in general is more robust than NN [MS99]: 1NN reaches ~95% accuracy, MaxEnt ~96% Reuters collection, class earnings, 20 words with highest X 2 -value knn is a lazy learning Actually, there is no learning or model Major problem: Performance (speed) We need to compute the distance between d and any doc in D Various suggestions to structure D to save computations Clustering Chose one representative per class and find nearest representative (not good) Multidimensional index structures and metric embeddings Ulf Leser: Text Analytics, Vorlesung, Sommersemester

8 Bayes Classification Simple method based on probability Given Set D of docs and classes c 1, c 2, c m Docs are described as a set F of binary features Usually the presence/absence of terms in d = VSM representation We seek p(c i d), the probability of a doc d D being a member of class c i d eventually is assigned to c with p(c d) = argmax p(c i d) Replace d with feature representation p ( c d) = p( c F[ d]) = p( c f [ d],..., fn[ d]) = p( c t1,..., t 1 n ) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

9 Naïve Bayes We have The first term cannot be learned with any reasonably large training set There are 2 n combinations of feature values Solution: Be naïve Assume statistical independence of all terms Then p p( c d) p( t1,..., t c)* p( c) ( 1 t1,..., tn c) = p( t c)*...* p( tn c) n And finally p( c d) p( c)* n i= 1 p( t i c) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

10 Classification with ME The ME approach models the joint probability p(c,d) as Z is a normalization constant The feature weights α i are learned from the data K is the number of features Classification with ME K 1 = f p( c, d) * α Z We have p(c,d) = p(c d) * p(d) Again, p(d) can be dropped for ranking Compute p(c,d) for all classes and return the class with the maximal value i= 1 i i ( d, c) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

11 Where is the Problem? We want the α to assert certain conditions on our joint probability distribution p(c,d) Counting distributions of single features over the training set does not in itself create a joint distribution Counting joint distributions: Data sparsity problem (again) In NB, we additionally assumed statistical independence to come up with a joint distribution (using Bayes Theorem) ME goes another way and computes the probability distribution which maximizes the entropy of the joint distribution Thus, it makes as little assumptions as possible giving the data This distribution is encoded in the feature weights α i Ulf Leser: Text Analytics, Vorlesung, Sommersemester

12 Properties of Maximum Entropy Classifiers In general, ME should outperform NB But not always There is theory behind the discrepancies (not covered here) It does not assume independence of features Two redundant features will simply get half of their weights Very popular in statistical NLP Some of the best POS-tagger are ME-based Some of the best NER systems are ME-based Several extensions Maximum Entropy Markov Models Conditional Random Fields Ulf Leser: Text Analytics, Vorlesung, Sommersemester

13 Class of Linear Classifiers Many common classifiers are (log-)linear classifiers Naïve Bayes Perceptron / Winnow Rocchio Linear and logistic regression Maximum entropy Support vector machines (with linear kernel) All compute a hyperplane which (hopefully) separates the two classes Despite similarity, noticeable performance differences exist Which of the infinite number of possible separating hyperplanes is chosen? How are non-separable data sets handled? Experience: Classifiers more powerful than linear often don t perform better (on text) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

14 Linear Classifiers All learn a hyperplane which is used to separate classes in high dimensional space For illustration, we stay in 2-dimensional space and look at binary classification problems only But which? Quelle: Xiaojin Zhu, SVM-cs540 Ulf Leser: Text Analytics, Vorlesung, Sommersemester

15 Support Vector Machines (sketch) Compute the hyperplane which maximizes the margin I.e., is as far away from any data point as possible Can be cast in a linear optimization problem and solved efficiently Solution only depends on the support vectors (points most closest to hyperplane) Complication since usually the classes are not linearly separable Minimize the error (misclassification) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

16 Problems not Linearly Separable Map high dimensional data into an even higher dimensional space None-linearly separable sets may become linearly separable Doing this efficiently requires a good deal of work ( Kernel trick ) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

17 Content of this Lecture (Text) clustering Clustering algorithms Application Ulf Leser: Text Analytics, Vorlesung, Sommersemester

18 Clustering Clustering groups objects into (usually disjoint) sets NN VBZ NNS PP ( ADV JJ) NN Intuitively, each set should contain objects that are similar to each other and dissimilar to objects in any other set We need a similarity or distance function Two optimization goals Also called unsupervised learning We don t know how many sets/classes we expect We don t know how those sets should look like We have no examples for set members Supervised learning = classification / categorization Ulf Leser: Text Analytics, Vorlesung, Sommersemester

19 Example 1 Ulf Leser: Text Analytics, Vorlesung, Sommersemester

20 Clustering 1 Intuition here: Similarity corresponds to Euclidian distance Optimization for good ratio of inner-cluster coherence and intra-cluster distance Ulf Leser: Text Analytics, Vorlesung, Sommersemester

21 Clustering 2 Better or worse? Ulf Leser: Text Analytics, Vorlesung, Sommersemester

22 Quality of a Clustering Let us measure cluster quality only by the average distance of objects within a cluster Definition Let f be a clustering of a set of objects O into a set of classes C with C =k. Let m c be the centre of all objects of class c (to be defined later), and let d(o,o ) be the distance between two objects o and o. Then, the k-score of f is q ( f ) = d( o, k m c c C f ( o) = c Remark Similarly, we could define the k-score as the average distance across all objects pairs within a cluster Would relieve us from finding the centre of a set of objects ) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

23 6-Score Find centre of all clusters, computer distance, aggregate Probably better than the 2-score of clustering on previous slide But Ulf Leser: Text Analytics, Vorlesung, Sommersemester

24 Disadvantage Optimal clustering trivially is reached for k= O We need to fix our definition Ulf Leser: Text Analytics, Vorlesung, Sommersemester

25 Quality of a Clustering 2 Definition Let f: O C with C arbitrary. Let dist(o, c i ) be the average distance of o to all points of cluster c i. We define Note Inner score: a(o) = dist(o,f(o)) Outer score: b(o) = min( dist(o,c i )) with C i f(o) Let the silhouette s(o) be s( o) = Then, the silhouette s(f) of f is Σs(o) b(o): How much decreases the score if f(o) would not exist and o was assigned its next other cluster? s(o) 0: Point right between two cluster s(o) ~ 1: Point very close to only one (its own) cluster s(o) ~ -1: Point far away from its own cluster b( o) a( o) max( a( o), b( o)) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

26 Quality of Clustering 3 The silhouette is a very technical definition Usually, we want to find intuitively appealing clusters Those might not at all conform to our definitions Quelle: [FPPS96] Ulf Leser: Text Analytics, Vorlesung, Sommersemester

27 Text Clustering Applications Explorative data analysis Learn about the structure within your document collection Corpus preprocessing Clustering provides a semantic index to corpus Group docs into clusters to ease navigation Retrieval speed: Index only one representative per cluster Processing of search results Cluster all hits into groups of similar hits (in particular: duplicates) Improving search recall Return doc and all members of its cluster Has similarity to automatic relevance feedback using top-k docs Word sense disambiguation The different senses of a word should appear as clusters Ulf Leser: Text Analytics, Vorlesung, Sommersemester

28 Processing Search Results The research breakthrough was labeling the clusters, i.e., grouping search results into folder topics [Clusty.com blog] Ulf Leser: Text Analytics, Vorlesung, Sommersemester

29 Similarity between Documents All clustering methods require some form of distance or similarity function Must be a metric: d(x,x)=0, d(x,y)=d(y,x), d(x,y) d(x,z)+d(z,y) In contrast to search, we now compare to docs with each other And not a document and a query Nevertheless, the same methods are usually used Compute TD / IDF values for all terms in the corpus Represent documents as K -dimensional vectors Use cosine as distance function sim( d 1, d 2 ) = cos( d 1, d 2 ) = d d 1 1 o d * d 2 2 = d 1 ( d [ i]* d [ i] ) 1 [ i] 2 * 2 d 2 [ i] 2 Ulf Leser: Text Analytics, Vorlesung, Sommersemester

30 Further Issues To increase speed, feature selection is necessary We never counted the time it takes to compare two high dimensional vectors Do not cluster on terms Instead, use the most descriptive terms for you intended clustering Cluster label Use the representative, e.g., show 5-10 terms with highest TF/IDF values in the cluster centre Ulf Leser: Text Analytics, Vorlesung, Sommersemester

31 Content of this Lecture Text clustering Clustering algorithms Hierarchical clustering K-means Soft clustering: EM algorithm Application Ulf Leser: Text Analytics, Vorlesung, Sommersemester

32 Classes of Cluster Algorithms Hierarchical clustering Iteratively creates a hierarchy of clusters Bottom-Up: Start from D clusters and merge clusters until only one remains Top-Down: Start from one cluster (including all docs) and split clusters until every doc is one cluster Or some stop criterion is met Partitioning Heuristically partition all objects in k clusters Guess a first partitioning and improve iteratively k is a parameter of the method, not a result Other Algorithmically: Max-Cut (partitioning) etc. Density-base clustering Minimum description length Ulf Leser: Text Analytics, Vorlesung, Sommersemester

33 Hierarchical Clustering Also called UPGMA Unweighted Pair-group method with arithmetic mean We only discuss the bottom-up approach Computes a binary tree (dendogram) Simple algorithm Compute distance matrix M Distances between any pair of docs Choose pair d 1, d 2 with smallest distance Compute x = m(d 1,d 2 ) (the centre point) Remove d 1, d 2 from M Insert x Distance between x and any d in M: Average distance between d 1 and d and d 2 and 2 Loop until M is empty Ulf Leser: Text Analytics, Vorlesung, Sommersemester

34 Example: Distance Matrix A B C D E F.. A B C D E 95.. F Ulf Leser: Text Analytics, Vorlesung, Sommersemester

35 Ulf Leser: Text Analytics, Vorlesung, Sommersemester Iteration A B C D E F G ABCDEFG A B. C.. D... E... F... G... (B,D) a ACEFGa A C. E.. F... G... a... A B C D E F G ACGab A C. G.. a... b... (E,F) b A B C D E F G (A,b) c CGac C G. a.. c... A B C D E F G (C,G) d acd a c. d.. A B C D E F G (d,c) e A B C D E F G (a,e) f A B C D E F G ae a e.

36 Hierarchical Clustering Ulf Leser: Text Analytics, Vorlesung, Sommersemester

37 Properties Advantages Simple and intuitive algorithm Number of clusters is not an input of the method Usually good quality clusters Disadvantage Very expensive Requires O(n 2 ) space and time (at least) for distance matrix Total runtime is O(n 2 *log(n)) Why? Not applicable as such to large doc sets Does not really generate clusters Ulf Leser: Text Analytics, Vorlesung, Sommersemester

38 Intuition Hierarchical clustering organizes a doc collection Ideally, hierarchical clustering directly creates a directory of the corpus This is more of a wish Many, many ways to group objects clustering will choose just one There are no names for the groups Ulf Leser: Text Analytics, Vorlesung, Sommersemester

39 Branch Length Use branch length to symbolize distance Outlier detection Outlier Ulf Leser: Text Analytics, Vorlesung, Sommersemester

40 Variations Hierarchical clustering uses the distance between the centers of clusters to decide about distance between clusters Other alternatives Single Link: Distance of the two closest docs in both clusters Complete Link: Distance of the two furthest docs Average Link: Average distance between pairs of docs from both clusters Centroid: Distance between centre points Ulf Leser: Text Analytics, Vorlesung, Sommersemester

41 Variations Hierarchical clustering uses the distance between the centers of clusters to decide about distance between clusters Other alternatives Single Link: Distance of the two closests docs in both clusters Complete Link: Distance of the two furthest docs Average Link: Average distance between pairs of docs from both clusters Centroid: Distance between centre points Ulf Leser: Text Analytics, Vorlesung, Sommersemester

42 Variations Hierarchical clustering uses the distance between the centers of clusters to decide about distance between clusters Other alternatives Single Link: Distance of the two closests docs in both clusters Complete Link: Distance of the two furthest docs Average Link: Average distance between pairs of docs from both clusters Centroid: Distance between centre points Ulf Leser: Text Analytics, Vorlesung, Sommersemester

43 Single-link versus Complete-link Ulf Leser: Text Analytics, Vorlesung, Sommersemester

44 More Properties Single-link Optimizes a local criterion (only look at the closest pair) Similar to computing a minimal spanning tree With cuts at most expensive branches as going down the hierarchy Creates elongated clusters (chaining effect) Complete-link Optimizes a global criterion (look at the worst pair) Creates more compact, more convex, spherical clusters Ulf Leser: Text Analytics, Vorlesung, Sommersemester

45 Content of this Lecture Text clustering Clustering algorithms Hierarchical clustering K-means Soft clustering: EM algorithm Application Ulf Leser: Text Analytics, Vorlesung, Sommersemester

46 K-Means Partitioning method K-Means probably is the best known clustering algorithm Requires the number k of clusters to be predefined Algorithm Guess k cluster centers at random Can use k docs, or k random points in doc-space Loop forever Assign all docs to their closest cluster center If no doc has changed its assignment, stop Or if sufficiently few docs have changed their assignment Otherwise, compute new cluster centre as centre of all points in cluster Ulf Leser: Text Analytics, Vorlesung, Sommersemester

47 Example 1 k=3 Choose random start points Quelle: Stanford, CS 262 Computational Genomics Ulf Leser: Text Analytics, Vorlesung, Sommersemester

48 Example 2 Assign docs to closest cluster centre Ulf Leser: Text Analytics, Vorlesung, Sommersemester

49 Example 3 Compute new cluster centre Ulf Leser: Text Analytics, Vorlesung, Sommersemester

50 Example 4 Ulf Leser: Text Analytics, Vorlesung, Sommersemester

51 Example 5 Ulf Leser: Text Analytics, Vorlesung, Sommersemester

52 Example 6 Converged Ulf Leser: Text Analytics, Vorlesung, Sommersemester

53 Properties Usually, k-means converges quite fast Let l be the number of iterations Complexity: O(l*k*n) Assignment: n*k distance computations New centers: Summing up n vectors k times Choosing the right start points is important K-Means essentially is a greedy heuristic and only finds local optima Option 1: Start several times with different start points Option 2: Compute hierarchical clustering on small random sample and choose start points as cluster centers Buckshot algorithm How to choose k? Try for different k and use quality score to find best value Ulf Leser: Text Analytics, Vorlesung, Sommersemester

54 k-means and Outlier Try for k=3 Ulf Leser: Text Analytics, Vorlesung, Sommersemester

55 Help: K-Medoid Chose the doc in the middle of a cluster as representative PAM: Partitioning around Medoids Advantage Less sensitive to outliers Also works for non-metric spaces as no new center point needs to be computed Disadvantage More expensive We need to compute all pair-wise distances in each cluster in each round Overall complexity is O(n 3 ) Can save re-computations at the expense of more space requirements Ulf Leser: Text Analytics, Vorlesung, Sommersemester

56 k-medoid and Outlier Ulf Leser: Text Analytics, Vorlesung, Sommersemester

57 Content of this Lecture Text clustering Clustering algorithms Hierarchical clustering K-means Soft clustering: EM algorithm Application Ulf Leser: Text Analytics, Vorlesung, Sommersemester

58 Soft Clustering We always assumed docs are assigned exactly one cluster Probabilistic interpretation: All docs pertain to all clusters with a certain probability Generative model Assume we have k doc-producing devices Such as authors, topics, Each device produces docs that are normally distributed in vector space with a device-specific mean and variance Assume that k devices have produced D documents Clustering can be interpreted as re-covering mean and variance of each distribution / device Solution: Expectation Maximization Algorithm (EM) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

59 Expectation Maximization (rough sketch, no math) EM optimizes the set of parameters Θ of a multi-variant normal distribution (mean and variance of k clusters) given sample data Iterative process with two phases Expectation: Assuming an instantiation of Θ, we can assign all docs its most likely generator / cluster Maximization: Assuming an assignment of docs to generators, we can compute the optimal Θ using maximum likelihood estimation Algorithm Or using a Bayes approach including a-priori probabilities of generators Guess an initial Θ Iterate through both steps until convergence Finds a local optimum, convergence guaranteed K-Means: special case of EM clustering assuming k normal distributions with different means yet equal and minimal variance Ulf Leser: Text Analytics, Vorlesung, Sommersemester

60 Content of this Lecture Text clustering Clustering algorithms Application Clustering Phenotypes Ulf Leser: Text Analytics, Vorlesung, Sommersemester

61 Mining Phenotypes for Function Prediction Ulf Leser: Text Analytics, Vorlesung, Sommersemester

62 Phenotypes Observable characteristics of an organism produced by the organism's genotype interacting with the environment. Individual 1 A Gene Transcripts Individual 2 T Proteins Phenotypes Ulf Leser: Text Analytics, Vorlesung, Sommersemester

63 Phenotypes Observable characteristics of an organism produced by the organism's genotype interacting with the environment. Individual 1 A Gene Transcripts Individual 2 T Proteins Disease Phenotypes Healthy Ulf Leser: Text Analytics, Vorlesung, Sommersemester

64 Genotypes ATCGATCGATGA ATCGACCGATGA Measuring genotypes: Sequencing, microarrays, etc. Describing genotypes: Gene Ontology > terms, 10 years history, widely accepted Ulf Leser: Text Analytics, Vorlesung, Sommersemester

65 Mining Phenotypes: General Idea Established Genotype Gene A Phenotype Genotype Gene B Phenotype? Established Genes with similar genotypes likely have similar phenotypes Question If genes generate very similar phenotypes do they have the same genotype? Ulf Leser: Text Analytics, Vorlesung, Sommersemester

66 Phenotypes What is a phenotype at? Visible characteristic of an organism Description of a disease Response to a drug Characterization of mutants Results of RNAi / gene knock-out Expression levels of genes Methods for the systematic measurement of phenotypes are established for few years only Describing phenotypes Today: Text, keywords, abstracts, home-grown vocabulary Tomorrow: Mammalian Phenotype Ontology? Ulf Leser: Text Analytics, Vorlesung, Sommersemester

67 Approach GO Annotation Gene A Phenotype Description Prediction Inference Similarity GO Annotation Gene B Phenotype Description Ulf Leser: Text Analytics, Vorlesung, Sommersemester

68 Phenodocs 411,102 phenotypes Short: <250 words Remove all phenotypes associated to more than one gene (~500) PhenomicDB Remove small phenotypes Remove multi-gene phenotypes Remove stop words Stemming 39,610 phenodocs for 15,426 genes Phenodocs Ulf Leser: Text Analytics, Vorlesung, Sommersemester

69 K-Means Clustering Hierarchical clustering would require ~ * = comparisons K-Means: Simple, iterative algorithm Number of clusters must be predefined We experimented with clusters Ulf Leser: Text Analytics, Vorlesung, Sommersemester

70 Properties: Phenodoc Similarity of Genes Genes with 5 PTs Genes in phenoclusters Pairwise similarity 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0, Ge ne no. 0 Genes with 5 PTs Control (Random selection) Pairw is e sim ila rity 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0, Gene no. 0 Pair-wise similarity scores of phenodocs of genes in the same cluster, sorted by score Result: Phenodocs of genes in phenoclusters are highly similar to each other Ulf Leser: Text Analytics, Vorlesung, Sommersemester

71 PPI: Inter-Connectedness Interacting proteins often share function PPI from BIOGRID database Not at all a complete dataset In >200 clusters, >30% of genes interact with each other Control (random groups): 3 clusters Result: Genes in phenoclusters interact with each other much more often than expected by chance Proteins and interactions from BioGrid. Red proteins have no phenotypes in PhenomicDB Ulf Leser: Text Analytics, Vorlesung, Sommersemester

72 Coherence of Functional Annotation Comparison of GO annotation of genes in phenoclusters Data from Entrez Gene Similarity of two GO terms: Normalized n# of shared ancestors Similarity of two genes: Average of the top-k GO pairs >200 clusters with score >0.4 Control: 2 clusters Results: Genes in phenoclusters have a much higher coherence in functional annotation than expected by chance Gene Ontology Molecular Function Biological Process Physiological Process Catalytic Activity Cellular Process Binding Metabolism Transferase Activity Nucleotide Binding Protein Metabolism Kinase Activity Cell Communication Signal Transduction Protein Modification Ulf Leser: Text Analytics, Vorlesung, Sommersemester

73 Function Prediction Can increased functional coherence of clusters be exploited for function prediction? General approach Compute phenoclusters For each cluster, compute set of associated genes (gene cluster) In each gene cluster, predict common GO terms to all genes Common: annotated to >50% of genes in the cluster Filtering clusters Idea: Find clusters a-priori which give hope for very good results Filter 1: Only clusters with >2 members and at least one common GO term Filter 2: Only clusters with GO coherence>0.4 Filter 3: Only clusters with PPI-connectedness >33% Ulf Leser: Text Analytics, Vorlesung, Sommersemester

74 Evaluation How can we know how good we are? Cross-validation Separate genes in training (90%) and test (10%) Remove annotation from genes in test set Build clusters and predict functions on training set Compare predicted with removed annotations Precision and recall Repeat and average results Macro-average Note: This punishes new and potentially valid annotations Ulf Leser: Text Analytics, Vorlesung, Sommersemester

75 Results for Different Filters (Filter 1) (Filter 1 & Filter 2) (Filter 1 & Filter 3) # of groups # of terms # of genes Precision 67.91% 62.52% 60.52% Recall 22.98% 26.16% 19.78% What if we consider predicted terms to be correct that are a little more general than the removed terms (filter 1)? One step more general: 75.6% precision, 28.7% recall Two steps: 76.3% precision, 30.7% recall The less stringent GO equality, the better the results This is a common trick in studies using GO Ulf Leser: Text Analytics, Vorlesung, Sommersemester

76 Results for Different Cluster Sizes K ,000 Cluster w/ GO-Sim 1 14 (5.6%) 26 (5.2%) 44 (5.9%) 71 (7.1%) # Genes Cluster w/ PPi 75% 12 (4.8%) 34 (6.8%) 65 (8.7%) 88 (8.8%) # Genes Cluster w/ PPi 33% 49 (19.6%) 119 (23.8%) 193 (25.7%) 252 (25.2%) # Genes Cluster for GO-Pred. 73 (29.2%) 153 (30.6%) 230 (30.7%) 295 (29.5%) # Genes # Terms Precision 81.53% 77.16% 74.26% 71.73% Recall 16.90% 20.22% 24.45% 26.36% Avg. Genes/Cluster ,750 3, (9.9%) 309 (10.3%) (11.4%) 353 (11.8%) (24.1%) 717 (23.9%) (27.2%) 816 (27.2%) % 62.89% 34.64% 34.61% 4 4 With increasing k Clusters are smaller Number of predicted terms increases Clusters are more homogeneous Number of genes which receive annotations stays roughly the same Precision decreases slowly, recall increases Effect of the rapid increase in number of predictions Ulf Leser: Text Analytics, Vorlesung, Sommersemester

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

http://www.xkcd.com/233/ Text Clustering David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Administrative 2 nd status reports Paper review

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Chapter 9. Classification and Clustering

Chapter 9. Classification and Clustering Chapter 9 Classification and Clustering Classification and Clustering Classification and clustering are classical pattern recognition and machine learning problems Classification, also referred to as categorization

More information

Classification. 1 o Semestre 2007/2008

Classification. 1 o Semestre 2007/2008 Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems

More information

Lecture 9: Support Vector Machines

Lecture 9: Support Vector Machines Lecture 9: Support Vector Machines William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 8 What we ll learn in this lecture Support Vector Machines (SVMs) a highly robust and

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics

More information

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document

More information

Kapitel 4: Clustering

Kapitel 4: Clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999 Text Categorization Foundations of Statistic Natural Language Processing The MIT Press1999 Outline Introduction Decision Trees Maximum Entropy Modeling (optional) Perceptrons K Nearest Neighbor Classification

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

VECTOR SPACE CLASSIFICATION

VECTOR SPACE CLASSIFICATION VECTOR SPACE CLASSIFICATION Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. Chapter 14 Wei Wei wwei@idi.ntnu.no Lecture

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser Multiple Sequence Alignment Sum-of-Pairs and ClustalW Ulf Leser This Lecture Multiple Sequence Alignment The problem Theoretical approach: Sum-of-Pairs scores Practical approach: ClustalW Ulf Leser: Bioinformatics,

More information

Support Vector Machines + Classification for IR

Support Vector Machines + Classification for IR Support Vector Machines + Classification for IR Pierre Lison University of Oslo, Dep. of Informatics INF3800: Søketeknologi April 30, 2014 Outline of the lecture Recap of last week Support Vector Machines

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 6: Flat Clustering Wiltrud Kessler & Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 0-- / 83

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

CSE 573: Artificial Intelligence Autumn 2010

CSE 573: Artificial Intelligence Autumn 2010 CSE 573: Artificial Intelligence Autumn 2010 Lecture 16: Machine Learning Topics 12/7/2010 Luke Zettlemoyer Most slides over the course adapted from Dan Klein. 1 Announcements Syllabus revised Machine

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 16: Flat Clustering Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2009.06.16 1/ 64 Overview

More information

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score

More information

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning Associate Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551

More information

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu [Kumar et al. 99] 2/13/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

More information

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2 161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot Foundations of Machine Learning CentraleSupélec Fall 2017 12. Clustering Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectives

More information

Bioinformatics - Lecture 07

Bioinformatics - Lecture 07 Bioinformatics - Lecture 07 Bioinformatics Clusters and networks Martin Saturka http://www.bioplexity.org/lectures/ EBI version 0.4 Creative Commons Attribution-Share Alike 2.5 License Learning on profiles

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

Data Mining in Bioinformatics Day 1: Classification

Data Mining in Bioinformatics Day 1: Classification Data Mining in Bioinformatics Day 1: Classification Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London What Is Text Clustering? Text Clustering = Grouping a set of documents into classes of similar

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Mining di Dati Web. Lezione 3 - Clustering and Classification

Mining di Dati Web. Lezione 3 - Clustering and Classification Mining di Dati Web Lezione 3 - Clustering and Classification Introduction Clustering and classification are both learning techniques They learn functions describing data Clustering is also known as Unsupervised

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Lecture 8 May 7, Prabhakar Raghavan

Lecture 8 May 7, Prabhakar Raghavan Lecture 8 May 7, 2001 Prabhakar Raghavan Clustering documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Given the set of docs from the results of

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction 2 Machine Learning Field of study that gives computers the ability to learn without

More information

6.034 Quiz 2, Spring 2005

6.034 Quiz 2, Spring 2005 6.034 Quiz 2, Spring 2005 Open Book, Open Notes Name: Problem 1 (13 pts) 2 (8 pts) 3 (7 pts) 4 (9 pts) 5 (8 pts) 6 (16 pts) 7 (15 pts) 8 (12 pts) 9 (12 pts) Total (100 pts) Score 1 1 Decision Trees (13

More information

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017 CPSC 340: Machine Learning and Data Mining Probabilistic Classification Fall 2017 Admin Assignment 0 is due tonight: you should be almost done. 1 late day to hand it in Monday, 2 late days for Wednesday.

More information

Clustering Algorithms for general similarity measures

Clustering Algorithms for general similarity measures Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative

More information

Introduction to Computer Science

Introduction to Computer Science DM534 Introduction to Computer Science Clustering and Feature Spaces Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at

More information

Problem 1: Complexity of Update Rules for Logistic Regression

Problem 1: Complexity of Update Rules for Logistic Regression Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1

More information

Machine Learning / Jan 27, 2010

Machine Learning / Jan 27, 2010 Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211

PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211 PV: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv IIR 6: Flat Clustering Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center

More information

Introduction to Machine Learning. Xiaojin Zhu

Introduction to Machine Learning. Xiaojin Zhu Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi- Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Introduction to Mobile Robotics

Introduction to Mobile Robotics Introduction to Mobile Robotics Clustering Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann Clustering (1) Common technique for statistical data analysis (machine learning,

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Structured Learning. Jun Zhu

Structured Learning. Jun Zhu Structured Learning Jun Zhu Supervised learning Given a set of I.I.D. training samples Learn a prediction function b r a c e Supervised learning (cont d) Many different choices Logistic Regression Maximum

More information

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010 INFORMATICS SEMINAR SEPT. 27 & OCT. 4, 2010 Introduction to Semi-Supervised Learning Review 2 Overview Citation X. Zhu and A.B. Goldberg, Introduction to Semi- Supervised Learning, Morgan & Claypool Publishers,

More information

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman Protein 2006 Motivation Correctly

More information

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering

More information

What to come. There will be a few more topics we will cover on supervised learning

What to come. There will be a few more topics we will cover on supervised learning Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Clustering CE-324: Modern Information Retrieval Sharif University of Technology Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch. 16 What

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Clustering and EM Barnabás Póczos & Aarti Singh Contents Clustering K-means Mixture of Gaussians Expectation Maximization Variational Methods 2 Clustering 3 K-

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Supervised and Unsupervised Learning (II)

Supervised and Unsupervised Learning (II) Supervised and Unsupervised Learning (II) Yong Zheng Center for Web Intelligence DePaul University, Chicago IPD 346 - Data Science for Business Program DePaul University, Chicago, USA Intro: Supervised

More information

Classification: Feature Vectors

Classification: Feature Vectors Classification: Feature Vectors Hello, Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just # free YOUR_NAME MISSPELLED FROM_FRIEND... : : : : 2 0 2 0 PIXEL 7,12

More information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

More information

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule. CS 188: Artificial Intelligence Fall 2008 Lecture 24: Perceptrons II 11/24/2008 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 15-1: Support Vector Machines Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University,

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

k-means demo Administrative Machine learning: Unsupervised learning Assignment 5 out Machine learning: Unsupervised learning" David Kauchak cs Spring 0 adapted from: http://www.stanford.edu/class/cs76/handouts/lecture7-clustering.ppt http://www.youtube.com/watch?v=or_-y-eilqo Administrative

More information

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek CS 8520: Artificial Intelligence Machine Learning 2 Paula Matuszek Fall, 2015!1 Regression Classifiers We said earlier that the task of a supervised learning system can be viewed as learning a function

More information

CLUSTERING IN BIOINFORMATICS

CLUSTERING IN BIOINFORMATICS CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,

More information

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning BANANAS APPLES Administrative Machine learning: Unsupervised learning" Assignment 5 out soon David Kauchak cs311 Spring 2013 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Machine

More information

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. Unsupervised Learning. Manfred Huber Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters Types of general clustering methods Clustering Algorithms for general similarity measures agglomerative versus divisive algorithms agglomerative = bottom-up build up clusters from single objects divisive

More information

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016 CPSC 340: Machine Learning and Data Mining Non-Parametric Models Fall 2016 Admin Course add/drop deadline tomorrow. Assignment 1 is due Friday. Setup your CS undergrad account ASAP to use Handin: https://www.cs.ubc.ca/getacct

More information

Data Mining Algorithms

Data Mining Algorithms for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information