INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Similar documents
INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Introduction to Information Retrieval

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Introduction to Information Retrieval

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Information Retrieval and Web Search Engines

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval and Web Search Engines

CSE 5243 INTRO. TO DATA MINING

Clustering CS 550: Machine Learning

Gene Clustering & Classification

Data Clustering. Danushka Bollegala

PV211: Introduction to Information Retrieval

Unsupervised Learning and Clustering

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Cluster Analysis: Agglomerate Hierarchical Clustering

Unsupervised Learning

Unsupervised Learning : Clustering

Information Retrieval and Organisation

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

INF 4300 Classification III Anne Solberg The agenda today:

What to come. There will be a few more topics we will cover on supervised learning

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning and Clustering

Based on Raymond J. Mooney s slides

Clustering Algorithms for general similarity measures

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

ECLT 5810 Clustering

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Unsupervised Learning

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

ECLT 5810 Clustering

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Data Informatics. Seon Ho Kim, Ph.D.

Introduction to Information Retrieval

Cluster Analysis. Ying Shen, SSE, Tongji University

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Text Documents clustering using K Means Algorithm

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Artificial Intelligence. Programming Styles

Unsupervised Learning Partitioning Methods

CHAPTER 4: CLUSTER ANALYSIS

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Chapter 9. Classification and Clustering

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

University of Florida CISE department Gator Engineering. Clustering Part 2

CS47300: Web Information Search and Management

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

MI example for poultry/export in Reuters. Overview. Introduction to Information Retrieval. Outline.

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Machine Learning (BSMC-GA 4439) Wenke Liu


Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Natural Language Processing

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering and Visualisation of Data

CSE 158. Web Mining and Recommender Systems. Midterm recap

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

Clustering algorithms

Lesson 3. Prof. Enza Messina

Cluster Analysis. CSE634 Data Mining

Network Traffic Measurements and Analysis

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

10701 Machine Learning. Clustering

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Clustering: K-means and Kernel K-means

Clustering Part 3. Hierarchical Clustering

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering in Data Mining

Clustering: Overview and K-means algorithm

Association Rule Mining and Clustering

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning. Unsupervised Learning. Manfred Huber

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

6. Learning Partitions of a Set

Machine learning - HT Clustering

Road map. Basic concepts

Introduction to Mobile Robotics

Understanding Clustering Supervising the unsupervised

Exploratory Analysis: Clustering

CLUSTERING. Quiz information. today 11/14/13% ! The second midterm quiz is on Thursday (11/21) ! In-class (75 minutes!)

Search Engines. Information Retrieval in Practice

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Machine Learning for OR & FE

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Transcription:

INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning. Vectors space classification. How to represent classes and class membership. Rocchio + knn. Linear vs non-linear decision boundaries. Today Evaluation of classifiers Unsupervised machine learning for class discovery: Clustering Flat vs. hierarchical clustering. k-means clustering Vector space quiz 2

Testing a classifier Vector space classification amounts to computing the boundaries in the space that separate the class regions: the decision boundaries. To evaluate the boundary, we measure the number of correct classification predictions on unseeen test items. Many ways to do this... We want to test how well a model generalizes on a held-out test set. Labeled test data is sometimes refered to as the gold standard. Why can t we test on the training data? 3 Example: Evaluating classifier decisions Predictions for a given class can be wrong or correct in two ways: gold = positive gold = negative prediction = positive true positive (TP) false positive (FP) prediction = negative false negative (FN) true negative (TN) 4

Example: Evaluating classifier decisions accuracy = TP+TN N = 1+6 10 = 0.7 precision = = 1 1+1 = 0.5 recall = = 1 TP TP+FP TP TP+FN 1+2 = 0.33 F-score = 2 precision recall precision+recall = 0.4 5 Evaluation measures accuracy = TP+TN TP+TN N = TP+TN +FP+FN The ratio of correct predictions. Not suitable for unbalanced numbers of positive / negative examples. TP precision = TP+FP The number of detected class members that were correct. TP recall = TP+FN The number of actual class members that were detected. Trade-off: Positive predictions for all examples would give 100% recall but (typically) terrible precision. F-score = 2 precision recall precision+recall Balanced measure of precision and recall (harmonic mean). 6

Evaluating multi-class predictions Macro-averaging Sum precision and recall for each class, and then compute global averages of these. The macro average will be highly influenced by the small classes. Micro-averaging Sum TPs, FPs, and FNs for all points/objects across all classes, and then compute global precision and recall. The micro average will be highly influenced by the large classes. 7 A note on obligatory assignment 2b Builds on oblig 2a: Vector space representation of a set of words based on BoW features extracted from a sample of the Brown corpus. For 2b we ll provide class labels for most of the words. Train a Rocchio classifier to predict labels for a set of unlabeled words. Label Examples food potato, food, bread, fish, eggs... institution embassy, institute, college, government, school... title president, professor, dr, governor, doctor... place_name italy, dallas, france, america, england... person_name lizzie, david, bill, howard, john... unknown department, egypt, robert, butter, senator... 8

A note on obligatory assignment 2b For a given set of objects {o 1,..., o m } the proximity matrix R is a square m m matrix where R ij stores the proximity of o i and o j. For our word space, R ij would give the dot-product of the normalized feature vectors x i and x j, representing the words o i and o j. Note that, if our similarity measure sim is symmetric, i.e. sim( x, y) = sim( y, x), then R will also be symmetric, i.e. R ij = R ji Computing all the pairwise similarities once and then storing them in R can help save time in many applications. R will provide the input to many clustering methods. By sorting the row elements of R, we get access to an important type of similarity relation; nearest neighbors. For 2b we will implement a proximity matrix for retrieving knn relations. 9 Two categorization tasks in machine learning Classification Supervised learning, requiring labeled training data. Given some training set of examples with class labels, train a classifier to predict the class labels of new objects. Clustering Unsupervised learning from unlabeled data. Automatically group similar objects together. No pre-defined classes: we only specify the similarity measure. The search for structure in data (Bezdek, 1981) General objective: Partition the data into subsets, so that the similarity among members of the same group is high (homogeneity) while the similarity between the groups themselves is low (heterogeneity). 10

Example applications of cluster analysis Visualization and exploratory data analysis. Many applications within IR. Examples: Speed up search: First retrieve the most relevant cluster, then retrieve documents from within the cluster. Presenting the search results: Instead of ranked lists, organize the results as clusters. Dimensionality reduction / class-based features. News aggregation / topic directories. Social network analysis; identify sub-communities and user segments. Image segmentation, product recommendations, demographic analysis,... 11 Main types of clustering methods Hierarchical Creates a tree structure of hierarchically nested clusters. Topic of the next lecture. Flat Often referred to as partitional clustering. Tries to directly decompose the data into a set of clusters. Topic of today. 12

Flat clustering Given a set of objects O = {o 1,..., o n }, construct a set of clusters C = {c 1,..., c k }, where each object o i is assigned to a cluster c i. Parameters: The cardinality k (the number of clusters). The similarity function s. More formally, we want to define an assignment γ : O C that optimizes some objective function F s (γ). In general terms, we want to optimize for: High intra-cluster similarity Low inter-cluster similarity 13 Flat clustering (cont d) Optimization problems are search problems: There s a finite number of possible partitionings of O. Naive solution: enumerate all possible assignments Γ = {γ 1,..., γ m } and choose the best one, ˆγ = arg min γ Γ F s (γ) Problem: Exponentially many possible partitions. Approximate the solution by iteratively improving on an initial (possibly random) partition until some stopping criterion is met. 14

k-means Unsupervised variant of the Rocchio classifier. Goal: Partition the n observed objects into k clusters C so that each point x j belongs to the cluster c i with the nearest centroid µ i. Typically assumes Euclidean distance as the similarity function s. The optimization problem: For each cluster, minimize the within-cluster sum of squares, F s = WCSS: WCSS = c i C x j c i x j µ i 2 Equivalent to minimizing the average squared distance between objects and their cluster centroids (since n is fixed) a measure of how well each centroid represents the members assigned to the cluster. 15 k-means (cont d) Algorithm Initialize: Compute centroids for k seeds. Iterate: Assign each object to the cluster with the nearest centroid. Compute new centroids for the clusters. Terminate: When stopping criterion is satisfied. Properties In short, we iteratively reassign memberships and recompute centroids until the configuration stabilizes. WCSS is monotonically decreasing (or unchanged) for each iteration. Guaranteed to converge but not to find the global minimum. The time complexity is linear, O(kn). 16

k-means example for k = 2 in R 2 (Manning, Raghavan & Schütze 2008) 17 Comments on k-means Seeding We initialize the algorithm by choosing random seeds that we use to compute the first set of centroids. Many possible heuristics for selecting seeds: pick k random objects from the collection; pick k random points in the space; pick k sets of m random points and compute centroids for each set; compute a hierarchical clustering on a subset of the data to find k initial clusters; etc.. The initial seeds can have a large impact on the resulting clustering (because we typically end up only finding a local minimum of the objective function). Outliers are troublemakers. 18

Comments on k-means Possible termination criterions Fixed number of iterations Clusters or centroids are unchanged between iterations. Threshold on the decrease of the objective function (absolute or relative to previous iteration) Some close relatives of k-means k-medoids: Like k-means but uses medoids instead of centroids to represent the cluster centers. Fuzzy c-means (FCM): Like k-means but assigns soft memberships in [0, 1], where membership is a function of the centroid distance. The computations of both WCSS and centroids are weighted by the membership function. 19 Flat Clustering: The good and the bad Pros Conceptually simple, and easy to implement. Efficient. Typically linear in the number of objects. Cons The dependence on random seeds as in k-means makes the clustering non-deterministic. The number of clusters k must be pre-specified. Often no principled means of a priori specifying k. The clustering quality often considered inferior to that of the less efficient hierarchical methods. Not as informative as the more stuctured clusterings produced by hierarchical methods. 20

Connecting the dots Focus of the last two lectures: Rocchio / nearest centroid classification, knn classification, and k-means clustering. Note how k-means clustering can be thought of as performing Rocchio classification in each iteration. Moreover, Rocchio can be thought of as a 1 Nearest Neighbor classifier with respect to the centroids. How can this be? Isn t knn non-linear and Rocchio linear? 21 Connecting the dots Recall that the knn decision boundary is locally linear for each cell in the Voronoi diagram. For both Rocchio and k-means, we re partitioning the observations according to the Voronoi diagram generated by the centroids. 22

Next Hierarchical clustering. Creates a tree structure of hierarchically nested clusters. Divisive (top-down): Let all objects be members of the same cluster; then successively split the group into smaller and maximally dissimilar clusters until all objects is its own singleton cluster. Agglomerative (bottom-up): Let each object define its own cluster; then successively merge most similar clusters until only one remains. How to measure the inter-cluster similarity ( linkage criterions ). 23 Agglomerative clustering Initially; regards each object as its own singleton cluster. Iteratively agglomerates (merges) the groups in a bottom-up fashion. Each merge defines a binary branch in the tree. Terminates; when only one cluster remains (the root). parameters: {o 1, o 2,..., o n }, sim C = {{o 1 }, {o 2 },..., {o n }} T = [] do for i = 1 to n 1 {c j, c k } arg max sim(c j, c k ) {c j,c k } C j k C C\{c j, c k } C C {c j c k } T[i] {c j, c k } At each stage, we merge the pair of clusters that are most similar, as defined by some measure of inter-cluster similarity; sim. Plugging in a different sim gives us a different sequence of merges T. 24

Dendrograms A hierarchical clustering is often visualized as a binary tree structure known as a dendrogram. A merge is shown as a horizontal line connecting two clusters. The y-axis coordinate of the line corresponds to the similarity of the merged clusters. We here assume dot-products of normalized vectors (self-similarity = 1). 25 Definitions of inter-cluster similarity So far we ve looked at ways to the define the similarity between pairs of objects. objects and a class. Now we ll look at ways to define the similarity between collections. In agglomerative clustering, a measure of cluster similarity sim(c i, c j ) is usually referred to as a linkage criterion: Single-linkage Complete-linkage Centroid-linkage Average-linkage The linkage criterion determines which pair of clusters we will merge to a new cluster in each step. 26

Bezdek, J. C. (1981). Pattern recognition with fuzzy objective function algorithms. Plenum Press. 26