Classification. 1 o Semestre 2007/2008

Similar documents
Application of Support Vector Machine Algorithm in Spam Filtering

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Search Engines. Information Retrieval in Practice

Based on Raymond J. Mooney s slides

Machine Learning. Supervised Learning. Manfred Huber

Network Traffic Measurements and Analysis

Bruno Martins. 1 st Semester 2012/2013

PV211: Introduction to Information Retrieval

Multi-label classification using rule-based classifier systems

Lecture 9: Support Vector Machines

Clustering. Bruno Martins. 1 st Semester 2012/2013

Machine Learning: Think Big and Parallel

SUPPORT VECTOR MACHINES

Supervised vs unsupervised clustering

CS 229 Midterm Review

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Kernels for Structured Data

Keyword Extraction by KNN considering Similarity among Features

Chapter 9. Classification and Clustering

Voxel selection algorithms for fmri

SUPPORT VECTOR MACHINES

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Generative and discriminative classification techniques

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Evaluating Classifiers

All lecture slides will be available at CSC2515_Winter15.html

Data Mining and Knowledge Discovery Practice notes 2

Content-based image and video analysis. Machine learning

CS371R: Final Exam Dec. 18, 2017

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Data Mining and Knowledge Discovery: Practice Notes

Applying Supervised Learning

Machine Learning: Algorithms and Applications Mockup Examination

node2vec: Scalable Feature Learning for Networks

CSE 573: Artificial Intelligence Autumn 2010

Naïve Bayes for text classification

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Generative and discriminative classification techniques

Database System Concepts

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Introduction to Automated Text Analysis. bit.ly/poir599

CS145: INTRODUCTION TO DATA MINING

Semi-supervised Learning

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Latent Class Modeling as a Probabilistic Extension of K-Means Clustering

Generative and discriminative classification

Machine Learning in Biology

Support Vector Machines

Classification: Linear Discriminant Functions

6.867 Machine Learning

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Semi-Supervised Clustering with Partial Background Information

Support Vector Machines.

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Machine Learning / Jan 27, 2010

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Machine Learning for NLP

CSE 158. Web Mining and Recommender Systems. Midterm recap

Content-based Recommender Systems

Using Machine Learning to Optimize Storage Systems

COMS 4771 Support Vector Machines. Nakul Verma

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Natural Language Processing

Information Management course

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

ADVANCED CLASSIFICATION TECHNIQUES

Data Mining and Knowledge Discovery: Practice Notes

Performance Evaluation of Various Classification Algorithms

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Statistical Methods for NLP

Social Media Computing

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Introduction to Machine Learning

VECTOR SPACE CLASSIFICATION

Classification Algorithms in Data Mining

String Vector based KNN for Text Categorization

CS249: ADVANCED DATA MINING

Chapter 3: Supervised Learning

Support Vector Machines 290N, 2015

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Data Mining and Knowledge Discovery: Practice Notes

Automatic Summarization

Support Vector Machines (a brief introduction) Adrian Bevan.

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Support Vector Machines + Classification for IR

Information Retrieval. (M&S Ch 15)

12 Classification using Support Vector Machines

Data Preprocessing. Supervised Learning

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

Machine Learning Classifiers and Boosting

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Tutorials Case studies

Encoding Words into String Vectors for Word Categorization

DM6 Support Vector Machines

CS145: INTRODUCTION TO DATA MINING

Transcription:

Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti.

Outline 1 2 3 Single-Class Multiple-Class Other Measures

Outline 1 2 3

Organizing Knowledge Systematic knowledge structures Ontologies Dewey decimal system, ACM Computing Classification System Patent subject classification Web catalogs Yahoo, Dmoz Problem: Manual maintenance

Supervised Learning Learning to assign objects to classes given examples Learn a classifier

Classification vs. Data Mining Lots of features and a lot of noise No fixed number of columns No categorical attribute values Data scarcity Larger number of class labels Hierarchical relationships between classes less systematic

Classifier Classifies documents according to the class distribution of its neighbors classifier Discovers the class distribution most likely to have generated a test document Support vector machines: Discovers an hyperplane that separates classes Rule induction: Induce rules for classification over diverse features

Other Issues Tokenization and feature extraction E.g.: replacing monetary amounts by a special token, part-of-speech tagging, etc. Evaluating text classifier Accuracy Training speed and scalability Simplicity, speed, and scalability for document modifications Ease of diagnosis, interpretation of results, and adding human judgment and feedback

Outline 1 2 3

Similar documents are expected to be assigned the same class label Similarity: vector space model + cosine similarity Training: Index each document and remember class label Testing: Fetch k most similar documents to the given document Majority class wins Alternatives: Weighted counts: counts of classes weighted by the corresponding similarity measure Per-class offset: tuned by testing the classier on a portion of training data held out for this purpose

knn Classifier score(c, d q ) = b c + sim(d q, d) d knn(d q)

Properties of knn Advantages: Easy availability and reuse of of inverted index Collection updates trivial Accuracy comparable to best known classifiers Problems: Classification efficiency many inverted index lookups scoring all candidate documents which overlap with d q in at least one word sorting by overall similarity picking the best k documents Space overhead and redundancy Data stored at level of individual documents No distillation

Improvements for knn To reduce space requirements and speed up classification Find clusters in the data Store only a few statistical parameters per cluster Compare with documents in only the most promising clusters However... Ad-hoc choices for number and size of clusters and parameters k is corpus sensitive

Probabilistic document classifier Assumptions: 1 A document can belong to exactly one class 2 Each class c has an associated prior probability P(c) 3 There is a class-conditional document distribution P(d c) for each class Given a document d, the probability of it being generated by class c is: P(c d) = P(d c)p(c) γ P(d γ)p(γ) The class with the highest probability is assigned to d q

Learning the Document Distribution P(d c) is estimated based on parameters Θ Θ is estimated based on two factors: 1 Prior knowledge before seeing any documents 2 Terms in the training documents Bayes Optimal Classifier P(c d) = Θ P(d c, Θ)P(c Θ) γ P(d γ,θ)p(γ Θ)P(Θ D) This is unfeasible to compute Maximum Likelihood Estimate Replace the sum with the value of P(d c,θ) for argmax Θ P(Θ D)

Naïve Bayes Classifier Naïve assumption assumption of independence between terms joint term distribution is the product of the marginals Widely used owing to simplicity and speed of training, applying, and updating Two kinds of widely used marginals for text Binary model Multinomial model

Naïve Bayes Models Binary Model Each parameter φ c,t indicates the probability that a document in class c will mention term t at least once P(d c) = Π t d φ c,t Π t W,t d (1 P(φ c,t )) Multinomial model each class has an associated die with W faces each parameter θ c,t denotes probability of the face turning up on tossing the die term t occurs n(d,t) times in document d document length is a random variable denoted L P(d c) = P(L = l d c)p(d l d,c) ( l = P(L = l d c) d {n(d,t)} ) Π t d θ n(d,t) t

Parameter Smoothing What if a test document d q contains a term t that never occurred in any training document in class c? P(c d q ) = 0 Even if many other terms clearly hint at a high likelihood of class c generating the document Thus, MLE cannot be used directly We assume a prior distribution on θ: π(θ) E.g., the uniform distribution The posterior distribution is π(θ k, n ) = P( k, n θ) 1 0 P( k, n p)π(p)dp

Laplace s Law of Succession The estimate θ is usually a property of the posterior distribution π(θ k, n ) We define a loss function Penalty for picking a smoothed value against the true value For the loss function L(θ, θ) = (θ θ) 2 the appropriate estimate to use is the expectation E(π(θ k, n )) This yields θ = k + 1 n + 2

Performance Analysis Multinomial naïve Bayes classifier generally outperforms the binary variant knn may outperform Naïve Bayes Naïve Bayes is faster and more compact Determines decision boundaries Regions of the term-space where different classes have similar probabilities Documents in these regions are hard to classify Strongly biased

Discriminative Classification Naïve Bayes classifiers induce linear decision boundaries between classes in the feature space Discriminative classifiers: Directly map the feature space to class labels Class labels are encoded as numbers e.g: +1 and -1 for two class problem For instance, we can try to find a vector α such that the sign of α d + b directly predicts the class of d Two possible solutions: Linear least-square regression

Assumption: training and test population are drawn from the same distribution Hypothesis: The classes can be separated by an hyperplane The hyperplane that is close to many training data points has a greater chance of misclassifying test instances An hyperplane that passes through a no-man s land, has lower chances of misclassifications Make a decision by thresholding Seek an hyperplane that maximizes the distance to any training point Choose the class on the same side of the hyperplane as the test document

Discovering the Hyperplane Assume the training documents are separable by an hyperplane perpendicular to a vector α Seek an α which maximizes the distance of any training point to the hyperplane This corresponds to solving the quadratic programming problem: Minimize 1 2 α α subject to c i (α d i + b) 1, i = 1,...,n

SVM Classifier

Non Separable Classes Classes in the training data not always separable We introduce slack variables Minimize 1 2 α α + C i ξ i subject to c i (α d i + b) 1 ξ i, i = 1,...,n and ξ i 0, i = 1,...,n Implementations solve the equivalent dual Maximize i λ i 1 2 i,j λ iλ j c i c j (d i d j ) subject to i c iλ i = 0 and 0 λ i C, i = 1,...,n

Analysis of SVMs Complexity: Quadratic optimization problem Requires on-demand computation of inner-products Recent SVM packages work in linear time by clever selection of working sets (sets of λ i ) Performance: Amongst most accurate classifier for text Better accuracy than Naïve Bayes and most classifiers Linear SVMs suffice standard text classification tasks have classes almost separable using a hyperplane in feature space Non-linear SVMs can be achieved through kernel functions

Outline Single-Class Multiple-Class Other Measures 1 2 3 Single-Class Multiple-Class Other Measures

Measures of Accuracy Single-Class Multiple-Class Other Measures Two cases: 1 Each document is associated with exactly one class, or 2 each document is associated with a subset of classes

Single-class Scenario Single-Class Multiple-Class Other Measures For the first case, we can use a confusion matrix M M[i,j] is the number of test documents belonging to class i which were assigned to class j Perfect classifier: diagonal elements M[i,i] would be nonzero Example: If M is large, we use M = 5 0 0 1 3 0 1 2 4 accuracy = i M[i, i]/ i,j M[i, j]

Multiple-class Scenario Single-Class Multiple-Class Other Measures One-vs.-rest Create a two-class problem for every class E.g. sports and not-sports, science and not-science, etc. We have a classifier for each case Accuracy is measured by recall and precision Let C d be the correct classes for document d and C d be the set of classes estimated by the classifier precision = C d C d C d recall = C d C d C d

Micro-Averaged Precision Single-Class Multiple-Class Other Measures In a problem with n classes, let C i be the number of documents in class i and let C i be the number of documents estimated to be of class i by the classifier Micro-averaged precision is defined as n i=1 C i C i n i=1 C i Micro-averaged recall is defined as n i=1 C i C i n i=1 C i Micro-averaged precision/recall measures correctly classified documents, thus favoring large classes

Macro-Averaged Precision Single-Class Multiple-Class Other Measures In a problem with n classes, let P i and R i be the precision and recall, respectively, achieved by a classifier for class i Macro-averaged precision is defined as 1 n n i=1 P n Macro-averaged recall is defined as 1 n n i=1 Macro-averaged precision/recall measures performance per class, giving all classes equal importance R n

Other Measures Single-Class Multiple-Class Other Measures Precision-recall graphs Show the tradeoff between precision and recall Break-even point Point in the precision/recall graph where precision equals recall The F 1 measure F 1 = 2 P i R i P i + R i Harmonic mean between precision and recall Discourages classifiers that trade one for the other

Single-Class Multiple-Class Other Measures Questions?