Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti.
Outline 1 2 3 Single-Class Multiple-Class Other Measures
Outline 1 2 3
Organizing Knowledge Systematic knowledge structures Ontologies Dewey decimal system, ACM Computing Classification System Patent subject classification Web catalogs Yahoo, Dmoz Problem: Manual maintenance
Supervised Learning Learning to assign objects to classes given examples Learn a classifier
Classification vs. Data Mining Lots of features and a lot of noise No fixed number of columns No categorical attribute values Data scarcity Larger number of class labels Hierarchical relationships between classes less systematic
Classifier Classifies documents according to the class distribution of its neighbors classifier Discovers the class distribution most likely to have generated a test document Support vector machines: Discovers an hyperplane that separates classes Rule induction: Induce rules for classification over diverse features
Other Issues Tokenization and feature extraction E.g.: replacing monetary amounts by a special token, part-of-speech tagging, etc. Evaluating text classifier Accuracy Training speed and scalability Simplicity, speed, and scalability for document modifications Ease of diagnosis, interpretation of results, and adding human judgment and feedback
Outline 1 2 3
Similar documents are expected to be assigned the same class label Similarity: vector space model + cosine similarity Training: Index each document and remember class label Testing: Fetch k most similar documents to the given document Majority class wins Alternatives: Weighted counts: counts of classes weighted by the corresponding similarity measure Per-class offset: tuned by testing the classier on a portion of training data held out for this purpose
knn Classifier score(c, d q ) = b c + sim(d q, d) d knn(d q)
Properties of knn Advantages: Easy availability and reuse of of inverted index Collection updates trivial Accuracy comparable to best known classifiers Problems: Classification efficiency many inverted index lookups scoring all candidate documents which overlap with d q in at least one word sorting by overall similarity picking the best k documents Space overhead and redundancy Data stored at level of individual documents No distillation
Improvements for knn To reduce space requirements and speed up classification Find clusters in the data Store only a few statistical parameters per cluster Compare with documents in only the most promising clusters However... Ad-hoc choices for number and size of clusters and parameters k is corpus sensitive
Probabilistic document classifier Assumptions: 1 A document can belong to exactly one class 2 Each class c has an associated prior probability P(c) 3 There is a class-conditional document distribution P(d c) for each class Given a document d, the probability of it being generated by class c is: P(c d) = P(d c)p(c) γ P(d γ)p(γ) The class with the highest probability is assigned to d q
Learning the Document Distribution P(d c) is estimated based on parameters Θ Θ is estimated based on two factors: 1 Prior knowledge before seeing any documents 2 Terms in the training documents Bayes Optimal Classifier P(c d) = Θ P(d c, Θ)P(c Θ) γ P(d γ,θ)p(γ Θ)P(Θ D) This is unfeasible to compute Maximum Likelihood Estimate Replace the sum with the value of P(d c,θ) for argmax Θ P(Θ D)
Naïve Bayes Classifier Naïve assumption assumption of independence between terms joint term distribution is the product of the marginals Widely used owing to simplicity and speed of training, applying, and updating Two kinds of widely used marginals for text Binary model Multinomial model
Naïve Bayes Models Binary Model Each parameter φ c,t indicates the probability that a document in class c will mention term t at least once P(d c) = Π t d φ c,t Π t W,t d (1 P(φ c,t )) Multinomial model each class has an associated die with W faces each parameter θ c,t denotes probability of the face turning up on tossing the die term t occurs n(d,t) times in document d document length is a random variable denoted L P(d c) = P(L = l d c)p(d l d,c) ( l = P(L = l d c) d {n(d,t)} ) Π t d θ n(d,t) t
Parameter Smoothing What if a test document d q contains a term t that never occurred in any training document in class c? P(c d q ) = 0 Even if many other terms clearly hint at a high likelihood of class c generating the document Thus, MLE cannot be used directly We assume a prior distribution on θ: π(θ) E.g., the uniform distribution The posterior distribution is π(θ k, n ) = P( k, n θ) 1 0 P( k, n p)π(p)dp
Laplace s Law of Succession The estimate θ is usually a property of the posterior distribution π(θ k, n ) We define a loss function Penalty for picking a smoothed value against the true value For the loss function L(θ, θ) = (θ θ) 2 the appropriate estimate to use is the expectation E(π(θ k, n )) This yields θ = k + 1 n + 2
Performance Analysis Multinomial naïve Bayes classifier generally outperforms the binary variant knn may outperform Naïve Bayes Naïve Bayes is faster and more compact Determines decision boundaries Regions of the term-space where different classes have similar probabilities Documents in these regions are hard to classify Strongly biased
Discriminative Classification Naïve Bayes classifiers induce linear decision boundaries between classes in the feature space Discriminative classifiers: Directly map the feature space to class labels Class labels are encoded as numbers e.g: +1 and -1 for two class problem For instance, we can try to find a vector α such that the sign of α d + b directly predicts the class of d Two possible solutions: Linear least-square regression
Assumption: training and test population are drawn from the same distribution Hypothesis: The classes can be separated by an hyperplane The hyperplane that is close to many training data points has a greater chance of misclassifying test instances An hyperplane that passes through a no-man s land, has lower chances of misclassifications Make a decision by thresholding Seek an hyperplane that maximizes the distance to any training point Choose the class on the same side of the hyperplane as the test document
Discovering the Hyperplane Assume the training documents are separable by an hyperplane perpendicular to a vector α Seek an α which maximizes the distance of any training point to the hyperplane This corresponds to solving the quadratic programming problem: Minimize 1 2 α α subject to c i (α d i + b) 1, i = 1,...,n
SVM Classifier
Non Separable Classes Classes in the training data not always separable We introduce slack variables Minimize 1 2 α α + C i ξ i subject to c i (α d i + b) 1 ξ i, i = 1,...,n and ξ i 0, i = 1,...,n Implementations solve the equivalent dual Maximize i λ i 1 2 i,j λ iλ j c i c j (d i d j ) subject to i c iλ i = 0 and 0 λ i C, i = 1,...,n
Analysis of SVMs Complexity: Quadratic optimization problem Requires on-demand computation of inner-products Recent SVM packages work in linear time by clever selection of working sets (sets of λ i ) Performance: Amongst most accurate classifier for text Better accuracy than Naïve Bayes and most classifiers Linear SVMs suffice standard text classification tasks have classes almost separable using a hyperplane in feature space Non-linear SVMs can be achieved through kernel functions
Outline Single-Class Multiple-Class Other Measures 1 2 3 Single-Class Multiple-Class Other Measures
Measures of Accuracy Single-Class Multiple-Class Other Measures Two cases: 1 Each document is associated with exactly one class, or 2 each document is associated with a subset of classes
Single-class Scenario Single-Class Multiple-Class Other Measures For the first case, we can use a confusion matrix M M[i,j] is the number of test documents belonging to class i which were assigned to class j Perfect classifier: diagonal elements M[i,i] would be nonzero Example: If M is large, we use M = 5 0 0 1 3 0 1 2 4 accuracy = i M[i, i]/ i,j M[i, j]
Multiple-class Scenario Single-Class Multiple-Class Other Measures One-vs.-rest Create a two-class problem for every class E.g. sports and not-sports, science and not-science, etc. We have a classifier for each case Accuracy is measured by recall and precision Let C d be the correct classes for document d and C d be the set of classes estimated by the classifier precision = C d C d C d recall = C d C d C d
Micro-Averaged Precision Single-Class Multiple-Class Other Measures In a problem with n classes, let C i be the number of documents in class i and let C i be the number of documents estimated to be of class i by the classifier Micro-averaged precision is defined as n i=1 C i C i n i=1 C i Micro-averaged recall is defined as n i=1 C i C i n i=1 C i Micro-averaged precision/recall measures correctly classified documents, thus favoring large classes
Macro-Averaged Precision Single-Class Multiple-Class Other Measures In a problem with n classes, let P i and R i be the precision and recall, respectively, achieved by a classifier for class i Macro-averaged precision is defined as 1 n n i=1 P n Macro-averaged recall is defined as 1 n n i=1 Macro-averaged precision/recall measures performance per class, giving all classes equal importance R n
Other Measures Single-Class Multiple-Class Other Measures Precision-recall graphs Show the tradeoff between precision and recall Break-even point Point in the precision/recall graph where precision equals recall The F 1 measure F 1 = 2 P i R i P i + R i Harmonic mean between precision and recall Discourages classifiers that trade one for the other
Single-Class Multiple-Class Other Measures Questions?