Text Categorization (I)

Similar documents
CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Naïve Bayes for text classification

Lecture 9: Support Vector Machines

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

CS145: INTRODUCTION TO DATA MINING

Social Media Computing

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

VECTOR SPACE CLASSIFICATION

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Natural Language Processing

Feature selection. LING 572 Fei Xia

Support Vector Machines 290N, 2015

Chapter 3: Supervised Learning

Based on Raymond J. Mooney s slides

CS570: Introduction to Data Mining

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Multi-Stage Rocchio Classification for Large-scale Multilabeled

CSE 573: Artificial Intelligence Autumn 2010

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Clustering Results. Result List Example. Clustering Results. Information Retrieval

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Information-Theoretic Feature Selection Algorithms for Text Classification

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Problem 1: Complexity of Update Rules for Logistic Regression

Data Mining and Knowledge Discovery: Practice Notes

Applied Statistics for Neuroscientists Part IIa: Machine Learning

Classification and K-Nearest Neighbors

Application of k-nearest Neighbor on Feature. Tuba Yavuz and H. Altay Guvenir. Bilkent University

Reading group on Ontologies and NLP:

Multi-Dimensional Text Classification

Text mining on a grid environment

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Applying Supervised Learning

Semi-supervised Learning

Introduction to Machine Learning. Xiaojin Zhu

Document Clustering for Mediated Information Access The WebCluster Project

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Instance and case-based reasoning

Supervised and Unsupervised Learning (II)

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Data Mining and Knowledge Discovery: Practice Notes

ECG782: Multidimensional Digital Signal Processing

Clustering & Classification (chapter 15)

Supervised Learning: K-Nearest Neighbors and Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Search Engines. Information Retrieval in Practice

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

TEXT CATEGORIZATION PROBLEM

Using Recursive Classification to Discover Predictive Features

Classification. 1 o Semestre 2007/2008

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Information Retrieval

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Using Text Learning to help Web browsing

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

SVM cont d. Applications face detection [IEEE INTELLIGENT SYSTEMS]

A Dendrogram. Bioinformatics (Lec 17)

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

SOCIAL MEDIA MINING. Data Mining Essentials

Fuzzy Multilevel Graph Embedding for Recognition, Indexing and Retrieval of Graphic Document Images

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

CS60092: Information Retrieval

Classification: Feature Vectors

Supervised classification of law area in the legal domain

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Recap of the last lecture. CS276A Text Retrieval and Mining. Text Categorization Examples. Categorization/Classification. Text Classification

k-nearest Neighbors + Model Selection

A Bagging Method using Decision Trees in the Role of Base Classifiers

Application of Support Vector Machine Algorithm in Spam Filtering

Semi-Supervised Clustering with Partial Background Information

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS249: ADVANCED DATA MINING

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization

Machine Learning Classifiers and Boosting

Content-based image and video analysis. Machine learning

Nearest Neighbor Classification

An Empirical Study on Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms

Nearest Neighbor Classification

Machine Learning Lecture 3

High Reliability Text Categorisation Systems

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Machine Learning. Classification

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Machine Learning Lecture 3

Object Recognition. Lecture 11, April 21 st, Lexing Xie. EE4830 Digital Image Processing

Data Preprocessing. Supervised Learning

k-nearest Neighbor (knn) Sept Youn-Hee Han

Transcription:

CS473 CS-473 Text Categorization (I) Luo Si Department of Computer Science Purdue University

Text Categorization (I) Outline Introduction to the task of text categorization Manual v.s. automatic text categorization Text categorization applications Evaluation of text categorization K nearest neighbor text categorization method

Text Categorization Tasks Assign predefined categories to text documents/objects Motivation Provide an organizational view of the data Large cost of manual text categorization Millions of dollars spent for manual categorization in companies, governments, public libraries, hospitals Manual categorization is almost impossible for some large scale applications (Classification or Web pages)

Text Categorization Automatic text categorization Learn algorithm to automatically assign predefined categories to text documents /objects automatic or semi-automatic Procedures Training: Given a set of categories and labeled document examples; learn a method to map a document to correct category (categories) Testing: Predict the category (categories) of a new document Automatic or semi-automatic categorization can significantly reduce the manual efforts

Text Categorization: Examples

Text Categorization: Examples Categories

Text Categorization: Examples Medical Subject Headings (Categories)

Example: U.S. Census in 1990 Included 22 million responses Needed to be classified into industry categories (200+) and occupation categories (500+) Would cost $15 millions if conduced by hand Two alternative automatic text categorization methods have been evaluated Knowledge-Engineering (Expert System) Machine Learning (K nearest neighbor method)

Example: U.S. Census in 1990 A Knowledge-Engineering Approach Expert System (Designed by domain expert) Hand-Coded rule (e.g., if Professor and Lecturer -> Education ) Development cost: 2 experts, 8 years (192 Person-months) Accuracy = 47% A Machine Learning Approach K Nearest Neighbor (KNN) classification: details later; find your language by what language your neighbors speak Fully automatic Development cost: 4 Person-months Accuracy = 60%

Many Applications! Web page classification (Yahoo-like category taxonomies) News article classification (more formal than most Web pages) Automatic email sorting (spam detection; into different folders) Word sense disambiguation (Java programming v.s. Java in Indonesia) Gene function classification (find the functions of a gene from the articles talking about the gene) What is your favorite applications?...

Techniques Explored in Text Categorization Rule-based Expert system (Hayes, 1990) Nearest Neighbor methods (Creecy 92; Yang 94) Decision symbolic rule induction (Apte 94) Naïve Bayes (Language Model) (Lewis 94; McCallum 98) Regression method (Furh 92; Yang 92) Support Vector Machines (Joachims 98, 05; Hofmann 03) Boosting or Bagging (Schapier 98) Neural networks (Wiener 95)

Text Categorization: Evaluation Performance of different algorithms on Reuters-21578 corpus: 90 categories, 7769 Training docs, 3019 test docs, (Yang, JIR 1999)

Text Categorization: Evaluation Contingency Table Per Category (for all docs) Predicted Positive Predicted Negative Truth: True Truth: False a b a+b c d c+d a+c b+d n=a+b+c+d a: number of true positive docs b: number of false-positive docs c: number of false negative docs d: number of true-negative docs n: total number of test documents

Text Categorization: Evaluation Contingency Table Per Category (for all docs) n: total number of docs c a d b Sensitivity: a/(a+c) true-positive rate, the larger the better Specificity: d/(b+d) true-negative rate, the larger the better They depend on decision threshold, trade off between the values

Text Categorization: Evaluation Recall: r=a/(a+c) truly-positive (percentage of positive docs detected) Precision: p=a/(a+b) how accurate is the predicted positive docs F-measure: Harmonic average: 2 (β 1)pr Fβ F 2 1 β p r 1 2 1 x 1 1 1 x 2 2pr p r Accuracy: (a+d)/n Error: (b+c)/n Accuracy+Error=1 how accurate is all the predicted docs error rate of predicted docs

Text Categorization: Evaluation Micro F1-Measure Calculate a single contingency table for all categories and calculate F1 measure Treat each prediction with equal weight; better for algorithms that work well on large categories Macro F1-Measure Calculate a single contingency table for every category; calculate F1 measure separately and average the values Treat each category with equal weight; better for algorithms that work well on many small categories

K-Nearest Neighbor Classifier Also called Instance-based learning or lazy learning low/no cost in training, high cost in online prediction Commonly used in pattern recognition (5 decades) Theoretical error bound analyzed by Duda & Hart (1957) Applied to text categorization in 1990 s Among top-performing text categorization methods

K-Nearest Neighbor Classifier Keep all training examples Find k examples that are most similar to the new document ( neighbor documents) Assign the category that is most common in these neighbor documents (neighbors vote for the category) Can be improved by considering the distance of a neighbor ( A closer neighbor has more weight/influence)

K-Nearest Neighbor Classifier Idea: find your language by what language your neighbors speak (k=1) (k=5) (k=10)? Use K nearest neighbors to vote 1-NN: Red; 5-NN: Blue; 10-NN:?; Weight 10-NN: Blue

K Nearest Neighbor: Technical Elements Document representation Document distance measure: closer documents should have similar labels; neighbors speak the same language Number of nearest neighbors (value of K) Decision threshold

K Nearest Neighbor: Framework Training data Test data V D={(x i,y i)}, xir,docs, y i{0,1} x Scoring Function Classification: R V 1 The neighbor hood is ŷ(x) sim(x,x i)yi k x D (x) i 1 if y(x) ˆ t 0 0 otherwise k D k D Document Representation: tf.idf weighting for each dimension

Choices of Similarity Functions Euclidean distance Kullback Leibler distance Dot product Cosine Similarity Kernel functions x x x *x 1 2 1v 2v v d(x1, x2) (x x ) d(x cos(x k(x 1 1, x, x,x Automatic learning of the metrics 1 2 2 2 ) ) v ) v x e d(x v 1v v x 1v x 2 1v 1v 2 1,x2 )/2σ 2v x log x x 1v 2v 2v v 2 x 2 2v (Gaussian Kernel)

Choices of Number of Neighbors (K) Find desired number of neighbors by cross validation Choose a subset of available data as training data, the rest as validation data Find the desired number of neighbors on the validation data The procedure can be repeated for different splits; find the consistent good number for the splits

TC: K-Nearest Neighbor Classifier Theoretical error bound analyzed by Duda & Hart (2000) & Devroye et al (1996). When n (#docs), k (#neighbors) and k/n 0 (ratio of neighbors and total docs), KNN approaches minimum error. 24

Characteristics of KNN Pros Simple and intuitive, based on local-continuity assumption Widely used and provide strong baseline in TC Evaluation No training needed, low training cost Easy to implement; can use standard IR techniques (e.g., tf.idf) Cons Heuristic approach, no explicit objective function Difficult to determine the number of neighbors High online cost in testing; find nearest neighbors has high time complexity

Text Categorization (I) Outline Introduction to the task of text categorization Manual v.s. automatic text categorization Text categorization applications Evaluation of text categorization K nearest neighbor text categorization method Lazy learning: no training Local-continuity assumption: find your language by what language your neighbors speak

Bibliography - Y. Yang. Expert network: Effective and Efficient Learning from Human Decisions in Text Categorization and retrieval. SIGIR. 1994 - D. D. Lewis. An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task. SIGIR, 1992 - A. McCallum. A Comparison of Event Models for Naïve Bayes Text Categorization. AAAI workshop, 1998 - Fuhr N, Hartmanna S, et al. Air/x A rule-based Multistage Indexing Systems for Large Subject Fields. RAIO. 1991 - Y. Yang and C. G. Chute. An example-based Mapping Method for Text Categorization and Retrieval. ACM TOIS. 12(3)-252-277, 1994 - T. Joachims. Text Categorization with Support Vector Machines: Learning with many relevant features. ECML. 1998 - L. Cai and T. Hofmann. Hierarchical Document Categorization with Support Vector Machines. CIKM. 2004 - R. E. Schapire, Y. Singer, et al. Boosting and Rocchio Applied to Text Filtering. SIGIR. 1998 - E. Wiener, J. O. Pedersen, et al. A Neural Network Approach to Topic Spotting. SDAIR, 1995

Bibliography - R. O. Duda, P. E. Hart, et al. Pattern Classification (2 nd Ed). Wiley. 2000 - L. Devroye, L. Györfi, et al. A Probabilistic Theory of Pattern Recognition. Springer. 1996 - Y. M. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1999 28