Reading group on Ontologies and NLP:

Similar documents
F. Aiolli - Sistemi Informativi 2006/2007

Information Retrieval. (M&S Ch 15)

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Chapter 6: Information Retrieval and Web Search. An introduction

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Introduction to Information Retrieval

TEXT CATEGORIZATION PROBLEM

Information Retrieval. Information Retrieval and Web Search

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Information Retrieval and Web Search

CS 6320 Natural Language Processing

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Feature selection. LING 572 Fei Xia

Chapter 2. Architecture of a Search Engine

Text Analytics (Text Mining)

WordNet-based User Profiles for Semantic Personalization

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm

Information Retrieval

Search Engines. Information Retrieval in Practice

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Text Categorization (I)

Document Clustering for Mediated Information Access The WebCluster Project

A Content Vector Model for Text Classification

Data Preprocessing. Supervised Learning

Chapter 27 Introduction to Information Retrieval and Web Search

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Outline. Structures for subject browsing. Subject browsing. Research issues. Renardus

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

A hybrid method to categorize HTML documents

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Information Retrieval: Retrieval Models

Representation Learning using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Classification. 1 o Semestre 2007/2008

Information Retrieval & Text Mining

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

Information Retrieval

Text Analytics (Text Mining)

Keyword Extraction by KNN considering Similarity among Features

Domain-specific Concept-based Information Retrieval System

Building Search Applications

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

String Vector based KNN for Text Categorization

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

Information Retrieval. hussein suleman uct cs

Web Information Retrieval using WordNet

Machine Learning Practice and Theory

Lecture 11: Clustering Introduction and Projects Machine Learning

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International ejournals

ECS289: Scalable Machine Learning

Introduction to Information Retrieval

International Journal of Advanced Research in Computer Science and Software Engineering

Enterprise Multimedia Integration and Search

Ontology-Based Web Query Classification for Research Paper Searching

Recap of the last lecture. CS276A Text Retrieval and Mining. Text Categorization Examples. Categorization/Classification. Text Classification

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Ubiquitous Computing and Communication Journal (ISSN )

Content-based Recommender Systems

UNICAL, 21/10/2004. Tutorial goals

Lecture 5: Information Retrieval using the Vector Space Model

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Information Retrieval

The Security Role for Content Analysis

Natural Language Processing

VK Multimedia Information Systems

CS371R: Final Exam Dec. 18, 2017

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

Text Mining: A Burgeoning technology for knowledge extraction

Automatic Summarization

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

9. Conclusions. 9.1 Definition KDD

Natural Language Processing with PoolParty

Search Results Clustering in Polish: Evaluation of Carrot

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Digital Libraries: Language Technologies

Encoding Words into String Vectors for Word Categorization

Information Retrieval

Introduction to Information Retrieval

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

Text Classification and Clustering Using Kernels for Structured Data

Chapter 9. Classification and Clustering

Classification and Clustering

CS47300: Web Information Search and Management

Information Retrieval

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Document indexing, similarities and retrieval in large scale text collections

Chapter 3 - Text. Management and Retrieval

A FUZZY NAIVE BAYESCLASSIFICATION USING CLASS SPECIFIC FEATURES FOR TEXT CATEGORIZATION

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Transcription:

Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 1 / 25 Te Reading group on Ontologies and NLP: Machine Learning in Automated Text Categorization, by Fabrizio Sebastianini. ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 1 47.. 27th February 2014

Topic & Research question Text categorization (TC) is the activity of labelling natural language texts with thematic categories from a predefined set. TC is a task of information retrieval (IR). Applications: document indexing based on controlled vocabulary, document filtering, automated metadata generation, word sense disambiguation, population of hierarchical catalogues of Web resources. Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 2 / 25 Te

Topic & Research question How? Two main approaches to TC: Knowledge engineering (KE): manually defining a set of rules to classify docs under the given categories. Machine learning (ML): general inductive process to build an automatic text classifier by learning, from a set of preclassified docs, the characteristics of the given categories.! This paper focus on this approach!! No systematic treatments of the subject. Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 3 / 25 Te

Contents 1 Introduction 2 Text categorization 3 Applications of text categorization 4 The machine learning approach to text categorization 5 Document indexing and dimensionality reduction 6 Inductive construction of text classifiers 7 Evaluation of text classifiers 8 Conclusion Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 4 / 25 Te

Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 5 / 25 Te 2. Text categorization A definition of text categorization TC is the task of assigning a Boolean value to each pair (d j,c i ) 2 DxC, where D is a domain of documents and C = c 1,...,c C. A value of T assigned to (d j,c i ) indicates a decision to file d j under c i while a value of F indicates a decision not to file d j under c i The tasks is to approximate the unknown target function : D x C! T,F (that describes how documents ought to be classified) by means of a function : DxC! T,F called the classifier (or rule, hypothesis, model), such that and coincide as much as possible E ectiveness: measure this coincidence. Cateogries are symbolic labels Metadata is not available: based solely on its semantics The membership of a document in a category cannot be decided deterministically: inter-indexer inconsistency

2. Text categorization Single-lable TC: The case in which exactly 1 category must be assigned to each document.! Binary TC: a document must be assigned to a category or its complement. Multi-label TC: The case in which any number of categories from 0 to C may be assigned to the same.! An algorithm for binary classification can be used for multi-label classification. This requires that categories are stochastically independent of each other.! This paper: binary case. Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 6 / 25 Te

2. Text categorization Two di erent ways of using a text classifier: Document pivoted TC (DPC): Given a document we want to find all the categories under which it should be filed. Category-pivoted TC (CPC): Given a category we want to find all the documents that should be filed under it.! The sets C and D might not be available in their entirety from the start. DPC is suitable when docs become available at di erent moments.! These decisions are important fro the choice of the classifier building method. Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 7 / 25 Te

2. Text categorization Hard categorization vs. ranking categorization: Complete automation: hard categorization Semiautomatic (useful in critical applications, when the quality of training data is low): Ranking categorization: Category-ranking TC (rank categories according to their estimated appropriateness to d), document-ranking TC (rank the documents according to their estimated appropriateness to c). Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 8 / 25 Te

3. Applications of text categorization Speech cateogrization, multimedia doc categorization, author identification, language identification, automatic identification of text genre, automated essay grading... Automatic indexing for Boolean information retrieval systems: Each doc is assigned one or more keywords in a controlled dictionary. If the entries in the controlled vocabulary are viewed as categories, text indexing is an instance of TC. Document organization: e.g. in a newspaper, classified ads must be categorized under categories such as Personals, Cars for Sale, etc; automatic grouping of conference papers into sections. Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 9 / 25 Te

3. Applications of text categorization Text filtering: Activity of classifying a stream of incoming documents, e.g. newsfeed, e-mail filter, at the producer end or the consumer end, adaptive filtering vs. routing or batch filtering. Word sense disambiguation: Activity of finding, given the occurrence in a text of an ambiguous word, the sense of this particular word occurrence; e.g. Bank of England (financial institution) vs bank of the river Thames (engineering artifact). Word occurrence contexts as docs and word senses as categories! Single-label TC, document-pivoted TC. Context-sensitive spelling correction, part of speech tagging, word choice selection, etc. Hierarchical categorization of Web pages, under the hierarchical catalogues hosted by popular Internet portals. Peculiarities: hypertextual nature of docs, hierarchical structure of the category set. Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 10 / 25 Te

4. The machine learning approach to text categorization In the 80s Knowledge engineering (KE) techniques CONSTRUE system; fig. 1 Drawback: knowledge acquisition problem. Since the 90s, the ML approach to TC has gained popularity: supervised learning. Advantages: automatic builder of classifiers (learner); the learner is available o -the-shelf. Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 11 / 25 Te

4.1. Training set, test set and validation set The ML approach relies on the availability of an initial corpus of documents under preclassified. Prior to classifier construction, the initial corpus is split in two sets: training (and validation) set: teh classifier is built by observing the characteristics of these docs test set, used for testing the e ectiveness of the classifiers. Train-and-test approach vs. k-fold crossvalidation Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 12 / 25 Te

5. Document indexing and dimensionality reduction Texts cannot be directly interpreted by a classifier. Indexing procedure: maps a text d j into a compact representation of its content. A text is usually represented as a vector of term weights d j =(w 1 j,..., w T t), where T is the set of terms that occur at least once in at least one document. Identify terms with words (bag of words), or phrases (removal of function words). Stemming. Weights range between 0 and 1; use the standard tfidf function, and then normalized weights by cosine normalization. The Darmstadt Indexing Approach: considers properties of texts, docs, categories, etc. as basic dimensions of the learning space. Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 13 / 25 Te

Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 14 / 25 Te 5. Document indexing and dimensionality reduction Dimensionality reduction: The high dimensionality of the term space might be problematic. Dimensionality reduction also reduces overfitting. Methods: Dimensionality reduction by term selection: document frequency, other information-theoretic term selection functions. Dimensionality reduction by term extraction: term clustering, latent semantic indexing (LSI)

6. Document indexing and dimensionality reduction Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 15 / 25 Te

7. Inductive construction of text classifiers Determining thresholds Probabilistic classifiers Decision tree classifiers Decision rule classifiers Regression methods On-line methods The Rocchio method Neural networks Example-based classifiers Building classifiers by support vector machines Classifier committees Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 16 / 25 Te

7. Inductive construction of text classifiers Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 17 / 25 Te

7. Inductive construction of text classifiers Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 18 / 25 Te

7. Inductive construction of text classifiers Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 19 / 25 Te

8. Evaluation of text classifiers Measures of text categorization e ectiveness Precision and recall Other measures of e ectiveness Measures alternative to e ectiveness Combined e ectiveness measures Benchmarks for text categorization Which text classifier is best? Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 20 / 25 Te

8. Evaluation of text classifiers Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 21 / 25 Te

8. Evaluation of text classifiers Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 22 / 25 Te

8. Evaluation of text classifiers Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 23 / 25 Te

9. Conclusion Numerous and important domains of application of TC. Indispensable in many applications. Improve productivity of human classifiers. Reach e ectiveness levels comparable to trained professionals. Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 24 / 25 Te

Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 25 / 25 Te Next reading group To be announced at https://blog.hig.no/ontologies/!