Reading group on Ontologies and NLP:

Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 1 / 25 Te Reading group on Ontologies and NLP: Machine Learning in Automated Text Categorization, by Fabrizio Sebastianini. ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 1 47.. 27th February 2014

Topic & Research question Text categorization (TC) is the activity of labelling natural language texts with thematic categories from a predefined set. TC is a task of information retrieval (IR). Applications: document indexing based on controlled vocabulary, document filtering, automated metadata generation, word sense disambiguation, population of hierarchical catalogues of Web resources. Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 2 / 25 Te

Topic & Research question How? Two main approaches to TC: Knowledge engineering (KE): manually defining a set of rules to classify docs under the given categories. Machine learning (ML): general inductive process to build an automatic text classifier by learning, from a set of preclassified docs, the characteristics of the given categories.! This paper focus on this approach!! No systematic treatments of the subject. Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 3 / 25 Te

Contents 1 Introduction 2 Text categorization 3 Applications of text categorization 4 The machine learning approach to text categorization 5 Document indexing and dimensionality reduction 6 Inductive construction of text classifiers 7 Evaluation of text classifiers 8 Conclusion Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 4 / 25 Te

Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 5 / 25 Te 2. Text categorization A definition of text categorization TC is the task of assigning a Boolean value to each pair (d j,c i ) 2 DxC, where D is a domain of documents and C = c 1,...,c C. A value of T assigned to (d j,c i ) indicates a decision to file d j under c i while a value of F indicates a decision not to file d j under c i The tasks is to approximate the unknown target function : D x C! T,F (that describes how documents ought to be classified) by means of a function : DxC! T,F called the classifier (or rule, hypothesis, model), such that and coincide as much as possible E ectiveness: measure this coincidence. Cateogries are symbolic labels Metadata is not available: based solely on its semantics The membership of a document in a category cannot be decided deterministically: inter-indexer inconsistency

2. Text categorization Single-lable TC: The case in which exactly 1 category must be assigned to each document.! Binary TC: a document must be assigned to a category or its complement. Multi-label TC: The case in which any number of categories from 0 to C may be assigned to the same.! An algorithm for binary classification can be used for multi-label classification. This requires that categories are stochastically independent of each other.! This paper: binary case. Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 6 / 25 Te

2. Text categorization Two di erent ways of using a text classifier: Document pivoted TC (DPC): Given a document we want to find all the categories under which it should be filed. Category-pivoted TC (CPC): Given a category we want to find all the documents that should be filed under it.! The sets C and D might not be available in their entirety from the start. DPC is suitable when docs become available at di erent moments.! These decisions are important fro the choice of the classifier building method. Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 7 / 25 Te

2. Text categorization Hard categorization vs. ranking categorization: Complete automation: hard categorization Semiautomatic (useful in critical applications, when the quality of training data is low): Ranking categorization: Category-ranking TC (rank categories according to their estimated appropriateness to d), document-ranking TC (rank the documents according to their estimated appropriateness to c). Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 8 / 25 Te

3. Applications of text categorization Speech cateogrization, multimedia doc categorization, author identification, language identification, automatic identification of text genre, automated essay grading... Automatic indexing for Boolean information retrieval systems: Each doc is assigned one or more keywords in a controlled dictionary. If the entries in the controlled vocabulary are viewed as categories, text indexing is an instance of TC. Document organization: e.g. in a newspaper, classified ads must be categorized under categories such as Personals, Cars for Sale, etc; automatic grouping of conference papers into sections. Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 9 / 25 Te

3. Applications of text categorization Text filtering: Activity of classifying a stream of incoming documents, e.g. newsfeed, e-mail filter, at the producer end or the consumer end, adaptive filtering vs. routing or batch filtering. Word sense disambiguation: Activity of finding, given the occurrence in a text of an ambiguous word, the sense of this particular word occurrence; e.g. Bank of England (financial institution) vs bank of the river Thames (engineering artifact). Word occurrence contexts as docs and word senses as categories! Single-label TC, document-pivoted TC. Context-sensitive spelling correction, part of speech tagging, word choice selection, etc. Hierarchical categorization of Web pages, under the hierarchical catalogues hosted by popular Internet portals. Peculiarities: hypertextual nature of docs, hierarchical structure of the category set. Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 10 / 25 Te

4. The machine learning approach to text categorization In the 80s Knowledge engineering (KE) techniques CONSTRUE system; fig. 1 Drawback: knowledge acquisition problem. Since the 90s, the ML approach to TC has gained popularity: supervised learning. Advantages: automatic builder of classifiers (learner); the learner is available o -the-shelf. Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 11 / 25 Te

4.1. Training set, test set and validation set The ML approach relies on the availability of an initial corpus of documents under preclassified. Prior to classifier construction, the initial corpus is split in two sets: training (and validation) set: teh classifier is built by observing the characteristics of these docs test set, used for testing the e ectiveness of the classifiers. Train-and-test approach vs. k-fold crossvalidation Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 12 / 25 Te

5. Document indexing and dimensionality reduction Texts cannot be directly interpreted by a classifier. Indexing procedure: maps a text d j into a compact representation of its content. A text is usually represented as a vector of term weights d j =(w 1 j,..., w T t), where T is the set of terms that occur at least once in at least one document. Identify terms with words (bag of words), or phrases (removal of function words). Stemming. Weights range between 0 and 1; use the standard tfidf function, and then normalized weights by cosine normalization. The Darmstadt Indexing Approach: considers properties of texts, docs, categories, etc. as basic dimensions of the learning space. Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 13 / 25 Te

Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 14 / 25 Te 5. Document indexing and dimensionality reduction Dimensionality reduction: The high dimensionality of the term space might be problematic. Dimensionality reduction also reduces overfitting. Methods: Dimensionality reduction by term selection: document frequency, other information-theoretic term selection functions. Dimensionality reduction by term extraction: term clustering, latent semantic indexing (LSI)

6. Document indexing and dimensionality reduction Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 15 / 25 Te

7. Inductive construction of text classifiers Determining thresholds Probabilistic classifiers Decision tree classifiers Decision rule classifiers Regression methods On-line methods The Rocchio method Neural networks Example-based classifiers Building classifiers by support vector machines Classifier committees Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 16 / 25 Te

7. Inductive construction of text classifiers Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 17 / 25 Te

7. Inductive construction of text classifiers Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 18 / 25 Te

7. Inductive construction of text classifiers Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 19 / 25 Te

8. Evaluation of text classifiers Measures of text categorization e ectiveness Precision and recall Other measures of e ectiveness Measures alternative to e ectiveness Combined e ectiveness measures Benchmarks for text categorization Which text classifier is best? Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 20 / 25 Te

8. Evaluation of text classifiers Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 21 / 25 Te

8. Evaluation of text classifiers Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 22 / 25 Te

8. Evaluation of text classifiers Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 23 / 25 Te

9. Conclusion Numerous and important domains of application of TC. Indispensable in many applications. Improve productivity of human classifiers. Reach e ectiveness levels comparable to trained professionals. Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 24 / 25 Te

Reading group on Ontologies and NLP: Machine Learning27thin February Automated 2014 25 / 25 Te Next reading group To be announced at https://blog.hig.no/ontologies/!