A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

Similar documents
A Comparison of Support Vector Machines and Self-Organizing Maps for Categorization

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

Keywords : Bayesian, classification, tokens, text, probability, keywords. GJCST-C Classification: E.5

A Content Vector Model for Text Classification

Keyword Extraction by KNN considering Similarity among Features

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

Feature Selection for an n-gram Approach to Web Page Genre Classification

Individualized Error Estimation for Classification and Regression Models

EVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST

Bayesian Spam Detection System Using Hybrid Feature Selection Method

Slides for Data Mining by I. H. Witten and E. Frank

Using Text Learning to help Web browsing

An Experimental Evaluation of Spam Filter Performance and Robustness Against Attack

Application of Support Vector Machine Algorithm in Spam Filtering

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

Discovering Advertisement Links by Using URL Text

An Assessment of Case Base Reasoning for Short Text Message Classification

Fig 1. Overview of IE-based text mining framework

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

CS299 Detailed Plan. Shawn Tice. February 5, The high-level steps for classifying web pages in Yioop are as follows:

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

The Role of Biomedical Dataset in Classification

Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

TEXT CATEGORIZATION PROBLEM

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS

Machine Learning Techniques for Data Mining

CHEAP, efficient and easy to use, has become an

Automatic Labeling of Issues on Github A Machine learning Approach

Comparative Study of Instance Based Learning and Back Propagation for Classification Problems

Information-Theoretic Feature Selection Algorithms for Text Classification

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

Noise-based Feature Perturbation as a Selection Method for Microarray Data

An Empirical Study of Lazy Multilabel Classification Algorithms

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

Feature Selection Methods for an Improved SVM Classifier

2. On classification and related tasks

Perceptron-Based Oblique Tree (P-BOT)

Applying Supervised Learning

CLASSIFYING WEB PAGES BY GENRE A Distance Function Approach

Applying Clustering of Hierarchical K-means-like Algorithm on Arabic Language

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

METEOR-S Web service Annotation Framework with Machine Learning Classification

Supervised classification of law area in the legal domain

Statistical dependence measure for feature selection in microarray datasets

5 Learning hypothesis classes (16 points)

On Effective Classification via Neural Networks

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

International Journal of Advanced Research in Computer Science and Software Engineering

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts.

Usage Guide to Handling of Bayesian Class Data

WEKA Explorer User Guide for Version 3-4

Performance Analysis of Data Mining Classification Techniques

Project Report: "Bayesian Spam Filter"

Evaluation Methods for Focused Crawling

Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling

Associating Terms with Text Categories

Text mining on a grid environment

Automatic Domain Partitioning for Multi-Domain Learning

Text Classification based on Limited Bibliographic Metadata

Text Document Clustering Using DPM with Concept and Feature Analysis

Encoding Words into String Vectors for Word Categorization

Network Traffic Measurements and Analysis

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

International ejournals

Classification. 1 o Semestre 2007/2008

The use of frequent itemsets extracted from textual documents for the classification task

Balancing Awareness and Interruption: Investigation of Notification Deferral Policies

Correlation Based Feature Selection with Irrelevant Feature Removal

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification

Semantic-Based Keyword Recovery Function for Keyword Extraction System

Text Categorization (I)

Final Report - Smart and Fast Sorting

Extraction of Web Image Information: Semantic or Visual Cues?

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Content Based Spam Filtering

Managing Open Bug Repositories through Bug Report Prioritization Using SVMs

INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING

Efficient Pairwise Classification

International Journal of Software and Web Sciences (IJSWS)

The Data Mining Application Based on WEKA: Geographical Original of Music

Predicting Popular Xbox games based on Search Queries of Users

COMBINING NEURAL NETWORKS FOR SKIN DETECTION

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Weka ( )

BE BRIEF THE KEY IN USING TECHNICAL INTERNET GROUPS

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Relevance Feature Discovery for Text Mining

Spam Detection ECE 539 Fall 2013 Ethan Grefe. For Public Use

Machine Learning. Decision Trees. Manfred Huber

Identifying Important Communications

The Impact of Information System Risk Management on the Frequency and Intensity of Security Incidents Original Scientific Paper

Transcription:

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au 2 School of Computing and Information Technology, University of Western Sydney, NSW, Australia d.merkl@uws.edu.au Abstract. This paper gives an analysis of multi-class e-mail categorization performance, comparing a character n-gram document representation against a word-frequency based representation. Furthermore the impact of using available e-mail specific meta-information on classification performance is explored and the findings are presented. Introduction The task of automatically sorting documents of a document collection into categories from a predefined set, is referred to as Text Categorization. Text categorization is applicable in a variety of domains: document genre identification, authorship attribution, survey coding to name but a few. One particular application is categorizing e-mail messages into legitimate and spam messages, i.e. spam filtering. Androutsopoulos et al. compare in [] a Naïve Bayes classifier against an Instance-Based classifier to categorize e-mail messages into spam and legitimate messages, and conclude that these learning-based classifiers clearly outperform simple anti-spam keyword approaches. However, sometimes it is desired to classify e-mail messages in more than two categories. Consider, for example an e-mail routing application, which automatically sorts incoming messages according to their content and routes them to receivers that are responsible for a particular topic. The study presented herein compares the performance of different text classification algorithms in such a multi-class setting. By nature, e-mail messages are short documents containing misspellings, special characters and abbreviations. This entails an additional challenge for text classifiers to cope with noisy input data. To classify e-mail in the presence of noise, a method used for language identification is adapted in order to statistically describe e-mail messages. Specifically, character-based n-gram frequency profiles, as proposed in [2], are used as features which represent each particular e-mail message. The comparison of the performance of categorization algorithms using character-based n-gram frequencies as elements of feature vectors with respect to multiple classes is described. The assumption is, that applying text categorization on character-based n-gram frequencies will outperform

word-based frequency representations of e-mails. In [3] a related approach aims at authorship attribution and topic detection. They evaluate the performance of a Naïve Bayes classifier combined with n-gram language models. The authors mention, that the character-based approach has better classification results than the word-based approach for topic detection in newsgroups. Their interpretation is that the character-based approach captures regularities that the word-based approach is missing in this particular application. Besides the content contained in the body of an e-mail message, the e-mail header holds useful data that has impact on the classification task. This study explores the influence of header information on classification performance thoroughly. Two different representations of each e-mail message were generated: one that contains all data of an e-mail message and a second, which consists of textual data found in the e-mail body. The impact on classification results when header information is discarded is shown. 2 Text Categorization The task of automatically sorting documents of a document collection into categories from a predefined set, is referred to as Text Categorization[4]. An important task in text categorization is to prepare text in such a way, that it becomes suitable for text classifier, i.e. transform them into an adequate document representation. Cavnar et al. mention in [2] a statistical model for describing documents, namely character n-gram frequency profiles. A character n-gram is defined as an n-character long slice of a longer string. As an example for n = 2, the character bi-grams of topic spotting are {to, op, pi, ic, c, s, sp, po, ot, tt, ti, in, ng}. Note that the space character is represented by. In order to obtain such frequency profiles, for each document in the collection n-grams with different length n are generated. Then, the n-gram occurrences in every document are counted on a per document basis. One objective of this study is to determine the influence of different document representations on the performance of different text-classification approaches. To this end, a character-based n-gram document representation with n {2, 3} is compared against a document representation based on word frequencies. In the word-frequency representation occurrences of each word in a document are counted on a per document basis. Generally, the initial number of features extracted from text corpora is very large. Many classifiers are unable to perform their task in a reasonable amount of time, if the number of features increases dramatically. Thus, appropriate feature selection strategies must be applied to the corpus. Another problem emerges if the amount of training data in proportion to the number of features is very low. In this particular case, classifiers produce a large number of hypothesis for the training data. This might end up in overfitting [5]. So, it is important to reduce the number of features while retaining those that contain information that is potentially useful. The idea of feature selection is to score each potential feature according to a feature selection metric and then take the n-top-scored features. For a recent survey on the performance of different feature selection metrics we

refer to [6]. For this study the Chi-Squared feature selection metric is used. The Chi-Squared metric evaluates the worth of an attribute by computing the value of the chi-squared statistic with respect to the class. For the task of document classification, algorithms of three different machine learning areas were selected. In particular, a Naïve Bayes classifier [7], partial decision trees () as a rule learning approach [8] and support vector machines trained with the sequential minimal optimization algorithm [9] as a representative of kernel-based learning were applied. 3 Experiments The major objective of these experiments is comparing the performance of different text classification approaches for multi-class categorization when applied to a noisy domain. By nature, e-mail messages are short documents containing misspellings, special characters and abbreviations. For that reason, e-mail messages constitute perfect candidates to evaluate this objective. Not to mention the varying length of e-mail messages which entails an additional challenge for text classification algorithms. The assumption is, that applying text categorization on a character-based n-gram frequency profile will outperform the word-frequency approach. This presumption is backed by the fact that character-based n-gram models are regarded as more stable with respect to noisy data. Moreover, the impact on performance is assessed when header information contained in e-mail messages is taken into account. Hence, two different corpus representations are generated to evaluate this issue. Note that all experiments were performed with 0-fold cross validation to reduce the likelihood of overfitting to the training set. Furthermore, we gratefully acknowledge the WEKA machine learning project for their open-source software [0], which was used to perform the experiments. 3. Data The document collection consists of,8 e-mail messages. These messages have been collected during a period of four months commencing with October 2002 until January 2003. The e-mails have been received by a single e-mail user account at the Institut für Softwaretechnik, Vienna University of Technology, Austria. Beside the noisiness of the corpus, it contains messages of different languages as well. Multi-linguality introduces yet another challenge for text classification. At first, messages containing confidential information were removed from the corpus. Next, the corpus was manually classified according to 6 categories. Note that the categorization process was performed subsequent to the collection period. Due to the manual classification of the corpus, some of the messages might have been misclassified. Some of the introduced categories deal with closely related topics in order to assess the accuracy of classifiers on similar categories. Next, two representations of each message were generated. The first representation consists of the data contained in the e-mail message, i.e. the complete header as well as the body. However, the e-mail header was not treated in a

special way. All non-latin characters, apart from the blank character, were discarded. Thus, all HTML-tags remain part of this representation. Henceforth, we refer to this representation as complete set. Furthermore, a second representation retaining only the data contained in the body of the e-mail message was generated. In addition, HTML-tags were discarded, too. Henceforth, we refer to this representation as cleaned set. Due to the fact, that some of the e-mail messages contained no textual data in the body besides HTML-tags and other special characters, the corpus of the cleaned set consists of less messages than the complete set. To provide the total figures, the complete set consists of, 8 e-mails whereas the cleaned set is constituted by, 692 e-mails. Subsequently, both representations were translated to lower case characters. Starting from these two message representations, the statistical models are built. In order to test the performance of text classifiers with respect to the number of features, we subsequently selected the top-scored n features with n {00, 200, 300, 400, 500, 000, 2000} determined by the Chi-Squared feature selection metric. 3.2 Results In Figure the classification accuracy of the text classifiers (y axis), along the number of features (x axis), is shown. In this case, the cleaned set is evaluated. Note that refers to the multi-nominal Naïve Bayes classifier, refers to the partial decision tree classifier and refers to the Support Vector Machine using the training algorithm. Figure (a) shows the percentage of correctly classified instances using character n-grams and Figure (b) depicts the results for word frequencies. Each curve corresponds to one classifier. If we consider the character n-gram representation (cf. Figure (a)) shows the lowest performance. It starts with 69.2% (00 features), increases strongly for 300 features (78.0%) and arrives at 82.7% for the maximum number of features. classifies 78.3% of the instances correctly when 00 features are used, which is higher than the 76.7% achieved with the classifier. However, as the number of features increases to 300, the classifier gets ahead of and arrives finally at 9.0% correctly classified instances (, 86.%). Hence, as long as the number of features is smaller than 500, either or yield high classification results. As the number of features increases, outperforms and dramatically. In case of word frequencies, a similar trend can be observed but the roles have changed, cf. Figure (b). All classifiers start with low performances. Remarkably, (65.7%) classifies less instances correctly than (76.0%) and (68.6%). All three classifiers boost their classification results enormously, as the number of features increases to 200. At last, the classifier yields 9.0% and outperforms both (85.8%) and (88.2%). Figure 2 shows the classification accuracy when the complete set is used for the classification task. Again, the left chart (cf. Figure 2(a)) represents the percentage of correctly classified instances for character n-grams and Figure 2(b) depicts the results for the word frequencies. If is applied to character n- grams, the classification task ends up in a random sorting of instances. The best result is achieved when 00 features are used (64.8%). As the number of features

0.5 0.5 (a) character n-grams (b) word frequencies Fig.. Classification performance of individual classifiers applied to the cleaned set. grows, s performance drops to its low of 54.2% (400 features) arriving at 62.7% for 2000 features. Contrarily, classifies 84.6% of the instances correctly using 00 features. However, increasing the number of features improves the classification performance of only marginally (2000 attributes, 89.%). starts at 76.%, increases significantly as 200 features are used (82.8%) and, after a continuous increase, classifies 92.9% of the instances correctly as the maximum number of features is reached. In analogy to the results obtained with character n-grams, shows poor performance when word frequencies are used, cf. Figure 2(b). Its top performance is 83.5% as the maximum number of features is reached. Interestingly, classifies 87.0% of instances correctly straight away the highest of all values obtained with 00 features. However, s performance increases only marginally for larger number of features and reaches, at last, 9%. starts between and with 80.%. Once 400 features are used, jumps into first place with 9% and arrives at the peak result of 93.6% correctly classified instances when 2000 features are used. 4 Conclusion In this paper, the results of three text categorization algorithms are described in a multi-class categorization setting. The algorithms are applied to character n-gram frequency statistics and a word frequency based document representation. A corpus consisting of multi-lingual e-mail messages which were manually split into multiple classes was used. Furthermore, the impact of e-mail metainformation on classification performance was assessed. The assumption, that a document representation based on character n-gram frequency statistics boosts categorization performance in a noisy domain such as e-mail filtering, could not be verified. The classifiers, especially and, showed similar performance regardless of the chosen document representation. However, when applied to word frequencies marginally better results were

0.5 0.5 (a) character n-grams (b) word frequencies Fig. 2. Classification performance of individual classifiers applied to the complete set. obtained for all categorization algorithms. Moreover, when a word-based document representation was used the percentage of correctly classified instances was higher in case of a small number of features. Using the word-frequency representation results in a minor improvement of classification accuracy. The results, especially those of, showed that both document representations are feasible in multi-class e-mail categorization. References. Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P.: Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. In: Proc. Workshop on Machine Learning and Textual Information Access. (2000) 2. Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proc. Int l Symp. Document Analysis and Information Retrieval. (994) 3. Peng, F., Schuurmans, D.: Combining naive Bayes and n-gram language models for text classification. In: Proc. European Conf. on Information Retrieval Research. (2003) 335 350 4. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34 (2002) 47 5. Mitchell, T.: Machine Learning. McGraw-Hill (997) 6. Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3 (2003) 289 305 7. McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: Proc. of AAAI-98 Workshop on Learning for Text Categorization. (998) 8. Frank, E., Witten, I.H.: Generating accurate rule sets without global optimization. In: Proc. Int l. Conf. on Machine Learning. (998) 44 5 9. Platt, J.: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Advances in Kernel Methods - Support Vector Learning. MIT Press (999) 85 208 0. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco (2000)