An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

Similar documents
INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

EVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST

A Content Vector Model for Text Classification

Keyword Extraction by KNN considering Similarity among Features

SPAM, generally defined as unsolicited bulk (UBE) Feature Construction Approach for Categorization Based on Term Space Partition

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

CHEAP, efficient and easy to use, has become an

JURD: Joiner of Un-Readable Documents algorithm for Reversing the Effects of Tokenisation Attacks against Content-based Spam Filters

Multi-Dimensional Text Classification

GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM

Collaborative Spam Mail Filtering Model Design

Schematizing a Global SPAM Indicative Probability

String Vector based KNN for Text Categorization

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

Semi-Supervised Clustering with Partial Background Information

On Effective Classification via Neural Networks

Using AdaBoost and Decision Stumps to Identify Spam

Keywords : Bayesian, classification, tokens, text, probability, keywords. GJCST-C Classification: E.5

Computer aided mail filtering using SVM

International Journal of Advanced Research in Computer Science and Software Engineering

A Reputation-based Collaborative Approach for Spam Filtering

An Experimental Evaluation of Spam Filter Performance and Robustness Against Attack

A hybrid method to categorize HTML documents

Centroid-Based Document Classification: Analysis & Experimental Results?

Encoding Words into String Vectors for Word Categorization

Naïve Bayes for text classification

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Rita McCue University of California, Santa Cruz 12/7/09

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

An Automatic Reply to Customers Queries Model with Chinese Text Mining Approach

Influence of Word Normalization on Text Classification

Bayesian Spam Detection System Using Hybrid Feature Selection Method

ScienceDirect. KNN with TF-IDF Based Framework for Text Categorization

An Immune Concentration Based Virus Detection Approach Using Particle Swarm Optimization

Using Gini-index for Feature Weighting in Text Categorization

Analysis on the technology improvement of the library network information retrieval efficiency

Social Media Computing

PERFORMANCE OF MACHINE LEARNING TECHNIQUES FOR SPAM FILTERING

CS229 Final Project: Predicting Expected Response Times

CORE for Anti-Spam. - Innovative Spam Protection - Mastering the challenge of spam today with the technology of tomorrow

Naive Bayes spam filtering using word-position-based attributes and length-sensitive classification thresholds

VECTOR SPACE CLASSIFICATION

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Filtering Spam by Using Factors Hyperbolic Trees

A study of classification algorithms using Rapidminer

Chapter 6: Information Retrieval and Web Search. An introduction

Unknown Malicious Code Detection Based on Bayesian

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Information Extraction from Spam s using Stylistic and Semantic Features to Identify Spammers

Improved Online Support Vector Machines Spam Filtering Using String Kernels

Online Self-Organised Map Classifiers as Text Filters for Spam Detection

A New Approach for Handling the Iris Data Classification Problem

Weight adjustment schemes for a centroid based classifier

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document

An Improvement of Centroid-Based Classification Algorithm for Text Classification

Concept-Based Document Similarity Based on Suffix Tree Document

Spam Decisions on Gray using Personalized Ontologies

A Feature Selection Method to Handle Imbalanced Data in Text Classification

Decision Science Letters

arxiv: v1 [cs.lg] 12 Feb 2018

A Survey on Postive and Unlabelled Learning

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Frequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

A generalized additive neural network application in information security

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

Information-Theoretic Feature Selection Algorithms for Text Classification

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Probabilistic Anti-Spam Filtering with Dimensionality Reduction

Automatic Domain Partitioning for Multi-Domain Learning

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Application of Support Vector Machine Algorithm in Spam Filtering

An Ensemble Data Mining Approach for Intrusion Detection in a Computer Network

A Comparative Study of Classification Based Personal Filtering

Answering Assistance by Semi-Supervised Text Classification

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

A Novel PAT-Tree Approach to Chinese Document Clustering

Content Based Spam Filtering

Automated Online News Classification with Personalization

Combining Bayesian and Rule Score Learning: Automated Tuning for SpamAssassin

Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

Filtering Spam Using Fuzzy Expert System 1 Hodeidah University, Faculty of computer science and engineering, Yemen 3, 4

High Reliability Text Categorisation Systems

Improving Imputation Accuracy in Ordinal Data Using Classification

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

TRANSDUCTIVE TRANSFER LEARNING BASED ON KL-DIVERGENCE. Received January 2013; revised May 2013

A comparative study for content-based dynamic spam classification using four machine learning algorithms

Chapter-8. Conclusion and Future Scope

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Performance Analysis of Data Mining Classification Techniques

Using Self-Organizing Maps for Sentiment Analysis. Keywords Sentiment Analysis, Self-Organizing Map, Machine Learning, Text Mining.

An Empirical Study of Lazy Multilabel Classification Algorithms

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

Transcription:

An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University of Tainan, Tainan, Taiwan 700 b Institute of Information Management Shu-Te University, Kaohsiung County, Taiwan 824 E-mail: cclai@mail.nutn.edu.tw, lravati@pchome.com.tw Abstract The increasing volume of unsolicited bulk e-mail (also known as spam) has generated a need for reliable anti-spam filters. Using a classifier based on machine learning techniques to automatically filter out spam e-mail has drawn many researchers attention. In this paper, we review some of relevant ideas and do a set of systematic experiments on e-mail categorization, which has been conducted with four machine learning algorithms applied to different parts of e-mail. Experimental results reveal that the header of e-mail provides very useful information for all the machine learning algorithms considered to detect spam e-mail. Keywords: spam, e-mail categorization, machine learning 1. Introduction In recent years, Internet e-mails have become a common and important medium of communication for almost everyone. However, spam, also known as unsolicited commercial/bulk e-mail, is a bane of e-mail communication. There are many serious problems associated with growing volumes of spam. Spam is not only a waste of storage space and communication bandwidth, but also a waste of time to tackle. Several solutions have been proposed to overcome the spam problem. Among the proposed methods, much interest has focused on the machine learning techniques in spam filtering. They include rule learning [3], Naïve Bayes [1, 6], decision trees [2], support vector machines [4] or combinations of different learners [7]. The basic and common concept of these approaches is that using a classifier to filter out spam and the classifier is learned from training data rather than constructed by hand. Therefore, it can result in better performance [9]. From the machine learning viewpoint, spam filtering based on the textual content of e-mail can be viewed as a special case of text categorization, with the categories being spam or non-spam [5]. Sahami et al. [6] employed Bayesian classification technique to filter junk e-mails. By making use of the extensible framework of Bayesian modeling, they can not only employ traditional document classification techniques based on the text of e-mail, but they can also easily incorporate domain knowledge to aim at filtering spam e-mails. Drucker et al. [4] used support vector machine (SVM) for classifying e-mails according to their contents and compared its performance with Ripper, Rocchio, and boosting decision trees. They concluded that boosting trees and SVM had acceptable test performance in terms of accuracy and speed. However, the training time of boosting trees is inordinately long. Androutsopoulos et al. [1] extended the Naïve Bayes (NB) filter proposed by Sahami et al. [6], by investigating the effect of different number of features and training-set sizes on the filter s performance. Meanwhile, they compared the performance of NB to a memory-based approach, and they found both abovementioned methods clearly outperform a typical keyword-based filter. The objective of this paper is to evaluate four respective machine learning algorithms for spam e- mail categorization. These techniques are Naïve Bayes (NB), term frequency inverse document frequency (TF-IDF), K-nearest neighbor (K-NN), and support vector machines (SVMs). In addition, we study different parts of an e-mail that can be exploited to improve the categorization capability. We considered

the following five combinations of an e-mail message: all (A), header (H), body (B), subject (S), and body with subject (B+S). The above-mentioned four methods with these features are compared to help evaluate the relative merits of these algorithms, and suggest directions for future works. The rest of this paper is organized as follows. Section 2 gives a brief review of four machine learning algorithms and features we used. Section 3 presents the experimental results designed to evaluate the performance of different experimental settings. The conclusions are summarized in Section 4. 2. Machine learning methods and features in the e-mail In this section, we review the machine learning algorithms in the literature that used for e-mail categorization (or anti-spam filtering). They include Naïve Bayes (NB), term frequency inverse document frequency (TF-IDF), K-nearest neighbor (K-NN), and support vector machines (SVMs). 2.1 Naïve Bayes The Naïve Bayes (NB) classifier is a probability-based approach. The basic concept of it is to find whether an e-mail is spam or not by looking at which words are found in the message and which words are absent from it. In the literature, the NB classifier for spam is defined as follows: CNB arg max P( ci ) P( wk ci ), (1) ci T k where T is the set of target classes (spam or non-spam), and P(w k c i ) is the probability that word w k occurs in the e-mail, given the e-mail belongs to class c i. The likelihood term is estimated as n P( w c k k i ), (2) N where n k is the number of times word w k occurs in e- mails with class c i, and N is the number of words in e- mails with class c i. 2.2 Term frequency-inverse document frequency The most often adopted representation of a set of messages is as term weight vectors which used in the Vector Space Model [8]. The term weights are real numbers indicating the significance of terms in identifying a document. Based on this concept, the weight of a term in an e-mail message can be computed by the tf idf. The tf (term frequency) indicates the number of times that a term t appears in an e-mail. The idf (inverse document frequency) is the inverse of document frequency in the set of e-mails that contain t. The tf idf weighting scheme is defined as N wij tfij log( ), (3) dfi where w ij is the weight of the ith term in the jth e-mail, tf ij is the number of times that the ith term occurs in the jth e-mail, N is the total number of e-mails in the collection, and df i is the number of e-mails in which the ith term occurs. 2.3 K-nearest neighbor The most basic instance-based method is the K-nearest neighbor (K-NN) algorithm. It is a very simple method to classify documents and to show very good performance on text categorization tasks [10]. If we want to apply K-NN method to classify e-mails, the e- mails of the training set have to be indexed and then convert them into a document vector representation. When classifying a new e-mail, the similarity between its document vector and each one in the training set has to be computed. Then, the categories of the k nearest neighbors are determined and the category which occurs most frequently is chosen. 2.4 Support vector machine Support vector machine (SVM) has become very popular in the machine learning community because of its good generalization performance and its ability to handle high-dimensional data by using kernels. According the description given in Woitaszek et al. [11], an e-mail may be represented by a feature vector x that is composed of the various words from a dictionary formed by analyzing the collected e-mails. Thus, an e-mail is classified as spam or non-spam by performing a simple dot product between the features of an e-mail and the SVM model weight vector, y w x b, (4) where y is the result of classification, w is weight vector corresponding to those in the feature vector x, and b is the bias parameter in the SVM model that determined by the training process. 2.5 The structure of an e-mail

In addition to the text message of an e-mail, an e-mail has additional information in the header. The header contains many fields, for example, trace information about which a message has passed (Received:), where the sender wants replies to go (Reply-To:), unique of ID of this message (Message-ID:), format of content (Content-Type:), etc. Figures 1 illustrates the header of an e-mail. Besides comparing the categorization performance among the learning algorithms, we intended to figure out which parts of an e-mail have critical influence on the classification results. Therefore, five features of an e-mail: all (A), header (H), body (B), subject (S), and body with subject (B+S), are used to evaluate the performance of four machine learning algorithms. Furthermore, we also considered four cases that whether stemming or stopping procedure was applied or not. Received: from chen2 (localhost [127.0.0.1]) by ipx.ntntc.edu.tw (8.12.9+Sun/8.12.9) with ESMTP id i791mh4h028241; Mon, 9 Aug 2004 09:48:49 +0800 (CST) From: =?big5?b?s6+pdrfx?= <robert@ipx.ntntc.edu.tw> To: <david@ipx.ntntc.edu.tw> Subject: =?big5?b?soquykvku6gp+g==?= Date: Mon, 9 Aug 2004 09:50:07 +0800 Message-ID: <000001c47db3$334b3000$2a8547cb@chen2> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="---- _NextPart_000_0001_01C47DF6.416E7000" X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook, Build 10.0.2627 Figure 1. The header of an e-mail 3. Experimental results and discussion In order to test the performance of above-mentioned four methods, two corpora were used. The first corpus (Corpus I) consists of our e-mails over a recent three month period. For experiments, we deleted some e- mails whose messages were too short or did not contain any content and obtain 1050 spam and 1057 non-spam. The second corpus (Corpus II) we adopted is available at www.spamassassin.org. This archive contains 2100 spam and 2107 non-spam messages. We run experiments with different training and test sets. The first pair of training and test set is created by splitting each corpus at a ratio of 20:80. The second pair and the third one are 30:70 and 40:60, respectively. In e-mail classification tasks, the performance is often measured in terms of accuracy. Let N legit and N spam denote the total numbers of non-spam and spam messages, respectively, to be classified by the machine learning method, and n(c V) the number of messages belonging to category C that the method classified as category V (here, C, V {legit, spam}). The accuracy is defined as following formula: number of e - mails correctly categorized Accuracy total number of e - mails n( legit legit) n( spam spam) N legit N spam (5) The overall performances of considered learning algorithms in different experiments are shown in Tables 1 and 2. From the results, we found the following phenomena. 1. Good performance of NB, TF-IDF and SVM with header information. NB and TF-IDF performed reasonably consistent and good in different experimental settings. While SVM performed well except the feature of subject. It seems that the subject is not enough for high accuracy classification in SVM. 2. Poor performance of KNN method. KNN performed the worst among all considered methods and the poorest in all cases. However, if the more preprocessing tasks are utilized (i.e., stemming and stopping are applied together), the better KNN performs. 3. No effect of stemming, but stopping can enhance the e-mail classification. Stemming did not make any significant improvement for all algorithms in performance, though it decreased the size of the feature set. On the other hand, when the stopping procedure is employed, that is, ignoring some words that do not carry meaning in natural language, we can get better performance. The phenomenon is obvious especially in K-NN method as shown in Figure 2. 4. Good performance with header. Among four machine learning algorithms, the performance with header was the best. This means that much information can be derived from the header and then the fields in the header can aim at classifying e-mails correctly. 5. Poor performance with subject or body. The poor performance of each algorithm occurs in subject or body. The reasons may be that the former provides too little useful information. On the contrary, the latter contains too much useless information to classify e-mails. From the observation, we know that although some learning algorithms can achieve satisfactory results, we

Accuracy(%) may try to improve the result by combing some of them. Here, we integrate TF-IDF and NB methods and apply them to two corpora. The experimental results are shown in the last column of Tables 1 and 2. They show that the accuracy can be improved by the new hybrid approach. 100 90 80 70 60 50 40 30 20 10 0 Figure 2. Performance of K-NN method in all features with and without stopping procedure in Corpus I 4. Conclusion The detection of spam e-mail is an important issue of information technologies, and machine learning has a central role to play in this topic. In this paper, we presented an empirical evaluation of four machine learning algorithms for spam e-mail categorization. These approaches, NB, TF-IDF, K-NN, and SVM, were applied to different parts of an e-mail in order to compare their performance. Experimental results indicate that NB, TF-IDF, and SVM yield better performance than K-NN. The phenomenon also found, at least with our test corpora, that classification with the header was the most accurate than other parts of an e-mail. On the other hand, we try to combine two methods (TF-IDF and NB) to achieve the most correct categorization. It was found that integrating different learning algorithms actually seems to be a promising way. Acknowledgements stopping without stopping A H B S B+S featues This work was partially supported by National Science Council, Taiwan, R.O.C., under grant NSC 92-2218-E- 024-004. Comments from anonymous referees are highly appreciated. References [1] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C.D. Spyropoulos, and P. Stamatopoulos, Learning to filter spam e-mail: A comparison of a Naïve Bayesian and a memory-based approach, Proceedings of the workshop: Machine Learning and Textual Information Access, pp. 1-13, 2000. [2] X. Carreras and L. Márquez, Boosting trees for antispam email filtering, Proceedings of 4th Int l Conf. on Recent Advances in Natural Language Processing, pp. 58-64, 2001. [3] W.W. Cohen, Learning rules that classify e-mail, Proceedings of AAAI Spring Symposium on Machine Learning in Information Access, pp. 18-25, 1996. [4] H. Drucker, D. Wu, and V.N. Vapnik, Support vector machines for spam categorization, IEEE Trans. Neural Networks, vol. 10, no. 5, pp. 1048-1054, 1999. [5] A. Ko cz and J. Alspector, SVM-based filtering of e- mail spam with content-specific misclassification costs, Proceedings of TextDM'01 Workshop on Text Mining, 2001. [6] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, A Bayesian approach to filtering junk e-mail, Learning for Text Categorization Papers from the AAAI Workshop, pp. 55-62, 1998. [7] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos, and P. Stamatopoulos, Stacking classifiers for anti-spam filtering of e-mail, Proceedings of the 6th Conf. on Empirical Methods in Natural Language Processing, pp. 44-50, 2001. [8] G. Salton, Automating Text Processing: The Transformation, Analysis and Retrieval of Information by Computer, Addison-Wesley, 1989. [9] K.-M. Schneider, A comparison of event models for Naïve Bayes anti-spam e-mail filtering, Proceedings of 10th Conf. of the European Chapter of the Association for Computational Linguistics, pp. 207-314, 2003. [10] Y. Yang, An evaluation of statistical approaches to text categorization, Journal of Information Retrieval, vol. 1, pp. 67-88, 1999. [11] M. Woitaszek, M. Shaaban, and R. Czernikowski, Identifying junk electronic mail in Microsoft Outlook with a support vector machine, Proceedings of the 2003 Symposium on Applications and the Internet, pp. 166-169, 2003.

Table 1. Performance of four machine learning algorithms in Corpus I Feature Preprocessing Algorithms s stemming stopping NB TFIDF K-NN SVM TFIDF+NB A 87.60 88.37 49.20 91.11 90.42 A 88.32 90.19 87.72 92.26 91.46 A 87.57 88.14 51.31 91.29 90.64 A 88.71 90.08 88.99 92.15 90.97 H 93.36 94.25 70.38 92.99 93.87 H 93.21 95.29 87.71 92.95 95.30 H 93.23 94.70 73.02 92.87 93.59 H 93.46 95.24 87.58 92.86 94.90 B 87.46 89.31 47.50 83.34 92.04 B 88.74 90.08 81.65 85.79 90.69 B 85.78 89.41 46.80 83.53 91.66 B 89.47 90.19 81.97 85.84 90.39 S 83.71 83.35 62.55 77.29 84.92 S 83.85 88.17 82.54 74.74 86.35 S 83.60 93.61 67.19 77.97 85.45 S 84.04 87.64 82.64 74.17 84.04 B+S 87.22 86.64 78.40 84.71 88.48 B+S 87.81 88.88 76.58 87.20 89.83 B+S 87.62 87.48 48.92 85.17 89.36 B+S 88.84 88.90 79.49 87.17 90.68 Average 88.18 89.50 70.10 86.27 90.25 Table 2. Performance of four machine learning algorithms in Corpus II Feature Preprocessing Algorithms s stemming stopping NB TFIDF K-NN SVM TFIDF+NB A 87.56 89.49 48.57 91.11 91.71 A 88.73 90.30 88.64 91.58 91.78 A 87.56 88.52 48.71 91.13 92.46 A 88.76 90.08 89.99 92.28 92.72 H 93.18 91.95 69.75 93.00 95.14 H 93.22 91.40 89.21 92.89 91.61 H 93.23 90.86 73.15 92.59 92.01 H 91.02 89.89 88.48 92.71 92.64 B 84.33 79.29 46.98 89.82 88.06 B 89.29 84.10 82.23 89.41 89.71 B 83.31 87.47 46.49 85.41 89.91 B 89.18 84.08 83.11 89.10 90.19 S 83.40 82.84 65.36 77.29 85.69 S 83.82 87.84 81.89 74.78 84.81 S 84.02 83.74 61.68 77.97 87.62 S 84.04 87.49 82.03 74.10 87.88 B+S 86.49 84.58 47.15 85.51 88.46 B+S 87.51 88.52 79.34 87.00 89.81 B+S 85.43 84.57 47.03 84.80 88.55 B+S 88.77 85.50 81.06 87.21 89.53 Average 87.64 87.13 70.04 86.98 90.01