An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

Size: px

Start display at page:

Download "An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization"

Alexander Morris
6 years ago
Views:

An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept.

1 An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University of Tainan, Tainan, Taiwan 700 b Institute of Information Management Shu-Te University, Kaohsiung County, Taiwan cclai@mail.nutn.edu.tw, lravati@pchome.com.tw Abstract The increasing volume of unsolicited bulk (also known as spam) has generated a need for reliable anti-spam filters. Using a classifier based on machine learning techniques to automatically filter out spam has drawn many researchers attention. In this paper, we review some of relevant ideas and do a set of systematic experiments on categorization, which has been conducted with four machine learning algorithms applied to different parts of . Experimental results reveal that the header of provides very useful information for all the machine learning algorithms considered to detect spam . Keywords: spam, categorization, machine learning 1. Introduction In recent years, Internet s have become a common and important medium of communication for almost everyone. However, spam, also known as unsolicited commercial/bulk , is a bane of communication. There are many serious problems associated with growing volumes of spam. Spam is not only a waste of storage space and communication bandwidth, but also a waste of time to tackle. Several solutions have been proposed to overcome the spam problem. Among the proposed methods, much interest has focused on the machine learning techniques in spam filtering. They include rule learning [3], Naïve Bayes [1, 6], decision trees [2], support vector machines [4] or combinations of different learners [7]. The basic and common concept of these approaches is that using a classifier to filter out spam and the classifier is learned from training data rather than constructed by hand. Therefore, it can result in better performance [9]. From the machine learning viewpoint, spam filtering based on the textual content of can be viewed as a special case of text categorization, with the categories being spam or non-spam [5]. Sahami et al. [6] employed Bayesian classification technique to filter junk s. By making use of the extensible framework of Bayesian modeling, they can not only employ traditional document classification techniques based on the text of , but they can also easily incorporate domain knowledge to aim at filtering spam s. Drucker et al. [4] used support vector machine (SVM) for classifying s according to their contents and compared its performance with Ripper, Rocchio, and boosting decision trees. They concluded that boosting trees and SVM had acceptable test performance in terms of accuracy and speed. However, the training time of boosting trees is inordinately long. Androutsopoulos et al. [1] extended the Naïve Bayes (NB) filter proposed by Sahami et al. [6], by investigating the effect of different number of features and training-set sizes on the filter s performance. Meanwhile, they compared the performance of NB to a memory-based approach, and they found both abovementioned methods clearly outperform a typical keyword-based filter. The objective of this paper is to evaluate four respective machine learning algorithms for spam e- mail categorization. These techniques are Naïve Bayes (NB), term frequency inverse document frequency (TF-IDF), K-nearest neighbor (K-NN), and support vector machines (SVMs). In addition, we study different parts of an that can be exploited to improve the categorization capability. We considered

2 the following five combinations of an message: all (A), header (H), body (B), subject (S), and body with subject (B+S). The above-mentioned four methods with these features are compared to help evaluate the relative merits of these algorithms, and suggest directions for future works. The rest of this paper is organized as follows. Section 2 gives a brief review of four machine learning algorithms and features we used. Section 3 presents the experimental results designed to evaluate the performance of different experimental settings. The conclusions are summarized in Section Machine learning methods and features in the In this section, we review the machine learning algorithms in the literature that used for categorization (or anti-spam filtering). They include Naïve Bayes (NB), term frequency inverse document frequency (TF-IDF), K-nearest neighbor (K-NN), and support vector machines (SVMs). 2.1 Naïve Bayes The Naïve Bayes (NB) classifier is a probability-based approach. The basic concept of it is to find whether an is spam or not by looking at which words are found in the message and which words are absent from it. In the literature, the NB classifier for spam is defined as follows: CNB arg max P( ci ) P( wk ci ), (1) ci T k where T is the set of target classes (spam or non-spam), and P(w k c i ) is the probability that word w k occurs in the , given the belongs to class c i. The likelihood term is estimated as n P( w c k k i ), (2) N where n k is the number of times word w k occurs in e- mails with class c i, and N is the number of words in e- mails with class c i. 2.2 Term frequency-inverse document frequency The most often adopted representation of a set of messages is as term weight vectors which used in the Vector Space Model [8]. The term weights are real numbers indicating the significance of terms in identifying a document. Based on this concept, the weight of a term in an message can be computed by the tf idf. The tf (term frequency) indicates the number of times that a term t appears in an . The idf (inverse document frequency) is the inverse of document frequency in the set of s that contain t. The tf idf weighting scheme is defined as N wij tfij log( ), (3) dfi where w ij is the weight of the ith term in the jth , tf ij is the number of times that the ith term occurs in the jth , N is the total number of s in the collection, and df i is the number of s in which the ith term occurs. 2.3 K-nearest neighbor The most basic instance-based method is the K-nearest neighbor (K-NN) algorithm. It is a very simple method to classify documents and to show very good performance on text categorization tasks [10]. If we want to apply K-NN method to classify s, the e- mails of the training set have to be indexed and then convert them into a document vector representation. When classifying a new , the similarity between its document vector and each one in the training set has to be computed. Then, the categories of the k nearest neighbors are determined and the category which occurs most frequently is chosen. 2.4 Support vector machine Support vector machine (SVM) has become very popular in the machine learning community because of its good generalization performance and its ability to handle high-dimensional data by using kernels. According the description given in Woitaszek et al. [11], an may be represented by a feature vector x that is composed of the various words from a dictionary formed by analyzing the collected s. Thus, an is classified as spam or non-spam by performing a simple dot product between the features of an and the SVM model weight vector, y w x b, (4) where y is the result of classification, w is weight vector corresponding to those in the feature vector x, and b is the bias parameter in the SVM model that determined by the training process. 2.5 The structure of an

3 In addition to the text message of an , an has additional information in the header. The header contains many fields, for example, trace information about which a message has passed (Received:), where the sender wants replies to go (Reply-To:), unique of ID of this message (Message-ID:), format of content (Content-Type:), etc. Figures 1 illustrates the header of an . Besides comparing the categorization performance among the learning algorithms, we intended to figure out which parts of an have critical influence on the classification results. Therefore, five features of an all (A), header (H), body (B), subject (S), and body with subject (B+S), are used to evaluate the performance of four machine learning algorithms. Furthermore, we also considered four cases that whether stemming or stopping procedure was applied or not. Received: from chen2 (localhost [ ]) by ipx.ntntc.edu.tw ( Sun/8.12.9) with ESMTP id i791mh4h028241; Mon, 9 Aug :48: (CST) From: =?big5?b?s6+pdrfx?= <robert@ipx.ntntc.edu.tw> To: <david@ipx.ntntc.edu.tw> Subject: =?big5?b?soquykvku6gp+g==?= Date: Mon, 9 Aug :50: Message-ID: <000001c47db3$334b3000$2a8547cb@chen2> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="---- _NextPart_000_0001_01C47DF6.416E7000" X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook, Build Figure 1. The header of an 3. Experimental results and discussion In order to test the performance of above-mentioned four methods, two corpora were used. The first corpus (Corpus I) consists of our s over a recent three month period. For experiments, we deleted some e- mails whose messages were too short or did not contain any content and obtain 1050 spam and 1057 non-spam. The second corpus (Corpus II) we adopted is available at This archive contains 2100 spam and 2107 non-spam messages. We run experiments with different training and test sets. The first pair of training and test set is created by splitting each corpus at a ratio of 20:80. The second pair and the third one are 30:70 and 40:60, respectively. In classification tasks, the performance is often measured in terms of accuracy. Let N legit and N spam denote the total numbers of non-spam and spam messages, respectively, to be classified by the machine learning method, and n(c V) the number of messages belonging to category C that the method classified as category V (here, C, V {legit, spam}). The accuracy is defined as following formula: number of e - mails correctly categorized Accuracy total number of e - mails n( legit legit) n( spam spam) N legit N spam (5) The overall performances of considered learning algorithms in different experiments are shown in Tables 1 and 2. From the results, we found the following phenomena. 1. Good performance of NB, TF-IDF and SVM with header information. NB and TF-IDF performed reasonably consistent and good in different experimental settings. While SVM performed well except the feature of subject. It seems that the subject is not enough for high accuracy classification in SVM. 2. Poor performance of KNN method. KNN performed the worst among all considered methods and the poorest in all cases. However, if the more preprocessing tasks are utilized (i.e., stemming and stopping are applied together), the better KNN performs. 3. No effect of stemming, but stopping can enhance the classification. Stemming did not make any significant improvement for all algorithms in performance, though it decreased the size of the feature set. On the other hand, when the stopping procedure is employed, that is, ignoring some words that do not carry meaning in natural language, we can get better performance. The phenomenon is obvious especially in K-NN method as shown in Figure Good performance with header. Among four machine learning algorithms, the performance with header was the best. This means that much information can be derived from the header and then the fields in the header can aim at classifying s correctly. 5. Poor performance with subject or body. The poor performance of each algorithm occurs in subject or body. The reasons may be that the former provides too little useful information. On the contrary, the latter contains too much useless information to classify s. From the observation, we know that although some learning algorithms can achieve satisfactory results, we

4 Accuracy(%) may try to improve the result by combing some of them. Here, we integrate TF-IDF and NB methods and apply them to two corpora. The experimental results are shown in the last column of Tables 1 and 2. They show that the accuracy can be improved by the new hybrid approach Figure 2. Performance of K-NN method in all features with and without stopping procedure in Corpus I 4. Conclusion The detection of spam is an important issue of information technologies, and machine learning has a central role to play in this topic. In this paper, we presented an empirical evaluation of four machine learning algorithms for spam categorization. These approaches, NB, TF-IDF, K-NN, and SVM, were applied to different parts of an in order to compare their performance. Experimental results indicate that NB, TF-IDF, and SVM yield better performance than K-NN. The phenomenon also found, at least with our test corpora, that classification with the header was the most accurate than other parts of an . On the other hand, we try to combine two methods (TF-IDF and NB) to achieve the most correct categorization. It was found that integrating different learning algorithms actually seems to be a promising way. Acknowledgements stopping without stopping A H B S B+S featues This work was partially supported by National Science Council, Taiwan, R.O.C., under grant NSC E Comments from anonymous referees are highly appreciated. References [1] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C.D. Spyropoulos, and P. Stamatopoulos, Learning to filter spam A comparison of a Naïve Bayesian and a memory-based approach, Proceedings of the workshop: Machine Learning and Textual Information Access, pp. 1-13, [2] X. Carreras and L. Márquez, Boosting trees for antispam filtering, Proceedings of 4th Int l Conf. on Recent Advances in Natural Language Processing, pp , [3] W.W. Cohen, Learning rules that classify , Proceedings of AAAI Spring Symposium on Machine Learning in Information Access, pp , [4] H. Drucker, D. Wu, and V.N. Vapnik, Support vector machines for spam categorization, IEEE Trans. Neural Networks, vol. 10, no. 5, pp , [5] A. Ko cz and J. Alspector, SVM-based filtering of e- mail spam with content-specific misclassification costs, Proceedings of TextDM'01 Workshop on Text Mining, [6] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, A Bayesian approach to filtering junk , Learning for Text Categorization Papers from the AAAI Workshop, pp , [7] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos, and P. Stamatopoulos, Stacking classifiers for anti-spam filtering of , Proceedings of the 6th Conf. on Empirical Methods in Natural Language Processing, pp , [8] G. Salton, Automating Text Processing: The Transformation, Analysis and Retrieval of Information by Computer, Addison-Wesley, [9] K.-M. Schneider, A comparison of event models for Naïve Bayes anti-spam filtering, Proceedings of 10th Conf. of the European Chapter of the Association for Computational Linguistics, pp , [10] Y. Yang, An evaluation of statistical approaches to text categorization, Journal of Information Retrieval, vol. 1, pp , [11] M. Woitaszek, M. Shaaban, and R. Czernikowski, Identifying junk electronic mail in Microsoft Outlook with a support vector machine, Proceedings of the 2003 Symposium on Applications and the Internet, pp , 2003.

5 Table 1. Performance of four machine learning algorithms in Corpus I Feature Preprocessing Algorithms s stemming stopping NB TFIDF K-NN SVM TFIDF+NB A A A A H H H H B B B B S S S S B+S B+S B+S B+S Average Table 2. Performance of four machine learning algorithms in Corpus II Feature Preprocessing Algorithms s stemming stopping NB TFIDF K-NN SVM TFIDF+NB A A A A H H H H B B B B S S S S B+S B+S B+S B+S Average

INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING

INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR EMAIL SPAMMING TAK-LAM WONG, KAI-ON CHOW, FRANZ WONG Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue,