An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University of Tainan, Tainan, Taiwan 700 b Institute of Information Management Shu-Te University, Kaohsiung County, Taiwan 824 E-mail: cclai@mail.nutn.edu.tw, lravati@pchome.com.tw Abstract The increasing volume of unsolicited bulk e-mail (also known as spam) has generated a need for reliable anti-spam filters. Using a classifier based on machine learning techniques to automatically filter out spam e-mail has drawn many researchers attention. In this paper, we review some of relevant ideas and do a set of systematic experiments on e-mail categorization, which has been conducted with four machine learning algorithms applied to different parts of e-mail. Experimental results reveal that the header of e-mail provides very useful information for all the machine learning algorithms considered to detect spam e-mail. Keywords: spam, e-mail categorization, machine learning 1. Introduction In recent years, Internet e-mails have become a common and important medium of communication for almost everyone. However, spam, also known as unsolicited commercial/bulk e-mail, is a bane of e-mail communication. There are many serious problems associated with growing volumes of spam. Spam is not only a waste of storage space and communication bandwidth, but also a waste of time to tackle. Several solutions have been proposed to overcome the spam problem. Among the proposed methods, much interest has focused on the machine learning techniques in spam filtering. They include rule learning [3], Naïve Bayes [1, 6], decision trees [2], support vector machines [4] or combinations of different learners [7]. The basic and common concept of these approaches is that using a classifier to filter out spam and the classifier is learned from training data rather than constructed by hand. Therefore, it can result in better performance [9]. From the machine learning viewpoint, spam filtering based on the textual content of e-mail can be viewed as a special case of text categorization, with the categories being spam or non-spam [5]. Sahami et al. [6] employed Bayesian classification technique to filter junk e-mails. By making use of the extensible framework of Bayesian modeling, they can not only employ traditional document classification techniques based on the text of e-mail, but they can also easily incorporate domain knowledge to aim at filtering spam e-mails. Drucker et al. [4] used support vector machine (SVM) for classifying e-mails according to their contents and compared its performance with Ripper, Rocchio, and boosting decision trees. They concluded that boosting trees and SVM had acceptable test performance in terms of accuracy and speed. However, the training time of boosting trees is inordinately long. Androutsopoulos et al. [1] extended the Naïve Bayes (NB) filter proposed by Sahami et al. [6], by investigating the effect of different number of features and training-set sizes on the filter s performance. Meanwhile, they compared the performance of NB to a memory-based approach, and they found both abovementioned methods clearly outperform a typical keyword-based filter. The objective of this paper is to evaluate four respective machine learning algorithms for spam e- mail categorization. These techniques are Naïve Bayes (NB), term frequency inverse document frequency (TF-IDF), K-nearest neighbor (K-NN), and support vector machines (SVMs). In addition, we study different parts of an e-mail that can be exploited to improve the categorization capability. We considered

the following five combinations of an e-mail message: all (A), header (H), body (B), subject (S), and body with subject (B+S). The above-mentioned four methods with these features are compared to help evaluate the relative merits of these algorithms, and suggest directions for future works. The rest of this paper is organized as follows. Section 2 gives a brief review of four machine learning algorithms and features we used. Section 3 presents the experimental results designed to evaluate the performance of different experimental settings. The conclusions are summarized in Section 4. 2. Machine learning methods and features in the e-mail In this section, we review the machine learning algorithms in the literature that used for e-mail categorization (or anti-spam filtering). They include Naïve Bayes (NB), term frequency inverse document frequency (TF-IDF), K-nearest neighbor (K-NN), and support vector machines (SVMs). 2.1 Naïve Bayes The Naïve Bayes (NB) classifier is a probability-based approach. The basic concept of it is to find whether an e-mail is spam or not by looking at which words are found in the message and which words are absent from it. In the literature, the NB classifier for spam is defined as follows: CNB arg max P( ci ) P( wk ci ), (1) ci T k where T is the set of target classes (spam or non-spam), and P(w k c i ) is the probability that word w k occurs in the e-mail, given the e-mail belongs to class c i. The likelihood term is estimated as n P( w c k k i ), (2) N where n k is the number of times word w k occurs in e- mails with class c i, and N is the number of words in e- mails with class c i. 2.2 Term frequency-inverse document frequency The most often adopted representation of a set of messages is as term weight vectors which used in the Vector Space Model [8]. The term weights are real numbers indicating the significance of terms in identifying a document. Based on this concept, the weight of a term in an e-mail message can be computed by the tf idf. The tf (term frequency) indicates the number of times that a term t appears in an e-mail. The idf (inverse document frequency) is the inverse of document frequency in the set of e-mails that contain t. The tf idf weighting scheme is defined as N wij tfij log( ), (3) dfi where w ij is the weight of the ith term in the jth e-mail, tf ij is the number of times that the ith term occurs in the jth e-mail, N is the total number of e-mails in the collection, and df i is the number of e-mails in which the ith term occurs. 2.3 K-nearest neighbor The most basic instance-based method is the K-nearest neighbor (K-NN) algorithm. It is a very simple method to classify documents and to show very good performance on text categorization tasks [10]. If we want to apply K-NN method to classify e-mails, the e- mails of the training set have to be indexed and then convert them into a document vector representation. When classifying a new e-mail, the similarity between its document vector and each one in the training set has to be computed. Then, the categories of the k nearest neighbors are determined and the category which occurs most frequently is chosen. 2.4 Support vector machine Support vector machine (SVM) has become very popular in the machine learning community because of its good generalization performance and its ability to handle high-dimensional data by using kernels. According the description given in Woitaszek et al. [11], an e-mail may be represented by a feature vector x that is composed of the various words from a dictionary formed by analyzing the collected e-mails. Thus, an e-mail is classified as spam or non-spam by performing a simple dot product between the features of an e-mail and the SVM model weight vector, y w x b, (4) where y is the result of classification, w is weight vector corresponding to those in the feature vector x, and b is the bias parameter in the SVM model that determined by the training process. 2.5 The structure of an e-mail

In addition to the text message of an e-mail, an e-mail has additional information in the header. The header contains many fields, for example, trace information about which a message has passed (Received:), where the sender wants replies to go (Reply-To:), unique of ID of this message (Message-ID:), format of content (Content-Type:), etc. Figures 1 illustrates the header of an e-mail. Besides comparing the categorization performance among the learning algorithms, we intended to figure out which parts of an e-mail have critical influence on the classification results. Therefore, five features of an e-mail: all (A), header (H), body (B), subject (S), and body with subject (B+S), are used to evaluate the performance of four machine learning algorithms. Furthermore, we also considered four cases that whether stemming or stopping procedure was applied or not. Received: from chen2 (localhost [127.0.0.1]) by ipx.ntntc.edu.tw (8.12.9+Sun/8.12.9) with ESMTP id i791mh4h028241; Mon, 9 Aug 2004 09:48:49 +0800 (CST) From: =?big5?b?s6+pdrfx?= <robert@ipx.ntntc.edu.tw> To: <david@ipx.ntntc.edu.tw> Subject: =?big5?b?soquykvku6gp+g==?= Date: Mon, 9 Aug 2004 09:50:07 +0800 Message-ID: <000001c47db3$334b3000$2a8547cb@chen2> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="---- _NextPart_000_0001_01C47DF6.416E7000" X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook, Build 10.0.2627 Figure 1. The header of an e-mail 3. Experimental results and discussion In order to test the performance of above-mentioned four methods, two corpora were used. The first corpus (Corpus I) consists of our e-mails over a recent three month period. For experiments, we deleted some e- mails whose messages were too short or did not contain any content and obtain 1050 spam and 1057 non-spam. The second corpus (Corpus II) we adopted is available at www.spamassassin.org. This archive contains 2100 spam and 2107 non-spam messages. We run experiments with different training and test sets. The first pair of training and test set is created by splitting each corpus at a ratio of 20:80. The second pair and the third one are 30:70 and 40:60, respectively. In e-mail classification tasks, the performance is often measured in terms of accuracy. Let N legit and N spam denote the total numbers of non-spam and spam messages, respectively, to be classified by the machine learning method, and n(c V) the number of messages belonging to category C that the method classified as category V (here, C, V {legit, spam}). The accuracy is defined as following formula: number of e - mails correctly categorized Accuracy total number of e - mails n( legit legit) n( spam spam) N legit N spam (5) The overall performances of considered learning algorithms in different experiments are shown in Tables 1 and 2. From the results, we found the following phenomena. 1. Good performance of NB, TF-IDF and SVM with header information. NB and TF-IDF performed reasonably consistent and good in different experimental settings. While SVM performed well except the feature of subject. It seems that the subject is not enough for high accuracy classification in SVM. 2. Poor performance of KNN method. KNN performed the worst among all considered methods and the poorest in all cases. However, if the more preprocessing tasks are utilized (i.e., stemming and stopping are applied together), the better KNN performs. 3. No effect of stemming, but stopping can enhance the e-mail classification. Stemming did not make any significant improvement for all algorithms in performance, though it decreased the size of the feature set. On the other hand, when the stopping procedure is employed, that is, ignoring some words that do not carry meaning in natural language, we can get better performance. The phenomenon is obvious especially in K-NN method as shown in Figure 2. 4. Good performance with header. Among four machine learning algorithms, the performance with header was the best. This means that much information can be derived from the header and then the fields in the header can aim at classifying e-mails correctly. 5. Poor performance with subject or body. The poor performance of each algorithm occurs in subject or body. The reasons may be that the former provides too little useful information. On the contrary, the latter contains too much useless information to classify e-mails. From the observation, we know that although some learning algorithms can achieve satisfactory results, we

Accuracy(%) may try to improve the result by combing some of them. Here, we integrate TF-IDF and NB methods and apply them to two corpora. The experimental results are shown in the last column of Tables 1 and 2. They show that the accuracy can be improved by the new hybrid approach. 100 90 80 70 60 50 40 30 20 10 0 Figure 2. Performance of K-NN method in all features with and without stopping procedure in Corpus I 4. Conclusion The detection of spam e-mail is an important issue of information technologies, and machine learning has a central role to play in this topic. In this paper, we presented an empirical evaluation of four machine learning algorithms for spam e-mail categorization. These approaches, NB, TF-IDF, K-NN, and SVM, were applied to different parts of an e-mail in order to compare their performance. Experimental results indicate that NB, TF-IDF, and SVM yield better performance than K-NN. The phenomenon also found, at least with our test corpora, that classification with the header was the most accurate than other parts of an e-mail. On the other hand, we try to combine two methods (TF-IDF and NB) to achieve the most correct categorization. It was found that integrating different learning algorithms actually seems to be a promising way. Acknowledgements stopping without stopping A H B S B+S featues This work was partially supported by National Science Council, Taiwan, R.O.C., under grant NSC 92-2218-E- 024-004. Comments from anonymous referees are highly appreciated. References [1] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C.D. Spyropoulos, and P. Stamatopoulos, Learning to filter spam e-mail: A comparison of a Naïve Bayesian and a memory-based approach, Proceedings of the workshop: Machine Learning and Textual Information Access, pp. 1-13, 2000. [2] X. Carreras and L. Márquez, Boosting trees for antispam email filtering, Proceedings of 4th Int l Conf. on Recent Advances in Natural Language Processing, pp. 58-64, 2001. [3] W.W. Cohen, Learning rules that classify e-mail, Proceedings of AAAI Spring Symposium on Machine Learning in Information Access, pp. 18-25, 1996. [4] H. Drucker, D. Wu, and V.N. Vapnik, Support vector machines for spam categorization, IEEE Trans. Neural Networks, vol. 10, no. 5, pp. 1048-1054, 1999. [5] A. Ko cz and J. Alspector, SVM-based filtering of e- mail spam with content-specific misclassification costs, Proceedings of TextDM'01 Workshop on Text Mining, 2001. [6] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, A Bayesian approach to filtering junk e-mail, Learning for Text Categorization Papers from the AAAI Workshop, pp. 55-62, 1998. [7] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos, and P. Stamatopoulos, Stacking classifiers for anti-spam filtering of e-mail, Proceedings of the 6th Conf. on Empirical Methods in Natural Language Processing, pp. 44-50, 2001. [8] G. Salton, Automating Text Processing: The Transformation, Analysis and Retrieval of Information by Computer, Addison-Wesley, 1989. [9] K.-M. Schneider, A comparison of event models for Naïve Bayes anti-spam e-mail filtering, Proceedings of 10th Conf. of the European Chapter of the Association for Computational Linguistics, pp. 207-314, 2003. [10] Y. Yang, An evaluation of statistical approaches to text categorization, Journal of Information Retrieval, vol. 1, pp. 67-88, 1999. [11] M. Woitaszek, M. Shaaban, and R. Czernikowski, Identifying junk electronic mail in Microsoft Outlook with a support vector machine, Proceedings of the 2003 Symposium on Applications and the Internet, pp. 166-169, 2003.

Table 1. Performance of four machine learning algorithms in Corpus I Feature Preprocessing Algorithms s stemming stopping NB TFIDF K-NN SVM TFIDF+NB A 87.60 88.37 49.20 91.11 90.42 A 88.32 90.19 87.72 92.26 91.46 A 87.57 88.14 51.31 91.29 90.64 A 88.71 90.08 88.99 92.15 90.97 H 93.36 94.25 70.38 92.99 93.87 H 93.21 95.29 87.71 92.95 95.30 H 93.23 94.70 73.02 92.87 93.59 H 93.46 95.24 87.58 92.86 94.90 B 87.46 89.31 47.50 83.34 92.04 B 88.74 90.08 81.65 85.79 90.69 B 85.78 89.41 46.80 83.53 91.66 B 89.47 90.19 81.97 85.84 90.39 S 83.71 83.35 62.55 77.29 84.92 S 83.85 88.17 82.54 74.74 86.35 S 83.60 93.61 67.19 77.97 85.45 S 84.04 87.64 82.64 74.17 84.04 B+S 87.22 86.64 78.40 84.71 88.48 B+S 87.81 88.88 76.58 87.20 89.83 B+S 87.62 87.48 48.92 85.17 89.36 B+S 88.84 88.90 79.49 87.17 90.68 Average 88.18 89.50 70.10 86.27 90.25 Table 2. Performance of four machine learning algorithms in Corpus II Feature Preprocessing Algorithms s stemming stopping NB TFIDF K-NN SVM TFIDF+NB A 87.56 89.49 48.57 91.11 91.71 A 88.73 90.30 88.64 91.58 91.78 A 87.56 88.52 48.71 91.13 92.46 A 88.76 90.08 89.99 92.28 92.72 H 93.18 91.95 69.75 93.00 95.14 H 93.22 91.40 89.21 92.89 91.61 H 93.23 90.86 73.15 92.59 92.01 H 91.02 89.89 88.48 92.71 92.64 B 84.33 79.29 46.98 89.82 88.06 B 89.29 84.10 82.23 89.41 89.71 B 83.31 87.47 46.49 85.41 89.91 B 89.18 84.08 83.11 89.10 90.19 S 83.40 82.84 65.36 77.29 85.69 S 83.82 87.84 81.89 74.78 84.81 S 84.02 83.74 61.68 77.97 87.62 S 84.04 87.49 82.03 74.10 87.88 B+S 86.49 84.58 47.15 85.51 88.46 B+S 87.51 88.52 79.34 87.00 89.81 B+S 85.43 84.57 47.03 84.80 88.55 B+S 88.77 85.50 81.06 87.21 89.53 Average 87.64 87.13 70.04 86.98 90.01