Probabilistic Anti-Spam Filtering with Dimensionality Reduction

Probabilistic Anti-Spam Filtering with Dimensionality Reduction ABSTRACT One of the biggest problems of e-mail communication is the massive spam message delivery Everyday billion of unwanted messages are sent by spammers and this number does not stop growing Helpfully there are different approaches able to automatically detect and remove most of these messages and a well-known ones are based on Bayesian decision theory However many machine learning techniques applied to text categorization have the same difficulty: the high dimensionality of the feature space Many term selection methods have been proposed in the literature Nevertheless it is still unclear how the performance of naive Bayes anti-spam filters depends on the methods applied for reducing the dimensionality of the feature space In this paper we compare the performance of most popular methods used as term selection techniques with some variations of the original naive Bayes anti-spam filter Categories and Subject Descriptors I5 [Pattern Recognition]: Applications; I27 [Artificial Intelligence]: Natural Language Processing Text analysis General Terms Anti-spam Filtering Keywords Terms selection technique naive Bayes anti-spam filter dimensionality reduction 1 INTRODUCTION The term spam is generally used to denote an unsolicited commercial e-mail Spam messages are annoying to most users because they clutter their mailboxes This problem can be quantified in economical terms since many hours are wasted everyday by workers It is not just the time they Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise to republish to post on servers or to redistribute to lists requires prior specific permission and/or a fee SAC 10 March 22-26 2010 Sierre Switzerland Copyright 2010 ACM X-XXXXX-XX-X/XX/XX $500 waste reading the spam but also the time they spend deleting those messages According to a report published by McAfee in March 2009 1 the cost in lost productivity per day per user is approximate equal to $050 based on the users having to spend 30 seconds for dealing with two spam messages each day and the users spam filter working at 95 percent accuracy (average accuracy achieved by the majority of available anti-spam filters) Therefore the productivity loss per employee per year due to spam is approximate equal to $18250 Supposing a company with 1000 workers earning $30 per hour it would suffer $182500 per year in lost productivity This works out to more than $41000 per 1 percent of spam allowed into a company Many methods have been proposed to automatic classify message [3] However among all proposed techniques machine learning algorithms have been achieved more success [1 3] These methods include approaches that are considered top-performers in text categorization like Boosting Support Vector Machines (SVM) and naive Bayes classifiers[1 5] The latter currently appear to be very popular in commercial and open-source spam filters This is probably due to their simplicity which makes them easy to implement their linear computational complexity and their accuracy which in spam filtering is comparable to that of more elaborate learning algorithms [5] A major difficulty of dealing with text categorization problems using approaches based on Bayesian probability is the high dimensionality of the feature space [1] The native feature space consists of unique terms (characters or words) that occur in e-mail messages which can be tens or hundreds of thousands of terms even for a moderate-sized e- mail collection This is prohibitively high for the most of learning algorithms (exceptions are k-nearest neighbors and SVM) Hence it is highly desirable to reduce the native space without sacrificing categorization accuracy and it is also desirable to achieve such a goal automatically [4 8] In this paper we present a comparative study of the five most used term selection techniques with four variants of the original naive Bayes algorithm for anti-spam filtering in order to examine how the term selection techniques effect the categorization accuracy of different anti-spam filters based on the Bayesian decision theory The remainder of this paper is organized as follows: Section 2 presents details of term selection techniques The naive Bayes anti-spam filters are described in Section 3 Section 4 presents the performance measurements used for comparing the achieved results Section 5 describes the methodology we employ in our experiments Experimental results are showed in Section 6 Finally Section 7 offers conclusions and future works 1 See http://wwwmcafeecom/us/local_content/ reports/mar_spam_reportpdf

2 DIMENSIONALITY REDUCTION In text categorization the high dimensionality of the term space (T ) may be problematic In fact many algorithms used for classifier can not scale to high values of T As a consequence a pass of dimensionality reduction is often applied before classifier whose effect is to reduce the size of the vector space from T to T T ; the set T is called the reduced term set [7] Techniques for term selection attempt to select from the original set T the subset T of terms (with T T ) that when used for document indexing yields the highest effectiveness For selecting the best terms we have to use a function that selects and ranks terms according to how good they are A computationally easier alternative is keeping the T T terms that receive the highest score according to a function that measures the importance of the term for the text categorization task 21 Representation Assuming that each message m is composed by a set of terms m = t 1 t n where each term t k corresponds to a word ( adult for example) a set of words ( to be removed ) or a single character ( $ ) we can represent each message by a vector x = x 1 x n where x 1 x n are values of the attributes X 1 X n associated with the terms t 1 t n In the simplest case each term represents a single word and all attributes are Boolean: X i = 1 if the message contains t i or X i = 0 otherwise Alternatively attributes may be integer values computed by term frequencies (TF) representing how many times each term occurs in the message A third alternative is to associate each attribute X i to a normalized TF x i = t i(m) m where t i(m) is the number of occurrences of the term represented by X i in m and m is the length of m measured in term occurrences Normalized TF takes into account the term repetition versus the size of message [1] 22 Term selection techniques In the following we describe the five most used Term Space Reduction (TSR) techniques Probabilities are interpreted on an event space of documents (for example P( t k c i) denotes the probability that for a random message m term t k does not occur in m and m belongs to category c i) and are estimated by counting occurrences in the training set Tr Since there are only two categories spam (c s) and legitimate (c l ) some functions are specified locally to a specific category; in order to assess the value of a term t k in a global category-independent sense either the sum f sum(t k ) = P c i=1 f(t k c i) the weighted sum f wsum(t k ) = P c i=1 P(ci)f(t k c i) or maximum f max(t k ) = max c i=1 f(t k c i) of their category-specific values f(t k c i) are computed 221 Document frequency (DF) DF is given by the frequency of messages with a term t k in the training set Tr as given by DF(t k ) = Trt k Tr where Tr tk represents the number of messages containing the term t k in the training set Tr and Tr is the amount of available messages [8] 222 Information gain (IG) IG is frequently employed as a term-goodness criterion in the field of machine learning [6] It measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a message [4] The IG of a term t k is computed according to IG(t k ) = P P(tc) c [c i c i ] Pt [t k t k ] P(t c)log P(t)P(c) 223 Mutual information (MI) MI is a criterion commonly used in statistical language modeling of words associations and related applications [8] The mutual information criterion between t k and c i is defined as MI(t k c i) = log P(t kc i ) P(t k )P(c i ) MI(t k c i) has a natural value of zero if t k and c i are independent To measure the goodness of a term in a global feature selection we can combine category-specific scores of a term into the three alternate ways: f sum f wsum or f max 224 χ 2 statistic χ 2 measures the lack of independence between the term t k and the class c i It can be compared to the χ 2 distribution with one degree of freedom to judge extremeness χ 2 statistic has a natural value of zero if t k and c i are independent We can calculate the χ 2 statistic for the term t k in the class c i by χ 2 (t k ) = Tr [P(t kc i )P( t k c i ) P(t k c i )P( t k c i )] 2 P(t k )P( t k )P(c i )P( c i ) 225 Odds ratio (OR) OR measures the ratio between the odds of the term appearing in a relevant document versus the odds of it appearing in a nonrelevant one The odds ratio between t k and c i is given by OR(t k c i) = P(t kc i )(1 P(t k c i )) (1 P(t k c i ))P(t k c i ) An OR of 1 indicates that term t k is equally likely in both classes c i and c i OR greater than 1 indicates that t k is more likely in c i On the other hand OR less than 1 indicates that t k is less likely in c i As MI to measure the goodness of a term in a global feature selection we can combine category-specific scores using functions f sum f wsum or f max 3 NAIVE BAYES ANTI-SPAM FILTERS Naive Bayes anti-spam filtering (NB) has become the most popular mechanism to distinguish spam messages from legitimate e-mail [5] From Bayes theorem and the theorem of total probability the probability for a message with vector x = x 1 x n belongs to a category c i {c s c l } is P(c i x) = P(c i)p( x c i ) P( x) Since the denominator does not depend on the category the filter classifies each message in category that maximizes P(c i)p( x c i) In spam filtering it is equivalent to classifying P(c s)p( x c s) P(c s)p( x c s)+p(c l )P( x c l ) > a message as spam (c s) whenever T with T = 05 By varying T we can opt for more true negatives at the expense of fewer true positives or viceversa The a priori probabilities P(c i) can be estimated as the occurrences frequency of documents belonging to the category c i in the training set Tr On the other hand P( x c i) is practically impossible to estimate directly because we would need in Tr some messages identical to the one we want to classify However NB classifier makes a simple assumption that the terms in a message are conditionally independent and the order they appear in a message is irrelevant The probabilities P( x c i) are estimated differently in each NB version [5] Several studies have found the NB classifier to be surprisingly effective in the spam filtering task [1 5] despite the fact that its independence assumption is usually oversimplistic In the following we describe four well-known versions of NB anti-spam filter available in the literature 31 Multinomial term frequency naive Bayes The multinomial term frequency NB (MN TF NB) represents each message as a set of terms m = {t 1 t n} computing each one of t k as how many times it appears in m In this sense m can be represented by a vector x = x 1 x n where each x k corresponds to the number of occurrences of t k in m Moreover each message m of

category c i can be interpreted as the result of picking independently m terms from T with replacement and probability P(t k c i) for each t k Hence P( x c i) is the multinomial distribution: P( x c i) = P( m ) m! ny k=1 P(t k c i) x k x k! The criterion for classifying a message as spam becomes: P(c s) Q n k=1 P(t k c s) x k Pc i {c sc l } P(ci) Q n k=1 P(t k c i) x k > T and probabilities P(t k c i) are estimated as a Laplacian prior P(t k c i) = 1+Nt k c i n+n ci where N tk c i is the number of occurrences of term t k in the training messages of category c i and N ci = P n k=i Nt kc i 32 Multinomial Boolean naive Bayes The multinomial Boolean NB (MN Boolean NB) is similar to the MN TF NB including the estimates of P(t k c i) except that each attribute x k is Boolean 33 Multivariate Bernoulli naive Bayes Let T = {t 1 t n} the result set of terms after term selection The multivariate Bernoulli NB (MV Bernoulli NB) represents each message m as a set of terms only by computing the presence or absence of each term Therefore m can be represented as a binary vector x = x 1 x n where each x k shows whether or not t k will occur in m The probabilities P( x c i) are computed by [5]: ny P( x c i) = P(t k c i) x k (1 P(t k c i)) (1 xk) k=1 The criterion for classifying a message as spam becomes: P(c s) Q n k=1 P(t k c s) x k (1 P(t k c s)) (1 x k) Pc i {c sc l } P(ci) Q n k=1 P(t k c i) x k (1 P(tk c i)) (1 x k) > T where probabilities P(t k c i) are estimated as a Laplacian prior P(t k c i) = 1+ Trt k c i 2+ Tr ci where Tr tk c i is the number of training messages of category c i that contain the term t k and Tr ci is the total number of training messages of category c i 34 Flexible Bayes Flexible Bayes (FB) represents the probabilities P( x c i) as the average of L kci normal distributions with different values for µ kci but the same one for σ kci : P(x k c i) = 1 L X kci g(x k ; µ kci l σ ci ) L kci l=1 where L kci is the amount of different values that the attribute X k has in the training set Tr of category c i Each of these values is used as µ kci l of a normal distribution of the category c i However all distributions of a category c i 1 are taken to have the same σ ci = [1] Trci 4 PERFORMANCE MEASUREMENTS Let S and L sets of spam and legitimate messages respectively the possible prediction results are: true positives (T P) corresponding to the set of spam messages correctly classified true negatives (T N) the set of legitimate messages correctly classified false negatives (FN) the set of spam messages incorrectly classified as legitimate and false positives (FP) the set of legitimate messages incorrectly classified as spam Some well-known evaluation measurements are: True positive rate (Tpr) True negative rate (Tnr) Spam precision (Spr) Legitimate precision (Lpr) ROC curves [3] precision recall [5] Accuracy rate (Acc) and Total Cost Ratio (TCR) [1] T CR was first proposed by [1] which offers an indication of the improvement provided by the filter Greater T CR indicates better performance and for TCR < 1 not using the filter is better However we opt to use the Matthews Correlation Coefficient [2]: MCC = ( T P T N ) ( FP FN ) p ( T P + FP )( T P + FN )( T N + FP )( T N + FN ) since it provides more information than T CR It returns a real value between 1 and +1 A coefficient equals to +1 indicates a perfect prediction; 0 an average random prediction; and 1 an inverse prediction According to [2] MCC is one of the best measure which may often provides a more balanced evaluation of the prediction than other measures such as the proportion of correct predictions (accuracy) especially if the two classes are of very different sizes 5 METHODOLOGY We use the six well-known Enron corpora (E1 E6) [5] in our experiments In order to provide an aggressive dimensionality reduction we sort the messages according to their arrival s date and we perform the training stage by using the 90% of the first received messages The remainder ones are separated for classifying After collecting all terms of the training messages we remove all irrelevant terms In this case we consider as irrelevant all terms which appear in less than five training messages (less than 01% of each corpus) Once the training stage has finished we apply a term selection technique for reducing the term space dimensionality In order to provide a complete evaluation we vary the number of terms from 10% to 100% of all retained terms in the preprocessing stage Next we classify the testing messages using the naive Bayes anti-spam filters presented in Section 3 We perform all possible combinations between NB anti-spam filters and term selection methods The first studies in naive Bayes anti-spam filters employed an unbalanced classification using cost-sensitive learning by varying the value of threshold T depending on the usage scenario However recent works have showed that the choice of T is very difficulty because it can vary for each user The same works indicate that setting T = 05 seems reasonable and it generally offers a good average prediction [5] 6 EXPERIMENTAL RESULTS In the following we present the results achieved for each corpus In the remainder of this section consider the following abbreviations: MN Boolean NB as MN Bool MN term frequency NB as MN TF MV Bernoulli NB as Bern and flexible Bayes as FB Due to paper limitations we present the best results and the best average prediction achieved by each NB classifier using different term selection techniques (TST) We define

the best result as the maximum MCC attained by the combination of one NB classifier and TST using a specific number of terms 2 Tables 1 3 5 7 9 and 11 present the best performance achieved by each classifier for each Enron dataset Tables 2 4 6 8 10 and 12 detail the set of results attained by the two classifiers which accomplished the best performance Table 1: E1 The best result achieved by filter Bern OR wsum 40 0872 MN Bool OR wsum 50 0861 MN TF OR wsum 50 0844 FB IG 70 0833 Table 2: E1 Two best individual performances Measurement Bern & OR wsum MN Bool & OR wsum T (% of T ) 40 50 TPr(%) 9867 9200 Spr(%) 8409 8846 TNr(%) 9239 9511 Lpr(%) 9942 9669 Acc(%) 9421 9421 T CR 5000 5000 MCC 0872 0861 Note that in Table 2 the combination of MV Bernoulli NB with OR wsum acquired the same TCR than MN Boolean NB with OR wsum However their MCC are differents It happens because the MCC offers a more balanced evaluation of the prediction particularly if the classes are of different sizes as discussed in Section 4 The E1 dataset has more legitimate messages than spams hence false negatives affect the classifiers performance more than false positives Table 3: E2 The best result achieved by filter Bern OR sum 40 0952 MN TF OR sum 40 0874 MN Bool OR sum 40 0861 FB IG 10 0855 Table 8 shows another drawback of TCR MV Bernoulli NB with IG using 20% of T achieved a perfect prediction (MCC = 1000 and TCR = + ) for Enron 4 On the other hand MN Boolean NB with OR wsum using 40% of T classified only one spam incorrectly as legitimate ( FP = 0 FN = 1) and accomplished MCC = 0996 and TCR = 450 If we analyze only the TCR we would wrong conclude that the first combination was much better than the second one The results indicate that the classifiers generally worsen their performance when the complete set of terms T is used for classifying However we found a trade off between 40% 60% of T which usually achieve the best performance 2 A comprehensive set of results including all tables and figures is available at http://blind_revision Table 4: E2 Two best individual performances Measurement Bern & OR sum MN TF & OR sum T (% of T ) 40 40 TPr(%) 9933 8133 Spr(%) 9371 10000 TNr(%) 9771 10000 Lpr(%) 9977 9398 Acc(%) 9813 9523 T CR 13636 5357 MCC 0952 0874 Table 5: E3 The best result achieved by filter Bern IG 30 0973 MN Bool IG 10 0936 MN TF IG 10 0884 FB MI max 20 0880 Table 6: E3 Two best individual performances Measurement Bern & IG MN Bool & IG T (% of T ) 30 10 TPr(%) 9933 9200 Spr(%) 9675 9857 TNr(%) 9876 9950 Lpr(%) 9975 9709 Acc(%) 9891 9746 T CR 25000 10714 MCC 0973 0936 Table 7: E4 The best result achieved by filter Bern IG 20 1000 MN Bool OR wsum 40 0996 MN TF NB OR wsum 40 0996 FB OR wsum 40 0974 Table 8: E4 Two best individual performances Measurement Bern & IG MN Bool & OR wsum T (% of T ) 20 40 TPr(%) 10000 9978 Spr(%) 10000 10000 TNr(%) 10000 10000 Lpr(%) 10000 9934 Acc(%) 10000 9983 T CR + 450000 MCC 1000 0996 Even a set of selected terms composed by only 10% 30% of T offers better results than a set with all terms of T Collectively regarding the term selection techniques the reported experiments indicate that {OR IG} > {χ 2 DF } >> MI where > means performs better than However we found that the IG and χ 2 are less sensitive to the variation of T On the other hand MI presented the

Table 9: E5 The best result achieved by filter Bern OR sum 50 0972 MN Bool OR wsum 50 0967 MN TF OR sum 50 0954 FB χ 2 10 0931 Table 10: E5 Two best individual performances Measurement Bern & OR sum MN Bool & OR wsum T (% of T ) 50 50 TPr(%) 9946 9918 Spr(%) 9892 9892 TNr(%) 9733 9733 Lpr(%) 9865 9799 Acc(%) 9884 9865 T CR 61333 52571 MCC 0972 0967 Table 11: E6 The best result achieved by filter Bern OR sum 50 0923 MN Bool OR sum 60 0897 FB IG 10 0873 MN TF OR sum 50 0829 Table 12: E6 Two best individual performances Measurement Bern & OR sum MN Bool & OR sum T (% of T ) 50 60 TPr(%) 9689 9822 Spr(%) 9909 9672 TNr(%) 9733 9000 Lpr(%) 9121 9441 Acc(%) 9700 9617 T CR 25000 19565 MCC 0923 0897 worst individual and average performance We also verify that the NB classifiers performance is highly sensitive to the quality of terms selected by the TSTs and the number of selected terms T With respect to the classifiers the results indicate that {MV Bernoulli NB} > {MN Boolean NB MN term frequency NB} > {flexible Bayes} MV Bernoulli NB acquired the best individual performance for the majority of datasets used in our experiments It should be emphasized that MV Bernoulli NB is the unique approach which takes into account the absence of terms in the messages Our experiments also show that filters which uses Boolean attributes achieve better results than term frequencies ones 7 CONCLUSIONS AND FURTHER WORK In this paper we present an evaluation of feature selection methods in dimensionality reduction for anti-spam filtering domain We performed the comparison of performance achieved by four NB anti-spam filters with different kind of representation applied to classify messages as legitimate or spam from six public large and real e-mail datasets after a pass of dimensionality reduction employed by five term selection techniques varying the number of selected terms Regarding TSTs we found OR and IG most effective in aggressive term removal without losing categorization accuracy On the other hand the employment of MI generally offers poor results which frequently worsen the classifier s performance The results also indicate that IG and χ 2 are less sensitive to the variation of T Among of all classifiers MV Bernoulli NB achieved best performance The results also verify that Boolean attributes performs better than the term frequency ones We also confirm that the performance of NB classifiers is highly sensitive to the selected attributes and the number of selected terms by the TSTs in the training stage Better the term selection technique better the classifiers prediction Future works should take into consideration that spam filtering is a coevolutionary problem because while the filter tries to evolve its prediction capacity the spammers try to evolve their spam messages in order to overreach the classifiers An efficient approach should have an effective way to adjust its rules in order to detect the changes of spam features In this way collaborative filters can be used to assist the classifier by accelerating the adaptation of the rules and increasing the classifiers performance Moreover spammers generally insert a large amount of noises in spam messages in order to difficult the probability estimation Thus the filters should have a flexible way to compare the terms in the classifying task Approaches based on fuzzy logic can be employed to make the comparison more flexible 8 REFERENCES [1] I Androutsopoulos G Paliouras and E Michelakis Learning to filter unsolicited commercial e-mail Technical Report 2004/2 National Centre for Scientific Research Demokritos Athens Greece March 2004 [2] P Baldi S Brunak Y Chauvin C Andersen and H Nielsen Assessing the accuracy of prediction algorithms for classification: an overview Bioinformatics 16(5):412 424 May 2000 [3] G Cormack Email spam filtering: A systematic review Foundations and Trends in Information Retrieval 1(4):335 455 2008 [4] G Forman and E Kirshenbaum Extremely fast text feature extraction for classification and indexing In Proceedings of 17th ACM Conference on Information and Knowledge Management pages 1221 1230 Napa Valley CA USA November 2008 [5] V Metsis I Androutsopoulos and G Paliouras Spam filtering with naive bayes - which naive bayes? In Proceedings of the 3rd International Conference on Email and Anti-Spam pages 1 5 Mountain View CA USA July 2006 [6] T Mitchell Machine Learning McCraw-Hill 1997 [7] F Sebastiani Machine learning in automated text categorization ACM Computing Surveys 34(1):1 47 March 2002 [8] Y Yang and J Pedersen A comparative study on feature selection in text categorization In Proceedings of the 14th International Conference on Machine Learning pages 412 420 Nashville TN USA July 1997