CMPT 585: Intranet and Internet Security Fall 2008 Montclair State University

Size: px

Start display at page:

Download "CMPT 585: Intranet and Internet Security Fall 2008 Montclair State University"

Sabrina Warner
5 years ago
Views:

1 Title Page CMPT 585: Intranet and Internet Security Fall 2008 Montclair State University Computing Security Project Project Topic: Bayesian Spam Detection Mechanisms and Future of Anti-Spam Filters Project member: Supervisor: Hiroki Yamakawa Dr. Stefan A. Robila

2 Table of Contents Title Page... 1 Table of Contents... 2 Abstract Introduction Bayes Theorem Bayesian Spam Filters SpamBayes POPFile Experimental Results Evaluation of Bayesian Spam Filters Vulnerabilities New Trends in Attacks against Bayesian Spam Filters Image Spam Image Spam with Content Obfuscation Techniques Other Spam Filtering Strategies Conclusion References: Tables and Figures Table 3.1 Test Record... 9 Table 3.2 Breakdown of classification... 9 Table 3.3 Test results in confusion Matrices... 9 Table 3.4 Confusion matrix Table 3.5 Test results in accuracy, precision, and recall Figure 5.1 Actual image spam Figure 5.2 the same image as Figure 5.1 with a speckle Figure 5.3 Content obfuscation techniques used to confuse OCR Figure 5.4 Image spam that is difficult for OCR to read

3 Abstract A number of spam filtering techniques have been developed to tackle the problems associated with spam s, such as decreased productivity among employees and compromised computer security caused by malicious mails. Among the pack of filtering techniques, Bayesian spam filters have gained popularity and credibility as one of the most effective ways to detect spam s. This paper first reviews a probability theory, Bayes Theorem, which is the backbone of Bayesian spam filters. With this theorem, Bayesian spam filters statistically analyze the text content of s and generate the probability values for the s to be spam. Two Bayesian spam filters, SpamBayes and POPFiles, are analyzed in detail, and the experiment conducted to evaluate their spam filtering performance supports the widelyreported effectiveness of Bayesian spam filters. There are some known vulnerabilities in Bayesian spam filters against certain spam tricks. One of the recent and most troubling spam tricks against Bayesian spam filters is called image spam. Some techniques used by image spam are studied alongside the vulnerabilities of Bayesian spam filters. Other spam filtering techniques, such as header verification, are reviewed to show a number of possibilities in spam filtering approaches. They can compliment, or compensate for the vulnerabilities of Bayesian spam filters. 3

4 1. Introduction Since the Internet and became major media of communication in nearly every aspect of our life in the late 1990 s and early 2000 s, they have revolutionized the way of doing business and socializing. The Internet has, just to list a few examples, made information gathering and publishing easy, and has provided the convenience of online shopping and financial management. has also brought about many beneficial results that are just as convenient and invaluable as the changes effected by the Internet in our society. These new technologies have, however, created new kinds of issues and problems at the same time. One of the major problems that are affecting almost all users of is the so called "spam," a label coined for unsolicited messages that inundate inboxes of millions of people every day. To deal with the issues, a number of spam detection and filtering methods have been developed. This project studies and examines the cuttingedge anti-spam technique, called "Bayesian spam filtering," and analyzes the current spam trends and the future of spam filtering technologies. For the analysis of anti-spam products actually available in the market today, the open source software POPFile and SpamBayes will be used. Their application of Bayesian spam filtering methods and effectiveness are evaluated. Experiments will be conducted not only to measure the performance, but also to contemplate on possible exploitation of the Bayesian spam filtering vulnerability. This paper is organized as follows. Section 2 reviews Bayes Theorem in order to understand the essential mechanisms of many current spam filters. Section 3 studies actual Bayesian spam filters and conducts some experiments on them. Section 4 studies vulnerabilities of Bayesian spam filters and provides some example attacks exploiting the vulnerabilities. Section 5 analyzes the new trends in the attacks against Bayesian spam filters and possible counter measures and strategies against the new attacks. Section 6 takes a look at other spam filtering strategies being researched today. Section 7 concludes the project. 4

5 2. Bayes Theorem Bayesian spam filtering uses Bayes' Theorem to distinguish spam s from legitimate s. Bayes' Theorem is a probability theorem developed by a British mathematician, Thomas Bayes, in 18th century [1]. The theorem is expressed as follows: P(A): P(B): probability of event A probability of event B P(A B): probability of event A given the occurrence of event B P(B A): probability of event B given the occurrence of event A In spam mail filtering, the above formula can be re-expressed as: Bayesian spam filtering scans a set of pre-labeled texts (either "spam" or "legitimate," for example) and calculates the frequency and probability of each word in the text. This initial process returns the so-called "prior probabilities," P(spam) and P(words) [2]. P(spam) represents the probability that any scanned is spam. The value is calculated with the total number of spam s in the set divided by the total number of all s in the set. P(words) represents the probability that any scanned contains the particular set of words. P(words spam) is the probability of the particular set of words being contained in the given that the is spam. Once the above three probabilities are obtained, the "posterior probability," P(spam words), can be calculated using the formula above. This probability tells how likely the given is spam. If the probability is significantly high (e.g. above or equal to 90%), the is considered to be spam. There is, however, a major problem with the use of the above formula: calculating the value of P(words spam) is almost impossible in real situations [3]. The term words represents a particular set of words in a given . It is usually represented as a binary vector with its elements indicating the presence or absence of the pre-defined set of words. For example, if a set (W 1,W 2,,W n ) represents the pre-defined set of n words, and if the , e 1, contains the words, W 1, W 5, W 7, W 8, and W 10 (e 1 = < W 1, W 5, W 7, W 8, W 10 >), the binary vector of words is expressed as words = <1,0,0,0,1,0,1,1,0,1,0,0,0, 0> with 1 indicating the presence of the corresponding word W i. Accurately filtering 5

6 s for spam usually requires a set of a few hundred words. This means that the number of possible values for words is so great that computing P(words spam), the probability of getting the particular set of words from the given that the is spam, is extremely intractable [3]. In order to address this issue, Naïve Bayes Classifier assumption is incorporated in the spam filtering. The assumption states that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature [4]. The assumption in the case of spam filtering can be interpreted as that the probability of the being spam independently influences the probability of each word appearing in the . With this assumption, the numerator of the probability formula, P(words spam)p(spam), is transformed as follows (the denominator remains the same): P(words spam)p(spam) = P(spam, W 1, W 2,, W n ) = P(spam) i=1 n P(W i spam) In this way, P(words spam) is transformed into i=1 n P(W i spam). Since calculating P(W i spam) is much easier than P(words spam), and the results of the computation reasonably reflect the probability in practice, the incorporation of the assumption in the spam filtering seems justifiable [5]. 6

7 3. Bayesian Spam Filters Most of today s spam filters uses the Naïve Bayes Classifier explained above as the main mechanism or part of the mechanisms to combat spam [6]. Some of the popular spam filters that take advantage of Naive Bayes Classifier are SpamAssassin [7], McAfee SpamKiller [8], Mozilla Thunderbird [9], Apple Mail [10], POPFile [11], and SpamBayes [12]. This paper mainly studies two popular Bayesian spam filters available as open source software, SpamBayes and POPFile, both of which are the implementation results of adopting the proposals made in A Plan for Spam, published by Paul Graham [13]. It is one of the most famous and widely publicized works proposing the application of Naïve Bayes Classifier in spam filtering [2]. 3.1 SpamBayes SpamBayes was developed by Tim Peters and others in August 2002, soon after Paul Graham published A Plan for Spam [14]. The initial implementation of Graham s proposal (using Naïve Bayes Classifier in spam filtering) proved successful but also gave rise to some problems. s were only classified as either definite spam or definite ham (a label given to legitimate s). In other words, s were classified as either one of them always with very strong probability values (confidence) [14]. This is a major issue when considering the cost of false positive cases. In the case of spam filtering, false positive is a term given to cases where non-spam, legitimate mails are incorrectly classified as spam. Usually, the cost of false positive is higher than the cost of false negative, a term given to cases where spam mails are incorrectly classified as non-spam. For example, if a legitimate, very important business message was sent by from company headquarter to its branch office, but the filtering system at the branch office incorrectly label the as spam, and filters it out from the regular inbox. False positive problems like this are much severer than when some spam mails are incorrectly classified as non-spam, and they creep into the regular inbox. If SpamBayes could classify s only as spam or non-spam and nothing in between, the weights given to the recalculation (recalibration) of the classifier in the case of false positive should be very large relative to the weights given to that of false negative cases. That is, the classifier of SpamBayes should adjust itself so that, if it were to make mistakes, it would incorrectly classify non-spam mails as spam much less frequently than it would incorrectly classify spam mail as non-spam. However, giving unreasonably large weights to false positive cases would lead to poor classification accuracy, incorrectly recognizing many spam mails, as well as most of the non-spam mail, as legitimate. The goal of spam filtering is to achieve optimal balance in the trade-off relation that exists between false positives and false negatives, while keeping the number of both cases as low as possible [15]. To help achieve this goal, SpamBayes incorporates a chi-squared test. The chi-squared test measures a statistical significance of a hypothesis that a given message is spam, and 7

8 another hypothesis that a given message is non-spam [14]. The probability calculated by the test indicates how likely it is to observe a particular mail, assuming the hypothesis that the mail is spam, or separately, it indicates the likelihood of observing a particular mail, assuming the other hypothesis that the mail is non-spam [14] [16]. The probabilities for the two hypotheses are combined and scaled to generate an overall message spam score in the range 0 to 1 [14]. The higher the score (the closer the score is to 1), the more likely that the mail is spam. The lower the score, the less likely the mail is spam. SpamBayes labels all s with the score between 0.20 and 0.90 as unsure. This asymmetric boundaries biased against the spam effectively reduces the number of false positives, and also false negatives, by resisting to assign a definite label of either spam or ham to those mails that are difficult to classify [14]. 3.2 POPFile POPFile was developed by John Graham-Cumming and initially released in September 2002 [17]. Just as SpamBayes extensively relies on Naïve Bayes Classifier to sort s into spam, ham, or unsure, POPFile uses Naïve Bayes Classifier for its filtering mechanism. POPFile is, however, not a spam filter in a technical sense. It is rather a general classifier that can classify s as spam or non-spam or whatever classification users define [11]. Users can create an arbitrary number of classification labels. For example, one can create labels, such as Business for job-related s, School for school-related s, and Personal for s personally addressed to him/her, along with Spam and Non-Spam labels. POPFile is more flexible than SpamBayes in this sense. However, POPFile does not have a mechanism similar to the spam score used by SpamBayes to better handle difficult-to-classify s. The tendency observed by Meyer and Whateley for Naïve Bayes Classifier to return polarized results (either strongly spam or strongly non-spam) [14] may not be specifically addressed in POPFile, given the feature of POPFile to accommodate an arbitrary number of user-defined classification labels. POPFile does not particularly give importance to spam filtering, but it does try to filter s into all the user-defined labels equally well. Even though POPFile has a default unclassified label, this label is provided for cases in which classification is difficult under any user defined labels. That is, the unclassified label is not specifically given to those cases that lie in the middle of spectrum between spam and non-spam. It may seem, therefore, that POPFile is less optimally configured for the sole purpose of spam filtering. The results of comparative experiments are presented in the next section to discuss this issue. 3.3 Experimental Results First, to measure the overall effectiveness of Bayesian spam filters, both SpamBayes and POPFile are tested for classification accuracy with 150 s taken from the corpus of spam and non-spam s provided by SpamAssassin [7]. Out of 150 s, 72 are spam, and 78 are non-spam. The test results are summarized in Tables 3.1, Table 3.2 and Table

9 Spam Ham Unsure/Unclassified Total Actual Test Set SpamBayes POPFile Table 3.1 Test Record Count of s classified as Spam, Non-Spam, or Unsure/Unclassified by SpamBayes and POPFile. The Actual Test Set row shows the real number of spam and non-spam s used for this test. SpamBayes POPFile Spam Non-Spam Unsure Spam Non-Spam Unclassified Positive Negative Positive Negative Test Result True False Table 3.2 Breakdown of classification The values in True row are the number of correct classification made by SpamBayes and POPFile for each label (Spam, Non-Spam, and Unsure). The values in False row are the number of incorrect classification made, representing false positives and false negatives. SpamBayes Real Value Test Result Spam Non-Spam Positive Spam 47 (True) 1 (False) Negative Non-Spam 6 (False) 71 (True) Unsure 19 6 POPFile Real Value Test Result Spam Non-Spam Positive Spam 60 (True) 6 (False) Negative Non-Spam 6 (False) 67 (True) Unclassified 6 5 Table 3.3 Test results in confusion Matrices 9

10 Confusion Matrix Real Value Test Result Spam Non-Spam Spam True Positive (TP) False Positive (FP) Non-Spam False Negative (FN) True Negative (TN) Table 3.4 Confusion matrix Table 3.1 provides a quick look at the test result. This table does not tell in itself the accuracy of classification made by either SpamBayes or POPFile. However, the table may indicate a tendency of SpamBayes to be more careful in labeling s as spam. As discussed earlier, in order to reduce the number of false positives, SpamBayes assigns the unsure label to those with the spam scores between 0.20 and 0.90 [14]. Table 3.2 seems to support this interpretation. SpamBayes returns only 1 false positive result (that is, classifying non-spam mails as spam) compared with 6 false negatives, while POPFile returns 6 false positives and 6 false negatives. The better performance of SpamBayes in terms of false positive is probably achieved by assigning the unsure label to the mails with the score between 0.20 and On the other hand, the unclassified labels are merely allocated by POPFile for mails, to which it cannot confidently assign the spam or the non-spam labels. In this sense, SpamBayes is clearly better configured to deal with spam filtering tasks, although the better performance tends to come with the price of having to manually classify more unsure s. Table 3.3 looks at the false positive and false negative cases in a confusion matrix for each spam filter. The column labels (Spam, Non-Spam) specify the classification made by the filter, and the row labels (Spam, Non-Spam) denote the actual classification. Table 3.4 shows the meaning of the value in each cell in the confusion matrix. In order to quantify the performance, three statistical measures traditionally used for classification quality are used. Those are accuracy, precision, and recall. Accuracy is defined as [18]: Accuracy = (TP + TN) / (TP + FP + FN + TN) refer to Table 3.4 If the number of unsure/unclassified cases is taken into consideration: Accuracy = (TP + TN) / (TP + FP + FN + TN + US + UNS) Where US is the number of spam mails labeled by the filters as unsure/unclassified, and UNS is the number of non-spam mails labeled as unsure/unclassified. In other words, it is calculated with the number of correctly classified s divided by the total number of s used for the test (either including or excluding the number of unsure/unclassified s). Simply put, it measures how well the filters correctly label s. 10

11 Precision is a measure used to calculate the exactness of classification. It is calculated with the number of correctly identified s belonging to the identification class divided by the total number of s identified, both correctly and incorrectly, as belonging to the class [19]. Precision = TP / (TP + FP) in terms of spam classification When considering the precision for non-spam classification (since this test is not a binary classification test in a pure sense [there are three classes: spam, non-spam, unsure/unclassified], calculating the precision for non-spam seems warranted), the formula could be expressed as: Precision (negative) = TN / (TN + FN) Recall measures the completeness of classification. It is calculated as the number of correctly identified s belonging to the class divided by the total number of s that actually belong to the class [19]. Recall = TP / (TP + FN) Recall (negative) = TN / (TN + FP) in terms of spam classification in terms of non-spam classification Table 3.5 below summarizes the test results in terms of accuracy, precision, and recall. SpamBayes POPFile Spam Non-Spam Spam Non-Spam Accuracy (w/o unsure/unclassified) 118/125 = /139 = 0.91 Accuracy (with unsure/unclassified) 118/150 = /150 = 0.85 Precision 47/48 = /77 = /66 = /73 = 0.92 Recall 47/72 = /78 = /72 = /78 = 0.86 Table 3.5 Test results in accuracy, precision, and recall Table 3.5 also reveals the difference of approaches in which SpamBayes and POPFile tackle the classification. It is observable in the very high precision and the relatively low recall for spam classification made by SpamBayes that the filter tries to achieve a high spam precision measure at the expense of a spam recall measure. That is, to reduce the number of times the filter classifies non-spam s as spam as much as possible, it classifies difficult s as unsure instead of risking false positive classification. 11

12 On the other hand, with POPFile, the measures of precision and recall for spam and nonspam appear very similar to each other. This may be because, as discussed in the previous section, POPFile does not particularly try to reduce the number of false positives in the spam filtering sense. In this test, it happens to have a label called spam and another one called non-spam, but POPFile just tries to classify s as one of those categories, and if assigning any of those labels is statistically difficult, it gives an unclassified label to the mails as a last resort. 3.4 Evaluation of Bayesian Spam Filters Given the experimental results and statistical performance measures, it seems reasonable to say that spam filters based on Naïve Bayes Classifier are effective in distinguishing spam s from non-spam s. The test only used 150 s, but the accuracy measures steadily increased throughout the test period. As more and more s are processed into the filters for testing, it will be all but certain that the accuracy will continue increasing until it hits a certain achievable level of classification accuracy for the classification model used. According to Yerazunis [20], over twenty Bayesian spam filters available on Sourceforge.net [21] in 2004 were reported to have demonstrable accuracies on the order of 99.9% accuracy. So, why are Bayesian spam filters so effective? One major reason would be that they have the mechanism of continuously incorporating and adjusting the spam/non-spam probability values while processing new documents along with their corresponding, user-verified, labels. This means that the filtering system can improve the filtering accuracy by continuously adjusting its accumulated knowledge (statistics) of what is considered to be spam and what is not, as more and more with corresponding labels are fed in the system. As soon as the system has gained substantial knowledge of spam statistics, it can scan new unlabeled s and state the probabilities of them being spam or non-spam with great accuracy. Another reason for the great performance observable in Bayesian spam filters is that most of them that are available to the public are personally trained by individual users to better filter their own unique set o s. Bayesian spam filters can be, and are, used on the server side, or incorporated in server software that handles multiple users [2] [22], but many personal users who adopt such spam filters as SpamBayes and POPFile install the filters as proxies that sit between their mail servers and their client software, such as Microsoft Outlook and Mozilla Thunderbird [11] [12]. Therefore, their filters are only used for their personal set of s. As a result, the filters can effectively fine tune themselves for the better performance. 12

13 4. Vulnerabilities As powerful and effective as Bayesian spam filters can be, there have been proposed a number of attacks designed to infiltrate their filtering mechanisms. Many of those attacks exploit the very strength of adoptive learning mechanisms inherent in Bayesian spam filters and the way they treat an message as a bag of words [6], which means treating a message just as a collection of words, disregarding grammar and word orders [23]. They are called Bayesian Poisoning [6] [24]. What Bayesian poisoning tries to do is to confuse Bayesian spam filters into accepting spam s as legitimate by inserting into the spam mails (or poisoning the mails with) random words and/or certain words not usually belonging to spam mails [24]. The creator of POPFile, John Graham-Cumming [11], presented at the Spam Conference held at MIT in 2004 [24] a very interesting way of attacking Bayesian spam filters, using Bayesian spam filters [25]. He collected a set of spam mails that infiltrated his welltrained POPFile. Then, after inserting five random words into each spam mail, he sent the spam mails to the target clients protected by Bayesian spam filters. Using so called web bugs, he collected the data of success and failure in the spam infiltration into the clients mail inbox. With the feedback data sent by the web bugs from the target clients, he trained another POPFile, which is to be used only for the purpose of attacking other Bayesian spam filters. With this POPFile trained specifically for the attacks, he could gather a collection of words, which he calls them kryptonite, that can, when inserted into spam mails, cause the target Bayesian filters to accept the spam mails as non-spam [25]. Graham-Cumming s attacks may be effective, but probably not practical in the real world as he himself and Stern et al point it out; sending and keeping track of millions, or even trillions, of spam mails to train the attacking Bayesian filters would most likely be implausible [6] [25]. The attack proposed by Stern et al takes a different approach to make Bayesian spam filters less effective. It tries to degrade the accuracy of the spam filters by causing them to frequently classify legitimate mails as spam [6]. This attack is carried out by inserting into spam mails a large number of words most commonly used in regular conversations. If target spam filters are continuously fed the maliciously processed mails, it is expected that they will start associating those commonly-used regular words with spam mails, and begin to classify legitimate mails as spam. However, for this attack to really work in the practical settings, there needs to be a strong coordination among all the major spam senders to adopt the same attacking tactics, so that many target filters in the general public could be effectively mistrained. There is, however, a so-called prisoner s dilemma associated with this attack strategies. If all the spammers abide by the same tactics, everybody will benefit by effectively incapacitating Bayesian spam filters. If, however, any one of them chooses to deviate from the massorchestrated tactics, this spammer has a greater chance of devising spam messages that can actually infiltrate the spam filters of general public. Therefore, there is a strong incentive for spammers to break away from this consortium of spam attacks, making it difficult to coordinate such attacks among all the major spam senders [6]. 13

14 5. New Trends in Attacks against Bayesian Spam Filters The attacks described in the previous section seem to be effective only in the controlled settings of experiments, but recently in the real world, spammers have increasingly been adopting new methods to get their messages past spam filters. 5.1 Image Spam One of the most popular methods relies on image spam, in which spammers embed their text messages in images, so that Bayesian spam filters cannot effectively analyze the content of mails for spam classification [26]. Below is an example of image spam, taken from Image Spam Dataset [27]. Figure 5.1 Actual image spam (available at Bayesian spam filters cannot effectively classify such mails as above as spam because the messages appear only as images to the filters, and the text in the images cannot be recognized, let alone analyzed, for spam detection. To counteract image spam, researchers have developed two basic methods: fingerprinting and Optical Character Recognition (OCR). Fingerprinting uses MD5 checksum to convert images into hash values, and compare them with certain images known to be spam or non-spam. OCR captures text messages embedded in images and transform them into plain texts. The plain texts are then analyzed by traditional Bayesian filters for spam classification [28]. 14

5.2 Image Spam with Content Obfuscation Techniques Fingerprinting and OCR have provided researchers with tools to analyze images-based spam mails in a way Bayesian spam filters cannot.

For making the fingerprinting ineffective, just a simple change in images, such as modifying a few pixels, is needed, causing the hash values to change [28]. The image below (Image 5.

15 5.2 Image Spam with Content Obfuscation Techniques Fingerprinting and OCR have provided researchers with tools to analyze images-based spam mails in a way Bayesian spam filters cannot. However, spammers can easily devise image spam mails that can escape the detection by Fingerprinting and OCR. For making the fingerprinting ineffective, just a simple change in images, such as modifying a few pixels, is needed, causing the hash values to change [28]. The image below (Image 5.2) shows an example of making a tiny change in the image to cause the fingerprint alteration. Figure 5.2 the same image as Figure 5.1 with a speckle added in the lower left corner. To defeat OCR detection, spammers can use a number of content obfuscation techniques. Figure 5.3 shows four different techniques to make it difficult for OCR to read the message [26]. Figure 5.4 shows an example of image spam applying different techniques [28]. Figure 5.3 Content obfuscation techniques used to confuse OCR [26] 15

Figure 5.4 Image spam that is difficult for OCR to read [28] The current OCR technology is not sophisticated enough to accurately detect and read text messages embedded in images in such a way.

16 Figure 5.4 Image spam that is difficult for OCR to read [28] The current OCR technology is not sophisticated enough to accurately detect and read text messages embedded in images in such a way. As one approach to deal with this issue, Biggio et al. [26] propose a method that can detect a number of known image spam characteristics (such as a use of many different colors or particular colors, or some detectable patterns as a result of adopting well-know techniques against OCR), and based on the presence of those characteristics, determines whether or not the s are image spam. This approach does not rely on OCR to classify s as image spam or not image spam. Rather than trying to read texts embedded in images, it tries to determine if the images in the s fit the characteristics of image spam. However, the method is designed to complement, not replace, OCR to detect image spam, and the importance of reading the text embedded in images remain the same for better judgment in spam classification [26]. 16

17 6. Other Spam Filtering Strategies There are a number of other spam filtering strategies and techniques currently researched, developed, and/or used to combat spam mails that have steadily been growing in numbers and sophistication. This section briefly reviews an header verification approach proposed by Trevino et al. [29], behavioral blacklisting by Ramachandran et al. [30], server reputation management by Mujica [31], and Spamlet, a system that is designed by Dallmeyer et al. to obstruct spammers operation [32]. Header verification techniques have been around since the early stage of spam filtering development, but Trevino et al. [29] claim that they are still effective and light in consuming resources of mail servers. Header verification techniques basically read and analyze only the header part of s, which contains, among others, the information of SMTPs used by senders of mails. In the process, the identities of SMTPs are checked with Domain Name System (DNS), and only the mails from the senders whose SMTP identities are verified are allowed to pass through the filter. Since, unlike Bayesian spam filtering, no text reading for statistical word analysis is required in this method, header verification should be able to handle image spam as well as it does non-image spam [29]. SpamTracker is a spam filtering system developed by Ramachandran et al. [30], which filters s based on the sending behaviors of the senders. Traditionally, the reputation of senders are checked by referring to their IP addresses and publicly accessible spam blacklists such as those provided by Spamhaus [33] and SpamCop [34]. SpamTracker, however, checks and analyzes the sending behaviors/patterns of senders and use a clustering algorithm to separately place them in a spam sender group or a non-spam sender group. Some of the sending patterns analyzed by SpamTracker are the information of the domains the senders target, and the methods/processes used to target those domains. The experiments conducted by Ranachandran et al. show that this technique works because senders of spam mails do cluster together in behavioral analysis, and they can be distinguished from those who do not send spam mails [30]. In the paper, Reputation Management for All Servers [31], Mujica is less optimistic about the capabilities of the current spam filters, and explains the great importance of reputation management on the part of all servers. If all the legitimate servers mange their reputation and have them listed on white lists, the problem of spam mails can be more easily and effectively handled. He proposes a freely available reputation management system (RM system) for all mail servers, which offers XML web services for queries about IP address reputation, and better interface for reputation management on the web. It also sets out a framework for reputation scoring processes [31]. Dallmeyer et al. [32] take a different approach to fight against spam mails. They developed a system called Spamlet, which tries to interact with spammer without human management by replying to spam messages. It relies on artificial intelligence algorithm, and by engaging spammers in fake transactions, it tries to consume spammers time and resources. 17

18 7. Conclusion This paper studied the mechanisms and effectiveness of Bayesian spam filters. A number of reports cited and the experiments conducted indicate that Bayesian spam filters are truly effective, when classifying s based on the text contents of the mails. Since messages sent by spammers tend to differ from the legitimate messages individuals personally receive from friends, business colleagues, and so on, the statistical analysis of word usage in s seems not just effective, but also relevant. However, as researchers and developers improve the performance of spam filters, spammers also try to devise new methods to infiltrate ever-improving spam filters. Image spam is a prominent example of new methods that are developed to combat spam filters, especially Bayesian spam filters. Today, the trend in spam filtering approaches seems to be moving toward the combination and integration of many different spam filtering techniques to better handle a variety of ever-changing spam tactics. It is also true that on an individual basis, new techniques, as well as existing techniques, are continuously researched and developed to improve spam filtering performances. As long as spam messages involve texts, whether they are embedded in images, or even in audio and/or video, the importance and relevance of text analysis seems to remain high for spam classification. If so, Bayesian spam filtering technique would continue to make up a vital part of spam detection mechanisms for the foreseeable future. 18

19 References: [1] "Bayes' Theorem", [online] 2008, (Accessed: October 2008) [2] "Bayesian Spam Filter", [online] 2008, (Accessed: October 2008) [3] Cormac O Brien and Carl Vogel. Spam filters: Bayes vs. chi-squared; letters vs. words. Technical Report TCD-CS , Trinity College Dublin, April [4] "Naive Bayes Classifier", [online] 2008, (Accessed: October 2008) [5] Trevor Stone, Parameterization of Naive Bayes for Spam Filtering, University of Colorado at Boulder, [6] Henry Stern, Justin Mason, and Michael Shepherd, A Linguistics-Based Attack on Personalized Statistical Classifiers, Dalhousie University, March 25, [7] SpamAssassin, [8] McAfee SpamKiller, html [9] Mozilla Thunderbird, [10] Apple Mail, [11] POPFile, [12] SpamBayes, [13] Paul Graham, A Plan for Spam, [online] 2002, (Accessed November 2008) [14] T.A Meyer and B Whateley, SpamBayes: Effective Open-Source, Bayesian Based, Classification System, Conference on and Spam (CEAS 04) [15] Type I and Type II Errors [online] (Accessed November 2008) [16] Chi-Square Test [online] [17] POPFile [online] 19

20 [18] Accuracy and Precision [online] (Accessed November 2008) [19] Precision and Recall [online] (Accessed November 2008) [20] William S. Yerazunis, The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It, 2004 MIT Spam Conference, 2004 [21] Sourceforge.net, [22] GFiMailEssentials, [23] Bag of Words Model [online] [24] Bayesian Poisoning [online] [25] John Graham-Cumming, How to Beat an Adoptive Spam Filter, Spam Conference 2004, available at [26] Battista Biggio, Giorgio Fumera, Ignazio Pillai, and Fabio Roli, Image Spam Filtering by Content Obscuring Detection, CEAS 2007 Forth Conference on and Anti-Spam, August 2-3, [27] Image Spam Dataset [online] [28] Image Based Spam, Red Condor, Inc, 2007, available at [29] Alberto Trevino and J. J. Ekstrom, Spam Filtering Through Header Relay Detection, Brigham Young University, 2007 [30] Anirudh Ramachandran, Nick Feamster, and Santosh Vempala, Filtering Spam with Behavioral Blacklisting, ACM , 2007 [31] Alberto Mujica, Reputation Management for All Servers, Reputation Technologies Inc. [32] Kenneth P. Dallmeyer, Peter C. Nelson, Elias D. Block, and Brandon R. Elvidge, Return to Spamlet, Artificial Intelligence Laboratory, University of Illinois at Chicago [33] Spamhaus, [34] SpamCop, 20

Probabilistic Learning Classification using Naïve Bayes

Probabilistic Learning Classification using Naïve Bayes Weather forecasts are usually provided in terms such as 70 percent chance of rain. These forecasts are known as probabilities of precipitation reports.