CMPT 585: Intranet and Internet Security Fall 2008 Montclair State University

Size: px
Start display at page:

Download "CMPT 585: Intranet and Internet Security Fall 2008 Montclair State University"

Transcription

1 Title Page CMPT 585: Intranet and Internet Security Fall 2008 Montclair State University Computing Security Project Project Topic: Bayesian Spam Detection Mechanisms and Future of Anti-Spam Filters Project member: Supervisor: Hiroki Yamakawa Dr. Stefan A. Robila

2 Table of Contents Title Page... 1 Table of Contents... 2 Abstract Introduction Bayes Theorem Bayesian Spam Filters SpamBayes POPFile Experimental Results Evaluation of Bayesian Spam Filters Vulnerabilities New Trends in Attacks against Bayesian Spam Filters Image Spam Image Spam with Content Obfuscation Techniques Other Spam Filtering Strategies Conclusion References: Tables and Figures Table 3.1 Test Record... 9 Table 3.2 Breakdown of classification... 9 Table 3.3 Test results in confusion Matrices... 9 Table 3.4 Confusion matrix Table 3.5 Test results in accuracy, precision, and recall Figure 5.1 Actual image spam Figure 5.2 the same image as Figure 5.1 with a speckle Figure 5.3 Content obfuscation techniques used to confuse OCR Figure 5.4 Image spam that is difficult for OCR to read

3 Abstract A number of spam filtering techniques have been developed to tackle the problems associated with spam s, such as decreased productivity among employees and compromised computer security caused by malicious mails. Among the pack of filtering techniques, Bayesian spam filters have gained popularity and credibility as one of the most effective ways to detect spam s. This paper first reviews a probability theory, Bayes Theorem, which is the backbone of Bayesian spam filters. With this theorem, Bayesian spam filters statistically analyze the text content of s and generate the probability values for the s to be spam. Two Bayesian spam filters, SpamBayes and POPFiles, are analyzed in detail, and the experiment conducted to evaluate their spam filtering performance supports the widelyreported effectiveness of Bayesian spam filters. There are some known vulnerabilities in Bayesian spam filters against certain spam tricks. One of the recent and most troubling spam tricks against Bayesian spam filters is called image spam. Some techniques used by image spam are studied alongside the vulnerabilities of Bayesian spam filters. Other spam filtering techniques, such as header verification, are reviewed to show a number of possibilities in spam filtering approaches. They can compliment, or compensate for the vulnerabilities of Bayesian spam filters. 3

4 1. Introduction Since the Internet and became major media of communication in nearly every aspect of our life in the late 1990 s and early 2000 s, they have revolutionized the way of doing business and socializing. The Internet has, just to list a few examples, made information gathering and publishing easy, and has provided the convenience of online shopping and financial management. has also brought about many beneficial results that are just as convenient and invaluable as the changes effected by the Internet in our society. These new technologies have, however, created new kinds of issues and problems at the same time. One of the major problems that are affecting almost all users of is the so called "spam," a label coined for unsolicited messages that inundate inboxes of millions of people every day. To deal with the issues, a number of spam detection and filtering methods have been developed. This project studies and examines the cuttingedge anti-spam technique, called "Bayesian spam filtering," and analyzes the current spam trends and the future of spam filtering technologies. For the analysis of anti-spam products actually available in the market today, the open source software POPFile and SpamBayes will be used. Their application of Bayesian spam filtering methods and effectiveness are evaluated. Experiments will be conducted not only to measure the performance, but also to contemplate on possible exploitation of the Bayesian spam filtering vulnerability. This paper is organized as follows. Section 2 reviews Bayes Theorem in order to understand the essential mechanisms of many current spam filters. Section 3 studies actual Bayesian spam filters and conducts some experiments on them. Section 4 studies vulnerabilities of Bayesian spam filters and provides some example attacks exploiting the vulnerabilities. Section 5 analyzes the new trends in the attacks against Bayesian spam filters and possible counter measures and strategies against the new attacks. Section 6 takes a look at other spam filtering strategies being researched today. Section 7 concludes the project. 4

5 2. Bayes Theorem Bayesian spam filtering uses Bayes' Theorem to distinguish spam s from legitimate s. Bayes' Theorem is a probability theorem developed by a British mathematician, Thomas Bayes, in 18th century [1]. The theorem is expressed as follows: P(A): P(B): probability of event A probability of event B P(A B): probability of event A given the occurrence of event B P(B A): probability of event B given the occurrence of event A In spam mail filtering, the above formula can be re-expressed as: Bayesian spam filtering scans a set of pre-labeled texts (either "spam" or "legitimate," for example) and calculates the frequency and probability of each word in the text. This initial process returns the so-called "prior probabilities," P(spam) and P(words) [2]. P(spam) represents the probability that any scanned is spam. The value is calculated with the total number of spam s in the set divided by the total number of all s in the set. P(words) represents the probability that any scanned contains the particular set of words. P(words spam) is the probability of the particular set of words being contained in the given that the is spam. Once the above three probabilities are obtained, the "posterior probability," P(spam words), can be calculated using the formula above. This probability tells how likely the given is spam. If the probability is significantly high (e.g. above or equal to 90%), the is considered to be spam. There is, however, a major problem with the use of the above formula: calculating the value of P(words spam) is almost impossible in real situations [3]. The term words represents a particular set of words in a given . It is usually represented as a binary vector with its elements indicating the presence or absence of the pre-defined set of words. For example, if a set (W 1,W 2,,W n ) represents the pre-defined set of n words, and if the , e 1, contains the words, W 1, W 5, W 7, W 8, and W 10 (e 1 = < W 1, W 5, W 7, W 8, W 10 >), the binary vector of words is expressed as words = <1,0,0,0,1,0,1,1,0,1,0,0,0, 0> with 1 indicating the presence of the corresponding word W i. Accurately filtering 5

6 s for spam usually requires a set of a few hundred words. This means that the number of possible values for words is so great that computing P(words spam), the probability of getting the particular set of words from the given that the is spam, is extremely intractable [3]. In order to address this issue, Naïve Bayes Classifier assumption is incorporated in the spam filtering. The assumption states that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature [4]. The assumption in the case of spam filtering can be interpreted as that the probability of the being spam independently influences the probability of each word appearing in the . With this assumption, the numerator of the probability formula, P(words spam)p(spam), is transformed as follows (the denominator remains the same): P(words spam)p(spam) = P(spam, W 1, W 2,, W n ) = P(spam) i=1 n P(W i spam) In this way, P(words spam) is transformed into i=1 n P(W i spam). Since calculating P(W i spam) is much easier than P(words spam), and the results of the computation reasonably reflect the probability in practice, the incorporation of the assumption in the spam filtering seems justifiable [5]. 6

7 3. Bayesian Spam Filters Most of today s spam filters uses the Naïve Bayes Classifier explained above as the main mechanism or part of the mechanisms to combat spam [6]. Some of the popular spam filters that take advantage of Naive Bayes Classifier are SpamAssassin [7], McAfee SpamKiller [8], Mozilla Thunderbird [9], Apple Mail [10], POPFile [11], and SpamBayes [12]. This paper mainly studies two popular Bayesian spam filters available as open source software, SpamBayes and POPFile, both of which are the implementation results of adopting the proposals made in A Plan for Spam, published by Paul Graham [13]. It is one of the most famous and widely publicized works proposing the application of Naïve Bayes Classifier in spam filtering [2]. 3.1 SpamBayes SpamBayes was developed by Tim Peters and others in August 2002, soon after Paul Graham published A Plan for Spam [14]. The initial implementation of Graham s proposal (using Naïve Bayes Classifier in spam filtering) proved successful but also gave rise to some problems. s were only classified as either definite spam or definite ham (a label given to legitimate s). In other words, s were classified as either one of them always with very strong probability values (confidence) [14]. This is a major issue when considering the cost of false positive cases. In the case of spam filtering, false positive is a term given to cases where non-spam, legitimate mails are incorrectly classified as spam. Usually, the cost of false positive is higher than the cost of false negative, a term given to cases where spam mails are incorrectly classified as non-spam. For example, if a legitimate, very important business message was sent by from company headquarter to its branch office, but the filtering system at the branch office incorrectly label the as spam, and filters it out from the regular inbox. False positive problems like this are much severer than when some spam mails are incorrectly classified as non-spam, and they creep into the regular inbox. If SpamBayes could classify s only as spam or non-spam and nothing in between, the weights given to the recalculation (recalibration) of the classifier in the case of false positive should be very large relative to the weights given to that of false negative cases. That is, the classifier of SpamBayes should adjust itself so that, if it were to make mistakes, it would incorrectly classify non-spam mails as spam much less frequently than it would incorrectly classify spam mail as non-spam. However, giving unreasonably large weights to false positive cases would lead to poor classification accuracy, incorrectly recognizing many spam mails, as well as most of the non-spam mail, as legitimate. The goal of spam filtering is to achieve optimal balance in the trade-off relation that exists between false positives and false negatives, while keeping the number of both cases as low as possible [15]. To help achieve this goal, SpamBayes incorporates a chi-squared test. The chi-squared test measures a statistical significance of a hypothesis that a given message is spam, and 7

8 another hypothesis that a given message is non-spam [14]. The probability calculated by the test indicates how likely it is to observe a particular mail, assuming the hypothesis that the mail is spam, or separately, it indicates the likelihood of observing a particular mail, assuming the other hypothesis that the mail is non-spam [14] [16]. The probabilities for the two hypotheses are combined and scaled to generate an overall message spam score in the range 0 to 1 [14]. The higher the score (the closer the score is to 1), the more likely that the mail is spam. The lower the score, the less likely the mail is spam. SpamBayes labels all s with the score between 0.20 and 0.90 as unsure. This asymmetric boundaries biased against the spam effectively reduces the number of false positives, and also false negatives, by resisting to assign a definite label of either spam or ham to those mails that are difficult to classify [14]. 3.2 POPFile POPFile was developed by John Graham-Cumming and initially released in September 2002 [17]. Just as SpamBayes extensively relies on Naïve Bayes Classifier to sort s into spam, ham, or unsure, POPFile uses Naïve Bayes Classifier for its filtering mechanism. POPFile is, however, not a spam filter in a technical sense. It is rather a general classifier that can classify s as spam or non-spam or whatever classification users define [11]. Users can create an arbitrary number of classification labels. For example, one can create labels, such as Business for job-related s, School for school-related s, and Personal for s personally addressed to him/her, along with Spam and Non-Spam labels. POPFile is more flexible than SpamBayes in this sense. However, POPFile does not have a mechanism similar to the spam score used by SpamBayes to better handle difficult-to-classify s. The tendency observed by Meyer and Whateley for Naïve Bayes Classifier to return polarized results (either strongly spam or strongly non-spam) [14] may not be specifically addressed in POPFile, given the feature of POPFile to accommodate an arbitrary number of user-defined classification labels. POPFile does not particularly give importance to spam filtering, but it does try to filter s into all the user-defined labels equally well. Even though POPFile has a default unclassified label, this label is provided for cases in which classification is difficult under any user defined labels. That is, the unclassified label is not specifically given to those cases that lie in the middle of spectrum between spam and non-spam. It may seem, therefore, that POPFile is less optimally configured for the sole purpose of spam filtering. The results of comparative experiments are presented in the next section to discuss this issue. 3.3 Experimental Results First, to measure the overall effectiveness of Bayesian spam filters, both SpamBayes and POPFile are tested for classification accuracy with 150 s taken from the corpus of spam and non-spam s provided by SpamAssassin [7]. Out of 150 s, 72 are spam, and 78 are non-spam. The test results are summarized in Tables 3.1, Table 3.2 and Table

9 Spam Ham Unsure/Unclassified Total Actual Test Set SpamBayes POPFile Table 3.1 Test Record Count of s classified as Spam, Non-Spam, or Unsure/Unclassified by SpamBayes and POPFile. The Actual Test Set row shows the real number of spam and non-spam s used for this test. SpamBayes POPFile Spam Non-Spam Unsure Spam Non-Spam Unclassified Positive Negative Positive Negative Test Result True False Table 3.2 Breakdown of classification The values in True row are the number of correct classification made by SpamBayes and POPFile for each label (Spam, Non-Spam, and Unsure). The values in False row are the number of incorrect classification made, representing false positives and false negatives. SpamBayes Real Value Test Result Spam Non-Spam Positive Spam 47 (True) 1 (False) Negative Non-Spam 6 (False) 71 (True) Unsure 19 6 POPFile Real Value Test Result Spam Non-Spam Positive Spam 60 (True) 6 (False) Negative Non-Spam 6 (False) 67 (True) Unclassified 6 5 Table 3.3 Test results in confusion Matrices 9

10 Confusion Matrix Real Value Test Result Spam Non-Spam Spam True Positive (TP) False Positive (FP) Non-Spam False Negative (FN) True Negative (TN) Table 3.4 Confusion matrix Table 3.1 provides a quick look at the test result. This table does not tell in itself the accuracy of classification made by either SpamBayes or POPFile. However, the table may indicate a tendency of SpamBayes to be more careful in labeling s as spam. As discussed earlier, in order to reduce the number of false positives, SpamBayes assigns the unsure label to those with the spam scores between 0.20 and 0.90 [14]. Table 3.2 seems to support this interpretation. SpamBayes returns only 1 false positive result (that is, classifying non-spam mails as spam) compared with 6 false negatives, while POPFile returns 6 false positives and 6 false negatives. The better performance of SpamBayes in terms of false positive is probably achieved by assigning the unsure label to the mails with the score between 0.20 and On the other hand, the unclassified labels are merely allocated by POPFile for mails, to which it cannot confidently assign the spam or the non-spam labels. In this sense, SpamBayes is clearly better configured to deal with spam filtering tasks, although the better performance tends to come with the price of having to manually classify more unsure s. Table 3.3 looks at the false positive and false negative cases in a confusion matrix for each spam filter. The column labels (Spam, Non-Spam) specify the classification made by the filter, and the row labels (Spam, Non-Spam) denote the actual classification. Table 3.4 shows the meaning of the value in each cell in the confusion matrix. In order to quantify the performance, three statistical measures traditionally used for classification quality are used. Those are accuracy, precision, and recall. Accuracy is defined as [18]: Accuracy = (TP + TN) / (TP + FP + FN + TN) refer to Table 3.4 If the number of unsure/unclassified cases is taken into consideration: Accuracy = (TP + TN) / (TP + FP + FN + TN + US + UNS) Where US is the number of spam mails labeled by the filters as unsure/unclassified, and UNS is the number of non-spam mails labeled as unsure/unclassified. In other words, it is calculated with the number of correctly classified s divided by the total number of s used for the test (either including or excluding the number of unsure/unclassified s). Simply put, it measures how well the filters correctly label s. 10

11 Precision is a measure used to calculate the exactness of classification. It is calculated with the number of correctly identified s belonging to the identification class divided by the total number of s identified, both correctly and incorrectly, as belonging to the class [19]. Precision = TP / (TP + FP) in terms of spam classification When considering the precision for non-spam classification (since this test is not a binary classification test in a pure sense [there are three classes: spam, non-spam, unsure/unclassified], calculating the precision for non-spam seems warranted), the formula could be expressed as: Precision (negative) = TN / (TN + FN) Recall measures the completeness of classification. It is calculated as the number of correctly identified s belonging to the class divided by the total number of s that actually belong to the class [19]. Recall = TP / (TP + FN) Recall (negative) = TN / (TN + FP) in terms of spam classification in terms of non-spam classification Table 3.5 below summarizes the test results in terms of accuracy, precision, and recall. SpamBayes POPFile Spam Non-Spam Spam Non-Spam Accuracy (w/o unsure/unclassified) 118/125 = /139 = 0.91 Accuracy (with unsure/unclassified) 118/150 = /150 = 0.85 Precision 47/48 = /77 = /66 = /73 = 0.92 Recall 47/72 = /78 = /72 = /78 = 0.86 Table 3.5 Test results in accuracy, precision, and recall Table 3.5 also reveals the difference of approaches in which SpamBayes and POPFile tackle the classification. It is observable in the very high precision and the relatively low recall for spam classification made by SpamBayes that the filter tries to achieve a high spam precision measure at the expense of a spam recall measure. That is, to reduce the number of times the filter classifies non-spam s as spam as much as possible, it classifies difficult s as unsure instead of risking false positive classification. 11

12 On the other hand, with POPFile, the measures of precision and recall for spam and nonspam appear very similar to each other. This may be because, as discussed in the previous section, POPFile does not particularly try to reduce the number of false positives in the spam filtering sense. In this test, it happens to have a label called spam and another one called non-spam, but POPFile just tries to classify s as one of those categories, and if assigning any of those labels is statistically difficult, it gives an unclassified label to the mails as a last resort. 3.4 Evaluation of Bayesian Spam Filters Given the experimental results and statistical performance measures, it seems reasonable to say that spam filters based on Naïve Bayes Classifier are effective in distinguishing spam s from non-spam s. The test only used 150 s, but the accuracy measures steadily increased throughout the test period. As more and more s are processed into the filters for testing, it will be all but certain that the accuracy will continue increasing until it hits a certain achievable level of classification accuracy for the classification model used. According to Yerazunis [20], over twenty Bayesian spam filters available on Sourceforge.net [21] in 2004 were reported to have demonstrable accuracies on the order of 99.9% accuracy. So, why are Bayesian spam filters so effective? One major reason would be that they have the mechanism of continuously incorporating and adjusting the spam/non-spam probability values while processing new documents along with their corresponding, user-verified, labels. This means that the filtering system can improve the filtering accuracy by continuously adjusting its accumulated knowledge (statistics) of what is considered to be spam and what is not, as more and more with corresponding labels are fed in the system. As soon as the system has gained substantial knowledge of spam statistics, it can scan new unlabeled s and state the probabilities of them being spam or non-spam with great accuracy. Another reason for the great performance observable in Bayesian spam filters is that most of them that are available to the public are personally trained by individual users to better filter their own unique set o s. Bayesian spam filters can be, and are, used on the server side, or incorporated in server software that handles multiple users [2] [22], but many personal users who adopt such spam filters as SpamBayes and POPFile install the filters as proxies that sit between their mail servers and their client software, such as Microsoft Outlook and Mozilla Thunderbird [11] [12]. Therefore, their filters are only used for their personal set of s. As a result, the filters can effectively fine tune themselves for the better performance. 12

13 4. Vulnerabilities As powerful and effective as Bayesian spam filters can be, there have been proposed a number of attacks designed to infiltrate their filtering mechanisms. Many of those attacks exploit the very strength of adoptive learning mechanisms inherent in Bayesian spam filters and the way they treat an message as a bag of words [6], which means treating a message just as a collection of words, disregarding grammar and word orders [23]. They are called Bayesian Poisoning [6] [24]. What Bayesian poisoning tries to do is to confuse Bayesian spam filters into accepting spam s as legitimate by inserting into the spam mails (or poisoning the mails with) random words and/or certain words not usually belonging to spam mails [24]. The creator of POPFile, John Graham-Cumming [11], presented at the Spam Conference held at MIT in 2004 [24] a very interesting way of attacking Bayesian spam filters, using Bayesian spam filters [25]. He collected a set of spam mails that infiltrated his welltrained POPFile. Then, after inserting five random words into each spam mail, he sent the spam mails to the target clients protected by Bayesian spam filters. Using so called web bugs, he collected the data of success and failure in the spam infiltration into the clients mail inbox. With the feedback data sent by the web bugs from the target clients, he trained another POPFile, which is to be used only for the purpose of attacking other Bayesian spam filters. With this POPFile trained specifically for the attacks, he could gather a collection of words, which he calls them kryptonite, that can, when inserted into spam mails, cause the target Bayesian filters to accept the spam mails as non-spam [25]. Graham-Cumming s attacks may be effective, but probably not practical in the real world as he himself and Stern et al point it out; sending and keeping track of millions, or even trillions, of spam mails to train the attacking Bayesian filters would most likely be implausible [6] [25]. The attack proposed by Stern et al takes a different approach to make Bayesian spam filters less effective. It tries to degrade the accuracy of the spam filters by causing them to frequently classify legitimate mails as spam [6]. This attack is carried out by inserting into spam mails a large number of words most commonly used in regular conversations. If target spam filters are continuously fed the maliciously processed mails, it is expected that they will start associating those commonly-used regular words with spam mails, and begin to classify legitimate mails as spam. However, for this attack to really work in the practical settings, there needs to be a strong coordination among all the major spam senders to adopt the same attacking tactics, so that many target filters in the general public could be effectively mistrained. There is, however, a so-called prisoner s dilemma associated with this attack strategies. If all the spammers abide by the same tactics, everybody will benefit by effectively incapacitating Bayesian spam filters. If, however, any one of them chooses to deviate from the massorchestrated tactics, this spammer has a greater chance of devising spam messages that can actually infiltrate the spam filters of general public. Therefore, there is a strong incentive for spammers to break away from this consortium of spam attacks, making it difficult to coordinate such attacks among all the major spam senders [6]. 13

14 5. New Trends in Attacks against Bayesian Spam Filters The attacks described in the previous section seem to be effective only in the controlled settings of experiments, but recently in the real world, spammers have increasingly been adopting new methods to get their messages past spam filters. 5.1 Image Spam One of the most popular methods relies on image spam, in which spammers embed their text messages in images, so that Bayesian spam filters cannot effectively analyze the content of mails for spam classification [26]. Below is an example of image spam, taken from Image Spam Dataset [27]. Figure 5.1 Actual image spam (available at Bayesian spam filters cannot effectively classify such mails as above as spam because the messages appear only as images to the filters, and the text in the images cannot be recognized, let alone analyzed, for spam detection. To counteract image spam, researchers have developed two basic methods: fingerprinting and Optical Character Recognition (OCR). Fingerprinting uses MD5 checksum to convert images into hash values, and compare them with certain images known to be spam or non-spam. OCR captures text messages embedded in images and transform them into plain texts. The plain texts are then analyzed by traditional Bayesian filters for spam classification [28]. 14

15 5.2 Image Spam with Content Obfuscation Techniques Fingerprinting and OCR have provided researchers with tools to analyze images-based spam mails in a way Bayesian spam filters cannot. However, spammers can easily devise image spam mails that can escape the detection by Fingerprinting and OCR. For making the fingerprinting ineffective, just a simple change in images, such as modifying a few pixels, is needed, causing the hash values to change [28]. The image below (Image 5.2) shows an example of making a tiny change in the image to cause the fingerprint alteration. Figure 5.2 the same image as Figure 5.1 with a speckle added in the lower left corner. To defeat OCR detection, spammers can use a number of content obfuscation techniques. Figure 5.3 shows four different techniques to make it difficult for OCR to read the message [26]. Figure 5.4 shows an example of image spam applying different techniques [28]. Figure 5.3 Content obfuscation techniques used to confuse OCR [26] 15

16 Figure 5.4 Image spam that is difficult for OCR to read [28] The current OCR technology is not sophisticated enough to accurately detect and read text messages embedded in images in such a way. As one approach to deal with this issue, Biggio et al. [26] propose a method that can detect a number of known image spam characteristics (such as a use of many different colors or particular colors, or some detectable patterns as a result of adopting well-know techniques against OCR), and based on the presence of those characteristics, determines whether or not the s are image spam. This approach does not rely on OCR to classify s as image spam or not image spam. Rather than trying to read texts embedded in images, it tries to determine if the images in the s fit the characteristics of image spam. However, the method is designed to complement, not replace, OCR to detect image spam, and the importance of reading the text embedded in images remain the same for better judgment in spam classification [26]. 16

17 6. Other Spam Filtering Strategies There are a number of other spam filtering strategies and techniques currently researched, developed, and/or used to combat spam mails that have steadily been growing in numbers and sophistication. This section briefly reviews an header verification approach proposed by Trevino et al. [29], behavioral blacklisting by Ramachandran et al. [30], server reputation management by Mujica [31], and Spamlet, a system that is designed by Dallmeyer et al. to obstruct spammers operation [32]. Header verification techniques have been around since the early stage of spam filtering development, but Trevino et al. [29] claim that they are still effective and light in consuming resources of mail servers. Header verification techniques basically read and analyze only the header part of s, which contains, among others, the information of SMTPs used by senders of mails. In the process, the identities of SMTPs are checked with Domain Name System (DNS), and only the mails from the senders whose SMTP identities are verified are allowed to pass through the filter. Since, unlike Bayesian spam filtering, no text reading for statistical word analysis is required in this method, header verification should be able to handle image spam as well as it does non-image spam [29]. SpamTracker is a spam filtering system developed by Ramachandran et al. [30], which filters s based on the sending behaviors of the senders. Traditionally, the reputation of senders are checked by referring to their IP addresses and publicly accessible spam blacklists such as those provided by Spamhaus [33] and SpamCop [34]. SpamTracker, however, checks and analyzes the sending behaviors/patterns of senders and use a clustering algorithm to separately place them in a spam sender group or a non-spam sender group. Some of the sending patterns analyzed by SpamTracker are the information of the domains the senders target, and the methods/processes used to target those domains. The experiments conducted by Ranachandran et al. show that this technique works because senders of spam mails do cluster together in behavioral analysis, and they can be distinguished from those who do not send spam mails [30]. In the paper, Reputation Management for All Servers [31], Mujica is less optimistic about the capabilities of the current spam filters, and explains the great importance of reputation management on the part of all servers. If all the legitimate servers mange their reputation and have them listed on white lists, the problem of spam mails can be more easily and effectively handled. He proposes a freely available reputation management system (RM system) for all mail servers, which offers XML web services for queries about IP address reputation, and better interface for reputation management on the web. It also sets out a framework for reputation scoring processes [31]. Dallmeyer et al. [32] take a different approach to fight against spam mails. They developed a system called Spamlet, which tries to interact with spammer without human management by replying to spam messages. It relies on artificial intelligence algorithm, and by engaging spammers in fake transactions, it tries to consume spammers time and resources. 17

18 7. Conclusion This paper studied the mechanisms and effectiveness of Bayesian spam filters. A number of reports cited and the experiments conducted indicate that Bayesian spam filters are truly effective, when classifying s based on the text contents of the mails. Since messages sent by spammers tend to differ from the legitimate messages individuals personally receive from friends, business colleagues, and so on, the statistical analysis of word usage in s seems not just effective, but also relevant. However, as researchers and developers improve the performance of spam filters, spammers also try to devise new methods to infiltrate ever-improving spam filters. Image spam is a prominent example of new methods that are developed to combat spam filters, especially Bayesian spam filters. Today, the trend in spam filtering approaches seems to be moving toward the combination and integration of many different spam filtering techniques to better handle a variety of ever-changing spam tactics. It is also true that on an individual basis, new techniques, as well as existing techniques, are continuously researched and developed to improve spam filtering performances. As long as spam messages involve texts, whether they are embedded in images, or even in audio and/or video, the importance and relevance of text analysis seems to remain high for spam classification. If so, Bayesian spam filtering technique would continue to make up a vital part of spam detection mechanisms for the foreseeable future. 18

19 References: [1] "Bayes' Theorem", [online] 2008, (Accessed: October 2008) [2] "Bayesian Spam Filter", [online] 2008, (Accessed: October 2008) [3] Cormac O Brien and Carl Vogel. Spam filters: Bayes vs. chi-squared; letters vs. words. Technical Report TCD-CS , Trinity College Dublin, April [4] "Naive Bayes Classifier", [online] 2008, (Accessed: October 2008) [5] Trevor Stone, Parameterization of Naive Bayes for Spam Filtering, University of Colorado at Boulder, [6] Henry Stern, Justin Mason, and Michael Shepherd, A Linguistics-Based Attack on Personalized Statistical Classifiers, Dalhousie University, March 25, [7] SpamAssassin, [8] McAfee SpamKiller, html [9] Mozilla Thunderbird, [10] Apple Mail, [11] POPFile, [12] SpamBayes, [13] Paul Graham, A Plan for Spam, [online] 2002, (Accessed November 2008) [14] T.A Meyer and B Whateley, SpamBayes: Effective Open-Source, Bayesian Based, Classification System, Conference on and Spam (CEAS 04) [15] Type I and Type II Errors [online] (Accessed November 2008) [16] Chi-Square Test [online] [17] POPFile [online] 19

20 [18] Accuracy and Precision [online] (Accessed November 2008) [19] Precision and Recall [online] (Accessed November 2008) [20] William S. Yerazunis, The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It, 2004 MIT Spam Conference, 2004 [21] Sourceforge.net, [22] GFiMailEssentials, [23] Bag of Words Model [online] [24] Bayesian Poisoning [online] [25] John Graham-Cumming, How to Beat an Adoptive Spam Filter, Spam Conference 2004, available at [26] Battista Biggio, Giorgio Fumera, Ignazio Pillai, and Fabio Roli, Image Spam Filtering by Content Obscuring Detection, CEAS 2007 Forth Conference on and Anti-Spam, August 2-3, [27] Image Spam Dataset [online] [28] Image Based Spam, Red Condor, Inc, 2007, available at [29] Alberto Trevino and J. J. Ekstrom, Spam Filtering Through Header Relay Detection, Brigham Young University, 2007 [30] Anirudh Ramachandran, Nick Feamster, and Santosh Vempala, Filtering Spam with Behavioral Blacklisting, ACM , 2007 [31] Alberto Mujica, Reputation Management for All Servers, Reputation Technologies Inc. [32] Kenneth P. Dallmeyer, Peter C. Nelson, Elias D. Block, and Brandon R. Elvidge, Return to Spamlet, Artificial Intelligence Laboratory, University of Illinois at Chicago [33] Spamhaus, [34] SpamCop, 20

Probabilistic Learning Classification using Naïve Bayes

Probabilistic Learning Classification using Naïve Bayes Probabilistic Learning Classification using Naïve Bayes Weather forecasts are usually provided in terms such as 70 percent chance of rain. These forecasts are known as probabilities of precipitation reports.

More information

Fighting Spam, Phishing and Malware With Recurrent Pattern Detection

Fighting Spam, Phishing and Malware With Recurrent Pattern Detection Fighting Spam, Phishing and Malware With Recurrent Pattern Detection White Paper September 2017 www.cyren.com 1 White Paper September 2017 Fighting Spam, Phishing and Malware With Recurrent Pattern Detection

More information

CS 188: Artificial Intelligence Fall Machine Learning

CS 188: Artificial Intelligence Fall Machine Learning CS 188: Artificial Intelligence Fall 2007 Lecture 23: Naïve Bayes 11/15/2007 Dan Klein UC Berkeley Machine Learning Up till now: how to reason or make decisions using a model Machine learning: how to select

More information

Final Report - Smart and Fast Sorting

Final Report - Smart and Fast  Sorting Final Report - Smart and Fast Email Sorting Antonin Bas - Clement Mennesson 1 Project s Description Some people receive hundreds of emails a week and sorting all of them into different categories (e.g.

More information

Project Report: "Bayesian Spam Filter"

Project Report: Bayesian  Spam Filter Humboldt-Universität zu Berlin Lehrstuhl für Maschinelles Lernen Sommersemester 2016 Maschinelles Lernen 1 Project Report: "Bayesian E-Mail Spam Filter" The Bayesians Sabine Bertram, Carolina Gumuljo,

More information

Artificial Intelligence Naïve Bayes

Artificial Intelligence Naïve Bayes Artificial Intelligence Naïve Bayes Instructors: David Suter and Qince Li Course Delivered @ Harbin Institute of Technology [M any slides adapted from those created by Dan Klein and Pieter Abbeel for CS188

More information

Spam Classification Documentation

Spam Classification Documentation Spam Classification Documentation What is SPAM? Unsolicited, unwanted email that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient. Objective:

More information

CONTENTS IN DETAIL PART I AN INTRODUCTION TO SPAM FILTERING INTRODUCTION 1 THE HISTORY OF SPAM 3 2 HISTORICAL APPROACHES TO FIGHTING SPAM 25

CONTENTS IN DETAIL PART I AN INTRODUCTION TO SPAM FILTERING INTRODUCTION 1 THE HISTORY OF SPAM 3 2 HISTORICAL APPROACHES TO FIGHTING SPAM 25 CONTENTS IN DETAIL INTRODUCTION xvii PART I AN INTRODUCTION TO SPAM FILTERING 1 THE HISTORY OF SPAM 3 The Definition of Spam... 4 The Very First Spam... 4 Spam: The Early Years... 7 Jay-Jay s College Fund...

More information

Identifying Important Communications

Identifying Important Communications Identifying Important Communications Aaron Jaffey ajaffey@stanford.edu Akifumi Kobashi akobashi@stanford.edu Abstract As we move towards a society increasingly dependent on electronic communication, our

More information

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence Fall 2011 CS 188: Artificial Intelligence Fall 2011 Lecture 21: ML: Naïve Bayes 11/10/2011 Dan Klein UC Berkeley Example: Spam Filter Input: email Output: spam/ham Setup: Get a large collection of example emails,

More information

Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI

Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI University of Houston Clear Lake School of Science & Computer Engineering Project Report Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes April 17, 2008 Course Number: CSCI 5634.01 University of

More information

Pattern recognition (4)

Pattern recognition (4) Pattern recognition (4) 1 Things we have discussed until now Statistical pattern recognition Building simple classifiers Supervised classification Minimum distance classifier Bayesian classifier (1D and

More information

Machine-Powered Learning for People-Centered Security

Machine-Powered Learning for People-Centered Security White paper Machine-Powered Learning for People-Centered Security Protecting Email with the Proofpoint Stateful Composite Scoring Service www.proofpoint.com INTRODUCTION: OUTGUNNED AND OVERWHELMED Today

More information

Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine

Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Shuang Hao, Nadeem Ahmed Syed, Nick Feamster, Alexander G. Gray, Sven Krasser Motivation Spam: More than Just a

More information

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008 Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 20: Naïve Bayes 4/11/2011 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. W4 due right now Announcements P4 out, due Friday First contest competition

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Unit VIII. Chapter 9. Link Analysis

Unit VIII. Chapter 9. Link Analysis Unit VIII Link Analysis: Page Ranking in web search engines, Efficient Computation of Page Rank using Map-Reduce and other approaches, Topic-Sensitive Page Rank, Link Spam, Hubs and Authorities (Text Book:2

More information

PERSONALIZATION OF MESSAGES

PERSONALIZATION OF  MESSAGES PERSONALIZATION OF E-MAIL MESSAGES Arun Pandian 1, Balaji 2, Gowtham 3, Harinath 4, Hariharan 5 1,2,3,4 Student, Department of Computer Science and Engineering, TRP Engineering College,Tamilnadu, India

More information

Security Gap Analysis: Aggregrated Results

Security Gap Analysis: Aggregrated Results Email Security Gap Analysis: Aggregrated Results Average rates at which enterprise email security systems miss spam, phishing and malware attachments November 2017 www.cyren.com 1 Email Security Gap Analysis:

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Introduction This paper will discuss the best practices for stopping the maximum amount of SPAM arriving in a user's inbox. It will outline simple

Introduction This paper will discuss the best practices for stopping the maximum amount of SPAM arriving in a user's inbox. It will outline simple Table of Contents Introduction...2 Overview...3 Common techniques to identify SPAM...4 Greylisting...5 Dictionary Attack...5 Catchalls...5 From address...5 HELO / EHLO...6 SPF records...6 Detecting SPAM...6

More information

How to Tell a Human apart from a Computer. The Turing Test. (and Computer Literacy) Spring 2013 ITS B 1. Are Computers like Human Brains?

How to Tell a Human apart from a Computer. The Turing Test. (and Computer Literacy) Spring 2013 ITS B 1. Are Computers like Human Brains? How to Tell a Human apart from a Computer The Turing Test (and Computer Literacy) Spring 2013 ITS102.23 - B 1 Are Computers like Human Brains? The impressive contributions of computers during World War

More information

Use of Synthetic Data in Testing Administrative Records Systems

Use of Synthetic Data in Testing Administrative Records Systems Use of Synthetic Data in Testing Administrative Records Systems K. Bradley Paxton and Thomas Hager ADI, LLC 200 Canal View Boulevard, Rochester, NY 14623 brad.paxton@adillc.net, tom.hager@adillc.net Executive

More information

Security and Privacy

Security and Privacy E-mail Security and Privacy Department of Computer Science Montclair State University Course : CMPT 320 Internet/Intranet Security Semester : Fall 2008 Student Instructor : Alex Chen : Dr. Stefan Robila

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Naïve Bayes Classification

Naïve Bayes Classification Naïve Bayes Classification Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2010-03-11 Outline Supervised learning: classification & regression Bayesian Spam Filtering: an example Naïve Bayes

More information

CSI5387: Data Mining Project

CSI5387: Data Mining Project CSI5387: Data Mining Project Terri Oda April 14, 2008 1 Introduction Web pages have become more like applications that documents. Not only do they provide dynamic content, they also allow users to play

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Revealing Botnet Membership Using DNSBL Counter-Intelligence

Revealing Botnet Membership Using DNSBL Counter-Intelligence Revealing Botnet Membership Using DNSBL Counter-Intelligence David Dagon dagon@cc.gatech.edu Anirudh Ramachandran, Nick Feamster, College of Computing, Georgia Tech From the presses Botnets send masses

More information

Hit the Ground Spam(fight)ing

Hit the Ground Spam(fight)ing Hit the Ground Spam(fight)ing LISA 05, San Diego December, 2005 John Rowan Littell Earlham College littejo (at) earlham (dot) edu $ARGV[0] There is no magic bullet. Many products, both commercial and open

More information

I G H T T H E A G A I N S T S P A M. ww w.atmail.com. Copyright 2015 atmail pty ltd. All rights reserved. 1

I G H T T H E A G A I N S T S P A M. ww w.atmail.com. Copyright 2015 atmail pty ltd. All rights reserved. 1 T H E F I G H T A G A I N S T S P A M ww w.atmail.com Copyright 2015 atmail pty ltd. All rights reserved. 1 EXECUTIVE SUMMARY IMPLEMENTATION OF OPENSOURCE ANTI-SPAM ENGINES IMPLEMENTATION OF OPENSOURCE

More information

Phishing. Eugene Davis UAH Information Security Club April 11, 2013

Phishing. Eugene Davis UAH Information Security Club April 11, 2013 Phishing Eugene Davis UAH Information Security Club April 11, 2013 Overview A social engineering attack in which the attacker impersonates a trusted entity Attacker attempts to retrieve privileged information

More information

Signature Verification Why xyzmo offers the leading solution

Signature Verification Why xyzmo offers the leading solution Dynamic (Biometric) Signature Verification The signature is the last remnant of the hand-written document in a digital world, and is considered an acceptable and trustworthy means of authenticating all

More information

SNS College of Technology, Coimbatore, India

SNS College of Technology, Coimbatore, India Support Vector Machine: An efficient classifier for Method Level Bug Prediction using Information Gain 1 M.Vaijayanthi and 2 M. Nithya, 1,2 Assistant Professor, Department of Computer Science and Engineering,

More information

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Predictive Analysis: Evaluation and Experimentation. Heejun Kim Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training

More information

Layer by Layer: Protecting from Attack in Office 365

Layer by Layer: Protecting  from Attack in Office 365 Layer by Layer: Protecting Email from Attack in Office 365 Office 365 is the world s most popular office productivity suite, with user numbers expected to surpass 100 million in 2017. With the vast amount

More information

CS 188: Artificial Intelligence Fall Announcements

CS 188: Artificial Intelligence Fall Announcements CS 188: Artificial Intelligence Fall 2006 Lecture 22: Naïve Bayes 11/14/2006 Dan Klein UC Berkeley Announcements Optional midterm On Tuesday 11/21 in class Review session 11/19, 7-9pm, in 306 Soda Projects

More information

Announcements. CS 188: Artificial Intelligence Fall Machine Learning. Classification. Classification. Bayes Nets for Classification

Announcements. CS 188: Artificial Intelligence Fall Machine Learning. Classification. Classification. Bayes Nets for Classification CS 88: Artificial Intelligence Fall 00 Lecture : Naïve Bayes //00 Announcements Optional midterm On Tuesday / in class Review session /9, 7-9pm, in 0 Soda Projects. due /. due /7 Dan Klein UC Berkeley

More information

The Security Role for Content Analysis

The Security Role for Content Analysis The Security Role for Content Analysis Jim Nisbet Founder, Tablus, Inc. November 17, 2004 About Us Tablus is a 3 year old company that delivers solutions to provide visibility to sensitive information

More information

Protecting from Attack in Office 365

Protecting  from Attack in Office 365 A hacker only needs one person to click on their fraudulent link to access credit card, debit card and Social Security numbers, names, addresses, proprietary information and other sensitive data. Protecting

More information

Collaborative Spam Mail Filtering Model Design

Collaborative Spam Mail Filtering Model Design I.J. Education and Management Engineering, 2013, 2, 66-71 Published Online February 2013 in MECS (http://www.mecs-press.net) DOI: 10.5815/ijeme.2013.02.11 Available online at http://www.mecs-press.net/ijeme

More information

DEVELOPMENT AND EVALUATION OF A SYSTEM FOR CHECKING FOR IMPROPER SENDING OF PERSONAL INFORMATION IN ENCRYPTED

DEVELOPMENT AND EVALUATION OF A SYSTEM FOR CHECKING FOR IMPROPER SENDING OF PERSONAL INFORMATION IN ENCRYPTED DEVELOPMENT AND EVALUATION OF A SYSTEM FOR CHECKING FOR IMPROPER SENDING OF PERSONAL INFORMATION IN ENCRYPTED E-MAIL Kenji Yasu 1, Yasuhiko Akahane 2, Masami Ozaki 1, Koji Semoto 1, Ryoichi Sasaki 1 1

More information

Structural and Temporal Properties of and Spam Networks

Structural and Temporal Properties of  and Spam Networks Technical Report no. 2011-18 Structural and Temporal Properties of E-mail and Spam Networks Farnaz Moradi Tomas Olovsson Philippas Tsigas Department of Computer Science and Engineering Chalmers University

More information

Large Scale Data Analysis Using Deep Learning

Large Scale Data Analysis Using Deep Learning Large Scale Data Analysis Using Deep Learning Machine Learning Basics - 1 U Kang Seoul National University U Kang 1 In This Lecture Overview of Machine Learning Capacity, overfitting, and underfitting

More information

CORE for Anti-Spam. - Innovative Spam Protection - Mastering the challenge of spam today with the technology of tomorrow

CORE for Anti-Spam. - Innovative Spam Protection - Mastering the challenge of spam today with the technology of tomorrow CORE for Anti-Spam - Innovative Spam Protection - Mastering the challenge of spam today with the technology of tomorrow Contents 1 Spam Defense An Overview... 3 1.1 Efficient Spam Protection Procedure...

More information

Marshal s Defense-in-Depth Anti-Spam Engine

Marshal s Defense-in-Depth Anti-Spam Engine Marshal s Defense-in-Depth Anti-Spam Engine January 2008 Contents Overview 2 Features 3 Summary 6 This whitepaper explores the underlying anti-spam and anti-phishing defense technology in Marshal s world

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

HOW TO CHOOSE A NEXT-GENERATION WEB APPLICATION FIREWALL

HOW TO CHOOSE A NEXT-GENERATION WEB APPLICATION FIREWALL HOW TO CHOOSE A NEXT-GENERATION WEB APPLICATION FIREWALL CONTENTS EXECUTIVE SUMMARY 1 WEB APPLICATION SECURITY CHALLENGES 2 INSIST ON BEST-IN-CLASS CORE CAPABILITIES 3 HARNESSING ARTIFICIAL INTELLIGENCE

More information

Efficacious Spam Filtering and Detection in Social Networks

Efficacious Spam Filtering and Detection in Social Networks Indian Journal of Science and Technology, Vol 7(S7), 180 184, November 2014 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Efficacious Spam Filtering and Detection in Social Networks U. V. Anbazhagu

More information

On the automatic classification of app reviews

On the automatic classification of app reviews Requirements Eng (2016) 21:311 331 DOI 10.1007/s00766-016-0251-9 RE 2015 On the automatic classification of app reviews Walid Maalej 1 Zijad Kurtanović 1 Hadeer Nabil 2 Christoph Stanik 1 Walid: please

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

Naïve Bayes Classifiers. Jonathan Lee and Varun Mahadevan

Naïve Bayes Classifiers. Jonathan Lee and Varun Mahadevan Naïve Bayes Classifiers Jonathan Lee and Varun Mahadevan Programming Project: Spam Filter Due: Thursday, November 10, 11:59pm Implement the Naive Bayes classifier for classifying emails as either spam

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Naïve Bayes Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188

More information

SQL Tuning Reading Recent Data Fast

SQL Tuning Reading Recent Data Fast SQL Tuning Reading Recent Data Fast Dan Tow singingsql.com Introduction Time is the key to SQL tuning, in two respects: Query execution time is the key measure of a tuned query, the only measure that matters

More information

Novelty, Trends and Questions

Novelty, Trends and Questions Novelty, Trends and Questions John Graham-Cumming Research Director, Sophos s Anti-Spam Task Force Quick Overview Latest news on content obfuscation How to spell V*I*A*G*R*A this week Spam trends What

More information

Putting the Touch in Multi-Touch: An in-depth look at the future of interactivity. Gary L. Barrett, Chief Technical Officer at Touch International

Putting the Touch in Multi-Touch: An in-depth look at the future of interactivity. Gary L. Barrett, Chief Technical Officer at Touch International Putting the Touch in Multi-Touch: An in-depth look at the future of interactivity Gary L. Barrett, Chief Technical Officer at Touch International Introduction Multi-touch touch screens are fast becoming

More information

Detecting Network Intrusions

Detecting Network Intrusions Detecting Network Intrusions Naveen Krishnamurthi, Kevin Miller Stanford University, Computer Science {naveenk1, kmiller4}@stanford.edu Abstract The purpose of this project is to create a predictive model

More information

2. On classification and related tasks

2. On classification and related tasks 2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.

More information

Correlation and Phishing

Correlation and Phishing A Trend Micro Research Paper Email Correlation and Phishing How Big Data Analytics Identifies Malicious Messages RungChi Chen Contents Introduction... 3 Phishing in 2013... 3 The State of Email Authentication...

More information

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016 CPSC 340: Machine Learning and Data Mining Non-Parametric Models Fall 2016 Admin Course add/drop deadline tomorrow. Assignment 1 is due Friday. Setup your CS undergrad account ASAP to use Handin: https://www.cs.ubc.ca/getacct

More information

Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science

Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science 310 Million + Current Domain Names 11 Billion+ Historical Domain Profiles 5 Million+ New Domain Profiles Daily

More information

Web Mail and e-scout Instructions

Web Mail and e-scout Instructions Web Mail and e-scout Instructions To log into e-scout: visit our home page at www.greenhills.net, click on web mail at the top of the page OR click on the customer center tab and then the webmail button,

More information

GFI product comparison: GFI MailEssentials vs. McAfee Security for Servers

GFI product comparison: GFI MailEssentials vs. McAfee Security for  Servers GFI product comparison: GFI MailEssentials vs. McAfee Security for Email Servers Features GFI MailEssentials McAfee Integrates with Microsoft Exchange Server 2003/2007/2010/2013 Scans incoming and outgoing

More information

Lecture 3: Linear Classification

Lecture 3: Linear Classification Lecture 3: Linear Classification Roger Grosse 1 Introduction Last week, we saw an example of a learning task called regression. There, the goal was to predict a scalar-valued target from a set of features.

More information

LivePoplet: Technology That Enables Mashup of Existing Applications

LivePoplet: Technology That Enables Mashup of Existing Applications LivePoplet: Technology That Enables Mashup of Existing Applications Akihiko Matsuo Kenji Oki Akio Shimono (Manuscript received January 29, 2009) We have developed LivePoplet, a technology that allows the

More information

Attacking CAPTCHAs for Fun and Profit

Attacking CAPTCHAs for Fun and Profit Attacking Author: Gursev Singh Kalra Managing Consultant Foundstone Professional Services Table of Contents Attacking... 1 Table of Contents... 2 Introduction... 3 A Strong CAPTCHA Implementation... 3

More information

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering W10.B.0.0 CS435 Introduction to Big Data W10.B.1 FAQs Term project 5:00PM March 29, 2018 PA2 Recitation: Friday PART 1. LARGE SCALE DATA AALYTICS 4. RECOMMEDATIO SYSTEMS 5. EVALUATIO AD VALIDATIO TECHIQUES

More information

3.5 SECURITY. How can you reduce the risk of getting a virus?

3.5 SECURITY. How can you reduce the risk of getting a virus? 3.5 SECURITY 3.5.4 MALWARE WHAT IS MALWARE? Malware, short for malicious software, is any software used to disrupt the computer s operation, gather sensitive information without your knowledge, or gain

More information

The evolution of malevolence

The evolution of malevolence Detection of spam hosts and spam bots using network traffic modeling Anestis Karasaridis Willa K. Ehrlich, Danielle Liu, David Hoeflin 4/27/2010. All rights reserved. AT&T and the AT&T logo are trademarks

More information

Imperva Incapsula Survey: What DDoS Attacks Really Cost Businesses

Imperva Incapsula Survey: What DDoS Attacks Really Cost Businesses Survey Imperva Incapsula Survey: What DDoS Attacks Really Cost Businesses BY: TIM MATTHEWS 2016, Imperva, Inc. All rights reserved. Imperva and the Imperva logo are trademarks of Imperva, Inc. Contents

More information

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann Search Engines Chapter 8 Evaluating Search Engines 9.7.2009 Felix Naumann Evaluation 2 Evaluation is key to building effective and efficient search engines. Drives advancement of search engines When intuition

More information

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Study on Classifiers using Genetic Algorithm and Class based Rules Generation 2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

More information

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017 CPSC 340: Machine Learning and Data Mining Probabilistic Classification Fall 2017 Admin Assignment 0 is due tonight: you should be almost done. 1 late day to hand it in Monday, 2 late days for Wednesday.

More information

ANALYSIS OF MODERN ATTACKS ON ANTIVIRUSES

ANALYSIS OF MODERN ATTACKS ON ANTIVIRUSES ANALYSIS OF MODERN ATTACKS ON ANTIVIRUSES 1 SILNOV DMITRY SERGEEVICH, 2 TARAKANOV OLEG VLADIMIROVICH Department of Information Systems and Technologies, National Research Nuclear University MEPhI (Moscow

More information

CLASSIFICATION JELENA JOVANOVIĆ. Web:

CLASSIFICATION JELENA JOVANOVIĆ.   Web: CLASSIFICATION JELENA JOVANOVIĆ Email: jeljov@gmail.com Web: http://jelenajovanovic.net OUTLINE What is classification? Binary and multiclass classification Classification algorithms Naïve Bayes (NB) algorithm

More information

Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion

Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion Lalit P. Jain, Walter J. Scheirer,2, and Terrance E. Boult,3 University of Colorado Colorado Springs 2 Harvard University

More information

K- Nearest Neighbors(KNN) And Predictive Accuracy

K- Nearest Neighbors(KNN) And Predictive Accuracy Contact: mailto: Ammar@cu.edu.eg Drammarcu@gmail.com K- Nearest Neighbors(KNN) And Predictive Accuracy Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni.

More information

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Naïve Bayes Classifiers. Jonathan Lee and Varun Mahadevan

Naïve Bayes Classifiers. Jonathan Lee and Varun Mahadevan Naïve Bayes Classifiers Jonathan Lee and Varun Mahadevan Independence Recap: Definition: Two events X and Y are independent if P(XY) = P(X)P(Y), and if P Y > 0, then P X Y = P(X) Conditional Independence

More information

Adobe Security Survey

Adobe Security Survey Adobe Security Survey October 2016 Edelman + Adobe INTRODUCTION Methodology Coinciding with National Cyber Security Awareness Month (NCSAM), Edelman Intelligence, on behalf of Adobe, conducted a nationally

More information

Introduction to Antispam Practices

Introduction to Antispam Practices By Alina P Published: 2007-06-11 18:34 Introduction to Antispam Practices According to a research conducted by Microsoft and published by the Radicati Group, the percentage held by spam in the total number

More information

MPEG Frame Types intrapicture predicted picture bidirectional predicted picture. I frames reference frames

MPEG Frame Types intrapicture predicted picture bidirectional predicted picture. I frames reference frames MPEG o We now turn our attention to the MPEG format, named after the Moving Picture Experts Group that defined it. To a first approximation, a moving picture (i.e., video) is simply a succession of still

More information

Statistical Analysis of MRI Data

Statistical Analysis of MRI Data Statistical Analysis of MRI Data Shelby Cummings August 1, 2012 Abstract Every day, numerous people around the country go under medical testing with the use of MRI technology. Developed in the late twentieth

More information

Global Journal of Engineering Science and Research Management

Global Journal of Engineering Science and Research Management ADVANCED K-MEANS ALGORITHM FOR BRAIN TUMOR DETECTION USING NAIVE BAYES CLASSIFIER Veena Bai K*, Dr. Niharika Kumar * MTech CSE, Department of Computer Science and Engineering, B.N.M. Institute of Technology,

More information

Objectives CINS/F1-01

Objectives CINS/F1-01 Email Security (1) Objectives Understand how e-mail systems operate over networks. Classify the threats to the security of e-mail. Study how S/MIME and PGP can be used to add security to e-mail systems.

More information

Text Classification. Dr. Johan Hagelbäck.

Text Classification. Dr. Johan Hagelbäck. Text Classification Dr. Johan Hagelbäck johan.hagelback@lnu.se http://aiguy.org Document Classification A very common machine learning problem is to classify a document based on its text contents We use

More information

Predicting Messaging Response Time in a Long Distance Relationship

Predicting Messaging Response Time in a Long Distance Relationship Predicting Messaging Response Time in a Long Distance Relationship Meng-Chen Shieh m3shieh@ucsd.edu I. Introduction The key to any successful relationship is communication, especially during times when

More information

Basic Concepts in Intrusion Detection

Basic Concepts in Intrusion Detection Technology Technical Information Services Security Engineering Roma, L Università Roma Tor Vergata, 23 Aprile 2007 Basic Concepts in Intrusion Detection JOVAN GOLIĆ Outline 2 Introduction Classification

More information

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict

More information

to Stay Out of the Spam Folder

to Stay Out of the Spam Folder Tips and Tricks to Stay Out of the Spam Folder At SendGrid we are very serious about email deliverability. We live and breathe it each day. Similar to how Google keeps adjusting its search algorithm to

More information

A New Measure of the Cluster Hypothesis

A New Measure of the Cluster Hypothesis A New Measure of the Cluster Hypothesis Mark D. Smucker 1 and James Allan 2 1 Department of Management Sciences University of Waterloo 2 Center for Intelligent Information Retrieval Department of Computer

More information

This module presents the star schema, an alternative to 3NF schemas intended for analytical databases.

This module presents the star schema, an alternative to 3NF schemas intended for analytical databases. Topic 3.3: Star Schema Design This module presents the star schema, an alternative to 3NF schemas intended for analytical databases. Star Schema Overview The star schema is a simple database architecture

More information

Spam Protection Guide

Spam  Protection Guide Spam Email Protection Guide Version 1.0 Last Modified 5/29/2014 by Mike Copening Contents Overview of Spam at RTS... 1 Types of Spam... 1 Spam Tricks... 2 Imitation of 3 rd Party Email Template... 2 Spoofed

More information

Panda Security 2010 Page 1

Panda Security 2010 Page 1 Panda Security 2010 Page 1 Executive Summary The malware economy is flourishing and affecting both consumers and businesses of all sizes. The reality is that cybercrime is growing exponentially in frequency

More information

How to Stay Compliant with SMS Marketing

How to Stay Compliant with SMS Marketing How to Stay Compliant with SMS Marketing Ensure your text campaigns deliver value to customers and keep your business secure GREAT TIPS INSIDE Even legitimate marketers can fall foul of mobile spamming,

More information