Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI

University of Houston Clear Lake School of Science & Computer Engineering Project Report Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes April 17, 2008 Course Number: CSCI 5634.01 University of Houston Clear Lake - School of Science & Computer Engineering 2700 Bay Area Blvd Houston, Texas 77058 T 281.283.3700

Spam Filtering - University of Houston Clear Lake School of Science & Computer Engineering An Alternative Approach Table of Contents I. Abstract - Brief overview of the spam filtering problem statement, approach taken, and results / findings II. Introduction - Background and history of spam filtering, as well as brief highlight of proposed techniques III. Objectives - Goals of the project, in regards to techniques and desired results in quantitative terms IV. Methods and Procedures - In depth explanation of the methods involved in the spam filtering of the project V. Results and Discussion - Results of the above techniques and discussion of the outcome VI. References - Sources referencing the material described in the introduction and other portions of the project VII. Appendices - Project Poster, Data, etc Abstract Due to the ever increasing amount of unsolicited email, commonly referred to as spam, techniques have arose for combating such messages. Although prevalent, Bayesian filters often misclassify legitimate email. We provide a supervised neural network approach to the filtering problem. This technique shows promising results, with zero misclassification on a corpus of moderate size. Introduction The notion of effective spam filtering has long since been a problem. The difficulty lies in the necessity of substantially small false rejection rates, as the misclassification of a valid email is rarely tolerable. Because senders of unsolicited email ( spammers ) very often disguise their messages to appear as valid correspondence, perfect filtering is impossible. Borrowing terminology from the field of biometrics, it would be possible to calculate the equal error rate of a given test set. From this value, we could produce a system that would be capable of balancing the false rejection rate with the false acceptance rate [1]. However, this approach will typically yield unacceptable results, as false rejections come as a much higher cost to the user - the result of missing an important email usually outweighs the improper acceptance of a particular spam message. Therefore, a virtually non-existent false rejection rate is necessary, so we instead focus on reducing the positive acceptance rate as much as possible while maintaing a minimal false rejection rate. Project Report 1

Traditional spam filters use Bayesian techniques; that is, statistical methods based on Bayes Theorem. As described by Obied, the probability that an email message belongs to a particular class can be calculated using Bayes Theorem as follows: where each class C represents either the class of spam messages or the class of non-spam messages, and each fi represents a particular feature used in classifying email messages [2]. Most Bayesian spam filters assume no a priori knowledge; initial probabilities are unknown. Through training, these filters can learn to differentiate effectively between spam and non-spam messages. An alternative approach to Bayesian techniques, as suggested by this paper, involves machine learning through the use of a supervised neural network. Other machine learning techniques have recently been attempted, including variants of k- Nearest-Neighbor (k-nn) classifiers [3]. An advantage of these techniques is that they commonly make use of an adjustable threshold, allowing the sensitivity of the filter to be modified at will. The approach that we present here does not provide such options, rather, we seek to prove that our method will perform filtering with enough precision to eliminate the need for such adjustments. For practically any spam filtering scheme, we must build and maintain databases containing the frequency with which specific words typically appear in both spam and ham (legitimate) messages. One problem arises when determining how to efficiently store and search for these words within our databases. A common practice, frequently used in Bayesian filtering, is to maintain two separate hash tables, one for each type of spam/ham mail classification. Each entry appearing in the hash tables consists of a key-value pair: a specific word paired with the number of occurrences (for a specific classification) found during the training phase. Here, the term training phase refers to the training phase of Bayesian filters, which may continue even after the initial setup. For our purposes, we will consider this process part of the preprocessing, and will reserve the term training to refer explicitly to the training of the neural network itself. Once these databases have been built, either from publicly found corpora or from past personal emails, the neural network can be trained. To implement the supervised learning approach, we require access to an adequate number of properly classified email messages. From this extended corpus, we will extract a minimal number of features. For each message, will will determine a spam score and a ham score. Each word in the message is compared to both databases. If a particular word appears in a database, the frequency of the word (stored in the hash table of the particular database) will be added to the appropriate spam or ham score for this message. Therefore, the spam score for a given message is directly related to the frequency that the message s words appear in the spam database. The same is true for the ham Project Report 2

score for a particular message. Using only the spam score and the ham score of each email, we seek to provide effective classification of the messages. Objectives We attempt to show that the spam filtering problem can be simplified into a two-dimensional classification problem, requiring only the spam and ham word frequencies for each message. We seek to prove that a neural network can perform the non-linear partitioning of this feature space. The purpose of this paper is not only to show that a neural network is indeed feasible, but also to demonstrate the accuracy of such a method. We provide quantitative results, supporting our claims. Methods & Procedures The first step in any spam filtering process is the acquisition of an adequate sized corpus. Although several public corpora are publicly available, the majority of these packages do not differentiate between which emails are spam and which are legitimate messages [4]. Many other compilations include a large number of spam messages, but do not include any ham whatsoever. For this reason, the ham corpus used consisted primarily of personal emails collected over the last few months. The spam corpus used is freely and publicly available at http://spamassassin.apache.org/publiccorpus/ with over 500 spam email messages included in the creation of our databases. The ham corpus contained only ham email messages in the emlx format, requiring a conversion utility. A small program, written in AppleScript, was used to convert the email messages to a UTF-8 format, the standard email format recommended by the Internet Mail Consortium (IMC) [5]. Once both sets of messages were successfully stored in this common format, we proceeded. To create and maintain the hash tables necessary to store the respective ham and spam word counts, we modified an open source perl script. The script performed the database creation, saving the hash tables in long term storage. All of the spam messages were combined into a single file, as were all of the ham messages. We proceeded by calling the script with each file, signifying the appropriate classification in each case. A total of 500 spam messages were placed in the database, resulting in 699,100 unique words. Similarly, a total of 282 ham messages resulted in the storage of 294,531 unique words. This process is shown in Appendix A. After the hash tables were built, we again modified the perl script to allow a list of files to be passed as an argument, together with a flag signifying the appropriate classification. For each file in the list, we calculated the spam and ham score, as described in the previous section. These values, together with a boolean flag signifying the proper classification, were written to an output file. This task completed the preprocessing portion of the project. Project Report 3

Using only two values as input (spam and ham scores), we expect the plot of these points to be non-linearly separable. This characteristic can be seen in the generalized Venn diagram shown below, and is further exemplified by observing the overlap present in the plot of the actual input values: Distribution Ham? Ham Score 120,000 113,500 107,000 100,500 94,000 87,500 81,000 74,500 68,000 61,500 55,000 SPAM 55,000 79,500 104,000 Spam Ham 128,500 153,000 177,500 202,000 226,500 Spam 251,000 275,500 300,000 To ensure correct classification, we must ensure that the type of network chosen is capable of non-linear separation. For this reason, a feed-forward neural network was chosen, using the backpropagation algorithm for training. To implement the neural network, the software package EasyNN-plus was used. The output file from the preprocessing stage was used as input to the network, after ordering of the data was randomized. As an initial approach, a single hidden layer was used. The program was allowed to specify the most appropriate choice for the number of nodes in the hidden layer, resulting in a total of five nodes. Thus, the network topology became a 2-5-1 feedforward network. A visualization of the network topology is given in Appendix B. Project Report 4

For the training, the learning rate was initially set to 0.80 and allowed both to decay and to be optimized. The momentum was set to zero, as there was no ordering of the data. We signified that the first 400 values should be used for training. We reserved 71 values for validation purposes, leaving 300 values for testing. Training took place over 27,916 epochs, and took approximately 45 seconds. The network evaluated the validating examples correctly after only several hundred epochs; however, training was continued until all errors were below 0.5%. Although the error could be further reduced, the algorithm was stopped at this point to prevent overtraining. Once the training had completed, the networked was queried against the testing data, and the results were output to a file for further analysis. Results & Discussion The neural network approach to spam filtering shows promising results. On the test set, all messages were properly classified. There were neither false positives nor false negatives; an ideal situation. To generate such excellent results, five nodes were required in the hidden layer. We can infer from this fact that the partitioning of the messages in the feature space is not a trivial feat, and is definitively non-linear. The network effectively classified the entire corpus of messages, with impressive results, verifying the plausibility of such a technique. Some of this information can be seen in Appendix C. Although we were able to perform the classification on the given test set, a larger corpus needs to be tested to ensure the accuracy of the system. It also remains to be shown that the weights can be modified easily to accommodate new input, so that the network retains high precision as email message content evolves over time. This neural network filtering approach is not suggested to be a stand alone solution. Other commonplace techniques, such as email blacklists, should first be used to evaluate candidate messages in order to improve on the effectiveness of the overall system [6]. In this paper, we only consider a message corpus consisting of preprocessed messages, that is, messages that have already undergone initial filtering techniques. We offer this approach as a replacement for the Bayesian filter commonly used as a component of an entire filtering system. Also, it should be noted that the proposed neural network does not perform feature extraction, rather, we depend on a separate preprocessing component to perform this task. In this paper, we perform the feature extraction somewhat manually, although this process could easily be performed automatically by a separate unit. Project Report 5

References [1] Author Unknown, About FAR, FRR, and EER, Human Scan, [Online Document], Mar. 2004, [cited 2008 April 17]. Available: http://www.bioid.com/sdk/docs/about_eer.htm [2] A. Obied, Bayesian Spam Filtering, Department of Computer Science, University of Calgary, 2007. [3] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos, and P. Stamatopoulos, A Memory- Based Approach to Anti-Spam Filtering for Mailing Lists, Information Retrieval, vol. 6, pp. 48 73, 2003. [4] G. Cormack, T. Lynam, Spam Track Guidlines - TREC 2005-2007, [Online Document], Jul. 2007, [cited 2008 April 17]. Available: http://plg.uwaterloo.ca/~gvcormac/spam/ [5] P. Hoffman, Display of Internationalized Mail Addresses Through Address Mapping, Internet Mail Consortium, [Online Document], Feb. 2003, [cited 2008 April 17]. Available: http://www.watersprings.org/pub/id/draft-hoffman-iea-headermap-00.txt [6] J. Kong, B. Rezaei, N. Sarshar, V. Roychowdhury, P. Boykin, "Collaborative Spam Filtering Using E-Mail Networks," Computer, vol. 39, no. 8, pp. 67-73, Aug., 2006 Project Report 6

Appendix A - Building Hash Tables Appendix B - Network Topology Project Report 7

Appendix C - Analysis of Results Project Report 8