Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI

Similar documents
An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

Spam Classification Documentation

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING

Collaborative Spam Mail Filtering Model Design

Overview of the TREC 2005 Spam Track. Gordon V. Cormack Thomas R. Lynam. 18 November 2005

Mailspike. Henrique Aparício

Accuracy Analysis of Neural Networks in removal of unsolicited s

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Detecting Spam with Artificial Neural Networks

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

A Reputation-based Collaborative Approach for Spam Filtering

Keywords : Bayesian, classification, tokens, text, probability, keywords. GJCST-C Classification: E.5

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion

Probabilistic Anti-Spam Filtering with Dimensionality Reduction

CSI5387: Data Mining Project

Hidden Loop Recovery for Handwriting Recognition

Decision Science Letters

Bayesian Spam Detection

2. On classification and related tasks

Spam Detection ECE 539 Fall 2013 Ethan Grefe. For Public Use

Cake and Grief Counseling Will be Available: Using Artificial Intelligence for Forensics Without Jeopardizing Humanity.

To earn the extra credit, one of the following has to hold true. Please circle and sign.

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

AUTOMATED STUDENT S ATTENDANCE ENTERING SYSTEM BY ELIMINATING FORGE SIGNATURES

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science

Schematizing a Global SPAM Indicative Probability

A Content Vector Model for Text Classification

CHEAP, efficient and easy to use, has become an

Predicting Popular Xbox games based on Search Queries of Users

A generalized additive neural network application in information security

Constructively Learning a Near-Minimal Neural Network Architecture

AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

DATA MINING TEST 2 INSTRUCTIONS: this test consists of 4 questions you may attempt all questions. maximum marks = 100 bonus marks available = 10

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

List of figures List of tables Acknowledgements

EVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST

Fingerprint Feature Extraction Based Discrete Cosine Transformation (DCT)

Chapter-8. Conclusion and Future Scope

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Handwritten Text Recognition

Spam Filtering Using Visual Features

Jarek Szlichta

Introduction This paper will discuss the best practices for stopping the maximum amount of SPAM arriving in a user's inbox. It will outline simple

Part 2. Reviewing and Interpreting Similarity Reports

Combined Weak Classifiers

Neural Network Neurons

The PAGE (Page Analysis and Ground-truth Elements) Format Framework

CMPT 882 Week 3 Summary

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Naïve Bayes for text classification

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Bayesian Spam Filtering Using Statistical Data Compression

INF 4300 Classification III Anne Solberg The agenda today:

Deep Face Recognition. Nathan Sun

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine

CHAPTER 4 CONTENT BASED FILTERING

Fraud Detection using Machine Learning

Cluster Analysis Gets Complicated

SPAM, generally defined as unsolicited bulk (UBE) Feature Construction Approach for Categorization Based on Term Space Partition

Performance Evaluation

Computer Vision. Exercise Session 10 Image Categorization

Use of Extreme Value Statistics in Modeling Biometric Systems

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN

Data Mining. Neural Networks

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Intrusion Detection and Violation of Compliance by Monitoring the Network

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

A System to Automatically Index Genealogical Microfilm Titleboards Introduction Preprocessing Method Identification

Handwritten Text Recognition

Mass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

Data Preprocessing. Supervised Learning

6.034 Quiz 2, Spring 2005

Automatic Creation of Digital Fast Adder Circuits by Means of Genetic Programming

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

10-701/15-781, Fall 2006, Final

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

Perceptron-Based Oblique Tree (P-BOT)

Rita McCue University of California, Santa Cruz 12/7/09

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

List of Exercises: Data Mining 1 December 12th, 2015

End-To-End Spam Classification With Neural Networks

Performance Analysis of Data Mining Classification Techniques

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters

Chapter 6 Evaluation Metrics and Evaluation

Application of Support Vector Machine Algorithm in Spam Filtering

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Visual object classification by sparse convolutional neural networks

Sender Reputation Filtering

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

2 OVERVIEW OF RELATED WORK

Evaluating Classifiers

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Fighting Spam, Phishing and Malware With Recurrent Pattern Detection

Transcription:

University of Houston Clear Lake School of Science & Computer Engineering Project Report Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes April 17, 2008 Course Number: CSCI 5634.01 University of Houston Clear Lake - School of Science & Computer Engineering 2700 Bay Area Blvd Houston, Texas 77058 T 281.283.3700

Spam Filtering - University of Houston Clear Lake School of Science & Computer Engineering An Alternative Approach Table of Contents I. Abstract - Brief overview of the spam filtering problem statement, approach taken, and results / findings II. Introduction - Background and history of spam filtering, as well as brief highlight of proposed techniques III. Objectives - Goals of the project, in regards to techniques and desired results in quantitative terms IV. Methods and Procedures - In depth explanation of the methods involved in the spam filtering of the project V. Results and Discussion - Results of the above techniques and discussion of the outcome VI. References - Sources referencing the material described in the introduction and other portions of the project VII. Appendices - Project Poster, Data, etc Abstract Due to the ever increasing amount of unsolicited email, commonly referred to as spam, techniques have arose for combating such messages. Although prevalent, Bayesian filters often misclassify legitimate email. We provide a supervised neural network approach to the filtering problem. This technique shows promising results, with zero misclassification on a corpus of moderate size. Introduction The notion of effective spam filtering has long since been a problem. The difficulty lies in the necessity of substantially small false rejection rates, as the misclassification of a valid email is rarely tolerable. Because senders of unsolicited email ( spammers ) very often disguise their messages to appear as valid correspondence, perfect filtering is impossible. Borrowing terminology from the field of biometrics, it would be possible to calculate the equal error rate of a given test set. From this value, we could produce a system that would be capable of balancing the false rejection rate with the false acceptance rate [1]. However, this approach will typically yield unacceptable results, as false rejections come as a much higher cost to the user - the result of missing an important email usually outweighs the improper acceptance of a particular spam message. Therefore, a virtually non-existent false rejection rate is necessary, so we instead focus on reducing the positive acceptance rate as much as possible while maintaing a minimal false rejection rate. Project Report 1

Traditional spam filters use Bayesian techniques; that is, statistical methods based on Bayes Theorem. As described by Obied, the probability that an email message belongs to a particular class can be calculated using Bayes Theorem as follows: where each class C represents either the class of spam messages or the class of non-spam messages, and each fi represents a particular feature used in classifying email messages [2]. Most Bayesian spam filters assume no a priori knowledge; initial probabilities are unknown. Through training, these filters can learn to differentiate effectively between spam and non-spam messages. An alternative approach to Bayesian techniques, as suggested by this paper, involves machine learning through the use of a supervised neural network. Other machine learning techniques have recently been attempted, including variants of k- Nearest-Neighbor (k-nn) classifiers [3]. An advantage of these techniques is that they commonly make use of an adjustable threshold, allowing the sensitivity of the filter to be modified at will. The approach that we present here does not provide such options, rather, we seek to prove that our method will perform filtering with enough precision to eliminate the need for such adjustments. For practically any spam filtering scheme, we must build and maintain databases containing the frequency with which specific words typically appear in both spam and ham (legitimate) messages. One problem arises when determining how to efficiently store and search for these words within our databases. A common practice, frequently used in Bayesian filtering, is to maintain two separate hash tables, one for each type of spam/ham mail classification. Each entry appearing in the hash tables consists of a key-value pair: a specific word paired with the number of occurrences (for a specific classification) found during the training phase. Here, the term training phase refers to the training phase of Bayesian filters, which may continue even after the initial setup. For our purposes, we will consider this process part of the preprocessing, and will reserve the term training to refer explicitly to the training of the neural network itself. Once these databases have been built, either from publicly found corpora or from past personal emails, the neural network can be trained. To implement the supervised learning approach, we require access to an adequate number of properly classified email messages. From this extended corpus, we will extract a minimal number of features. For each message, will will determine a spam score and a ham score. Each word in the message is compared to both databases. If a particular word appears in a database, the frequency of the word (stored in the hash table of the particular database) will be added to the appropriate spam or ham score for this message. Therefore, the spam score for a given message is directly related to the frequency that the message s words appear in the spam database. The same is true for the ham Project Report 2

score for a particular message. Using only the spam score and the ham score of each email, we seek to provide effective classification of the messages. Objectives We attempt to show that the spam filtering problem can be simplified into a two-dimensional classification problem, requiring only the spam and ham word frequencies for each message. We seek to prove that a neural network can perform the non-linear partitioning of this feature space. The purpose of this paper is not only to show that a neural network is indeed feasible, but also to demonstrate the accuracy of such a method. We provide quantitative results, supporting our claims. Methods & Procedures The first step in any spam filtering process is the acquisition of an adequate sized corpus. Although several public corpora are publicly available, the majority of these packages do not differentiate between which emails are spam and which are legitimate messages [4]. Many other compilations include a large number of spam messages, but do not include any ham whatsoever. For this reason, the ham corpus used consisted primarily of personal emails collected over the last few months. The spam corpus used is freely and publicly available at http://spamassassin.apache.org/publiccorpus/ with over 500 spam email messages included in the creation of our databases. The ham corpus contained only ham email messages in the emlx format, requiring a conversion utility. A small program, written in AppleScript, was used to convert the email messages to a UTF-8 format, the standard email format recommended by the Internet Mail Consortium (IMC) [5]. Once both sets of messages were successfully stored in this common format, we proceeded. To create and maintain the hash tables necessary to store the respective ham and spam word counts, we modified an open source perl script. The script performed the database creation, saving the hash tables in long term storage. All of the spam messages were combined into a single file, as were all of the ham messages. We proceeded by calling the script with each file, signifying the appropriate classification in each case. A total of 500 spam messages were placed in the database, resulting in 699,100 unique words. Similarly, a total of 282 ham messages resulted in the storage of 294,531 unique words. This process is shown in Appendix A. After the hash tables were built, we again modified the perl script to allow a list of files to be passed as an argument, together with a flag signifying the appropriate classification. For each file in the list, we calculated the spam and ham score, as described in the previous section. These values, together with a boolean flag signifying the proper classification, were written to an output file. This task completed the preprocessing portion of the project. Project Report 3

Using only two values as input (spam and ham scores), we expect the plot of these points to be non-linearly separable. This characteristic can be seen in the generalized Venn diagram shown below, and is further exemplified by observing the overlap present in the plot of the actual input values: Distribution Ham? Ham Score 120,000 113,500 107,000 100,500 94,000 87,500 81,000 74,500 68,000 61,500 55,000 SPAM 55,000 79,500 104,000 Spam Ham 128,500 153,000 177,500 202,000 226,500 Spam 251,000 275,500 300,000 To ensure correct classification, we must ensure that the type of network chosen is capable of non-linear separation. For this reason, a feed-forward neural network was chosen, using the backpropagation algorithm for training. To implement the neural network, the software package EasyNN-plus was used. The output file from the preprocessing stage was used as input to the network, after ordering of the data was randomized. As an initial approach, a single hidden layer was used. The program was allowed to specify the most appropriate choice for the number of nodes in the hidden layer, resulting in a total of five nodes. Thus, the network topology became a 2-5-1 feedforward network. A visualization of the network topology is given in Appendix B. Project Report 4

For the training, the learning rate was initially set to 0.80 and allowed both to decay and to be optimized. The momentum was set to zero, as there was no ordering of the data. We signified that the first 400 values should be used for training. We reserved 71 values for validation purposes, leaving 300 values for testing. Training took place over 27,916 epochs, and took approximately 45 seconds. The network evaluated the validating examples correctly after only several hundred epochs; however, training was continued until all errors were below 0.5%. Although the error could be further reduced, the algorithm was stopped at this point to prevent overtraining. Once the training had completed, the networked was queried against the testing data, and the results were output to a file for further analysis. Results & Discussion The neural network approach to spam filtering shows promising results. On the test set, all messages were properly classified. There were neither false positives nor false negatives; an ideal situation. To generate such excellent results, five nodes were required in the hidden layer. We can infer from this fact that the partitioning of the messages in the feature space is not a trivial feat, and is definitively non-linear. The network effectively classified the entire corpus of messages, with impressive results, verifying the plausibility of such a technique. Some of this information can be seen in Appendix C. Although we were able to perform the classification on the given test set, a larger corpus needs to be tested to ensure the accuracy of the system. It also remains to be shown that the weights can be modified easily to accommodate new input, so that the network retains high precision as email message content evolves over time. This neural network filtering approach is not suggested to be a stand alone solution. Other commonplace techniques, such as email blacklists, should first be used to evaluate candidate messages in order to improve on the effectiveness of the overall system [6]. In this paper, we only consider a message corpus consisting of preprocessed messages, that is, messages that have already undergone initial filtering techniques. We offer this approach as a replacement for the Bayesian filter commonly used as a component of an entire filtering system. Also, it should be noted that the proposed neural network does not perform feature extraction, rather, we depend on a separate preprocessing component to perform this task. In this paper, we perform the feature extraction somewhat manually, although this process could easily be performed automatically by a separate unit. Project Report 5

References [1] Author Unknown, About FAR, FRR, and EER, Human Scan, [Online Document], Mar. 2004, [cited 2008 April 17]. Available: http://www.bioid.com/sdk/docs/about_eer.htm [2] A. Obied, Bayesian Spam Filtering, Department of Computer Science, University of Calgary, 2007. [3] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos, and P. Stamatopoulos, A Memory- Based Approach to Anti-Spam Filtering for Mailing Lists, Information Retrieval, vol. 6, pp. 48 73, 2003. [4] G. Cormack, T. Lynam, Spam Track Guidlines - TREC 2005-2007, [Online Document], Jul. 2007, [cited 2008 April 17]. Available: http://plg.uwaterloo.ca/~gvcormac/spam/ [5] P. Hoffman, Display of Internationalized Mail Addresses Through Address Mapping, Internet Mail Consortium, [Online Document], Feb. 2003, [cited 2008 April 17]. Available: http://www.watersprings.org/pub/id/draft-hoffman-iea-headermap-00.txt [6] J. Kong, B. Rezaei, N. Sarshar, V. Roychowdhury, P. Boykin, "Collaborative Spam Filtering Using E-Mail Networks," Computer, vol. 39, no. 8, pp. 67-73, Aug., 2006 Project Report 6

Appendix A - Building Hash Tables Appendix B - Network Topology Project Report 7

Appendix C - Analysis of Results Project Report 8