Detecting Spam with Artificial Neural Networks

Detecting Spam with Artificial Neural Networks Andrew Edstrom University of Wisconsin - Madison Abstract This is my final project for CS 539. In this project, I demonstrate the suitability of neural networks for the task of classifying spam emails. I discuss how I was able to attain a classification accuracy of 94.6% through minor changes in network configuration and the momentum alpha parameter, ultimately outperforming existing research on this same dataset. Keywords: Artificial Intelligence, Machine Learning, Neural Networks, Spam Detection I. Introduction Neural networks are powerful tools for any machine learning task which involves classification. They are utilized in a wide range of applications including recommendation engines, computer vision, and dashboard customization. Because of their versatility, they are emerging as one of the primary tools in the machine learning professional s toolkit. However, neural networks are not as widely used in spam email classification as one might expect. Instead, most modern spam filters employ naïve Bayes classifiers, due in large part to Paul Graham s famed article A Plan For Spam. Naïve Bayes is a great approach for spam classification with high accuracy and a low false-positive rate, but by itself it may not be enough to achieve the 99.99+% accuracy which we would like to see.

Google reported that introducing neural networks into Gmail s spam filters took them from 99.5% to 99.9% accuracy, suggesting that neural networks may be useful for enhancing spam filters, especially when used in conjunction with Bayesian classification and other methodologies. However, there is not much research on the use of neural networks for spam detection, and most of the existing research holds the network configuration, momentum, and learning rate as a constant, investigating the effectiveness of the network across datasets rather than the suitability of different network configurations for the task. In my project, I have done the opposite, holding the dataset constant while adjusting the network configuration and parameters, in order to find the ideal network configuration for spam classification. II. Work Performed Because I wanted to focus on network configuration rather than the dataset preparation, I chose to use the UCI Spambase dataset. In this dataset, each email is assigned a label of spam or not spam. There are 4601 emails, all of which have been processed to extract a number of features, including the frequency of certain spammy words and the amount of capital letters used. Before performing any experiments, I randomized this dataset once. I used this same random ordering in each of my trials, so as to not skew my results. To implement my neural network, I first attempted to use Caffe, a deep learning library from UC Berkeley. However, I encountered numerous problems while attempting to get it to build on my computer. I found Caffe to be a very poorly-documented open-source library which depended on numerous other poorly-documented open-source libraries, all of which were themselves quite tricky to install. It was shamefully complicated to even obtain all of the dependencies, some of which requiring days of waiting for an application to be approved before they could even be download. Unfortunately, in order to compile Caffe one must install all the correct versions of all the correct

libraries and put them in the correct location in the file system, which is completely different for each library. Once the dependencies are downloaded and installed, you must set numerous environment variables and manually configure a makefile of several hundred lines. Each time you make a change to any one of these pieces along the way, it takes about 30 minutes of compilation to determine whether it fixed the problem or not. After well over 10 hours of fierce conflict with Caffe, I decided to explore other options. After playing with several libraries, I settled on modifying a Matlab implementation of a feedforward MLP with backpropagation by Hesham Eraqi. I chose this implementation as a basis for my project because it made it easy to change the network configuration, momentum alpha, number of epochs, and learning rate, all by changing a single line in the Configurations/Parameters section. Eraqi s MLP implementation only supported calculation of training error, so I added code to evaluate the network with a testing set once it had finished training. I added additional code to calculate and display final results. After all 10 trials have completed, the testing errors are averaged. The average testing error and the average training error are both displayed, because if there is a large difference between the two this is a good indicator that the network is having a problem of overfitting. I also display the network configuration, learning rate, and momentum alpha. After some initial exploratory trials, I found that a learning rate of.1 was ideal. Trials with several epoch sizes between 200 and 2000 showed that increasing the number of epochs did not give any improvement in accuracy beyond 199 epochs, so I used 199 epochs for each trial. My actual experiments consisted of 29 trials, each with its own configuration and parameter settings. III. Results Figure 1 shows how all of my experiments yielded accuracies in the 92-95% range, demonstrating that neural networks have a fairly high accuracy regardless of the configuration or

parameters used. Across my trials I tested a wide range of configurations, from a single hidden layer of eight neurons, to two hidden layers of five neurons, to three hidden layers of 50, 50, and 200 neurons. It seems that any neural network will perform fairly well, no matter its set-up, but through fine tuning we can increase the performance by several percentage points. Figure 1 I tested several numbers of hidden layers (Figure 2), and I tried many different sizes for each layer. However, no matter the number of neurons per layer, a single layer proved to be ideal. Both my lowest error and my lowest average error across trials came from networks with one layer.

Figure 2 Once I determined that one layer was sufficient, I tried several different numbers of neurons for this layer (Figure 3). Preliminary tests showed that any number over 15 caused overfitting, however I did one experiment with 40 just to confirm. Interestingly, 11 performed best, outperforming both 10 and 12 by almost 0.5%. Combined with the previous experiment, it became clear that a simple network always worked best. My best results came from networks with one hidden layer of 11 hidden neurons. Networks that were larger either vertically or horizontally often got a very low average training error sometimes below 0.5% while testing error increased past 6%. I took this as a clear sign of overfitting.

Figure 3 The final parameter I tried adjusting was the momentum alpha (Figure 4). This variable had a surprisingly large effect on the error rate. I performed several experiments, holding the network configuration and learning rate constant, and found that networks with a momentum alpha of 0.1 dramatically outperformed those with momentum alphas that were higher or lower.

Figure 4 IV. Conclusions Through my experiments, I found that the best configuration for spam detection on the UCI Spambase dataset with a neural network is 11 hidden neurons in a single hidden layer, and a momentum alpha of 0.1. My results confirmed the findings of Idris, who used a neural network to classify spam on this same dataset and attained an accuracy of 94.3%. Most of my results fell in this general range, though after tweaking and experimentation I was able to train a network that slightly beat their best result. Using what I found to be the ideal configuration, I attained an accuracy of 94.6%. This goes to show that fine tuning of network configuration and parameters is quite important in neural network research. Even though all neural networks will perform quite well,

adding just a single neuron can have a nontrivial effect on error rate. In my case, tiny changes like this sometimes reduced my error rate by as much as half of a percent. V. Future Work Further researchers on this topic might consider looking at the false-positive rate of networks with different configurations. In spam detection, false positives are essentially unacceptable, and one of the primary advantages of naïve Bayes is that it promises a low falsepositive rate. If one could develop a neural network with a very low false-positive rate, neural networks would seem a much more viable option for commercial spam detection. It would be quite interesting to see whether the network which yielded the lowest error rate also yielded the lowest false-positive rate. Graham, P. (2002). A Plan for Spam. References Idris, I. (2014). E-mail Spam Classification with Artificial Neural Network and Negative Selection Algorithm. International Journal of Computer Science, 1. Massey, B., et al., Learning Spam: Simple Techniques for Freely-Available Software, Proceedings of Freenix Track 2003 Usenix Annual Technical Conference, Online!, Jun. 9, 2003, pp. 63-76, Berkley, CA, USA.

Metz, Cade. "Google Says Its AI Catches 99.9 Percent of Gmail Spam." Wired.com. July 09, 2015. Accessed May 12, 2016. http://www.wired.com/2015/07/google-says-ai-catches-99-9- percent-gmail-spam/all/1. Sallab, A. A., & Rashwan, M. A. (2012). E-Mail Classification Using Deep Networks. Journal of Theoretical and Applied Information Technology, 37(2), 241-251.