Bayesian Spam Detection System Using Hybrid Feature Selection Method

2016 International Conference on Manufacturing Science and Information Engineering (ICMSIE 2016) ISBN: 978-1-60595-325-0 Bayesian Spam Detection System Using Hybrid Feature Selection Method JUNYING CHEN, SHUNFENG ZHOU and HUAQING MIN ABSTRACT With the rapid development of Internet, the amount of text information has increased dramatically. As such, how to effectively and accurately identify, classify and deal with these information becomes a major challenge. In this paper, we used a term frequency hybrid filter which combines the refined naïve Bayesian classifier and innovative hybrid feature selection method to detect spams. According to our experiment results, we found that the hybrid feature selection method had better spam detection performance than traditional feature selection methods. 1 INTRODUCTION By the end of 2015, the number of Chinese Internet users had broken through 650 million. E-mail has become an important method for communicating, gaining information and looking for jobs. However, in recent years, more and more spams have not only affected people s daily work and life, but also brought huge economic loss to the society [1]. Current mainstream spam-blocking method is collecting a large amount of spams and using such spams to train a classifier, so as to get the classifier to work intelligently, which can identify spams among new e-mail messages [2-4]. However, spams can attack some widely-used spam filters which use specific spam detection algorithms. Such attacks seriously affect the effectiveness and practicality of current anti-spam technology. So we should improve current antispam technology. In this paper, we put forward a new hybrid feature selection method based on refined naïve Bayesian classification, which is called term frequency hybrid filter. The experiment results demonstrate that such classifier improves performance. 1 Junying Chen, Shunfeng Zhou, Huaqing Min, Guangzhou Key Laboratory of Robotics and Intelligent Software, School of Software Engineering, South China University of Technology, Guangzhou, Guangdong, China, 510006 386

Spam Dataset Features Selection A new mail Classifier Result Figure 1. Spam detection system. SPAM DETECTION SYSTEM DESIGN The spam detection system block diagram is shown in Figure 1. Before the e- mail classification process, pre-processing is required, which switch e-mails into text messages. Then the sentences are split into word list, which is called space vector model. In order to reduce the calculation time and suppress noise, the classifier usually selects part of the word features [5]. Furthermore, dimensionality reduction is performed on the dataset in advance to improve performance. Finally, such trained classifier is used to identify a new e-mail and output the classification judge result. Refined Naïve Bayesian Classification Algorithm Description If the feature w i appears in document d, then the probability of document d belongs to class C i, as shown in the following: (1) In this paper, we refined the classifier by also considering the feature w i does not appear in document d: (2) Assume that any two features are independent, then based on naïve Bayesian classification algorithm, document d belongs to class C i if and only if [6]: 387

(3) Hybrid Feature Selection Module Huge amount of documents will produce huge feature set, which will cost a long time in training and classifying, and bring in many noises. As a result, a dimensionality reduction method is needed. General dimensionality reduction methods include feature extraction and feature selection. The feature selection methods used in text classifying include term frequency, information gain [7], mutual information[8] and chi-square detection, etc. However, traditional feature selection algorithms don t help to improve the classifier performance much. In this paper, we put forward a new hybrid feature selection method, which is called term frequency hybrid filter. Firstly, all feature words are sorted by frequencies. Then we can set the information gain, mutual information, chi-square detection or their combinations as the filter selection feature. If one word s index is more than k in the word list which is sorted by filter feature selection algorithm, filter it out and continue to choose; or select this word as a component of the feature set. Generally, k is 40%, 50% or 60% of the total amount of features, depending on the actual dataset. This hybrid feature selection method considers the classifying ability of the high-frequency words, but filters the high-frequency words with low classifying ability. Therefore, the term frequency hybrid filter combines the advantages of term frequency method and other feature selection algorithms. EXPERIMENTS AND RESULTS The e-mail dataset is collected from user mailbox, consisting of totally 811 mails, including 490 spams and 321 non-spams. Each mail had deleted the attachments, and left the theme, sender address, main body and attachment file names.10-fold cross-validation was performed on arbitrary dataset, and the result was the average of the 10 tests. Recall rate and F1 score were used as evaluation measurements, which were widely used in machine learning algorithm evaluations. F1 score considered both the correct and complete identification capabilities of the algorithms, while the recall rate was related to the misjudgment possibility. Word frequency, information gain, mutual information, chi-square detection and three hybrid feature selection combinations were used on dataset classification. The three hybrid feature selection combinations respectively use information gain, mutual information and chi-square detection method as the filter feature selection method to sort all features, and select first 50% words as 388

feature set components. After applying feature selection methods, the refined naïve Bayesian classifier was used to classify the dataset, and the evaluations was conducted, as shown in the Table I. As shown in Table I, mutual information method had the highest recall rate, but its F1 score was too low to use in normal cases. Hybrid feature selection combination I had a good balance in recall rate and F1 score. By applying hybrid feature selection method, useless high-frequency words were intelligently filtered out, so the performance of the naïve Bayesian classifier was improved. TABLE I. The performance of different features selection methods. Features selection method Recall rate F1score Term frequency (first 400 words) 0.9704 0.9716 Information gain (first 1500 words) 0.9519 0.9605 Mutual information (first 1000 words) 0.9922 0.8513 Chi-square detection(first 400 words) 0.9242 0.9574 Hybrid features selection combination I 0.9686 0.9824 Hybrid features selection combination II 0.9610 0.9738 Hybrid features selection combination III 0.9505 0.9663 CONCLUSIONS In this paper, we refined the naïve Bayesian classifier,increasing its spam detection correctness. When applying appropriate hybrid feature selection method, as investigated in this paper, not only the classifier's detection performance can be improved, but also the computational complexity can be reduced. The experiments described in this paper demonstrated that our refined naïve Bayesian classifier combined with hybrid feature selection method can fulfill our everyday spam detection requirements. ACKNOWLEDGEMENTS This work is supported by Guangzhou Science and Technology Program (Key Laboratory Project, No. 15180007) and the Fundamental Research Funds for the Central Universities (No. 2015ZM081). 389

REFERENCES 1. Kanich, C., et al. 2008. Spamalytics: An empirical analysis of spam marketing conversion R, 15th ACM Conference on Computer and Communications Security, 2008. 2. Alpaydin, E. 2014. Introduction to machine learning, MIT press, pp. 640. 3. Harrington, P. 2012. Machine learning in action M, Manning Publications Co.pp. 230. 4. Hearst, M. A., Dumais S. T., Osman E., et al. 1998. Support Vector Machines, IEEE J. Intelligent Systems and their Applications, 13(4),pp. 18-28. 5. Guyon, I. and Elisseeff, A. An introduction to variable and feature selection, The Journal of Machine Learning Research, 3,pp. 1157-1182. 6. Androutsopoulos, I., Koutsias J., Chandrinos K.V., et al. 2000. An evaluation of naive bayesian anti-spam filtering C, Workshop on Machine Learning in the New Information Age, 2012. 7. Kent, J. T. Information gain and a general measure of correlation, Biometrika, 70(1), pp. 163-173. 8. Fraser, A. M. and Swinney, H. L. Independent coordinates for strange attractors from mutual information, Physical review A, 33(2), pp. 1134. 390