Bayesian Spam Detection System Using Hybrid Feature Selection Method

Similar documents
The Comparative Study of Machine Learning Algorithms in Text Data Classification*

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

Discovering Advertisement Links by Using URL Text

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

Keywords : Bayesian, classification, tokens, text, probability, keywords. GJCST-C Classification: E.5

An Adaptive Threshold LBP Algorithm for Face Recognition

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Spam UF. Use and customization instructions for the Barracuda Spam service at the University of Florida.

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Application of Improved Lzc Algorithm in the Discrimination of Photo and Text ChengJing Ye 1, a, Donghai Zeng 2,b

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Collaborative Spam Mail Filtering Model Design

Improved Balanced Parallel FP-Growth with MapReduce Qing YANG 1,a, Fei-Yang DU 2,b, Xi ZHU 1,c, Cheng-Gong JIANG *

RETRACTED ARTICLE. Web-Based Data Mining in System Design and Implementation. Open Access. Jianhu Gong 1* and Jianzhi Gong 2

Fast or furious? - User analysis of SF Express Inc

Feature weighting classification algorithm in the application of text data processing research

Time Series Clustering Ensemble Algorithm Based on Locality Preserving Projection

Identifying Important Communications

Forward Feature Selection Using Residual Mutual Information

A Reputation-based Collaborative Approach for Spam Filtering

Spam Classification Documentation

Usage Guide to Handling of Bayesian Class Data

Classification and Summarization: A Machine Learning Approach

Filtering Spam Using Fuzzy Expert System 1 Hodeidah University, Faculty of computer science and engineering, Yemen 3, 4

OUR TOP DATA SOURCES AND WHY THEY MATTER

A Survey And Comparative Analysis Of Data

Discriminate Analysis

Copyright 2011 please consult the authors

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Applications of Machine Learning on Keyword Extraction of Large Datasets

Information Retrieval

Using PageRank in Feature Selection

Analysis on the technology improvement of the library network information retrieval efficiency

A Content Vector Model for Text Classification

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

Research on Design and Application of Computer Database Quality Evaluation Model

1. Access to Chinese Academic Journal Web

The Impact of Information System Risk Management on the Frequency and Intensity of Security Incidents Original Scientific Paper

Application of Redundant Backup Technology in Network Security

Feature-weighted k-nearest Neighbor Classifier

CORE for Anti-Spam. - Innovative Spam Protection - Mastering the challenge of spam today with the technology of tomorrow

Fuzzy Entropy based feature selection for classification of hyperspectral data

Text Classification for Spam Using Naïve Bayesian Classifier

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

10601 Machine Learning. Model and feature selection

A NEW HYBRID APPROACH FOR NETWORK TRAFFIC CLASSIFICATION USING SVM AND NAÏVE BAYES ALGORITHM

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

Artificial Intelligence. Programming Styles

Bayesian Spam Filtering Using Statistical Data Compression

Non-ML Anti-Spamming: A Role Based Solution

Comparing Univariate and Multivariate Decision Trees *

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

Applying Machine Learning to Real Problems: Why is it Difficult? How Research Can Help?

K-means clustering based filter feature selection on high dimensional data

Linear Discriminant Analysis in Ottoman Alphabet Character Recognition

Performance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM

2. Design Methodology

Record Linkage using Probabilistic Methods and Data Mining Techniques

Building Classifiers using Bayesian Networks

Temperature Calculation of Pellet Rotary Kiln Based on Texture

A PSO-based Generic Classifier Design and Weka Implementation Study

Reconstruction-based Classification Rule Hiding through Controlled Data Modification

Visual Analysis of Lagrangian Particle Data from Combustion Simulations

Using PageRank in Feature Selection

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Model-based segmentation and recognition from range data

VisoLink: A User-Centric Social Relationship Mining

Research Domain Selection using Naive Bayes Classification

Schematizing a Global SPAM Indicative Probability

Metric and Identification of Spatial Objects Based on Data Fields

The Applicability of the Perturbation Model-based Privacy Preserving Data Mining for Real-world Data

BUAA AUDR at ImageCLEF 2012 Photo Annotation Task

CONTENTS IN DETAIL PART I AN INTRODUCTION TO SPAM FILTERING INTRODUCTION 1 THE HISTORY OF SPAM 3 2 HISTORICAL APPROACHES TO FIGHTING SPAM 25

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

An Empirical Study of Lazy Multilabel Classification Algorithms

Information Retrieval

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

A Feature Selection Method to Handle Imbalanced Data in Text Classification

Feature Ranking Using Linear SVM

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

On Effective Classification via Neural Networks

Domain-specific Concept-based Information Retrieval System

Spam Management with PureMessage

Statistical Pattern Recognition

COMP61011 Foundations of Machine Learning. Feature Selection

Analytical Support of Financial Footnotes Analysis

Prairie View A&M University Managing your s. Office of Information Resource Management

Layer by Layer: Protecting from Attack in Office 365

VECTOR SPACE CLASSIFICATION

CS 5540 Spring 2013 Assignment 3, v1.0 Due: Apr. 24th 11:59PM

Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1

NON-CENTRALIZED DISTINCT L-DIVERSITY

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Decision Science Letters

Transcription:

2016 International Conference on Manufacturing Science and Information Engineering (ICMSIE 2016) ISBN: 978-1-60595-325-0 Bayesian Spam Detection System Using Hybrid Feature Selection Method JUNYING CHEN, SHUNFENG ZHOU and HUAQING MIN ABSTRACT With the rapid development of Internet, the amount of text information has increased dramatically. As such, how to effectively and accurately identify, classify and deal with these information becomes a major challenge. In this paper, we used a term frequency hybrid filter which combines the refined naïve Bayesian classifier and innovative hybrid feature selection method to detect spams. According to our experiment results, we found that the hybrid feature selection method had better spam detection performance than traditional feature selection methods. 1 INTRODUCTION By the end of 2015, the number of Chinese Internet users had broken through 650 million. E-mail has become an important method for communicating, gaining information and looking for jobs. However, in recent years, more and more spams have not only affected people s daily work and life, but also brought huge economic loss to the society [1]. Current mainstream spam-blocking method is collecting a large amount of spams and using such spams to train a classifier, so as to get the classifier to work intelligently, which can identify spams among new e-mail messages [2-4]. However, spams can attack some widely-used spam filters which use specific spam detection algorithms. Such attacks seriously affect the effectiveness and practicality of current anti-spam technology. So we should improve current antispam technology. In this paper, we put forward a new hybrid feature selection method based on refined naïve Bayesian classification, which is called term frequency hybrid filter. The experiment results demonstrate that such classifier improves performance. 1 Junying Chen, Shunfeng Zhou, Huaqing Min, Guangzhou Key Laboratory of Robotics and Intelligent Software, School of Software Engineering, South China University of Technology, Guangzhou, Guangdong, China, 510006 386

Spam Dataset Features Selection A new mail Classifier Result Figure 1. Spam detection system. SPAM DETECTION SYSTEM DESIGN The spam detection system block diagram is shown in Figure 1. Before the e- mail classification process, pre-processing is required, which switch e-mails into text messages. Then the sentences are split into word list, which is called space vector model. In order to reduce the calculation time and suppress noise, the classifier usually selects part of the word features [5]. Furthermore, dimensionality reduction is performed on the dataset in advance to improve performance. Finally, such trained classifier is used to identify a new e-mail and output the classification judge result. Refined Naïve Bayesian Classification Algorithm Description If the feature w i appears in document d, then the probability of document d belongs to class C i, as shown in the following: (1) In this paper, we refined the classifier by also considering the feature w i does not appear in document d: (2) Assume that any two features are independent, then based on naïve Bayesian classification algorithm, document d belongs to class C i if and only if [6]: 387

(3) Hybrid Feature Selection Module Huge amount of documents will produce huge feature set, which will cost a long time in training and classifying, and bring in many noises. As a result, a dimensionality reduction method is needed. General dimensionality reduction methods include feature extraction and feature selection. The feature selection methods used in text classifying include term frequency, information gain [7], mutual information[8] and chi-square detection, etc. However, traditional feature selection algorithms don t help to improve the classifier performance much. In this paper, we put forward a new hybrid feature selection method, which is called term frequency hybrid filter. Firstly, all feature words are sorted by frequencies. Then we can set the information gain, mutual information, chi-square detection or their combinations as the filter selection feature. If one word s index is more than k in the word list which is sorted by filter feature selection algorithm, filter it out and continue to choose; or select this word as a component of the feature set. Generally, k is 40%, 50% or 60% of the total amount of features, depending on the actual dataset. This hybrid feature selection method considers the classifying ability of the high-frequency words, but filters the high-frequency words with low classifying ability. Therefore, the term frequency hybrid filter combines the advantages of term frequency method and other feature selection algorithms. EXPERIMENTS AND RESULTS The e-mail dataset is collected from user mailbox, consisting of totally 811 mails, including 490 spams and 321 non-spams. Each mail had deleted the attachments, and left the theme, sender address, main body and attachment file names.10-fold cross-validation was performed on arbitrary dataset, and the result was the average of the 10 tests. Recall rate and F1 score were used as evaluation measurements, which were widely used in machine learning algorithm evaluations. F1 score considered both the correct and complete identification capabilities of the algorithms, while the recall rate was related to the misjudgment possibility. Word frequency, information gain, mutual information, chi-square detection and three hybrid feature selection combinations were used on dataset classification. The three hybrid feature selection combinations respectively use information gain, mutual information and chi-square detection method as the filter feature selection method to sort all features, and select first 50% words as 388

feature set components. After applying feature selection methods, the refined naïve Bayesian classifier was used to classify the dataset, and the evaluations was conducted, as shown in the Table I. As shown in Table I, mutual information method had the highest recall rate, but its F1 score was too low to use in normal cases. Hybrid feature selection combination I had a good balance in recall rate and F1 score. By applying hybrid feature selection method, useless high-frequency words were intelligently filtered out, so the performance of the naïve Bayesian classifier was improved. TABLE I. The performance of different features selection methods. Features selection method Recall rate F1score Term frequency (first 400 words) 0.9704 0.9716 Information gain (first 1500 words) 0.9519 0.9605 Mutual information (first 1000 words) 0.9922 0.8513 Chi-square detection(first 400 words) 0.9242 0.9574 Hybrid features selection combination I 0.9686 0.9824 Hybrid features selection combination II 0.9610 0.9738 Hybrid features selection combination III 0.9505 0.9663 CONCLUSIONS In this paper, we refined the naïve Bayesian classifier,increasing its spam detection correctness. When applying appropriate hybrid feature selection method, as investigated in this paper, not only the classifier's detection performance can be improved, but also the computational complexity can be reduced. The experiments described in this paper demonstrated that our refined naïve Bayesian classifier combined with hybrid feature selection method can fulfill our everyday spam detection requirements. ACKNOWLEDGEMENTS This work is supported by Guangzhou Science and Technology Program (Key Laboratory Project, No. 15180007) and the Fundamental Research Funds for the Central Universities (No. 2015ZM081). 389

REFERENCES 1. Kanich, C., et al. 2008. Spamalytics: An empirical analysis of spam marketing conversion R, 15th ACM Conference on Computer and Communications Security, 2008. 2. Alpaydin, E. 2014. Introduction to machine learning, MIT press, pp. 640. 3. Harrington, P. 2012. Machine learning in action M, Manning Publications Co.pp. 230. 4. Hearst, M. A., Dumais S. T., Osman E., et al. 1998. Support Vector Machines, IEEE J. Intelligent Systems and their Applications, 13(4),pp. 18-28. 5. Guyon, I. and Elisseeff, A. An introduction to variable and feature selection, The Journal of Machine Learning Research, 3,pp. 1157-1182. 6. Androutsopoulos, I., Koutsias J., Chandrinos K.V., et al. 2000. An evaluation of naive bayesian anti-spam filtering C, Workshop on Machine Learning in the New Information Age, 2012. 7. Kent, J. T. Information gain and a general measure of correlation, Biometrika, 70(1), pp. 163-173. 8. Fraser, A. M. and Swinney, H. L. Independent coordinates for strange attractors from mutual information, Physical review A, 33(2), pp. 1134. 390