An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization
|
|
- Alexander Morris
- 6 years ago
- Views:
Transcription
1 An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University of Tainan, Tainan, Taiwan 700 b Institute of Information Management Shu-Te University, Kaohsiung County, Taiwan cclai@mail.nutn.edu.tw, lravati@pchome.com.tw Abstract The increasing volume of unsolicited bulk (also known as spam) has generated a need for reliable anti-spam filters. Using a classifier based on machine learning techniques to automatically filter out spam has drawn many researchers attention. In this paper, we review some of relevant ideas and do a set of systematic experiments on categorization, which has been conducted with four machine learning algorithms applied to different parts of . Experimental results reveal that the header of provides very useful information for all the machine learning algorithms considered to detect spam . Keywords: spam, categorization, machine learning 1. Introduction In recent years, Internet s have become a common and important medium of communication for almost everyone. However, spam, also known as unsolicited commercial/bulk , is a bane of communication. There are many serious problems associated with growing volumes of spam. Spam is not only a waste of storage space and communication bandwidth, but also a waste of time to tackle. Several solutions have been proposed to overcome the spam problem. Among the proposed methods, much interest has focused on the machine learning techniques in spam filtering. They include rule learning [3], Naïve Bayes [1, 6], decision trees [2], support vector machines [4] or combinations of different learners [7]. The basic and common concept of these approaches is that using a classifier to filter out spam and the classifier is learned from training data rather than constructed by hand. Therefore, it can result in better performance [9]. From the machine learning viewpoint, spam filtering based on the textual content of can be viewed as a special case of text categorization, with the categories being spam or non-spam [5]. Sahami et al. [6] employed Bayesian classification technique to filter junk s. By making use of the extensible framework of Bayesian modeling, they can not only employ traditional document classification techniques based on the text of , but they can also easily incorporate domain knowledge to aim at filtering spam s. Drucker et al. [4] used support vector machine (SVM) for classifying s according to their contents and compared its performance with Ripper, Rocchio, and boosting decision trees. They concluded that boosting trees and SVM had acceptable test performance in terms of accuracy and speed. However, the training time of boosting trees is inordinately long. Androutsopoulos et al. [1] extended the Naïve Bayes (NB) filter proposed by Sahami et al. [6], by investigating the effect of different number of features and training-set sizes on the filter s performance. Meanwhile, they compared the performance of NB to a memory-based approach, and they found both abovementioned methods clearly outperform a typical keyword-based filter. The objective of this paper is to evaluate four respective machine learning algorithms for spam e- mail categorization. These techniques are Naïve Bayes (NB), term frequency inverse document frequency (TF-IDF), K-nearest neighbor (K-NN), and support vector machines (SVMs). In addition, we study different parts of an that can be exploited to improve the categorization capability. We considered
2 the following five combinations of an message: all (A), header (H), body (B), subject (S), and body with subject (B+S). The above-mentioned four methods with these features are compared to help evaluate the relative merits of these algorithms, and suggest directions for future works. The rest of this paper is organized as follows. Section 2 gives a brief review of four machine learning algorithms and features we used. Section 3 presents the experimental results designed to evaluate the performance of different experimental settings. The conclusions are summarized in Section Machine learning methods and features in the In this section, we review the machine learning algorithms in the literature that used for categorization (or anti-spam filtering). They include Naïve Bayes (NB), term frequency inverse document frequency (TF-IDF), K-nearest neighbor (K-NN), and support vector machines (SVMs). 2.1 Naïve Bayes The Naïve Bayes (NB) classifier is a probability-based approach. The basic concept of it is to find whether an is spam or not by looking at which words are found in the message and which words are absent from it. In the literature, the NB classifier for spam is defined as follows: CNB arg max P( ci ) P( wk ci ), (1) ci T k where T is the set of target classes (spam or non-spam), and P(w k c i ) is the probability that word w k occurs in the , given the belongs to class c i. The likelihood term is estimated as n P( w c k k i ), (2) N where n k is the number of times word w k occurs in e- mails with class c i, and N is the number of words in e- mails with class c i. 2.2 Term frequency-inverse document frequency The most often adopted representation of a set of messages is as term weight vectors which used in the Vector Space Model [8]. The term weights are real numbers indicating the significance of terms in identifying a document. Based on this concept, the weight of a term in an message can be computed by the tf idf. The tf (term frequency) indicates the number of times that a term t appears in an . The idf (inverse document frequency) is the inverse of document frequency in the set of s that contain t. The tf idf weighting scheme is defined as N wij tfij log( ), (3) dfi where w ij is the weight of the ith term in the jth , tf ij is the number of times that the ith term occurs in the jth , N is the total number of s in the collection, and df i is the number of s in which the ith term occurs. 2.3 K-nearest neighbor The most basic instance-based method is the K-nearest neighbor (K-NN) algorithm. It is a very simple method to classify documents and to show very good performance on text categorization tasks [10]. If we want to apply K-NN method to classify s, the e- mails of the training set have to be indexed and then convert them into a document vector representation. When classifying a new , the similarity between its document vector and each one in the training set has to be computed. Then, the categories of the k nearest neighbors are determined and the category which occurs most frequently is chosen. 2.4 Support vector machine Support vector machine (SVM) has become very popular in the machine learning community because of its good generalization performance and its ability to handle high-dimensional data by using kernels. According the description given in Woitaszek et al. [11], an may be represented by a feature vector x that is composed of the various words from a dictionary formed by analyzing the collected s. Thus, an is classified as spam or non-spam by performing a simple dot product between the features of an and the SVM model weight vector, y w x b, (4) where y is the result of classification, w is weight vector corresponding to those in the feature vector x, and b is the bias parameter in the SVM model that determined by the training process. 2.5 The structure of an
3 In addition to the text message of an , an has additional information in the header. The header contains many fields, for example, trace information about which a message has passed (Received:), where the sender wants replies to go (Reply-To:), unique of ID of this message (Message-ID:), format of content (Content-Type:), etc. Figures 1 illustrates the header of an . Besides comparing the categorization performance among the learning algorithms, we intended to figure out which parts of an have critical influence on the classification results. Therefore, five features of an all (A), header (H), body (B), subject (S), and body with subject (B+S), are used to evaluate the performance of four machine learning algorithms. Furthermore, we also considered four cases that whether stemming or stopping procedure was applied or not. Received: from chen2 (localhost [ ]) by ipx.ntntc.edu.tw ( Sun/8.12.9) with ESMTP id i791mh4h028241; Mon, 9 Aug :48: (CST) From: =?big5?b?s6+pdrfx?= <robert@ipx.ntntc.edu.tw> To: <david@ipx.ntntc.edu.tw> Subject: =?big5?b?soquykvku6gp+g==?= Date: Mon, 9 Aug :50: Message-ID: <000001c47db3$334b3000$2a8547cb@chen2> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="---- _NextPart_000_0001_01C47DF6.416E7000" X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook, Build Figure 1. The header of an 3. Experimental results and discussion In order to test the performance of above-mentioned four methods, two corpora were used. The first corpus (Corpus I) consists of our s over a recent three month period. For experiments, we deleted some e- mails whose messages were too short or did not contain any content and obtain 1050 spam and 1057 non-spam. The second corpus (Corpus II) we adopted is available at This archive contains 2100 spam and 2107 non-spam messages. We run experiments with different training and test sets. The first pair of training and test set is created by splitting each corpus at a ratio of 20:80. The second pair and the third one are 30:70 and 40:60, respectively. In classification tasks, the performance is often measured in terms of accuracy. Let N legit and N spam denote the total numbers of non-spam and spam messages, respectively, to be classified by the machine learning method, and n(c V) the number of messages belonging to category C that the method classified as category V (here, C, V {legit, spam}). The accuracy is defined as following formula: number of e - mails correctly categorized Accuracy total number of e - mails n( legit legit) n( spam spam) N legit N spam (5) The overall performances of considered learning algorithms in different experiments are shown in Tables 1 and 2. From the results, we found the following phenomena. 1. Good performance of NB, TF-IDF and SVM with header information. NB and TF-IDF performed reasonably consistent and good in different experimental settings. While SVM performed well except the feature of subject. It seems that the subject is not enough for high accuracy classification in SVM. 2. Poor performance of KNN method. KNN performed the worst among all considered methods and the poorest in all cases. However, if the more preprocessing tasks are utilized (i.e., stemming and stopping are applied together), the better KNN performs. 3. No effect of stemming, but stopping can enhance the classification. Stemming did not make any significant improvement for all algorithms in performance, though it decreased the size of the feature set. On the other hand, when the stopping procedure is employed, that is, ignoring some words that do not carry meaning in natural language, we can get better performance. The phenomenon is obvious especially in K-NN method as shown in Figure Good performance with header. Among four machine learning algorithms, the performance with header was the best. This means that much information can be derived from the header and then the fields in the header can aim at classifying s correctly. 5. Poor performance with subject or body. The poor performance of each algorithm occurs in subject or body. The reasons may be that the former provides too little useful information. On the contrary, the latter contains too much useless information to classify s. From the observation, we know that although some learning algorithms can achieve satisfactory results, we
4 Accuracy(%) may try to improve the result by combing some of them. Here, we integrate TF-IDF and NB methods and apply them to two corpora. The experimental results are shown in the last column of Tables 1 and 2. They show that the accuracy can be improved by the new hybrid approach Figure 2. Performance of K-NN method in all features with and without stopping procedure in Corpus I 4. Conclusion The detection of spam is an important issue of information technologies, and machine learning has a central role to play in this topic. In this paper, we presented an empirical evaluation of four machine learning algorithms for spam categorization. These approaches, NB, TF-IDF, K-NN, and SVM, were applied to different parts of an in order to compare their performance. Experimental results indicate that NB, TF-IDF, and SVM yield better performance than K-NN. The phenomenon also found, at least with our test corpora, that classification with the header was the most accurate than other parts of an . On the other hand, we try to combine two methods (TF-IDF and NB) to achieve the most correct categorization. It was found that integrating different learning algorithms actually seems to be a promising way. Acknowledgements stopping without stopping A H B S B+S featues This work was partially supported by National Science Council, Taiwan, R.O.C., under grant NSC E Comments from anonymous referees are highly appreciated. References [1] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C.D. Spyropoulos, and P. Stamatopoulos, Learning to filter spam A comparison of a Naïve Bayesian and a memory-based approach, Proceedings of the workshop: Machine Learning and Textual Information Access, pp. 1-13, [2] X. Carreras and L. Márquez, Boosting trees for antispam filtering, Proceedings of 4th Int l Conf. on Recent Advances in Natural Language Processing, pp , [3] W.W. Cohen, Learning rules that classify , Proceedings of AAAI Spring Symposium on Machine Learning in Information Access, pp , [4] H. Drucker, D. Wu, and V.N. Vapnik, Support vector machines for spam categorization, IEEE Trans. Neural Networks, vol. 10, no. 5, pp , [5] A. Ko cz and J. Alspector, SVM-based filtering of e- mail spam with content-specific misclassification costs, Proceedings of TextDM'01 Workshop on Text Mining, [6] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, A Bayesian approach to filtering junk , Learning for Text Categorization Papers from the AAAI Workshop, pp , [7] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos, and P. Stamatopoulos, Stacking classifiers for anti-spam filtering of , Proceedings of the 6th Conf. on Empirical Methods in Natural Language Processing, pp , [8] G. Salton, Automating Text Processing: The Transformation, Analysis and Retrieval of Information by Computer, Addison-Wesley, [9] K.-M. Schneider, A comparison of event models for Naïve Bayes anti-spam filtering, Proceedings of 10th Conf. of the European Chapter of the Association for Computational Linguistics, pp , [10] Y. Yang, An evaluation of statistical approaches to text categorization, Journal of Information Retrieval, vol. 1, pp , [11] M. Woitaszek, M. Shaaban, and R. Czernikowski, Identifying junk electronic mail in Microsoft Outlook with a support vector machine, Proceedings of the 2003 Symposium on Applications and the Internet, pp , 2003.
5 Table 1. Performance of four machine learning algorithms in Corpus I Feature Preprocessing Algorithms s stemming stopping NB TFIDF K-NN SVM TFIDF+NB A A A A H H H H B B B B S S S S B+S B+S B+S B+S Average Table 2. Performance of four machine learning algorithms in Corpus II Feature Preprocessing Algorithms s stemming stopping NB TFIDF K-NN SVM TFIDF+NB A A A A H H H H B B B B S S S S B+S B+S B+S B+S Average
INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING
INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR EMAIL SPAMMING TAK-LAM WONG, KAI-ON CHOW, FRANZ WONG Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue,
More informationA Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics
A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au
More informationEVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST
EVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST Enrico Blanzieri and Anton Bryl May 2007 Technical Report # DIT-07-025 Evaluation of the Highest
More informationA Content Vector Model for Text Classification
A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.
More informationKeyword Extraction by KNN considering Similarity among Features
64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,
More informationSPAM, generally defined as unsolicited bulk (UBE) Feature Construction Approach for Categorization Based on Term Space Partition
Feature Construction Approach for Email Categorization Based on Term Space Partition Guyue Mi, Pengtao Zhang and Ying Tan Abstract This paper proposes a novel feature construction approach based on term
More informationFeature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News
Selecting Model in Automatic Text Categorization of Chinese Industrial 1) HUEY-MING LEE 1 ), PIN-JEN CHEN 1 ), TSUNG-YEN LEE 2) Department of Information Management, Chinese Culture University 55, Hwa-Kung
More informationProject Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI
University of Houston Clear Lake School of Science & Computer Engineering Project Report Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes April 17, 2008 Course Number: CSCI 5634.01 University of
More informationCountering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008
Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification
More informationCHEAP, efficient and easy to use, has become an
Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 A Multi-Resolution-Concentration Based Feature Construction Approach for Spam Filtering Guyue Mi,
More informationJURD: Joiner of Un-Readable Documents algorithm for Reversing the Effects of Tokenisation Attacks against Content-based Spam Filters
JURD: Joiner of Un-Readable Documents algorithm for Reversing the Effects of Tokenisation Attacks against Content-based Spam Filters Igor Santos, Carlos Laorden, Borja Sanz, Pablo G. Bringas S 3 Lab, DeustoTech
More informationMulti-Dimensional Text Classification
Multi-Dimensional Text Classification Thanaruk THEERAMUNKONG IT Program, SIIT, Thammasat University P.O. Box 22 Thammasat Rangsit Post Office, Pathumthani, Thailand, 12121 ping@siit.tu.ac.th Verayuth LERTNATTEE
More informationGRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM
http:// GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM Akshay Kumar 1, Vibhor Harit 2, Balwant Singh 3, Manzoor Husain Dar 4 1 M.Tech (CSE), Kurukshetra University, Kurukshetra,
More informationCollaborative Spam Mail Filtering Model Design
I.J. Education and Management Engineering, 2013, 2, 66-71 Published Online February 2013 in MECS (http://www.mecs-press.net) DOI: 10.5815/ijeme.2013.02.11 Available online at http://www.mecs-press.net/ijeme
More informationSchematizing a Global SPAM Indicative Probability
Schematizing a Global SPAM Indicative Probability NIKOLAOS KORFIATIS MARIOS POULOS SOZON PAPAVLASSOPOULOS Department of Management Science and Technology Athens University of Economics and Business Athens,
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationSTUDYING OF CLASSIFYING CHINESE SMS MESSAGES
STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationOn Effective Classification via Neural Networks
On Effective E-mail Classification via Neural Networks Bin Cui 1, Anirban Mondal 2, Jialie Shen 3, Gao Cong 4, and Kian-Lee Tan 1 1 Singapore-MIT Alliance, National University of Singapore {cuibin, tankl}@comp.nus.edu.sg
More informationUsing AdaBoost and Decision Stumps to Identify Spam
Using AdaBoost and Decision Stumps to Identify Spam E-mail Tyrone Nicholas June 4, 2003 Abstract An existing spam e-mail filter using the Naive Bayes decision engine was retrofitted with one based on the
More informationKeywords : Bayesian, classification, tokens, text, probability, keywords. GJCST-C Classification: E.5
Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 13 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global
More informationComputer aided mail filtering using SVM
Computer aided mail filtering using SVM Lin Liao, Jochen Jaeger Department of Computer Science & Engineering University of Washington, Seattle Introduction What is SPAM? Electronic version of junk mail,
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:
More informationA Reputation-based Collaborative Approach for Spam Filtering
Available online at www.sciencedirect.com ScienceDirect AASRI Procedia 5 (2013 ) 220 227 2013 AASRI Conference on Parallel and Distributed Computing Systems A Reputation-based Collaborative Approach for
More informationAn Experimental Evaluation of Spam Filter Performance and Robustness Against Attack
An Experimental Evaluation of Spam Filter Performance and Robustness Against Attack Steve Webb, Subramanyam Chitti, and Calton Pu {webb, chittis, calton}@cc.gatech.edu College of Computing Georgia Institute
More informationA hybrid method to categorize HTML documents
Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper
More informationCentroid-Based Document Classification: Analysis & Experimental Results?
Centroid-Based Document Classification: Analysis & Experimental Results? Eui-Hong (Sam) Han and George Karypis University of Minnesota, Department of Computer Science / Army HPC Research Center Minneapolis,
More informationEncoding Words into String Vectors for Word Categorization
Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,
More informationNaïve Bayes for text classification
Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support
More informationComment Extraction from Blog Posts and Its Applications to Opinion Mining
Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan
More informationRita McCue University of California, Santa Cruz 12/7/09
Rita McCue University of California, Santa Cruz 12/7/09 1 Introduction 2 Naïve Bayes Algorithms 3 Support Vector Machines and SVMLib 4 Comparative Results 5 Conclusions 6 Further References Support Vector
More informationKarami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.
Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review
More informationAn Automatic Reply to Customers Queries Model with Chinese Text Mining Approach
Proceedings of the 6th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 15-17, 2007 71 An Automatic Reply to Customers E-mail Queries Model with Chinese Text Mining Approach
More informationInfluence of Word Normalization on Text Classification
Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we
More informationBayesian Spam Detection System Using Hybrid Feature Selection Method
2016 International Conference on Manufacturing Science and Information Engineering (ICMSIE 2016) ISBN: 978-1-60595-325-0 Bayesian Spam Detection System Using Hybrid Feature Selection Method JUNYING CHEN,
More informationScienceDirect. KNN with TF-IDF Based Framework for Text Categorization
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 69 ( 2014 ) 1356 1364 24th DAAAM International Symposium on Intelligent Manufacturing and Automation, 2013 KNN with TF-IDF Based
More informationAn Immune Concentration Based Virus Detection Approach Using Particle Swarm Optimization
An Immune Concentration Based Virus Detection Approach Using Particle Swarm Optimization Wei Wang 1,2, Pengtao Zhang 1,2, and Ying Tan 1,2 1 Key Laboratory of Machine Perception, Ministry of Eduction,
More informationUsing Gini-index for Feature Weighting in Text Categorization
Journal of Computational Information Systems 9: 14 (2013) 5819 5826 Available at http://www.jofcis.com Using Gini-index for Feature Weighting in Text Categorization Weidong ZHU 1,, Yongmin LIN 2 1 School
More informationAnalysis on the technology improvement of the library network information retrieval efficiency
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(6):2198-2202 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Analysis on the technology improvement of the
More informationSocial Media Computing
Social Media Computing Lecture 4: Introduction to Information Retrieval and Classification Lecturer: Aleksandr Farseev E-mail: farseev@u.nus.edu Slides: http://farseev.com/ainlfruct.html At the beginning,
More informationPERFORMANCE OF MACHINE LEARNING TECHNIQUES FOR SPAM FILTERING
PERFORMANCE OF MACHINE LEARNING TECHNIQUES FOR EMAIL SPAM FILTERING M. Deepika 1 Shilpa Rani 2 1,2 Assistant Professor, Department of Computer Science & Engineering, Sreyas Institute of Engineering & Technology,
More informationCS229 Final Project: Predicting Expected Response Times
CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time
More informationCORE for Anti-Spam. - Innovative Spam Protection - Mastering the challenge of spam today with the technology of tomorrow
CORE for Anti-Spam - Innovative Spam Protection - Mastering the challenge of spam today with the technology of tomorrow Contents 1 Spam Defense An Overview... 3 1.1 Efficient Spam Protection Procedure...
More informationNaive Bayes spam filtering using word-position-based attributes and length-sensitive classification thresholds
Naive Bayes spam filtering using word-position-based attributes and length-sensitive classification thresholds Johan Hovold Department of Computer Science Lund University Box 118, 221 00 Lund, Sweden johan.hovold.363@student.lu.se
More informationVECTOR SPACE CLASSIFICATION
VECTOR SPACE CLASSIFICATION Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. Chapter 14 Wei Wei wwei@idi.ntnu.no Lecture
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationFiltering Spam by Using Factors Hyperbolic Trees
Filtering Spam by Using Factors Hyperbolic Trees Hailong Hou*, Yan Chen, Raheem Beyah, Yan-Qing Zhang Department of Computer science Georgia State University P.O. Box 3994 Atlanta, GA 30302-3994, USA *Contact
More informationA study of classification algorithms using Rapidminer
Volume 119 No. 12 2018, 15977-15988 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A study of classification algorithms using Rapidminer Dr.J.Arunadevi 1, S.Ramya 2, M.Ramesh Raja
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationUnknown Malicious Code Detection Based on Bayesian
Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 3836 3842 Advanced in Control Engineering and Information Science Unknown Malicious Code Detection Based on Bayesian Yingxu Lai
More informationLarge Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao
Large Scale Chinese News Categorization --based on Improved Feature Selection Method Peng Wang Joint work with H. Zhang, B. Xu, H.W. Hao Computational-Brain Research Center Institute of Automation, Chinese
More informationIn this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.
December 13, 2006 IS256: Applied Natural Language Processing Final Project Email classification for semi-automated reply generation HANNES HESSE mail 2056 Emerson Street Berkeley, CA 94703 phone 1 (510)
More informationA novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems
A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics
More informationLearning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li
Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,
More informationInformation Extraction from Spam s using Stylistic and Semantic Features to Identify Spammers
Information Extraction from Spam using Stylistic and Semantic Features to Identify Spammers Soma Halder University of Alabama at Birmingham soma@cis.uab.edu Richa Tiwari University of Alabama at Birmingham
More informationImproved Online Support Vector Machines Spam Filtering Using String Kernels
Improved Online Support Vector Machines Spam Filtering Using String Kernels Ola Amayri and Nizar Bouguila Concordia University, Montreal, Quebec, Canada H3G 2W1 {o_amayri,bouguila}@encs.concordia.ca Abstract.
More informationOnline Self-Organised Map Classifiers as Text Filters for Spam Detection
Journal of Information Assurance and Security 4 (2009) 151-160 Online Self-Organised Map Classifiers as Text Filters for Spam Email Detection Bogdan Vrusias 1 and Ian Golledge 2 1 University of Surrey,
More informationA New Approach for Handling the Iris Data Classification Problem
International Journal of Applied Science and Engineering 2005. 3, : 37-49 A New Approach for Handling the Iris Data Classification Problem Shyi-Ming Chen a and Yao-De Fang b a Department of Computer Science
More informationWeight adjustment schemes for a centroid based classifier
Weight adjustment schemes for a centroid based classifier Shrikanth Shankar and George Karypis University of Minnesota, Department of Computer Science Minneapolis, MN 55455 Technical Report: TR 00-035
More informationBayesTH-MCRDR Algorithm for Automatic Classification of Web Document
BayesTH-MCRDR Algorithm for Automatic Classification of Web Document Woo-Chul Cho and Debbie Richards Department of Computing, Macquarie University, Sydney, NSW 2109, Australia {wccho, richards}@ics.mq.edu.au
More informationAn Improvement of Centroid-Based Classification Algorithm for Text Classification
An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,
More informationConcept-Based Document Similarity Based on Suffix Tree Document
Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri
More informationSpam Decisions on Gray using Personalized Ontologies
Spam Decisions on Gray E-mail using Personalized Ontologies Seongwook Youn Semantic Information Research Laboratory (http://sir-lab.usc.edu) Dept. of Computer Science Univ. of Southern California Los Angeles,
More informationA Feature Selection Method to Handle Imbalanced Data in Text Classification
A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University
More informationDecision Science Letters
Decision Science Letters 3 (2014) 439 444 Contents lists available at GrowingScience Decision Science Letters homepage: www.growingscience.com/dsl Identifying spam e-mail messages using an intelligence
More informationarxiv: v1 [cs.lg] 12 Feb 2018
Email Classification into Relevant Category Using Neural Networks arxiv:1802.03971v1 [cs.lg] 12 Feb 2018 Deepak Kumar Gupta & Shruti Goyal Co-Founders: Reckon Analytics deepak@reckonanalytics.com & shruti@reckonanalytics.com
More informationA Survey on Postive and Unlabelled Learning
A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled
More informationA BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK
A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific
More informationFrequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning
Frequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning Izumi Suzuki, Koich Yamada, Muneyuki Unehara Nagaoka University of Technology, 1603-1, Kamitomioka Nagaoka, Niigata
More informationAn Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm
Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy
More informationA generalized additive neural network application in information security
Lecture Notes in Management Science (2014) Vol. 6: 58 64 6 th International Conference on Applied Operational Research, Proceedings Tadbir Operational Research Group Ltd. All rights reserved. www.tadbir.ca
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationInformation-Theoretic Feature Selection Algorithms for Text Classification
Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 5 Information-Theoretic Feature Selection Algorithms for Text Classification Jana Novovičová Institute
More informationSemantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman
Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information
More informationProbabilistic Anti-Spam Filtering with Dimensionality Reduction
Probabilistic Anti-Spam Filtering with Dimensionality Reduction ABSTRACT One of the biggest problems of e-mail communication is the massive spam message delivery Everyday billion of unwanted messages are
More informationAutomatic Domain Partitioning for Multi-Domain Learning
Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels
More informationDetecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach
Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA
More informationApplication of Support Vector Machine Algorithm in Spam Filtering
Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification
More informationAn Ensemble Data Mining Approach for Intrusion Detection in a Computer Network
International Journal of Science and Engineering Investigations vol. 6, issue 62, March 2017 ISSN: 2251-8843 An Ensemble Data Mining Approach for Intrusion Detection in a Computer Network Abisola Ayomide
More informationA Comparative Study of Classification Based Personal Filtering
A Comparative Study of Classification Based Personal E-mail Filtering Yanlei Diao, Hongjun Lu and Deai Wu Department of Computer Science The Hong Kong University of Science and Technology Clear Water Bay,
More informationAnswering Assistance by Semi-Supervised Text Classification
In Intelligent Data Analysis, 8(5), 24. Email Answering Assistance by Semi-Supervised Text Classification Tobias Scheffer Humboldt-Universität zu Berlin Department of Computer Science Unter den Linden
More informationText classification II CE-324: Modern Information Retrieval Sharif University of Technology
Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationA Novel PAT-Tree Approach to Chinese Document Clustering
A Novel PAT-Tree Approach to Chinese Document Clustering Kenny Kwok, Michael R. Lyu, Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong
More informationContent Based Spam Filtering
2016 International Conference on Collaboration Technologies and Systems Content Based Spam E-mail Filtering 2nd Author Pingchuan Liu and Teng-Sheng Moh Department of Computer Science San Jose State University
More informationAutomated Online News Classification with Personalization
Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798
More informationCombining Bayesian and Rule Score Learning: Automated Tuning for SpamAssassin
Combining Bayesian and Rule Score Learning: Automated Tuning for SpamAssassin Alexander K. Seewald Austrian Research Institute for Artificial Intelligence Freyung 6/6, A-1010 Vienna, Austria alexsee@oefai.at,
More informationClassifying Bug Reports to Bugs and Other Requests Using Topic Modeling
Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling Natthakul Pingclasai Department of Computer Engineering Kasetsart University Bangkok, Thailand Email: b5310547207@ku.ac.th Hideaki
More informationMODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour
MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11 Nearest Neighbour Classifier Keywords: K Neighbours, Weighted, Nearest Neighbour 1 Nearest neighbour classifiers This is amongst the simplest
More informationFiltering Spam Using Fuzzy Expert System 1 Hodeidah University, Faculty of computer science and engineering, Yemen 3, 4
Filtering Spam Using Fuzzy Expert System 1 Siham A. M. Almasan, 2 Wadeea A. A. Qaid, 3 Ahmed Khalid, 4 Ibrahim A. A. Alqubati 1, 2 Hodeidah University, Faculty of computer science and engineering, Yemen
More informationHigh Reliability Text Categorisation Systems
University of Cagliari Department of Electrical and Electronic Engineering High Reliability Text Categorisation Systems Doctoral Thesis of: Dott. Ing. Ignazio Pillai Tutor: Prof. Ing. Fabio Roli Dottorato
More informationImproving Imputation Accuracy in Ordinal Data Using Classification
Improving Imputation Accuracy in Ordinal Data Using Classification Shafiq Alam 1, Gillian Dobbie, and XiaoBin Sun 1 Faculty of Business and IT, Whitireia Community Polytechnic, Auckland, New Zealand shafiq.alam@whitireia.ac.nz
More informationDesigning and Building an Automatic Information Retrieval System for Handling the Arabic Data
American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far
More informationTRANSDUCTIVE TRANSFER LEARNING BASED ON KL-DIVERGENCE. Received January 2013; revised May 2013
International Journal of Innovative Computing, Information and Control ICIC International c 2014 ISSN 1349-4198 Volume 10, Number 1, February 2014 pp. 303 313 TRANSDUCTIVE TRANSFER LEARNING BASED ON KL-DIVERGENCE
More informationA comparative study for content-based dynamic spam classification using four machine learning algorithms
Available online at www.sciencedirect.com Knowledge-Based Systems xxx (2008) xxx xxx www.elsevier.com/locate/knosys A comparative study for content-based dynamic spam classification using four machine
More informationChapter-8. Conclusion and Future Scope
Chapter-8 Conclusion and Future Scope This thesis has addressed the problem of Spam E-mails. In this work a Framework has been proposed. The proposed framework consists of the three pillars which are Legislative
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationPerformance Analysis of Data Mining Classification Techniques
Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal
More informationUsing Self-Organizing Maps for Sentiment Analysis. Keywords Sentiment Analysis, Self-Organizing Map, Machine Learning, Text Mining.
Using Self-Organizing Maps for Sentiment Analysis Anuj Sharma Indian Institute of Management Indore 453331, INDIA Email: f09anujs@iimidr.ac.in Shubhamoy Dey Indian Institute of Management Indore 453331,
More informationAn Empirical Study of Lazy Multilabel Classification Algorithms
An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
More informationLetter Pair Similarity Classification and URL Ranking Based on Feedback Approach
Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach P.T.Shijili 1 P.G Student, Department of CSE, Dr.Nallini Institute of Engineering & Technology, Dharapuram, Tamilnadu, India
More information