An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

Size: px
Start display at page:

Download "An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization"

Transcription

1 An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University of Tainan, Tainan, Taiwan 700 b Institute of Information Management Shu-Te University, Kaohsiung County, Taiwan cclai@mail.nutn.edu.tw, lravati@pchome.com.tw Abstract The increasing volume of unsolicited bulk (also known as spam) has generated a need for reliable anti-spam filters. Using a classifier based on machine learning techniques to automatically filter out spam has drawn many researchers attention. In this paper, we review some of relevant ideas and do a set of systematic experiments on categorization, which has been conducted with four machine learning algorithms applied to different parts of . Experimental results reveal that the header of provides very useful information for all the machine learning algorithms considered to detect spam . Keywords: spam, categorization, machine learning 1. Introduction In recent years, Internet s have become a common and important medium of communication for almost everyone. However, spam, also known as unsolicited commercial/bulk , is a bane of communication. There are many serious problems associated with growing volumes of spam. Spam is not only a waste of storage space and communication bandwidth, but also a waste of time to tackle. Several solutions have been proposed to overcome the spam problem. Among the proposed methods, much interest has focused on the machine learning techniques in spam filtering. They include rule learning [3], Naïve Bayes [1, 6], decision trees [2], support vector machines [4] or combinations of different learners [7]. The basic and common concept of these approaches is that using a classifier to filter out spam and the classifier is learned from training data rather than constructed by hand. Therefore, it can result in better performance [9]. From the machine learning viewpoint, spam filtering based on the textual content of can be viewed as a special case of text categorization, with the categories being spam or non-spam [5]. Sahami et al. [6] employed Bayesian classification technique to filter junk s. By making use of the extensible framework of Bayesian modeling, they can not only employ traditional document classification techniques based on the text of , but they can also easily incorporate domain knowledge to aim at filtering spam s. Drucker et al. [4] used support vector machine (SVM) for classifying s according to their contents and compared its performance with Ripper, Rocchio, and boosting decision trees. They concluded that boosting trees and SVM had acceptable test performance in terms of accuracy and speed. However, the training time of boosting trees is inordinately long. Androutsopoulos et al. [1] extended the Naïve Bayes (NB) filter proposed by Sahami et al. [6], by investigating the effect of different number of features and training-set sizes on the filter s performance. Meanwhile, they compared the performance of NB to a memory-based approach, and they found both abovementioned methods clearly outperform a typical keyword-based filter. The objective of this paper is to evaluate four respective machine learning algorithms for spam e- mail categorization. These techniques are Naïve Bayes (NB), term frequency inverse document frequency (TF-IDF), K-nearest neighbor (K-NN), and support vector machines (SVMs). In addition, we study different parts of an that can be exploited to improve the categorization capability. We considered

2 the following five combinations of an message: all (A), header (H), body (B), subject (S), and body with subject (B+S). The above-mentioned four methods with these features are compared to help evaluate the relative merits of these algorithms, and suggest directions for future works. The rest of this paper is organized as follows. Section 2 gives a brief review of four machine learning algorithms and features we used. Section 3 presents the experimental results designed to evaluate the performance of different experimental settings. The conclusions are summarized in Section Machine learning methods and features in the In this section, we review the machine learning algorithms in the literature that used for categorization (or anti-spam filtering). They include Naïve Bayes (NB), term frequency inverse document frequency (TF-IDF), K-nearest neighbor (K-NN), and support vector machines (SVMs). 2.1 Naïve Bayes The Naïve Bayes (NB) classifier is a probability-based approach. The basic concept of it is to find whether an is spam or not by looking at which words are found in the message and which words are absent from it. In the literature, the NB classifier for spam is defined as follows: CNB arg max P( ci ) P( wk ci ), (1) ci T k where T is the set of target classes (spam or non-spam), and P(w k c i ) is the probability that word w k occurs in the , given the belongs to class c i. The likelihood term is estimated as n P( w c k k i ), (2) N where n k is the number of times word w k occurs in e- mails with class c i, and N is the number of words in e- mails with class c i. 2.2 Term frequency-inverse document frequency The most often adopted representation of a set of messages is as term weight vectors which used in the Vector Space Model [8]. The term weights are real numbers indicating the significance of terms in identifying a document. Based on this concept, the weight of a term in an message can be computed by the tf idf. The tf (term frequency) indicates the number of times that a term t appears in an . The idf (inverse document frequency) is the inverse of document frequency in the set of s that contain t. The tf idf weighting scheme is defined as N wij tfij log( ), (3) dfi where w ij is the weight of the ith term in the jth , tf ij is the number of times that the ith term occurs in the jth , N is the total number of s in the collection, and df i is the number of s in which the ith term occurs. 2.3 K-nearest neighbor The most basic instance-based method is the K-nearest neighbor (K-NN) algorithm. It is a very simple method to classify documents and to show very good performance on text categorization tasks [10]. If we want to apply K-NN method to classify s, the e- mails of the training set have to be indexed and then convert them into a document vector representation. When classifying a new , the similarity between its document vector and each one in the training set has to be computed. Then, the categories of the k nearest neighbors are determined and the category which occurs most frequently is chosen. 2.4 Support vector machine Support vector machine (SVM) has become very popular in the machine learning community because of its good generalization performance and its ability to handle high-dimensional data by using kernels. According the description given in Woitaszek et al. [11], an may be represented by a feature vector x that is composed of the various words from a dictionary formed by analyzing the collected s. Thus, an is classified as spam or non-spam by performing a simple dot product between the features of an and the SVM model weight vector, y w x b, (4) where y is the result of classification, w is weight vector corresponding to those in the feature vector x, and b is the bias parameter in the SVM model that determined by the training process. 2.5 The structure of an

3 In addition to the text message of an , an has additional information in the header. The header contains many fields, for example, trace information about which a message has passed (Received:), where the sender wants replies to go (Reply-To:), unique of ID of this message (Message-ID:), format of content (Content-Type:), etc. Figures 1 illustrates the header of an . Besides comparing the categorization performance among the learning algorithms, we intended to figure out which parts of an have critical influence on the classification results. Therefore, five features of an all (A), header (H), body (B), subject (S), and body with subject (B+S), are used to evaluate the performance of four machine learning algorithms. Furthermore, we also considered four cases that whether stemming or stopping procedure was applied or not. Received: from chen2 (localhost [ ]) by ipx.ntntc.edu.tw ( Sun/8.12.9) with ESMTP id i791mh4h028241; Mon, 9 Aug :48: (CST) From: =?big5?b?s6+pdrfx?= <robert@ipx.ntntc.edu.tw> To: <david@ipx.ntntc.edu.tw> Subject: =?big5?b?soquykvku6gp+g==?= Date: Mon, 9 Aug :50: Message-ID: <000001c47db3$334b3000$2a8547cb@chen2> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="---- _NextPart_000_0001_01C47DF6.416E7000" X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook, Build Figure 1. The header of an 3. Experimental results and discussion In order to test the performance of above-mentioned four methods, two corpora were used. The first corpus (Corpus I) consists of our s over a recent three month period. For experiments, we deleted some e- mails whose messages were too short or did not contain any content and obtain 1050 spam and 1057 non-spam. The second corpus (Corpus II) we adopted is available at This archive contains 2100 spam and 2107 non-spam messages. We run experiments with different training and test sets. The first pair of training and test set is created by splitting each corpus at a ratio of 20:80. The second pair and the third one are 30:70 and 40:60, respectively. In classification tasks, the performance is often measured in terms of accuracy. Let N legit and N spam denote the total numbers of non-spam and spam messages, respectively, to be classified by the machine learning method, and n(c V) the number of messages belonging to category C that the method classified as category V (here, C, V {legit, spam}). The accuracy is defined as following formula: number of e - mails correctly categorized Accuracy total number of e - mails n( legit legit) n( spam spam) N legit N spam (5) The overall performances of considered learning algorithms in different experiments are shown in Tables 1 and 2. From the results, we found the following phenomena. 1. Good performance of NB, TF-IDF and SVM with header information. NB and TF-IDF performed reasonably consistent and good in different experimental settings. While SVM performed well except the feature of subject. It seems that the subject is not enough for high accuracy classification in SVM. 2. Poor performance of KNN method. KNN performed the worst among all considered methods and the poorest in all cases. However, if the more preprocessing tasks are utilized (i.e., stemming and stopping are applied together), the better KNN performs. 3. No effect of stemming, but stopping can enhance the classification. Stemming did not make any significant improvement for all algorithms in performance, though it decreased the size of the feature set. On the other hand, when the stopping procedure is employed, that is, ignoring some words that do not carry meaning in natural language, we can get better performance. The phenomenon is obvious especially in K-NN method as shown in Figure Good performance with header. Among four machine learning algorithms, the performance with header was the best. This means that much information can be derived from the header and then the fields in the header can aim at classifying s correctly. 5. Poor performance with subject or body. The poor performance of each algorithm occurs in subject or body. The reasons may be that the former provides too little useful information. On the contrary, the latter contains too much useless information to classify s. From the observation, we know that although some learning algorithms can achieve satisfactory results, we

4 Accuracy(%) may try to improve the result by combing some of them. Here, we integrate TF-IDF and NB methods and apply them to two corpora. The experimental results are shown in the last column of Tables 1 and 2. They show that the accuracy can be improved by the new hybrid approach Figure 2. Performance of K-NN method in all features with and without stopping procedure in Corpus I 4. Conclusion The detection of spam is an important issue of information technologies, and machine learning has a central role to play in this topic. In this paper, we presented an empirical evaluation of four machine learning algorithms for spam categorization. These approaches, NB, TF-IDF, K-NN, and SVM, were applied to different parts of an in order to compare their performance. Experimental results indicate that NB, TF-IDF, and SVM yield better performance than K-NN. The phenomenon also found, at least with our test corpora, that classification with the header was the most accurate than other parts of an . On the other hand, we try to combine two methods (TF-IDF and NB) to achieve the most correct categorization. It was found that integrating different learning algorithms actually seems to be a promising way. Acknowledgements stopping without stopping A H B S B+S featues This work was partially supported by National Science Council, Taiwan, R.O.C., under grant NSC E Comments from anonymous referees are highly appreciated. References [1] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C.D. Spyropoulos, and P. Stamatopoulos, Learning to filter spam A comparison of a Naïve Bayesian and a memory-based approach, Proceedings of the workshop: Machine Learning and Textual Information Access, pp. 1-13, [2] X. Carreras and L. Márquez, Boosting trees for antispam filtering, Proceedings of 4th Int l Conf. on Recent Advances in Natural Language Processing, pp , [3] W.W. Cohen, Learning rules that classify , Proceedings of AAAI Spring Symposium on Machine Learning in Information Access, pp , [4] H. Drucker, D. Wu, and V.N. Vapnik, Support vector machines for spam categorization, IEEE Trans. Neural Networks, vol. 10, no. 5, pp , [5] A. Ko cz and J. Alspector, SVM-based filtering of e- mail spam with content-specific misclassification costs, Proceedings of TextDM'01 Workshop on Text Mining, [6] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, A Bayesian approach to filtering junk , Learning for Text Categorization Papers from the AAAI Workshop, pp , [7] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos, and P. Stamatopoulos, Stacking classifiers for anti-spam filtering of , Proceedings of the 6th Conf. on Empirical Methods in Natural Language Processing, pp , [8] G. Salton, Automating Text Processing: The Transformation, Analysis and Retrieval of Information by Computer, Addison-Wesley, [9] K.-M. Schneider, A comparison of event models for Naïve Bayes anti-spam filtering, Proceedings of 10th Conf. of the European Chapter of the Association for Computational Linguistics, pp , [10] Y. Yang, An evaluation of statistical approaches to text categorization, Journal of Information Retrieval, vol. 1, pp , [11] M. Woitaszek, M. Shaaban, and R. Czernikowski, Identifying junk electronic mail in Microsoft Outlook with a support vector machine, Proceedings of the 2003 Symposium on Applications and the Internet, pp , 2003.

5 Table 1. Performance of four machine learning algorithms in Corpus I Feature Preprocessing Algorithms s stemming stopping NB TFIDF K-NN SVM TFIDF+NB A A A A H H H H B B B B S S S S B+S B+S B+S B+S Average Table 2. Performance of four machine learning algorithms in Corpus II Feature Preprocessing Algorithms s stemming stopping NB TFIDF K-NN SVM TFIDF+NB A A A A H H H H B B B B S S S S B+S B+S B+S B+S Average

INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING

INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR  SPAMMING INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR EMAIL SPAMMING TAK-LAM WONG, KAI-ON CHOW, FRANZ WONG Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue,

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

EVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST

EVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST EVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST Enrico Blanzieri and Anton Bryl May 2007 Technical Report # DIT-07-025 Evaluation of the Highest

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

SPAM, generally defined as unsolicited bulk (UBE) Feature Construction Approach for Categorization Based on Term Space Partition

SPAM, generally defined as unsolicited bulk  (UBE) Feature Construction Approach for  Categorization Based on Term Space Partition Feature Construction Approach for Email Categorization Based on Term Space Partition Guyue Mi, Pengtao Zhang and Ying Tan Abstract This paper proposes a novel feature construction approach based on term

More information

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News Selecting Model in Automatic Text Categorization of Chinese Industrial 1) HUEY-MING LEE 1 ), PIN-JEN CHEN 1 ), TSUNG-YEN LEE 2) Department of Information Management, Chinese Culture University 55, Hwa-Kung

More information

Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI

Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI University of Houston Clear Lake School of Science & Computer Engineering Project Report Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes April 17, 2008 Course Number: CSCI 5634.01 University of

More information

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008 Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification

More information

CHEAP, efficient and easy to use, has become an

CHEAP, efficient and easy to use,  has become an Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 A Multi-Resolution-Concentration Based Feature Construction Approach for Spam Filtering Guyue Mi,

More information

JURD: Joiner of Un-Readable Documents algorithm for Reversing the Effects of Tokenisation Attacks against Content-based Spam Filters

JURD: Joiner of Un-Readable Documents algorithm for Reversing the Effects of Tokenisation Attacks against Content-based Spam Filters JURD: Joiner of Un-Readable Documents algorithm for Reversing the Effects of Tokenisation Attacks against Content-based Spam Filters Igor Santos, Carlos Laorden, Borja Sanz, Pablo G. Bringas S 3 Lab, DeustoTech

More information

Multi-Dimensional Text Classification

Multi-Dimensional Text Classification Multi-Dimensional Text Classification Thanaruk THEERAMUNKONG IT Program, SIIT, Thammasat University P.O. Box 22 Thammasat Rangsit Post Office, Pathumthani, Thailand, 12121 ping@siit.tu.ac.th Verayuth LERTNATTEE

More information

GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM

GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM http:// GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM Akshay Kumar 1, Vibhor Harit 2, Balwant Singh 3, Manzoor Husain Dar 4 1 M.Tech (CSE), Kurukshetra University, Kurukshetra,

More information

Collaborative Spam Mail Filtering Model Design

Collaborative Spam Mail Filtering Model Design I.J. Education and Management Engineering, 2013, 2, 66-71 Published Online February 2013 in MECS (http://www.mecs-press.net) DOI: 10.5815/ijeme.2013.02.11 Available online at http://www.mecs-press.net/ijeme

More information

Schematizing a Global SPAM Indicative Probability

Schematizing a Global SPAM Indicative Probability Schematizing a Global SPAM Indicative Probability NIKOLAOS KORFIATIS MARIOS POULOS SOZON PAPAVLASSOPOULOS Department of Management Science and Technology Athens University of Economics and Business Athens,

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

On Effective Classification via Neural Networks

On Effective  Classification via Neural Networks On Effective E-mail Classification via Neural Networks Bin Cui 1, Anirban Mondal 2, Jialie Shen 3, Gao Cong 4, and Kian-Lee Tan 1 1 Singapore-MIT Alliance, National University of Singapore {cuibin, tankl}@comp.nus.edu.sg

More information

Using AdaBoost and Decision Stumps to Identify Spam

Using AdaBoost and Decision Stumps to Identify Spam Using AdaBoost and Decision Stumps to Identify Spam E-mail Tyrone Nicholas June 4, 2003 Abstract An existing spam e-mail filter using the Naive Bayes decision engine was retrofitted with one based on the

More information

Keywords : Bayesian, classification, tokens, text, probability, keywords. GJCST-C Classification: E.5

Keywords : Bayesian,  classification, tokens, text, probability, keywords. GJCST-C Classification: E.5 Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 13 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global

More information

Computer aided mail filtering using SVM

Computer aided mail filtering using SVM Computer aided mail filtering using SVM Lin Liao, Jochen Jaeger Department of Computer Science & Engineering University of Washington, Seattle Introduction What is SPAM? Electronic version of junk mail,

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

A Reputation-based Collaborative Approach for Spam Filtering

A Reputation-based Collaborative Approach for Spam Filtering Available online at www.sciencedirect.com ScienceDirect AASRI Procedia 5 (2013 ) 220 227 2013 AASRI Conference on Parallel and Distributed Computing Systems A Reputation-based Collaborative Approach for

More information

An Experimental Evaluation of Spam Filter Performance and Robustness Against Attack

An Experimental Evaluation of Spam Filter Performance and Robustness Against Attack An Experimental Evaluation of Spam Filter Performance and Robustness Against Attack Steve Webb, Subramanyam Chitti, and Calton Pu {webb, chittis, calton}@cc.gatech.edu College of Computing Georgia Institute

More information

A hybrid method to categorize HTML documents

A hybrid method to categorize HTML documents Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper

More information

Centroid-Based Document Classification: Analysis & Experimental Results?

Centroid-Based Document Classification: Analysis & Experimental Results? Centroid-Based Document Classification: Analysis & Experimental Results? Eui-Hong (Sam) Han and George Karypis University of Minnesota, Department of Computer Science / Army HPC Research Center Minneapolis,

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

Rita McCue University of California, Santa Cruz 12/7/09

Rita McCue University of California, Santa Cruz 12/7/09 Rita McCue University of California, Santa Cruz 12/7/09 1 Introduction 2 Naïve Bayes Algorithms 3 Support Vector Machines and SVMLib 4 Comparative Results 5 Conclusions 6 Further References Support Vector

More information

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings. Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review

More information

An Automatic Reply to Customers Queries Model with Chinese Text Mining Approach

An Automatic Reply to Customers  Queries Model with Chinese Text Mining Approach Proceedings of the 6th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 15-17, 2007 71 An Automatic Reply to Customers E-mail Queries Model with Chinese Text Mining Approach

More information

Influence of Word Normalization on Text Classification

Influence of Word Normalization on Text Classification Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we

More information

Bayesian Spam Detection System Using Hybrid Feature Selection Method

Bayesian Spam Detection System Using Hybrid Feature Selection Method 2016 International Conference on Manufacturing Science and Information Engineering (ICMSIE 2016) ISBN: 978-1-60595-325-0 Bayesian Spam Detection System Using Hybrid Feature Selection Method JUNYING CHEN,

More information

ScienceDirect. KNN with TF-IDF Based Framework for Text Categorization

ScienceDirect. KNN with TF-IDF Based Framework for Text Categorization Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 69 ( 2014 ) 1356 1364 24th DAAAM International Symposium on Intelligent Manufacturing and Automation, 2013 KNN with TF-IDF Based

More information

An Immune Concentration Based Virus Detection Approach Using Particle Swarm Optimization

An Immune Concentration Based Virus Detection Approach Using Particle Swarm Optimization An Immune Concentration Based Virus Detection Approach Using Particle Swarm Optimization Wei Wang 1,2, Pengtao Zhang 1,2, and Ying Tan 1,2 1 Key Laboratory of Machine Perception, Ministry of Eduction,

More information

Using Gini-index for Feature Weighting in Text Categorization

Using Gini-index for Feature Weighting in Text Categorization Journal of Computational Information Systems 9: 14 (2013) 5819 5826 Available at http://www.jofcis.com Using Gini-index for Feature Weighting in Text Categorization Weidong ZHU 1,, Yongmin LIN 2 1 School

More information

Analysis on the technology improvement of the library network information retrieval efficiency

Analysis on the technology improvement of the library network information retrieval efficiency Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(6):2198-2202 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Analysis on the technology improvement of the

More information

Social Media Computing

Social Media Computing Social Media Computing Lecture 4: Introduction to Information Retrieval and Classification Lecturer: Aleksandr Farseev E-mail: farseev@u.nus.edu Slides: http://farseev.com/ainlfruct.html At the beginning,

More information

PERFORMANCE OF MACHINE LEARNING TECHNIQUES FOR SPAM FILTERING

PERFORMANCE OF MACHINE LEARNING TECHNIQUES FOR  SPAM FILTERING PERFORMANCE OF MACHINE LEARNING TECHNIQUES FOR EMAIL SPAM FILTERING M. Deepika 1 Shilpa Rani 2 1,2 Assistant Professor, Department of Computer Science & Engineering, Sreyas Institute of Engineering & Technology,

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

CORE for Anti-Spam. - Innovative Spam Protection - Mastering the challenge of spam today with the technology of tomorrow

CORE for Anti-Spam. - Innovative Spam Protection - Mastering the challenge of spam today with the technology of tomorrow CORE for Anti-Spam - Innovative Spam Protection - Mastering the challenge of spam today with the technology of tomorrow Contents 1 Spam Defense An Overview... 3 1.1 Efficient Spam Protection Procedure...

More information

Naive Bayes spam filtering using word-position-based attributes and length-sensitive classification thresholds

Naive Bayes spam filtering using word-position-based attributes and length-sensitive classification thresholds Naive Bayes spam filtering using word-position-based attributes and length-sensitive classification thresholds Johan Hovold Department of Computer Science Lund University Box 118, 221 00 Lund, Sweden johan.hovold.363@student.lu.se

More information

VECTOR SPACE CLASSIFICATION

VECTOR SPACE CLASSIFICATION VECTOR SPACE CLASSIFICATION Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. Chapter 14 Wei Wei wwei@idi.ntnu.no Lecture

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Filtering Spam by Using Factors Hyperbolic Trees

Filtering Spam by Using Factors Hyperbolic Trees Filtering Spam by Using Factors Hyperbolic Trees Hailong Hou*, Yan Chen, Raheem Beyah, Yan-Qing Zhang Department of Computer science Georgia State University P.O. Box 3994 Atlanta, GA 30302-3994, USA *Contact

More information

A study of classification algorithms using Rapidminer

A study of classification algorithms using Rapidminer Volume 119 No. 12 2018, 15977-15988 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A study of classification algorithms using Rapidminer Dr.J.Arunadevi 1, S.Ramya 2, M.Ramesh Raja

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Unknown Malicious Code Detection Based on Bayesian

Unknown Malicious Code Detection Based on Bayesian Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 3836 3842 Advanced in Control Engineering and Information Science Unknown Malicious Code Detection Based on Bayesian Yingxu Lai

More information

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao Large Scale Chinese News Categorization --based on Improved Feature Selection Method Peng Wang Joint work with H. Zhang, B. Xu, H.W. Hao Computational-Brain Research Center Institute of Automation, Chinese

More information

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

In this project, I examined methods to classify a corpus of  s by their content in order to suggest text blocks for semi-automatic replies. December 13, 2006 IS256: Applied Natural Language Processing Final Project Email classification for semi-automated reply generation HANNES HESSE mail 2056 Emerson Street Berkeley, CA 94703 phone 1 (510)

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Information Extraction from Spam s using Stylistic and Semantic Features to Identify Spammers

Information Extraction from Spam  s using Stylistic and Semantic Features to Identify Spammers Information Extraction from Spam using Stylistic and Semantic Features to Identify Spammers Soma Halder University of Alabama at Birmingham soma@cis.uab.edu Richa Tiwari University of Alabama at Birmingham

More information

Improved Online Support Vector Machines Spam Filtering Using String Kernels

Improved Online Support Vector Machines Spam Filtering Using String Kernels Improved Online Support Vector Machines Spam Filtering Using String Kernels Ola Amayri and Nizar Bouguila Concordia University, Montreal, Quebec, Canada H3G 2W1 {o_amayri,bouguila}@encs.concordia.ca Abstract.

More information

Online Self-Organised Map Classifiers as Text Filters for Spam Detection

Online Self-Organised Map Classifiers as Text Filters for Spam  Detection Journal of Information Assurance and Security 4 (2009) 151-160 Online Self-Organised Map Classifiers as Text Filters for Spam Email Detection Bogdan Vrusias 1 and Ian Golledge 2 1 University of Surrey,

More information

A New Approach for Handling the Iris Data Classification Problem

A New Approach for Handling the Iris Data Classification Problem International Journal of Applied Science and Engineering 2005. 3, : 37-49 A New Approach for Handling the Iris Data Classification Problem Shyi-Ming Chen a and Yao-De Fang b a Department of Computer Science

More information

Weight adjustment schemes for a centroid based classifier

Weight adjustment schemes for a centroid based classifier Weight adjustment schemes for a centroid based classifier Shrikanth Shankar and George Karypis University of Minnesota, Department of Computer Science Minneapolis, MN 55455 Technical Report: TR 00-035

More information

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document BayesTH-MCRDR Algorithm for Automatic Classification of Web Document Woo-Chul Cho and Debbie Richards Department of Computing, Macquarie University, Sydney, NSW 2109, Australia {wccho, richards}@ics.mq.edu.au

More information

An Improvement of Centroid-Based Classification Algorithm for Text Classification

An Improvement of Centroid-Based Classification Algorithm for Text Classification An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,

More information

Concept-Based Document Similarity Based on Suffix Tree Document

Concept-Based Document Similarity Based on Suffix Tree Document Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri

More information

Spam Decisions on Gray using Personalized Ontologies

Spam Decisions on Gray  using Personalized Ontologies Spam Decisions on Gray E-mail using Personalized Ontologies Seongwook Youn Semantic Information Research Laboratory (http://sir-lab.usc.edu) Dept. of Computer Science Univ. of Southern California Los Angeles,

More information

A Feature Selection Method to Handle Imbalanced Data in Text Classification

A Feature Selection Method to Handle Imbalanced Data in Text Classification A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University

More information

Decision Science Letters

Decision Science Letters Decision Science Letters 3 (2014) 439 444 Contents lists available at GrowingScience Decision Science Letters homepage: www.growingscience.com/dsl Identifying spam e-mail messages using an intelligence

More information

arxiv: v1 [cs.lg] 12 Feb 2018

arxiv: v1 [cs.lg] 12 Feb 2018 Email Classification into Relevant Category Using Neural Networks arxiv:1802.03971v1 [cs.lg] 12 Feb 2018 Deepak Kumar Gupta & Shruti Goyal Co-Founders: Reckon Analytics deepak@reckonanalytics.com & shruti@reckonanalytics.com

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

Frequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning

Frequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning Frequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning Izumi Suzuki, Koich Yamada, Muneyuki Unehara Nagaoka University of Technology, 1603-1, Kamitomioka Nagaoka, Niigata

More information

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy

More information

A generalized additive neural network application in information security

A generalized additive neural network application in information security Lecture Notes in Management Science (2014) Vol. 6: 58 64 6 th International Conference on Applied Operational Research, Proceedings Tadbir Operational Research Group Ltd. All rights reserved. www.tadbir.ca

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Information-Theoretic Feature Selection Algorithms for Text Classification

Information-Theoretic Feature Selection Algorithms for Text Classification Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 5 Information-Theoretic Feature Selection Algorithms for Text Classification Jana Novovičová Institute

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

Probabilistic Anti-Spam Filtering with Dimensionality Reduction

Probabilistic Anti-Spam Filtering with Dimensionality Reduction Probabilistic Anti-Spam Filtering with Dimensionality Reduction ABSTRACT One of the biggest problems of e-mail communication is the massive spam message delivery Everyday billion of unwanted messages are

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

An Ensemble Data Mining Approach for Intrusion Detection in a Computer Network

An Ensemble Data Mining Approach for Intrusion Detection in a Computer Network International Journal of Science and Engineering Investigations vol. 6, issue 62, March 2017 ISSN: 2251-8843 An Ensemble Data Mining Approach for Intrusion Detection in a Computer Network Abisola Ayomide

More information

A Comparative Study of Classification Based Personal Filtering

A Comparative Study of Classification Based Personal  Filtering A Comparative Study of Classification Based Personal E-mail Filtering Yanlei Diao, Hongjun Lu and Deai Wu Department of Computer Science The Hong Kong University of Science and Technology Clear Water Bay,

More information

Answering Assistance by Semi-Supervised Text Classification

Answering Assistance by Semi-Supervised Text Classification In Intelligent Data Analysis, 8(5), 24. Email Answering Assistance by Semi-Supervised Text Classification Tobias Scheffer Humboldt-Universität zu Berlin Department of Computer Science Unter den Linden

More information

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

A Novel PAT-Tree Approach to Chinese Document Clustering

A Novel PAT-Tree Approach to Chinese Document Clustering A Novel PAT-Tree Approach to Chinese Document Clustering Kenny Kwok, Michael R. Lyu, Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong

More information

Content Based Spam Filtering

Content Based Spam  Filtering 2016 International Conference on Collaboration Technologies and Systems Content Based Spam E-mail Filtering 2nd Author Pingchuan Liu and Teng-Sheng Moh Department of Computer Science San Jose State University

More information

Automated Online News Classification with Personalization

Automated Online News Classification with Personalization Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798

More information

Combining Bayesian and Rule Score Learning: Automated Tuning for SpamAssassin

Combining Bayesian and Rule Score Learning: Automated Tuning for SpamAssassin Combining Bayesian and Rule Score Learning: Automated Tuning for SpamAssassin Alexander K. Seewald Austrian Research Institute for Artificial Intelligence Freyung 6/6, A-1010 Vienna, Austria alexsee@oefai.at,

More information

Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling

Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling Natthakul Pingclasai Department of Computer Engineering Kasetsart University Bangkok, Thailand Email: b5310547207@ku.ac.th Hideaki

More information

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11 Nearest Neighbour Classifier Keywords: K Neighbours, Weighted, Nearest Neighbour 1 Nearest neighbour classifiers This is amongst the simplest

More information

Filtering Spam Using Fuzzy Expert System 1 Hodeidah University, Faculty of computer science and engineering, Yemen 3, 4

Filtering Spam Using Fuzzy Expert System 1 Hodeidah University, Faculty of computer science and engineering, Yemen 3, 4 Filtering Spam Using Fuzzy Expert System 1 Siham A. M. Almasan, 2 Wadeea A. A. Qaid, 3 Ahmed Khalid, 4 Ibrahim A. A. Alqubati 1, 2 Hodeidah University, Faculty of computer science and engineering, Yemen

More information

High Reliability Text Categorisation Systems

High Reliability Text Categorisation Systems University of Cagliari Department of Electrical and Electronic Engineering High Reliability Text Categorisation Systems Doctoral Thesis of: Dott. Ing. Ignazio Pillai Tutor: Prof. Ing. Fabio Roli Dottorato

More information

Improving Imputation Accuracy in Ordinal Data Using Classification

Improving Imputation Accuracy in Ordinal Data Using Classification Improving Imputation Accuracy in Ordinal Data Using Classification Shafiq Alam 1, Gillian Dobbie, and XiaoBin Sun 1 Faculty of Business and IT, Whitireia Community Polytechnic, Auckland, New Zealand shafiq.alam@whitireia.ac.nz

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

TRANSDUCTIVE TRANSFER LEARNING BASED ON KL-DIVERGENCE. Received January 2013; revised May 2013

TRANSDUCTIVE TRANSFER LEARNING BASED ON KL-DIVERGENCE. Received January 2013; revised May 2013 International Journal of Innovative Computing, Information and Control ICIC International c 2014 ISSN 1349-4198 Volume 10, Number 1, February 2014 pp. 303 313 TRANSDUCTIVE TRANSFER LEARNING BASED ON KL-DIVERGENCE

More information

A comparative study for content-based dynamic spam classification using four machine learning algorithms

A comparative study for content-based dynamic spam classification using four machine learning algorithms Available online at www.sciencedirect.com Knowledge-Based Systems xxx (2008) xxx xxx www.elsevier.com/locate/knosys A comparative study for content-based dynamic spam classification using four machine

More information

Chapter-8. Conclusion and Future Scope

Chapter-8. Conclusion and Future Scope Chapter-8 Conclusion and Future Scope This thesis has addressed the problem of Spam E-mails. In this work a Framework has been proposed. The proposed framework consists of the three pillars which are Legislative

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Using Self-Organizing Maps for Sentiment Analysis. Keywords Sentiment Analysis, Self-Organizing Map, Machine Learning, Text Mining.

Using Self-Organizing Maps for Sentiment Analysis. Keywords Sentiment Analysis, Self-Organizing Map, Machine Learning, Text Mining. Using Self-Organizing Maps for Sentiment Analysis Anuj Sharma Indian Institute of Management Indore 453331, INDIA Email: f09anujs@iimidr.ac.in Shubhamoy Dey Indian Institute of Management Indore 453331,

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach P.T.Shijili 1 P.G Student, Department of CSE, Dr.Nallini Institute of Engineering & Technology, Dharapuram, Tamilnadu, India

More information