INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING
|
|
- Edmund Woods
- 5 years ago
- Views:
Transcription
1 INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING TAK-LAM WONG, KAI-ON CHOW, FRANZ WONG Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong Abstract: ming causes serious problems in the Internet resulting in a huge waste of resources and attracting high attention from research society. Automatic document classification and keyword-based filtering are two kinds of techniques which have been applied to filter s to achieve satisfactory results. This paper proposes a formal method by incorporating keyword-based filtering to document classification. To consider the potentially high cost of misclassification of an to a in real word situation, a cost-sensitive evaluation metric is adopted to evaluate our approaches. We conducted extensive experiments in real-word data showing promising results. Keywords: filtering; Documents classification; Keyword-based filtering; Cost-sensitive evaluation 1. Introduction The increasing number of ming on the Internet has generated a large amount of negative impacts [7,10]. For example, ming can lead to a huge waste of processing power, transmission bandwidth, and storage space. ming, which is the delivery of s by unauthorized parties for mostly advertisement purpose, is one of the most common form of ming. Figure 1 shows an excerpt of a . A normally contains advertising content such as a list of brand names. In 1997, about 10% of s received by corporate networks are [6]. However, in 2003, Dunn reports that 75% of s in an account are s [8]. This raises the need for effective measures to reduce the s. The rationale of filtering is to identify s and not let them be delivered to users. Document classification techniques, which aims at categorizing documents into different predefined categories, have been applied to filtering with satisfactory results. By making use of a machine learning approach, a document classifier is automatically trained from a set of training examples. One advantage of automatic document classification is that they can handle the uncertainty involved effectively. However, unlike ordinary document classification, in which the costs for misclassifying documents to different classes are the same, a serious problem of filtering is in the high cost of classifying a ham , which refers to the sent by a known authority, to . For instance, a retailer may lose customers if the filtering method incorrectly classifies some s from customer as s and hence the retailer cannot provide immediate response to customers. On the contrary, classifying a to a ham may only produce browsing trouble to users with relatively limited impact. Original replica Rolex and other watches for men and ladies from $ Follow this promotional link to see lowest prices. BMW Breguet... Vacheron Constantin Vip Rolex -- Phone: Mobile: troublemakeraffects@telesp.net.br Figure 1. An excerpt of a Keyword-based filtering is another approach for filtering with effective results and is commonly employed in commercial software. The idea is to maintain X/07/$ IEEE 3899
2 a list of words and s that contain words are classified as s. The advantage of this approach is that it is light-weight and effective. Also, it allows users to modify the list of words based on their prior knowledge. However, keyword-based filtering is considered ad-hoc and is unable to handle uncertainty because it makes hard decision in the filtering process. Software products that combine the two techniques are not commonly seen. In this paper, we develop an approach to formally incorporate keyword-based filtering to different document classification techniques, in particular, multinominal naïve Bayesian and K-nearest neighbour. Our approaches can soften the hard decision made by keyword-based filtering, yet preserve its effectiveness. To tackle the different costs of the misclassification of a ham to a and the misclassification of a to a ham , a cost-sensitive method is used to evaluate different filtering techniques. We have conducted experiments on real-world data to evaluate our approaches in tackling ming, and considering the costs of different types of misclassification. Empirical results demonstrate that our approaches are more effective compared with approaches using document classification alone. 2. Related Works Several document classification based techniques have been proposed for filtering. Sahami et al. applied Bayesian approach to filter s [11]. Their approach considers a set of problem-specific features such as the domain type of the senders. For example, it is unlikely that s are sent from.edu domains. Carreras and Marquez proposed a method using boosting techniques to improve the performance [4]. Their approach makes use of the AdaBoost algorithm to discover a set of weak rules and then combines several weak rules to form accurate rules for filtering s. Androutsopoulos et al. investigated the effect of the size of feature set, the size of training set, the use lemmatization, and the use of stop words in training naïve Bayesian classifiers for filtering [1]. Sakkis et al. employed memory-based learning method, which corresponds to the K-nearest neighbour algorithm, for the task [12]. Different factors such as the value of K, the features adopted, etc. have been investigated in their works. A detailed comparison between naïve Bayesian and memory-based methods has been reported in [2]. Statistical techniques have also been employed to train document classifiers for filtering [3]. One characteristic of these document classification techniques is that they can effectively handle the uncertainty involved in the problem. Besides document classification techniques, rule induction techniques have been proposed for filtering s. Cohen proposed a method for learning filtering rules from a set of training examples [5]. The rules are keyword-spotting rules that classify s based on keywords contained in different fields of s. Hidalgo et al. decided and developed a system with a set of heuristics for detecting s [9]. Such heuristics include some special characters, for example, #,!, contained in s. These methods are mainly based on keyword matching in s. An advantage of applying filtering rules to filtering is that they are fast and effective. However, the rules are quite restrictive and make hard decision only in filtering. 3. Document Classification 3.1. Multinominal Naïve Bayesian α c Figure 2. A generative model for generating s. Spam filtering can be considered to be a document classification problem, in which a document refers to an and classes include and ham . An can be considered as a bag of words. The generation process can be represented by the generative model depicted in Figure 2. This model can represent the dependence of different variables. Shaded and unshaded nodes correspond to observed and unobserved variables respectively. A block refers to the repetition of the variables inside. In the model, a multinominal parameter α specifies a multinomial distribution which generates the class of an c. Together with another parameter β, c describes another multinomial distribution which generates the N words w contained in an . Suppose a collection contains M β w N M 3900
3 s, the generation process repeats M times. Since the s are given in filtering, w is an observed variable in the model. However, α, β, and c are unobservable because their values are hidden for an . Based on the generative model, the probability of generating an d i given the parameters α and β, is expressed as follows: Ni P( di α, β ) = P( c α) P( wk c; β ) (1) k = 1 where w k refers the k-th word in the . N i denotes the number of words of the i-th in the collection. Recall that our objective in filtering is to determine the class of a given . Essentially, we determine the probability for the class of the document d i belonging to the class c j given the document and the parameters, P(c=c j d i ;α,β). According to Bayes theorem, we have: Ni P( c = cj di; α, β ) P( c = cj α ) P( wk cj; β ) (2) k = 1 Once we obtain the values of P(c α) and P(w c,β), we can infer the class label of an . We determine the value of these probabilities as follows: Let V be the vocabulary of words contained in all s. Then we have: 1+ f ( i= 1 P( c = cj α ) = (3) M + c where M is the number of s in the collection; f(d i,c j ) is an indicator function which equals to 1 if document d i is labeled as class c j, and equals to 0 otherwise; c denotes the number of classes in the underlying problem. In filtering c is equal to 2 because the classes include ham and . Next, we have: M M 1+ g( wk, i= 1 P( wk cj; β ) = (4) V M V + g( ws, s= 1 i= 1 where V denotes the size of the vocabulary and g(d i,w k,c j ) is an indicator function which equals to 1 if d i is labeled as c j and contains the word w k, while equals to 0 otherwise K-Nearest Neighbour K-nearest neighbour (K-NN) is a classification technique based on similarity measure between instances and has been applied in document classification. In filtering, each can be represented by a vector d k The t-th entry of d i, denoted by w it, refers to the weighting of the t-th word in the vocabulary V in the . We define the similarity sim(d p,d q ), known as the cosine similarity, between two s d p and d q as follows: dp dq sim( dp, dq) = (5) dp dq where d refers to the norm of the vector d. Suppose we have a collection of s denoted by D. If we want to determine the class label of the d i, we first compute the similarity of d i between each of the s in D. Next the K most similar s will be identified. The score of d i belonging to class c j is defined as follows: score ( = sim( dp) y( dp, (6) dp KNN where KNN refers to the set of K-nearest neighbours in D, and y(d p,c j ) is an indicator function which equals to 1 if d p is labeled as c j, and equals to 0 otherwise. Finally, the class labeled of d i is decided as follows: c = max{ score( } (7) c j 4. Incorporating Keyword-Based Filtering in Different Classifiers Keyword-based filtering is commonly applied to existing commercial filtering system because it is fast and effective. The idea of keyword-based filtering is to check if an contains words. If an contains any words, it will be considered to be a . Table 1 depicts a list of words. This list is also used in the experiments in this paper. Though keyword-based filtering is effective, it is ad-hoc, makes hard decisions, and is unlikely to handle uncertainty. As a consequence, we incorporate keyword-based filtering in multinominal naïve Bayesian and K-NN classifiers to improve the effectiveness and yet handle the uncertainty involved for filtering. Table 1. A list of words for s filtering gamble gambling casino poker pill drug medicine viagra sex fuck suck porn lonely girl penis nude adult lesbian virus 100% earn money credit card cheap cash insurance free undeliverable mail 3901
4 Recall that a multinominal naïve Bayesian classifier is characterized by a set of probabilities P(c α) and P(w c;β). P(w c;β) can be interpreted as how likely an of class c will contain the word w. For example, if a number of s contain a word, such as Viagra, while only a few ham s contain this word, P(w= viagra c=;β) will be significantly higher than P(w= viagra c=ham;β). We incorporate keyword-based filtering to naïve Bayesian filtering based on this idea. For each of the words listed in Table 1, we set P(w=w c=;β) to a higher value and reduce the values of P(w c=;β) for other words accordingly; while set P(w=w c=ham;β) to a smaller value and increase the values of P(w c=ham;β) for other words accordingly, where w is a word. To incorporate keyword-based filtering in K-NN, we artificially create s, each of which only consists of a word. Each document is represented by a vector in which each entry is the weight for a word in the vocabulary V. Next, we set the weight for the word in the artificial to a higher value. These documents are labeled with s. To determine the class of an d i, the similarity between d i and all the s in D, as well as the similarity between d i and the artificial s are calculated. The class label is then determined according to Equations 6 and 7. Consequently, an that contains words will be more likely to be classified as s by using multinominal naïve Bayesian or K-NN classifiers alone. On the other hand, it can soften the decision made by using only keyword-based filtering. 5. Experimental Results We evaluate our proposed methods in the TREC 2005 Spam Corpus provided by Text Retrieval Conference. It consists of s in which about 57% are s. As discussed in Section 1, about 70% of s to an account are. To simulate the real situation, we randomly select s to construct the training set and s to construct the testing set. Both training set and testing set contain 70% of s and 30% of ham s. Each is preprocessed by removing the stop words and conducting stemming to each word. We also use Chi-Square method for feature selection and the resulting features contain about 400 words. For K-NN, K is set to 5 and we adopt the commonly used term frequency-inverse document frequency (TF-IDF) to calculate the weight for each word. After that, four sets of experiments were conducted for evaluation. The first and second sets of experiments apply multinominal naïve Bayesian and K-NN approaches respectively for filtering s. These two sets of experiments can be treated as the baselines of our methods of incorporating keyword-based filtering. The third and fourth sets are to incorporate keyword-based filtering to multinominal naïve Bayesian and to K-NN as described in Section 4 respectively. The words are listed in Table 1. We adopt macro F-measure (F macro ) and micro F-measure (F micro ) as the evaluation metrics in our experiments. However, one problem of F-measure is that it does not consider the costs for different types of misclassification. As exemplified in Section 1, normally the cost of classifying a ham to a is relatively higher. As a result, we use another cost-sensitive metric for our evaluation. Let λ be a parameter denoting the cost of classifying a ham to in relation to the cost of classifying a to a ham . For example, λ=9 if classifying a ham to a is 9 times more expensive than classifying a to a ham . Next, we define the weighted error rate (Err λ ) as follows: λ nham + n ham Errλ = (8) λ N + N ham where n ham and n ham refer to the number of s incorrectly classified to s and the number of s incorrectly classified to ham respectively. N ham and N refer to the total number of ham s and the total number of s respectively. We also define the baseline error (Err λ b ) as follows: N b Err = λ λ N + N (9) ham Err b λ can be interpreted as the error for the case that when there is no filters and all the s are considered as ham . Next the total cost ratio (TCR λ ) is a single value presenting the comparison to the baseline: b Errγ N TCRγ = = (10) Err λ n n γ ham + ham The higher the TCR λ of an filter, the better its performance is. If TCR λ is less than 1, it means that the filter is worse than the case that no filters are present. 3902
5 Table 2. The experimental results of different approaches in filtering evaluated using TCR λ. Bayes and K-NN refer to the multinominal naïve Bayesian and K-nearest approaches respectively. Bayes++ and K-NN++ refer to the approaches of incorporating keyword-based filtering to multinominal naïve Bayesian and K-NN respectively. Bayes K-NN Bayes++ K-NN++ TCR TCR Table 2 shows the experimental results of different approaches for the cases when λ=1 and λ=9 respectively. Bayes and K-NN refer to multinominal naïve Bayesian and K-nearest neighbour approaches respectively. Bayes++ and K-NN++ refer to the approaches that incorporate keyword-based filtering to multinominal naïve Bayesian and K-NN respectively. Recall that the higher the value of TCR λ, the better the performance of the filter, and the higher the value of λ, the higher the cost of misclassification of a ham to . It can be observed that when the cost of misclassification of a ham to a and that of misclassification of a to ham are the same (i.e. λ=1), the performance of incorporating keyword-based filtering is just comparable to the performance of not incorporating keyword-based filtering. However, when we set a higher cost for the misclassification of a ham to a , TCR 9 of K-NN++ is 0.14 higher than TCR 9 of KNN; while TCR 9 of Bayes++ is 0.1 higher than TCR 9 of Bayes. Our approaches of incorporating keyword-based filtering outperform the approaches which do not incorporate keyword-based filtering. Furthermore, it can be observed that multinominal naïve Bayesian is more sensitive to the cost of misclassification of a ham to a . When λ is increased from 1 to 9, the value TCR λ is changed from 8.89 to 3.21 and 8.84 to 3.31 for including and excluding keyword-based filtering respectively. On the other hand, K-NN is more stable with respect to the cost of misclassification. Table 3 shows the experimental results of different approaches evaluated using F-measure. Recall that F-measure (F macro and F micro ) sets the same cost to the misclassification of a ham to , and the misclassification of a to ham . It can be observed that the results of Bayes++ are similar to that of Bayes; while the results of K-NN are similar to that of K-NN++. Table 3. The experimental results of different approaches in filtering evaluated using F-measure. Bayes and K-NN refer to the multinominal naïve Bayesian and K-nearest approaches respectively. Bayes++ and K-NN++ refer to the approaches of incorporating keyword-based filtering to multinominal naïve Bayesian and K-NN respectively. Bayes K-NN Bayes++ K-NN++ F macro F micro Conclusions and Future Work We propose approaches for formally incorporating keyword-based filtering, which is fast and effective but is considered ad-hoc, to document classification techniques, in particular, multinominal naïve Bayesian and K-nearest neighbour. One property of our approaches is to soften the hard decision made by keyword-based filtering. In order to effectively evaluate the performance of our approaches, we use a cost-sensitive evaluation method which considers different costs for misclassification of a ham to a and misclassification of a to a ham . Empirical results show that our approaches achieve better performance when setting a higher cost for misclassification of a ham to . We intend to extend our work in several directions. One possible direction is to automatically discover the words from s using machine learning approaches. Methods such as latent semantic indexing have been applied to clustering words in documents of similar topics. It is likely that we can employ similar technique to automatically discover the words from s. Another direction is to increase the expressiveness of the list of words. For example, words may be organized as an ontology and provide richer information to the filtering system. References [1] I. Androutsopoulos, J. Kout, K. Chandrinos, G. Paliouras, and C. Spyropoulos, An evaluation of naive Bayesian anti- filtering, Proceedings of the Workshop on Machine Learning in the New Information Age, pp. 9-17, [2] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. Spyropoulos, and P. Stamatopoulos 3903
6 Learning to filter A comparison of a naive bayesian and a memorybased approach, Proceedings of the Fourth PKDD s Workshop on Machine Learning and Textual Information Access, pp. 1-13, [3] A. Bratko, G. Cormack, B. Filipic, T. Lynam, and B. Zupan, Spam filtering using statistical data compression models, Journal of Machine Learning Research, Vol 7, No. 12, pp , [4] X. Carreras and L. Marquez. Boosting trees for anti- filtering, Proceedings of the fourth International Conference on Recent Advances in Natural Language Processing, pp , [5] W. Cohen, Learning rules that classify , Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, pp , [6] L. F. Cranor and B. A. LaMacchia, Spam!, Communications of the ACM, Vol 41, No. 8, pp , [7] B. Davison, M. Najork, and T. Converse, Proceedings of the Second International Workshop on Adversarial Information Retrieval on the Web, Technical Report LU-CSE , Department of Computer Science and Engineering, Lehigh University, 2006 [8] J. E. Dunn, Spam some figures on the threat, id=1372, [9] J. Hidalgo, M. Lopez, and E. Sanz, Combining text and heuristics for cost-sensitive filtering, Proceedings of The Fourth Conference on Computational Language Learning and the Second Learning Language in Logic Workshop, pp , [10] G. Mishne, D. Carmel, and R. Lempel, Blocking blog with language model disagreement, Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web, pp. 1-6, [11] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, A Bayesian approach to filtering junk , Proceedings of AAAI-98 Workshop on Learning for Text Categorization, pp , [12] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C. Spyropoulos, and P. Stamatopoulos, A memorybased approach to anti- filtering for mailing lists, Information Retrieval, Vol 6, No. 1, pp ,
An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization
An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University
More informationA Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics
A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au
More informationProject Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI
University of Houston Clear Lake School of Science & Computer Engineering Project Report Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes April 17, 2008 Course Number: CSCI 5634.01 University of
More informationA Comparative Study of Classification Based Personal Filtering
A Comparative Study of Classification Based Personal E-mail Filtering Yanlei Diao, Hongjun Lu and Deai Wu Department of Computer Science The Hong Kong University of Science and Technology Clear Water Bay,
More informationKeyword Extraction by KNN considering Similarity among Features
64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,
More informationKeywords : Bayesian, classification, tokens, text, probability, keywords. GJCST-C Classification: E.5
Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 13 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global
More informationCountering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008
Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification
More informationSTUDYING OF CLASSIFYING CHINESE SMS MESSAGES
STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationEncoding Words into String Vectors for Word Categorization
Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,
More informationFiltering Spam Using Fuzzy Expert System 1 Hodeidah University, Faculty of computer science and engineering, Yemen 3, 4
Filtering Spam Using Fuzzy Expert System 1 Siham A. M. Almasan, 2 Wadeea A. A. Qaid, 3 Ahmed Khalid, 4 Ibrahim A. A. Alqubati 1, 2 Hodeidah University, Faculty of computer science and engineering, Yemen
More informationInfluence of Word Normalization on Text Classification
Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we
More informationA Three-Way Decision Approach to Spam Filtering
A Three-Way Decision Approach to Email Spam Filtering Bing Zhou, Yiyu Yao, and Jigang Luo Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 {zhou200b,yyao,luo226}@cs.uregina.ca
More informationA Content Vector Model for Text Classification
A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.
More informationVideo annotation based on adaptive annular spatial partition scheme
Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory
More informationCHEAP, efficient and easy to use, has become an
Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 A Multi-Resolution-Concentration Based Feature Construction Approach for Spam Filtering Guyue Mi,
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationApplication of Support Vector Machine Algorithm in Spam Filtering
Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification
More informationFiltering Spam by Using Factors Hyperbolic Trees
Filtering Spam by Using Factors Hyperbolic Trees Hailong Hou*, Yan Chen, Raheem Beyah, Yan-Qing Zhang Department of Computer science Georgia State University P.O. Box 3994 Atlanta, GA 30302-3994, USA *Contact
More informationDecision Science Letters
Decision Science Letters 3 (2014) 439 444 Contents lists available at GrowingScience Decision Science Letters homepage: www.growingscience.com/dsl Identifying spam e-mail messages using an intelligence
More informationBayesian Spam Detection System Using Hybrid Feature Selection Method
2016 International Conference on Manufacturing Science and Information Engineering (ICMSIE 2016) ISBN: 978-1-60595-325-0 Bayesian Spam Detection System Using Hybrid Feature Selection Method JUNYING CHEN,
More informationA Reputation-based Collaborative Approach for Spam Filtering
Available online at www.sciencedirect.com ScienceDirect AASRI Procedia 5 (2013 ) 220 227 2013 AASRI Conference on Parallel and Distributed Computing Systems A Reputation-based Collaborative Approach for
More informationText Categorization (I)
CS473 CS-473 Text Categorization (I) Luo Si Department of Computer Science Purdue University Text Categorization (I) Outline Introduction to the task of text categorization Manual v.s. automatic text categorization
More informationEVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST
EVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST Enrico Blanzieri and Anton Bryl May 2007 Technical Report # DIT-07-025 Evaluation of the Highest
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationOn Effective Classification via Neural Networks
On Effective E-mail Classification via Neural Networks Bin Cui 1, Anirban Mondal 2, Jialie Shen 3, Gao Cong 4, and Kian-Lee Tan 1 1 Singapore-MIT Alliance, National University of Singapore {cuibin, tankl}@comp.nus.edu.sg
More informationSPAM, generally defined as unsolicited bulk (UBE) Feature Construction Approach for Categorization Based on Term Space Partition
Feature Construction Approach for Email Categorization Based on Term Space Partition Guyue Mi, Pengtao Zhang and Ying Tan Abstract This paper proposes a novel feature construction approach based on term
More informationJURD: Joiner of Un-Readable Documents algorithm for Reversing the Effects of Tokenisation Attacks against Content-based Spam Filters
JURD: Joiner of Un-Readable Documents algorithm for Reversing the Effects of Tokenisation Attacks against Content-based Spam Filters Igor Santos, Carlos Laorden, Borja Sanz, Pablo G. Bringas S 3 Lab, DeustoTech
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems
More informationAutomated Online News Classification with Personalization
Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798
More informationNaïve Bayes for text classification
Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support
More informationSpam Classification Documentation
Spam Classification Documentation What is SPAM? Unsolicited, unwanted email that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient. Objective:
More informationA study of classification algorithms using Rapidminer
Volume 119 No. 12 2018, 15977-15988 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A study of classification algorithms using Rapidminer Dr.J.Arunadevi 1, S.Ramya 2, M.Ramesh Raja
More informationGRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM
http:// GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM Akshay Kumar 1, Vibhor Harit 2, Balwant Singh 3, Manzoor Husain Dar 4 1 M.Tech (CSE), Kurukshetra University, Kurukshetra,
More informationCS229 Final Project: Predicting Expected Response Times
CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time
More informationA probabilistic description-oriented approach for categorising Web documents
A probabilistic description-oriented approach for categorising Web documents Norbert Gövert Mounia Lalmas Norbert Fuhr University of Dortmund {goevert,mounia,fuhr}@ls6.cs.uni-dortmund.de Abstract The automatic
More informationClassification. 1 o Semestre 2007/2008
Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class
More informationA generalized additive neural network application in information security
Lecture Notes in Management Science (2014) Vol. 6: 58 64 6 th International Conference on Applied Operational Research, Proceedings Tadbir Operational Research Group Ltd. All rights reserved. www.tadbir.ca
More informationAn Experimental Evaluation of Spam Filter Performance and Robustness Against Attack
An Experimental Evaluation of Spam Filter Performance and Robustness Against Attack Steve Webb, Subramanyam Chitti, and Calton Pu {webb, chittis, calton}@cc.gatech.edu College of Computing Georgia Institute
More informationA novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems
A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics
More informationInternational Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X
Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,
More informationUsing Text Learning to help Web browsing
Using Text Learning to help Web browsing Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia Carnegie Mellon University, Pittsburgh, PA, USA Dunja.Mladenic@{ijs.si, cs.cmu.edu} Abstract Web browsing
More informationContent Based Spam Filtering
2016 International Conference on Collaboration Technologies and Systems Content Based Spam E-mail Filtering 2nd Author Pingchuan Liu and Teng-Sheng Moh Department of Computer Science San Jose State University
More informationSNS College of Technology, Coimbatore, India
Support Vector Machine: An efficient classifier for Method Level Bug Prediction using Information Gain 1 M.Vaijayanthi and 2 M. Nithya, 1,2 Assistant Professor, Department of Computer Science and Engineering,
More informationSimilarity search in multimedia databases
Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationSchematizing a Global SPAM Indicative Probability
Schematizing a Global SPAM Indicative Probability NIKOLAOS KORFIATIS MARIOS POULOS SOZON PAPAVLASSOPOULOS Department of Management Science and Technology Athens University of Economics and Business Athens,
More informationarxiv: v1 [cs.lg] 12 Feb 2018
Email Classification into Relevant Category Using Neural Networks arxiv:1802.03971v1 [cs.lg] 12 Feb 2018 Deepak Kumar Gupta & Shruti Goyal Co-Founders: Reckon Analytics deepak@reckonanalytics.com & shruti@reckonanalytics.com
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationGenerating Estimates of Classification Confidence for a Case-Based Spam Filter
Generating Estimates of Classification Confidence for a Case-Based Spam Filter Sarah Jane Delany 1, Pádraig Cunningham 2, and Dónal Doyle 2 1 Dublin Institute of Technology, Kevin Street, Dublin 8, Ireland
More informationAutomatic Domain Partitioning for Multi-Domain Learning
Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels
More informationProject Report: "Bayesian Spam Filter"
Humboldt-Universität zu Berlin Lehrstuhl für Maschinelles Lernen Sommersemester 2016 Maschinelles Lernen 1 Project Report: "Bayesian E-Mail Spam Filter" The Bayesians Sabine Bertram, Carolina Gumuljo,
More informationImproving Recognition through Object Sub-categorization
Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,
More informationChapter-8. Conclusion and Future Scope
Chapter-8 Conclusion and Future Scope This thesis has addressed the problem of Spam E-mails. In this work a Framework has been proposed. The proposed framework consists of the three pillars which are Legislative
More informationDetecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach
Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA
More informationKarami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.
Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review
More informationClassification and Clustering
Chapter 9 Classification and Clustering Classification and Clustering Classification/clustering are classical pattern recognition/ machine learning problems Classification, also referred to as categorization
More informationMODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS
MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500
More information2. Design Methodology
Content-aware Email Multiclass Classification Categorize Emails According to Senders Liwei Wang, Li Du s Abstract People nowadays are overwhelmed by tons of coming emails everyday at work or in their daily
More informationWEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS
WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS Juan Martinez-Romo and Lourdes Araujo Natural Language Processing and Information Retrieval Group at UNED * nlp.uned.es Fifth International Workshop
More informationPerformance Analysis of Data Mining Classification Techniques
Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal
More informationCollaborative Spam Mail Filtering Model Design
I.J. Education and Management Engineering, 2013, 2, 66-71 Published Online February 2013 in MECS (http://www.mecs-press.net) DOI: 10.5815/ijeme.2013.02.11 Available online at http://www.mecs-press.net/ijeme
More informationAn Improvement of Centroid-Based Classification Algorithm for Text Classification
An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,
More informationMulti-Dimensional Text Classification
Multi-Dimensional Text Classification Thanaruk THEERAMUNKONG IT Program, SIIT, Thammasat University P.O. Box 22 Thammasat Rangsit Post Office, Pathumthani, Thailand, 12121 ping@siit.tu.ac.th Verayuth LERTNATTEE
More informationBayesTH-MCRDR Algorithm for Automatic Classification of Web Document
BayesTH-MCRDR Algorithm for Automatic Classification of Web Document Woo-Chul Cho and Debbie Richards Department of Computing, Macquarie University, Sydney, NSW 2109, Australia {wccho, richards}@ics.mq.edu.au
More informationVALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER
VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER CS6007-INFORMATION RETRIEVAL Regulation 2013 Academic Year 2018
More informationFinal Report - Smart and Fast Sorting
Final Report - Smart and Fast Email Sorting Antonin Bas - Clement Mennesson 1 Project s Description Some people receive hundreds of emails a week and sorting all of them into different categories (e.g.
More informationDeveloping Focused Crawlers for Genre Specific Search Engines
Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com
More informationBayesian Spam Detection
Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal Volume 2 Issue 1 Article 2 2015 Bayesian Spam Detection Jeremy J. Eberhardt University or Minnesota, Morris Follow this and additional
More informationSemantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman
Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information
More informationInformation-Theoretic Feature Selection Algorithms for Text Classification
Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 5 Information-Theoretic Feature Selection Algorithms for Text Classification Jana Novovičová Institute
More informationBayesian Spam Filtering Using Statistical Data Compression
Global Journal of researches in engineering Numerical Methods Volume 11 Issue 7 Version 1.0 December 2011 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc.
More informationMODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour
MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11 Nearest Neighbour Classifier Keywords: K Neighbours, Weighted, Nearest Neighbour 1 Nearest neighbour classifiers This is amongst the simplest
More informationAn Assessment of Case Base Reasoning for Short Text Message Classification
An Assessment of Case Base Reasoning for Short Text Message Classification Matt Healy 1, Sarah Jane Delany 1, and Anton Zamolotskikh 2 1 Dublin Institute of Technology, Kevin Street, Dublin 8, Ireland
More informationProbabilistic Anti-Spam Filtering with Dimensionality Reduction
Probabilistic Anti-Spam Filtering with Dimensionality Reduction ABSTRACT One of the biggest problems of e-mail communication is the massive spam message delivery Everyday billion of unwanted messages are
More informationText Classification. Dr. Johan Hagelbäck.
Text Classification Dr. Johan Hagelbäck johan.hagelback@lnu.se http://aiguy.org Document Classification A very common machine learning problem is to classify a document based on its text contents We use
More informationCombining Text Embedding and Knowledge Graph Embedding Techniques for Academic Search Engines
Combining Text Embedding and Knowledge Graph Embedding Techniques for Academic Search Engines SemDeep-4, Oct. 2018 Gengchen Mai Krzysztof Janowicz Bo Yan STKO Lab, University of California, Santa Barbara
More informationCS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai
CS 8803 AIAD Prof Ling Liu Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai Under the supervision of Steve Webb Motivations and Objectives Spam, which was until
More informationFeature selection. LING 572 Fei Xia
Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection
More informationSupervised classification of law area in the legal domain
AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationCost-sensitive Boosting for Concept Drift
Cost-sensitive Boosting for Concept Drift Ashok Venkatesan, Narayanan C. Krishnan, Sethuraman Panchanathan Center for Cognitive Ubiquitous Computing, School of Computing, Informatics and Decision Systems
More informationA Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion
A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion Ye Tian, Gary M. Weiss, Qiang Ma Department of Computer and Information Science Fordham University 441 East Fordham
More informationImproving the methods of classification based on words ontology
www.ijcsi.org 262 Improving the methods of email classification based on words ontology Foruzan Kiamarzpour 1, Rouhollah Dianat 2, Mohammad bahrani 3, Mehdi Sadeghzadeh 4 1 Department of Computer Engineering,
More informationAnnotated Suffix Trees for Text Clustering
Annotated Suffix Trees for Text Clustering Ekaterina Chernyak and Dmitry Ilvovsky National Research University Higher School of Economics Moscow, Russia echernyak,dilvovsky@hse.ru Abstract. In this paper
More informationCS 188: Artificial Intelligence Fall Machine Learning
CS 188: Artificial Intelligence Fall 2007 Lecture 23: Naïve Bayes 11/15/2007 Dan Klein UC Berkeley Machine Learning Up till now: how to reason or make decisions using a model Machine learning: how to select
More informationCS294-1 Final Project. Algorithms Comparison
CS294-1 Final Project Algorithms Comparison Deep Learning Neural Network AdaBoost Random Forest Prepared By: Shuang Bi (24094630) Wenchang Zhang (24094623) 2013-05-15 1 INTRODUCTION In this project, we
More informationTo earn the extra credit, one of the following has to hold true. Please circle and sign.
CS 188 Spring 2011 Introduction to Artificial Intelligence Practice Final Exam To earn the extra credit, one of the following has to hold true. Please circle and sign. A I spent 3 or more hours on the
More informationFeature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News
Selecting Model in Automatic Text Categorization of Chinese Industrial 1) HUEY-MING LEE 1 ), PIN-JEN CHEN 1 ), TSUNG-YEN LEE 2) Department of Information Management, Chinese Culture University 55, Hwa-Kung
More informationHigh Reliability Text Categorisation Systems
University of Cagliari Department of Electrical and Electronic Engineering High Reliability Text Categorisation Systems Doctoral Thesis of: Dott. Ing. Ignazio Pillai Tutor: Prof. Ing. Fabio Roli Dottorato
More informationStudy on Classifiers using Genetic Algorithm and Class based Rules Generation
2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules
More informationCombination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset
International Journal of Computer Applications (0975 8887) Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset Mehdi Naseriparsa Islamic Azad University Tehran
More informationDomain-specific Concept-based Information Retrieval System
Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical
More informationDomain Specific Search Engine for Students
Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam
More informationA Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition
A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es
More informationClassifying Bug Reports to Bugs and Other Requests Using Topic Modeling
Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling Natthakul Pingclasai Department of Computer Engineering Kasetsart University Bangkok, Thailand Email: b5310547207@ku.ac.th Hideaki
More informationRobust Relevance-Based Language Models
Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new
More information